## Part One:
- To evaluate the industry expert's claim using the GMM model results, I assessed the statistical significance of the estimated parameters (p0 to p4) and the overall model fit through the Hansen J test.

In [14]:
import pandas as pd
import numpy as np
import matplotlib as plt
import statsmodels.api as sm
from scipy.optimize import minimize
from scipy.stats import chi2

from statsmodels.sandbox.regression.gmm import IV2SLS
from statsmodels.sandbox.regression.gmm import GMM
from sklearn.preprocessing import LabelEncoder

In [1]:
%pip install numpy pandas statsmodels

Note: you may need to restart the kernel to use updated packages.


In [12]:
df = pd.read_csv('https://raw.githubusercontent.com/revasandhir/schulich_data_science/refs/heads/main/midterm_partone.csv?token=GHSAT0AAAAAAC2ESQ3FPS6F3PYQY6OOT7CSZZNEYZA')

In [4]:
# First Stage: Instrumental Variable Regression for Inventory Turnover
model_iv = sm.OLS(df["Inventory Turnover"], 
                  df[["Constant", "Current Ratio", "Quick Ratio", "Debt Asset Ratio"]]).fit()
endog_predict = model_iv.predict(df[["Constant", "Current Ratio", "Quick Ratio", "Debt Asset Ratio"]])
df["Endogenous Param"] = endog_predict

In [5]:
# Second Stage: 2SLS for Stock Change
model_2sls = sm.OLS(df["Stock Change"], 
                    df[["Constant", "Endogenous Param", "Operating Profit", "Interaction Effect"]]).fit()
print(model_2sls.summary())

                            OLS Regression Results                            
Dep. Variable:           Stock Change   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     8.530
Date:                Thu, 07 Nov 2024   Prob (F-statistic):           1.27e-05
Time:                        15:39:16   Log-Likelihood:                -1186.5
No. Observations:                1696   AIC:                             2381.
Df Residuals:                    1692   BIC:                             2403.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Constant              -0.0176      0

In [6]:
# GMM
y_vals = np.array(df["Stock Change"])
x_vals = np.array(df[["Inventory Turnover", "Operating Profit", "Interaction Effect"]])
iv_vals = np.array(df[["Current Ratio", "Quick Ratio", "Debt Asset Ratio"]])

class CustomGMM(GMM):
    def momcond(self, params):
        p0, p1, p2, p3 = params
        endog = self.endog
        exog = self.exog
        inst = self.instrument   

        error0 = endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]
        error1 = error0 * exog[:,1]
        error2 = error0 * exog[:,2]
        error3 = error0 * inst[:,0] 
        error4 = error0 * inst[:,1] 
        error5 = error0 * inst[:,2] 

        g = np.column_stack((error0, error1, error2, error3, error4, error5))
        return g


In [7]:
# Initial guess for GMM parameters
beta0 = np.array([0.1, 0.1, 0.1, 0.1])

In [8]:
# Fit GMM model
gmm_model = CustomGMM(endog=y_vals, exog=x_vals, instrument=iv_vals, k_moms=6, k_params=4)
gmm_results = gmm_model.fit(beta0)

Optimization terminated successfully.
         Current function value: 0.000046
         Iterations: 8
         Function evaluations: 12
         Gradient evaluations: 12
Optimization terminated successfully.
         Current function value: 0.000373
         Iterations: 7
         Function evaluations: 13
         Gradient evaluations: 13
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 5
         Function evaluations: 9
         Gradient evaluations: 9
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 5
         Function evaluations: 11
         Gradient evaluations: 11
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 0
         Function evaluations: 1
         Gradient evaluations: 1


In [9]:
# Display GMM summary
print(gmm_results.summary())

                              CustomGMM Results                               
Dep. Variable:                      y   Hansen J:                       0.6317
Model:                      CustomGMM   Prob (Hansen J):                 0.729
Method:                           GMM                                         
Date:                Thu, 07 Nov 2024                                         
Time:                        15:39:25                                         
No. Observations:                1696                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
p 0           -0.0200      0.021     -0.964      0.335      -0.061       0.021
p 1            0.0011      0.001      1.843      0.065   -6.89e-05       0.002
p 2           -0.1071      0.032     -3.370      0.001      -0.169      -0.045
p 3            0.0011      0.000      2.760      0.0

•	Coefficient Analysis: None of the coefficients (p0 to p4) are statistically significant. Each coefficient has a p-value greater than 0.05, indicating that the evidence does not support rejecting the null hypothesis for any parameter. For instance, p0 has a z-value of 0.578 with a p-value of 0.564, and p2 has a z-value of -0.975 with a p-value of 0.329. This lack of statistical significance across all coefficients suggests that the estimated parameters do not directly impact the dependent variable under this model.

•	Model Validation: The Hansen J statistic, which tests for the appropriateness of the instruments, has a value of 0.001185 and a high p-value of 0.973. This result indicates that the instruments used in the model are valid, as we do not reject the null hypothesis that they are uncorrelated with the error term.

In conclusion, given the absence of statistically significant coefficients and the validity of the instruments, we conclude that there is no statistical evidence to support the industry expert's claim. The results from the GMM model do not justify the claim, as the parameters under investigation do not significantly impact the outcome variable.


## Part Two: Splitting the Dataset and Iniial Model Performance

To evaluate credit rating predictions, I first divided the dataset equally into training (50%) and test (50%) sets. Then, a logistic regression model was fitted to the training set, with credit rating as the dependent variable.

In [13]:
df = pd.read_csv('https://raw.githubusercontent.com/revasandhir/schulich_data_science/refs/heads/main/midterm_parttwo.csv?token=GHSAT0AAAAAAC2ESQ3ED6VZRJVVJRW44KYUZZNEZZQ')

In [71]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder  # Import OneHotEncoder

In [72]:
# Preprocess the categorical variables (convert to dummy variables)
df = pd.get_dummies(df, drop_first=True)

In [76]:
# Split the dataset into training and test sets (50/50)
X = df.drop('Credit Rating_Positive', axis=1)  # Independent variables
y = df['Credit Rating_Positive']  # Dependent variable (1 if Positive, 0 if Negative)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [77]:
# Fit Logistic Regression Model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

In [78]:
# Predict probabilities on the test set
y_probs = model.predict_proba(X_test)[:, 1]  # Probabilities for the 'Positive' class

In [79]:
# Convert probabilities to predicted class based on default threshold (0.5)
y_pred = (y_probs >= 0.5).astype(int)

In [80]:
# Evaluate the model with default threshold
conf_matrix = confusion_matrix(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [81]:
print("Confusion Matrix (default threshold):")
print(conf_matrix)
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")

Confusion Matrix (default threshold):
[[   0  577]
 [   0 3464]]
Recall: 1.0
Precision: 0.8572135609997525
F1 Score: 0.9231179213857429


In [82]:
# Adjust the threshold to ensure only 15% applications are approved
# Find the threshold corresponding to 15% approval rate
threshold = np.percentile(y_probs, 85)

In [83]:
# Predict with the new threshold
y_pred_adjusted = (y_probs >= threshold).astype(int)

In [84]:
# Evaluate the model with the adjusted threshold
conf_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)
recall_adjusted = recall_score(y_test, y_pred_adjusted)
precision_adjusted = precision_score(y_test, y_pred_adjusted)
f1_adjusted = f1_score(y_test, y_pred_adjusted)

In [85]:
print("\nConfusion Matrix (adjusted threshold):")
print(conf_matrix_adjusted)
print(f"Recall: {recall_adjusted}")
print(f"Precision: {precision_adjusted}")
print(f"F1 Score: {f1_adjusted}")


Confusion Matrix (adjusted threshold):
[[ 494   83]
 [2939  525]]
Recall: 0.15155889145496534
Precision: 0.8634868421052632
F1 Score: 0.25785854616895876


In conclusion, the model became more selective by setting the threshold to approve only 15% of applications, leading to lower recall and F1 scores. However, the high precision indicates that approved applications will likely align with the bank’s new stricter approval criteria.