In [None]:
# Feature 3, Regression Analysis – Train vs Test and Different Models – Linear Regression

# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pyplot import subplots
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

# Import specific objects
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)


In [None]:
# Feature 3, Regression Analysis – Train vs Test and Different Models – Linear Regression

# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pyplot import subplots
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

# Import specific objects
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

In [None]:
# Load the CSV file
file_path = r"../data/processed/CleanData_OneHotEncoded.csv"
df = pd.read_csv(file_path)

In [None]:
# Dropping one dummy variable from each categorical variable set to avoid Multicolinearity
df_dropped = df.drop(columns=['edu_Uneducated','marital_Single','income_Less than $40K'])
df_dropped.columns

In [None]:
#MODEL 1: INCLUDING ONLY DEMOGRAPHIC VARIABLES, CARD MONTHS ON BOOKS, NUMBER OF RELATIONSHIPS WITH BANK
# defining target variable y and predictors matrix
X1 = df_dropped.drop(columns=['Credit_Limit','Unnamed: 0','Months_Inactive_12_mon','Avg_Utilization_Ratio','Total_Revolving_Bal' ])
y1 = df_dropped['Credit_Limit']

# Add a constant term for the intercept
X1 = sm.add_constant(X1)

# Fit the model
model1 = sm.OLS(y1, X1).fit()

# Output the summary
print(model1.summary())

In [None]:
#MODEL 2: INCLUDING ALL POTENTIAL PREDICTORS
# defining target variable y and predictors matrix
X2 = df_dropped.drop(columns=['Credit_Limit','Unnamed: 0'])
y2 = df_dropped['Credit_Limit']

# Add a constant term for the intercept
X2 = sm.add_constant(X2)

# Fit the model
model2 = sm.OLS(y2, X2).fit()

# Output the summary
print(model2.summary())

In [None]:
#MODEL 3: SIMPLIFIED MODEL
# defining target variable y and predictors matrix
X3 = df_dropped.drop(columns=['Credit_Limit','Unnamed: 0','Customer_Age','Gender','Months_on_book','Months_Inactive_12_mon',
                              'edu_College','edu_Doctorate','edu_Graduate','edu_High School','edu_Post-Graduate', 'marital_Divorced',
                              'income_$40K - $60K'])
y3 = df_dropped['Credit_Limit']

# Add a constant term for the intercept
X3 = sm.add_constant(X3)

# Fit the model
model3 = sm.OLS(y3, X3).fit()

# Output the summary
print(model3.summary())

## REGRESSION ANALYSIS

### MODEL 1: USING ONLY DEMOGRAPHIC AND CUSTOMER CONTACT HISTORY WITH BANK
To first explore the client’s main question, that is, whether credit limits can be predicted solely based on the customers’ demographic characteristics and their contact history with a bank, we fit a linear regression model for our target variable (Credit_Limit) using as predictors only the demographic variables in the dataset (Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income) along with two contact history variables (Months_on_book, Total_Relationship_Count).
With and R-squared of 0.384, that is, only 38% of the variation in Credit_Limit is explained by the variation of the selected variables. This suggests that including only these variables would not be sufficient to fit a model that successfully predicts Credit Limit for the purposes of the client’s intended application.

Despite this, looking at the significance of the variables can shed some light into which of these variables could be relevant to determine a customer’s credit card limit if we were to add other relevant variables that may be missing in the dataset. It is to note that because the model is poorly fit, the significance of some of these variables would change if we add other important variables that explain the variability in credit limit.
Customer Age, Education Level and Months on book (months since customer's card first activation) are not significant in this model, while for Marital Status, there’s only a significant impact if a customer is married in comparison with single customers (the reference group), that is a divorced customer’s credit limit is not significantly different from the credit limit of a single customer.

In this model, gender is significant (being Male yields a higher credit limit), credit limit increases with number of dependents, decreases with the Number of open products (relationship count)  that the customer has with the bank (probably influenced by products being credit-related products than asset holding accounts), and increases at higher income as one would expect.


### MODEL 2: MODEL 1 + CREDIT CARD USAGE VARIABLES
For this model, we included 3 credit card usage variables available in the dataset (Months_Inactive_12_mon, Avg_Utilization_Ratio and Total_Revolving_Balance) on top of the ones in Model 1,  to use all potential predictors and incorporate interactions between them that were not considered before.

The R-squared increased considerably when adding these predictors but it is still only 0.55,  which suggests that 45% of the variation in Credit_Limit is explained by other variables that are beyond the reach of the selected dataset.

Of the three added credit card utilization only Months inactive within the last 12 months is not significant as indicated by its p-value.
It is important to note that as opposed to Model 1, Gender is not significant in this model which illustrates that the differences in credit limit by gender found before where mostly explained by the combination of gender and card utilization ratio and revolving balances, which may be explained by sampling more than by systemic discrimination based on gender.

Another difference with the results in Model 1 is the non-significance of the income bracket $40K - $60K which, with the caveat of the non-robust fit of the model, suggests that there is little material difference in the Credit Limit that can be achieved by customers in this income bracket and customers earning Less than $40K when considered in combination with the other predictors in Model 2.

### MODEL 3: SIGNIFICANT VARIABLES ONLY
Model 3 includes only the significant variables found in Model 2. The value of the predictors don’t change dramatically when compared to their Model 2 counterparts and the R-squared value is still 0.55.


### REGRESSION ANALYSIS TAKEAWAYS
- The non-robust results found in our first model allowed us to answer one of the questions asked by the client: Can credit limits be predicted solely based on the customers’ demographic characteristics and their contact history with a bank? We found including only these variables would not be sufficient to fit a model that successfully predicts Credit Limit for the purposes of the client’s intended application.

- Despite this, the consistent significance of some of the variables throughout the regression analysis suggests that including Dependent Count, Total Relationship Count, Avg. Utilization Ratio, Total Revolving Balance, asking users if they’re Married or not, and their income brackets could be relevant variables to collect in the app’s questionnaire, that is, in combination with other variables unavailable in the dataset.

- One main limitation of the database is that it was originally intended to predict credit card churn and not necessarily predict credit limit as our project. As such, the database is missing some key variables to determine Credit Limit such as Credit Score, Customer Net Worth, among others. In the next phase of the project we aim to find this complementary data and add it to our analysis in order to fulfill the second part of the client’s request.

- Another limitation is the assumption of representativeness of the limited data set. We aim to acquire a more comprehensive dataset that ensures proper representation of credit card users.

