## [9660] Homework # 1 - Logistic Regression
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/credit_card_churners_1_2500.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 6:05 PM on the due date
  * No late submission will be accepted
* Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework #1 Requirements
* Load data into dataframe
* Examine data
* Use SimpleImputer to replace missing values
* Prepare data for model training
* Train Logistic Regression model
  * If you get errors, change appropriate hyperparameters to eliminate errors
* Caluculate and display model accuracy
  * The final model must have accuracy > 91%
    * Change hyperparameters accordingly to achieve this accuracy level

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 03/06/24 14:26:00


### Import libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Load data

#### Credit Card Churn Prediction
* https://www.kaggle.com/datasets/anwarsan/credit-card-bank-churn

Business Problem  
A business manager of a consumer credit card bank is facing the problem of customer attrition. They want to analyze the data to find out the reason behind this and leverage the same to predict customers who are likely to drop off.

Columns
* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - High School, College, Post-Graduate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
* Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
* Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
* Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

In [3]:
# Read data from file (credit_card_churners_1_10k.csv) into dataframe
#  NOTE: Use CLIENTNUM as the index column
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/credit_card_churners_1_2500.csv', sep=',', low_memory=False, index_col="CLIENTNUM")

### Examine data

In [4]:
# Review dataframe shape
df.shape

(2500, 23)

In [5]:
# Display first few rows of dataframe
df.head()

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
719999508,1,37.0,3,0.0,1.0,0.0,24,3,1,2,...,1.561,2438,45,0.607,0.069,0,1,0,1,0
716713533,1,53.0,3,2.0,2.0,0.0,44,5,2,2,...,0.542,3393,58,0.871,0.065,0,1,0,0,1
711800658,0,42.0,3,0.0,0.0,0.0,36,2,3,3,...,0.577,2465,42,0.355,0.0,1,0,0,0,1
719384433,0,44.0,3,0.0,0.0,0.0,28,5,2,3,...,0.654,2581,57,0.781,0.641,1,0,1,0,0
718894233,1,53.0,2,1.0,1.0,0.0,36,6,2,4,...,0.698,2116,63,0.575,0.471,1,0,0,1,0


In [6]:
# Display distribution counts for target variable Attrition_Flag
print(df['Attrition_Flag'].value_counts())

Attrition_Flag
1    2113
0     387
Name: count, dtype: int64


### Prepare data

##### Check for missing values

In [7]:
df.isnull().sum()

Attrition_Flag                0
Customer_Age                109
Dependent_count               0
Education_Level               0
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
Gender_F                      0
Gender_M                      0
Marital_Status_Divorced       0
Marital_Status_Married        0
Marital_Status_Single         0
dtype: int64

#### Use the SimpleImputer to replace missing values

In [8]:
pd.set_option('display.max_columns', None)

In [9]:
df[df.isnull().any(axis=1)]

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
713003358,0,,3,1.0,2.0,0.0,30,2,2,2,29338.0,159,29179.0,1.017,4983,40,0.379,0.005,0,1,0,0,1
717547383,1,,2,1.0,0.0,0.0,32,2,3,1,2890.0,1392,1498.0,0.736,4233,77,0.791,0.482,1,0,0,1,0
716564433,0,,1,1.0,0.0,0.0,36,3,2,4,4287.0,0,4287.0,0.294,1635,41,0.242,0.000,1,0,0,0,1
788876208,1,,2,1.0,0.0,0.0,27,5,3,3,2265.0,0,2265.0,0.603,3826,71,0.543,0.000,1,0,0,0,1
714030183,1,,0,1.0,2.0,1.0,36,5,3,3,34516.0,1494,33022.0,1.229,3259,65,0.585,0.043,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
718599108,1,,3,1.0,0.0,0.0,39,5,3,2,3230.0,1172,2058.0,0.514,4419,64,0.730,0.363,1,0,0,1,0
789830058,1,,4,0.0,2.0,0.0,34,4,1,2,21573.0,1585,19988.0,0.621,1384,36,2.273,0.073,0,1,0,0,1
708646233,1,,1,0.0,0.0,0.0,36,3,2,1,17116.0,1289,15827.0,0.559,3632,50,0.923,0.075,1,0,1,0,0
713617608,1,,3,0.0,3.0,1.0,33,5,2,3,34516.0,1844,32672.0,1.051,4217,79,0.795,0.053,0,1,0,1,0


In [10]:
# Setup imputer to replace NaN cells with mean of column
#  Both hyperparameters explicitly specified for teaching purposes
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [11]:
cols_to_impute = ['Customer_Age']

In [12]:
df[cols_to_impute] = imp_mean.fit_transform(df[cols_to_impute]) #replace of the NULLs with this value 

#### Check for missing values again

In [13]:
df.isnull().sum()

Attrition_Flag              0
Customer_Age                0
Dependent_count             0
Education_Level             0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Gender_F                    0
Gender_M                    0
Marital_Status_Divorced     0
Marital_Status_Married      0
Marital_Status_Single       0
dtype: int64

### Separate independent and dependent variables
* Independent variables: All remaining variables except Attrition_Flag
* Dependent variable: Attrition_Flag

In [14]:
X = df.drop("Attrition_Flag", axis = 1)
y = df["Attrition_Flag"]

In [15]:
X.head()

Unnamed: 0_level_0,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
719999508,37.0,3,0.0,1.0,0.0,24,3,1,2,11894.0,816,11078.0,1.561,2438,45,0.607,0.069,0,1,0,1,0
716713533,53.0,3,2.0,2.0,0.0,44,5,2,2,19063.0,1236,17827.0,0.542,3393,58,0.871,0.065,0,1,0,0,1
711800658,42.0,3,0.0,0.0,0.0,36,2,3,3,2145.0,0,2145.0,0.577,2465,42,0.355,0.0,1,0,0,0,1
719384433,44.0,3,0.0,0.0,0.0,28,5,2,3,1877.0,1203,674.0,0.654,2581,57,0.781,0.641,1,0,1,0,0
718894233,53.0,2,1.0,1.0,0.0,36,6,2,4,2956.0,1391,1565.0,0.698,2116,63,0.575,0.471,1,0,0,1,0


### Split data into training and test sets

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2024)

### Train Logistic Regression model

In [17]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### If the above results in error, review the error message, look up the documentation for LogisticRegression, change the appropriate model hyperparameter and re-train the model
* Repeat until there is no error

In [18]:
model = LogisticRegression(max_iter = 400)
model.fit(X_train, y_train)

### Test model

In [19]:
# Generate predictions against the test set
predictions = model.predict(X_test)

# Print predictions
print(predictions)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 0 1 0 1 1 1 1 1 0 

### Model evaluation

In [20]:
# Print model accuracy
accuracy = model.score(X_test, y_test)
print("accuracy =", round((accuracy * 100), 2), "%")

accuracy = 90.67 %


### Goal: Improve model performance to have accuracy > 91%

In [21]:
# Display model (default) hyperparameters
model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 400,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [22]:
# Re-train model with different hyperparameters
model = LogisticRegression(max_iter = 400, penalty = 'l1', solver='liblinear')
model.fit(X_train, y_train)

### Test updated model

In [23]:
# Generate predictions against the test set
predictions = model.predict(X_test)

# Print predictions
print(predictions)

[1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1
 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 0 1 0 1 1 1 1 1 0 

### Evaluate updated model

In [24]:
# Print model accuracy
accuracy = model.score(X_test, y_test)
print("accuracy =", round((accuracy * 100), 2), "%")


accuracy = 91.33 %
