# [3920] Homework # 1 - Logistic Regression
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/credit_card_churners_1_10k.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 2:30 PM on the due date
  * No late submission will be accepted
* Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework #1 Requirements
* Load data into dataframe
* Examine data
* Use SimpleImputer to replace missing values
* Prepare data for model training
* Train Logistic Regression model (change hyperparameters and re-train as needed)
* Test model and evaluate model performance metrics

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 10/05/23 18:17:52


### Import libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

### Load data

#### Credit Card Churn Prediction
* https://www.kaggle.com/datasets/anwarsan/credit-card-bank-churn

Business Problem  
A business manager of a consumer credit card bank is facing the problem of customer attrition. They want to analyze the data to find out the reason behind this and leverage the same to predict customers who are likely to drop off.

Columns
* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - High School, College, Post-Graduate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
* Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
* Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
* Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

In [3]:
# Read data from file (credit_card_churners_1_10k.csv) into dataframe
#  NOTE: Use CLIENTNUM as the index column
file_path = "https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/credit_card_churners_1_10k.csv"
df = pd.read_csv(file_path, index_col="CLIENTNUM")
df

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
712965183,1,63.0,2,1.0,0.0,0.0,52,5,2,3,...,0.416,1188,35,0.750,0.781,1,0,0,1,0
714225333,1,48.0,4,1.0,0.0,0.0,36,5,1,1,...,0.661,1545,21,0.909,0.264,1,0,0,1,0
710512833,1,38.0,2,1.0,0.0,0.0,29,6,1,1,...,0.615,5178,79,0.756,0.405,1,0,0,1,0
716396358,1,52.0,2,1.0,1.0,0.0,47,5,3,0,...,0.921,1531,35,0.667,0.619,0,1,0,1,0
715609533,0,47.0,3,0.0,0.0,0.0,35,1,3,3,...,0.621,1887,36,0.333,0.000,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
771219033,1,42.0,1,0.0,2.0,0.0,29,4,3,3,...,0.447,4199,73,0.659,0.113,0,1,1,0,0
712864683,0,42.0,3,2.0,0.0,0.0,30,2,3,3,...,0.373,1693,37,0.423,0.000,1,0,0,0,1
788350908,1,38.0,3,1.0,1.0,0.0,31,3,1,3,...,0.651,3548,69,0.500,0.207,0,1,0,0,1
713725683,1,36.0,0,1.0,0.0,0.0,27,6,3,2,...,0.625,2614,79,0.646,0.562,1,0,0,0,1


### Examine data

In [4]:
# Review dataframe shape
df.shape

(10000, 23)

In [5]:
# Display first few rows of dataframe
df.head(10)

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
712965183,1,63.0,2,1.0,0.0,0.0,52,5,2,3,...,0.416,1188,35,0.75,0.781,1,0,0,1,0
714225333,1,48.0,4,1.0,0.0,0.0,36,5,1,1,...,0.661,1545,21,0.909,0.264,1,0,0,1,0
710512833,1,38.0,2,1.0,0.0,0.0,29,6,1,1,...,0.615,5178,79,0.756,0.405,1,0,0,1,0
716396358,1,52.0,2,1.0,1.0,0.0,47,5,3,0,...,0.921,1531,35,0.667,0.619,0,1,0,1,0
715609533,0,47.0,3,0.0,0.0,0.0,35,1,3,3,...,0.621,1887,36,0.333,0.0,1,0,0,0,1
715240683,1,46.0,4,0.0,2.0,0.0,37,3,2,3,...,0.461,3718,66,0.61,0.084,0,1,0,1,0
711422358,1,42.0,5,2.0,1.0,0.0,36,1,2,2,...,0.757,14955,118,0.662,0.221,0,1,1,0,0
711802758,1,63.0,1,1.0,1.0,0.0,53,4,1,2,...,0.583,1700,42,0.355,0.309,0,1,0,1,0
710694033,1,32.0,0,1.0,1.0,0.0,36,2,1,1,...,0.98,13400,104,0.733,0.168,0,1,1,0,0
713995233,1,38.0,2,0.0,0.0,0.0,36,6,4,3,...,0.641,4707,88,0.692,0.0,1,0,0,1,0


In [6]:
# Display distribution counts for target variable Attrition_Flag
df["Attrition_Flag"].value_counts()

Attrition_Flag
1    8392
0    1608
Name: count, dtype: int64

### Prepare data

##### Check for missing values

In [7]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

Customer_Age    502
dtype: int64


#### Use the SimpleImputer to replace missing values

In [8]:
imputer = SimpleImputer(strategy="mean")

In [9]:
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

#### Check for missing values again

In [10]:
print(df_imputed.isnull().sum())

Attrition_Flag              0
Customer_Age                0
Dependent_count             0
Education_Level             0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Gender_F                    0
Gender_M                    0
Marital_Status_Divorced     0
Marital_Status_Married      0
Marital_Status_Single       0
dtype: int64


### Separate independent and dependent variables
* Independent variables: All remaining variables except Attrition_Flag
* Dependent variable: Attrition_Flag

In [11]:
X = df.drop(columns=["Attrition_Flag"])

In [12]:
y = df["Attrition_Flag"]

### Split data into training and test sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (8000, 22)
X_test shape: (2000, 22)
y_train shape: (8000,)
y_test shape: (2000,)


### Train Logistic Regression model

In [15]:
## model = LogisticRegression()
## model.fit(X_train, y_train)

### If above results in error, review error message, look up documentation for LogisticRegression, and change model hyperparameter appropriately

In [16]:
imputer = SimpleImputer(strategy="mean")
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

### Test model

In [17]:
# Generate predictions against the test set
y_pred = model.predict(X_test)
print(y_pred)

[1 0 1 ... 1 1 1]


### Model evaluation

In [18]:
# Print model accuracy
print(accuracy_score(y_test, y_pred))

0.888


In [19]:
# Print classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.45      0.56       322
           1       0.90      0.97      0.94      1678

    accuracy                           0.89      2000
   macro avg       0.83      0.71      0.75      2000
weighted avg       0.88      0.89      0.88      2000



In [20]:
# Print confusion matrix
print(confusion_matrix(y_test, y_pred))

[[ 144  178]
 [  46 1632]]
