-
Notifications
You must be signed in to change notification settings - Fork 0
/
eda-and-modeling.py
370 lines (232 loc) · 12.3 KB
/
eda-and-modeling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
#!/usr/bin/env python
# coding: utf-8
# ## Loading Libraries
# In[1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,StandardScaler,LabelBinarizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,recall_score,f1_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as make_pipeline_imb
import warnings
warnings.filterwarnings('ignore')
# ## Loading the dataset
# In[2]:
df = pd.read_csv('Churn_Modelling.csv')
df.head()
# In[3]:
df.info()
# In[4]:
df.describe()
# In[5]:
df.isna().sum()
# ## Exploratory Data Analysis
# In[6]:
plt.figure(figsize=(4,4))
output_counts = df['Exited'].value_counts()
plt.pie(output_counts, labels=output_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Distribution of Order Status \n')
plt.ylabel('')
plt.show()
# #### Observation:
#
# The target column looks imbalanced - accuracy would not be a good metric to evaluate the model and we would need to apply balancing techniques like SMOTE while modeling
# In[27]:
print("Box plot for numerical features:")
plt.figure(figsize=(18,22))
numeric_features = ['Age','Tenure','EstimatedSalary']
for i, column in enumerate(numeric_features):
plt.subplot(4,2, i + 1)
sns.boxplot(x=df[column], color='skyblue', width=0.4)
plt.title(column)
plt.xlabel('Value')
plt.ylabel('Frequency')
# #### Observation:
#
# It does not look like there are many outliers in the numerical features, except Age - but we can't do much about it
# In[8]:
print("Count plot for categorical features:")
plt.figure(figsize=(20,22))
for i, column in enumerate(['Geography','Gender','HasCrCard','IsActiveMember']):
plt.subplot(4,2,i + 1)
sns.countplot(x= df[column], data=df)
plt.title(column)
plt.xlabel('Value')
plt.ylabel('Frequency')
# #### Observation:
#
# The count plots indicate that there are more customers from France than Spain or Germany, the gender distribution is nearly balanced, most customers have a credit card, and there is a higher proportion of active members
# In[9]:
print("Observing independent variables based on the Exited Column:")
fig, axes = plt.subplots(3,2, figsize=(24,20))
sns.boxplot(data=df, y='CreditScore', x ='Exited', ax=axes[0,0])
axes[0,0].set_title('Credit Score Distribution by Churn Status')
sns.boxplot(data=df, y='Age', x ='Exited', ax=axes[0,1])
axes[0,1].set_title('Age Distribution by Churn Status')
axes[0,1].tick_params(axis='x', rotation=45)
sns.boxplot(data=df, y='Tenure', x ='Exited', ax=axes[1,0])
axes[1,0].set_title('Tenure Distribution by Churn Status')
axes[1,0].tick_params(axis='x', rotation=45)
sns.boxplot(data=df, y='Balance', x ='Exited', ax=axes[1,1])
axes[1,1].set_title('Balance Distribution by Churn Status')
sns.boxplot(data=df, y='EstimatedSalary', x= 'Exited', ax=axes[2,0])
axes[2,0].set_title('Estimated Salary Distribution by Churn Status')
axes[2,1].axis('off')
plt.show()
# #### Observation:
#
# The credit score distribution between customers who churned and those who did not churn is similar, indicating that the credit score may not be a strong predictor of churn.
#
# The tenure distribution suggests that customers with a shorter tenure are slightly more likely to churn than those with a longer tenure.
#
# Age distribution shows a more pronounced difference - older customers appear more likely to churn than younger ones.
#
# Lastly, the balance distribution is significantly different, with churned customers having higher balances on average. This could indicate that customers with higher balances are at a higher risk of churn.
#
# The estimated salary distribution does not show a stark difference between the churned and retained customers, suggesting that salary may not be a primary factor in the decision to churn.
# In[10]:
plt.figure(figsize=(12, 5))
sns.histplot(data=df, x='CreditScore', hue='Exited', kde=True)
plt.title('Observing the Distribution of Credit Score based on the Exited column')
plt.show()
# #### Observation:
#
# By observing the distribution of credit scores among customers who have churned versus those who haven't, we can see that there is no stark contrast between the two groups, implying that credit score alone may not be a strong predictor of churn within this dataset.
# In[11]:
bins = [0,669,739,850]
labels = ['Low','Medium','High']
df['CreditScoreGroup'] = pd.cut(df['CreditScore'], bins=bins, labels=labels, include_lowest=True)
plt.figure(figsize=(6,3))
sns.countplot(x = 'CreditScoreGroup', hue = 'Exited', data = df)
plt.title('Observing the Credit Score buckets based on the Exited column')
plt.show()
# #### Observation:
#
# Majority of customers who churned are in the 'High' credit score group, suggesting that credit score might have some influence on churn, albeit not a straightforward one
# In[12]:
plt.figure(figsize=(5,5))
sns.scatterplot(x='Tenure', y='Balance', hue='Exited', data=df)
plt.title('Observing the Balance based on the Exited column')
plt.show()
# #### Observation:
#
# The scatter plot indicates that there is no clear pattern or correlation between tenure and balance for both churned and retained customers, implying these factors independently do not strongly predict customer churn
# ## Feature Engineering
# ### *Exploring the relationship between features*
#
# a. **Credit Utilization**: Ratio of balance to credit score. It can give an insight into how much of the available credit the customer is using
#
# b. **Interaction Score**: A composite score based on the number of products, active membership, and credit card possession which can give a holistic view of a customer's engagement with the bank. Higher engagement levels might be associated with lower churn rates
#
# c. **Balance To Salary Ratio**: Ratio of the customer's balance to their estimated salary. This feature can indicate how significant the customer's balance is in relation to their income
#
# d. **Credit Score Age Interaction**: An interaction term between credit score and age to explore if the impact of credit score on churn varies across different age groups
# In[13]:
df['CreditUtilization']=df['Balance']/df['CreditScore']
df['InteractionScore']=df['NumOfProducts']+df['HasCrCard']+df['IsActiveMember']
df['BalanceToSalaryRatio']=df['Balance']/df['EstimatedSalary']
df['CreditScoreAgeInteraction']=df['CreditScore']*df['Age']
# In[14]:
plt.figure(figsize = (8,8))
sns.heatmap(df.drop(['RowNumber','CustomerId'],axis=1).corr(), annot=True, fmt='.2f')
plt.show()
# #### Observations:
#
# Notable correlations include a moderate positive relationship between Age and Exited, suggesting older customers are more likely to churn, and a strong positive relationship between Balance and CreditUtilization, which is intuitive as higher balances would likely increase credit utilization rates
# In[15]:
correlation_matrix = df.drop(['RowNumber','CustomerId'],axis=1).corr()
target_correlations = correlation_matrix['Exited']
print(target_correlations)
# #### Observation:
#
# 1. Age shows a moderately positive correlation, suggesting older customers are more likely to churn
# 2. IsActiveMember has a negative correlation, indicating that active members are less likely to churn
# 3. High CreditUtilization and NumOfProducts also appear to be associated with a higher likelihood of churning
# ## Modeling
# In[16]:
cat_col = ['Geography','Gender','CreditScoreGroup']
print("Observing the categorical column disribution before encoding: \n")
for columns in cat_col:
print(columns, '\n')
print(df[columns].value_counts(),'\n')
# In[25]:
encoder = LabelEncoder()
for columns in cat_col:
df[columns] = encoder.fit_transform(df[columns])
print("Observing the categorical column disribution after encoding: \n")
for columns in cat_col:
print(columns, '\n')
print(df[columns].value_counts(),'\n')
# In[18]:
col_drop = ['Exited','RowNumber','CustomerId','Surname']
X = df.drop(col_drop, axis=1)
y = df['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaling_columns = ['Age','CreditScore','Balance','EstimatedSalary','CreditUtilization','BalanceToSalaryRatio','CreditScoreAgeInteraction']
scaler = StandardScaler()
scaler.fit(X_train[scaling_columns])
X_train[scaling_columns] = scaler.transform(X_train[scaling_columns])
X_test[scaling_columns] = scaler.transform(X_test[scaling_columns])
# In[24]:
print("Training dataset shape:", X_train.shape, y_train.shape)
print("Test dataset shape:",X_test.shape, y_test.shape)
# In[22]:
models = {
'Logistic Regression': LogisticRegression(random_state=42, class_weight='balanced'),
'Random Forest': RandomForestClassifier(random_state=42, class_weight='balanced'),
'K-Nearest Neighbors': make_pipeline_imb(SMOTE(random_state=42), KNeighborsClassifier()),
'Support Vector Machine': make_pipeline_imb(SMOTE(random_state=42), SVC(probability=True, random_state=42)),
'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss', scale_pos_weight=(len(y_train) - sum(y_train)) / sum(y_train), random_state=42),
'Gradient Boosting': make_pipeline_imb(SMOTE(random_state=42), GradientBoostingClassifier(random_state=42))
}
results_df = pd.DataFrame(columns=['Model','Accuracy','Recall Score','F1 Score','ROC AUC Score'])
lb = LabelBinarizer()
lb.fit(y_train)
for name, model in models.items():
print(f"Model: {name}")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred),'\n')
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred),'\n')
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy} \n")
recall = recall_score(y_test, y_pred, pos_label=1)
print(f"Recall Score: {recall}")
f1 = f1_score(lb.transform(y_test), lb.transform(y_pred), pos_label=1)
print(f"F1 Score: {f1}")
if hasattr(model, "predict_proba"):
roc_auc = roc_auc_score(lb.transform(y_test), model.predict_proba(X_test)[:, 1])
print(f"ROC AUC Score: {roc_auc}")
else:
roc_auc = None
results_df = results_df.append({'Model': name, 'Accuracy': accuracy, 'Recall Score': recall, 'F1 Score': f1, 'ROC AUC Score': roc_auc}, ignore_index=True)
print("-" * 50,'\n')
# In[23]:
results_df
# From the results of the classification models on the churn prediction dataset, we can infer the following:
#
# 1. **Gradient Boosting** has the highest F1 score (0.598391) and the highest ROC AUC score (0.859767) among all the models. This suggests that Gradient Boosting is the most effective model in balancing precision and recall and has the best ability to distinguish between the churned and non-churned customers.
#
# 2. **XGBoost** also performs well, with a relatively high F1 score (0.586974) and a good ROC AUC score (0.841784). This indicates that XGBoost is another strong model for this task.
#
# 3. **Random Forest** has a high accuracy (0.862000) but a lower F1 score (0.538976) compared to Gradient Boosting and XGBoost. This suggests that while Random Forest is good at predicting the majority class (non-churned customers), it might not be as effective at identifying the minority class (churned customers).
#
# 4. **Support Vector Machine** and **K-Nearest Neighbors** have moderate F1 scores and ROC AUC scores. They perform better than Logistic Regression but are not as effective as Gradient Boosting or XGBoost for this dataset.
#
# 5. **Logistic Regression** has the lowest accuracy (0.703667), F1 score (0.473029), and ROC AUC score (0.764076) among all the models. This indicates that Logistic Regression is the least effective model for predicting customer churn in this dataset.
#
# #### Overall:
#
# Gradient Boosting appears to be the best model for this churn prediction task, followed closely by XGBoost. These models are able to better handle the class imbalance and provide a good balance between precision and recall.
# In[ ]: