# Problem Statement:

Predict churn from the bank customer dataset.

### Dataset Content:


* RowNumber—corresponds to the record (row) number and has no effect on the output.
* CustomerId—contains random values and has no effect on customer leaving the bank.
* Surname—the surname of a customer has no impact on their decision to leave the bank.
* CreditScore—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
* Geography—a customer’s location can affect their decision to leave the bank.
* Gender—it’s interesting to explore whether gender plays a role in a customer leaving the bank.
* Age—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
* Tenure—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
* Balance—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
* NumOfProducts—refers to the number of products that a customer has purchased through the bank.
* HasCrCard—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
* IsActiveMember—active customers are less likely to leave the bank.
* EstimatedSalary—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
* Exited—whether or not the customer left the bank.


In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt 
import seaborn as sns  
from sklearn.model_selection import train_test_split, cross_val_score


from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import  confusion_matrix , plot_roc_curve, classification_report

In [None]:
df = pd.read_csv('/kaggle/input/churn-for-bank-customers/churn.csv')
df.head()

In [None]:
df.info()
# no NAN values 

### Remove useless columns and see distributions of the variables

In [None]:
# remove useless columns
df.drop(["RowNumber","CustomerId","Surname"], axis = 1, inplace = True)

# Plot histogram grid
df.hist(figsize=(14,14))
plt.show()

### Looking at Correlations between the variables

In [None]:
# Calculate correlations between numeric features
correlations = df.corr()

# sort features in order of their correlation with "Exited"
sort_corr_cols = correlations.Exited.sort_values(ascending=False).keys()
sort_corr = correlations.loc[sort_corr_cols,sort_corr_cols]
sort_corr

# Generate a mask for the upper triangle
corr_mask = np.zeros_like(df.corr())
corr_mask[np.triu_indices_from(corr_mask)] = 1

# Make the figsize 9x9
plt.figure(figsize=(9,9))

# Plot heatmap of annotated correlations; change background to white
sns.heatmap(sort_corr*100, 
                cmap='RdBu', 
                annot=True,
                fmt='.0f',
                mask=corr_mask,)

plt.title('Correlations by Exited', fontsize=14)
plt.yticks(rotation=0)
plt.show()

In [None]:
def kdeplot(feature):
    plt.figure(figsize=(9, 4))
    plt.title(f"KDE Plot for {feature}")
    ax0 = sns.kdeplot(df[df['Exited'] == 0][feature].dropna(), color= 'dodgerblue', label= 'Exited - 0')
    ax1 = sns.kdeplot(df[df['Exited'] == 1][feature].dropna(), color= 'orange', label= 'Exited - 1')

### Looking at Variables with low correlation with the target variable

In [None]:
kdeplot('Tenure')
kdeplot('HasCrCard')
kdeplot('EstimatedSalary')

The variables with low correlation do seem so contribute to predicting the final outcome 
(the KDE plots show how different are the graphs of the features with respenct to the target feature. Thus more differences in the graph, the more the variable contributes to the target feature)

### Finding and Removing Outliers in numerical variables

In [None]:
outlier_plot = ["CreditScore","Age","Tenure","Balance","NumOfProducts","EstimatedSalary"]
for i in outlier_plot:
    sns.boxplot(x = df[i])
    plt.show()

In [None]:
# Seems like CreditScore, Age, NumOfProducts have outliers
outliers = ['Age','CreditScore','NumOfProducts']

In [None]:
def outlier_removal(df,column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    fence_low = q1 - 1.5 * iqr
    fence_high = q3 + 1.5 * iqr
    cleaned_data = df.loc[(df[column] > fence_low) & (df[column] < fence_high)]
    return cleaned_data

In [None]:
# clean the dataset by removing outliers
df_cleaned = outlier_removal(outlier_removal(outlier_removal(df,'Age'),'CreditScore'),'NumOfProducts')

print(df.shape)
print(df_cleaned.shape)

### Looking at Unique data for Encoding

In [None]:
def unique_counts(df):
    for column in df.columns:
        print(f'{column} :  {len(df[column].value_counts())}')
unique_counts(df_cleaned)

In [None]:
# Gender and Geography need to be encoded
df_cleaned = pd.get_dummies(df_cleaned, columns = ["Geography"])
df_cleaned.replace({'Female': 0,'Male': 1},inplace=True)
df_cleaned

# Modeling

In [None]:
X = df_cleaned.drop(["Exited"], axis=1)
Y = df_cleaned["Exited"]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 0)

In [None]:
# Helper function for confusion matric and classification report 
def evaluate_model(classifier):
    cf_matrix = confusion_matrix(y_test, classifier.predict(x_test))
    sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, 
                fmt='.2%', cmap='Blues')

    print(classification_report(y_test, classifier.predict(x_test),zero_division=0))

In [None]:
# Helper function for cross validation
def score_model(classifier):
    print(f"Test accurarcy {classifier.score(x_test,y_test)}")
    val = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)
    print(f"cross validation Mean : {val.mean()} and STD of {val.std()}")

## Logistic Regression

In [None]:
log_clsf = LogisticRegression(max_iter=10000)
log_clsf.fit(x_train,y_train)

score_model(log_clsf)

In [None]:
evaluate_model(log_clsf)

plot_roc_curve(log_clsf, x_test, y_test)  
plt.show() 

## Random Forest

In [None]:
rf_clsf = RandomForestClassifier(random_state = 42, max_depth = 10, n_estimators = 1000)
rf_clsf.fit(x_train, y_train)

score_model(rf_clsf)

In [None]:
evaluate_model(rf_clsf)

plot_roc_curve(rf_clsf, x_test, y_test)  
plt.show() 

## Support Vector Machine (SVM) 

In [None]:
svm_clsf = SVC()
svm_clsf.fit(x_train, y_train)

score_model(svm_clsf)

In [None]:
evaluate_model(svm_clsf)

plot_roc_curve(svm_clsf, x_test, y_test)  
plt.show() 

## KNN

In [None]:
best_knn = []
error = [] 
for K in range(20):
    K = K+1
    model = KNeighborsClassifier(n_neighbors = K)

    model.fit(x_train, y_train)  
    pred=model.predict(x_test) 
    error.append(np.mean(pred != y_test))
    best_knn.insert(K, model.score(x_test,y_test))

# Get the best fitting number of neighbours 
for i,v in enumerate(best_knn):
    if v == max(best_knn):
        print(f'best n_neighbours = {i}')
        
curve = pd.Series(error) #elbow curve 
curve.plot()

In [None]:
knn_clsf = KNeighborsClassifier(n_neighbors=15)
knn_clsf.fit(x_train, y_train)

score_model(knn_clsf)

In [None]:
evaluate_model(knn_clsf)

plot_roc_curve(knn_clsf, x_test, y_test)  
plt.show() 

## Voting Classification

In [None]:
voting_classfication = VotingClassifier(estimators = [('lg', log_clsf), ('rfg', rf_clsf), ('svc', svm_clsf), ('knn', knn_clsf)])
voting_classfication.fit(x_train, y_train)

print("Test accuracy: ", voting_classfication.score(x_test,y_test))

In [None]:
evaluate_model(voting_classfication)

# Summary:

The best model amonst the ones implemented is Random Forests with an accuracy of 86.8% 
as for the other models, even though their accuracy hovers arround 80%, their AUC is pretty bad and thus shouldn't be used for a real world scenario.



Do post a comment if you have any suggessions !