# Credit Cards - Predicition of churning customers

#### Step by step guide to create a Random Forest Classifier to predict the Customer Churning

1. Data Understanding
2. Data Preparation
3. Feature Engineering
4. Feature Extraction
5. Training
6. Evaluation

# Data Understanding

**Importing the necessary libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

**Loading the dataset**

In [None]:
path = '/kaggle/input/credit-card-customers/'

data = pd.read_csv(path + 'BankChurners.csv')

# drop last two columns
data = data.iloc[:, :-2]
data.head()

**Preliminary Data Description**

In [None]:
print(f"""
    No. of samples  : {data.shape[0]}
    No. of features : {data.shape[1]}
    Missing values  : {data.isnull().sum().sum()}
""")

# Data Preparation

**Categorical features present in the dataset**

In [None]:
data.select_dtypes(include='object').columns.to_list()

**Label Encoding the ordinal features**

In [None]:
data['Attrition_Flag'].replace({'Existing Customer': 0, 'Attrited Customer': 1}, inplace=True)

data['Gender'].replace({'M': 1, 'F': 0})

education_mapping = {
    'Unknown': 0, 'Uneducated': 1, 'High School': 2, 'College': 3, 
    'Graduate': 4, 'Post-Graduate': 5, 'Doctorate': 6
}
data['Education_Level'].replace(education_mapping, inplace=True)

income_mapping= {
    'Less than $40K': 'less_than_40k', '$40K - $60K': '40k_60k', '$60K - $80K': '60k_80k', 
    '$80K - $120K': '80k_120k', '$120K +': 'greater_than_120k',
}
data['Income_Category'].replace(income_mapping, inplace=True)

**One-Hot encoding the nominal features**

In [None]:
data_dummies = pd.get_dummies(data)

**Drop the below features**

1. `CLIENTNUM` - Since, it is an ID parameter and not relevant for our analysis
2. We can drop one feature from each of the `one-hot encoded` features. In our case, we are dropping -
    * `Gender_F`
    * `Marital_Status_Unknown`
    * `Income_Category_Unknown`

In [None]:
data.drop('CLIENTNUM', axis=1, inplace=True)

data_dummies.drop(['Gender_F', 'Marital_Status_Unknown', 'Income_Category_Unknown'], axis=1, inplace=True)

**Dataset description after data preparation**

In [None]:
print(f"""
    No. of samples  : {data.shape[0]}
    No. of features : {data.shape[1]}
    Missing values  : {data.isnull().sum().sum()}
""")

# Feature Engineering

**Checking `skewness` of the numeric features**

In [None]:
df_skew = pd.DataFrame(data_dummies.skew(), columns=['Skewness']).sort_values(by='Skewness')
df_skew.head()

**Applying `log transformation` to the highly skewed features**

In [None]:
## Get all the numeric features in out dataset
numeric_features = data_dummies.skew().index

## We do not want to touch our target feature
if 'Attrition_Flag' in numeric_features:
    numeric_features = numeric_features.drop('Attrition_Flag')
    
## Getting all the skewed features (skew > 0.5 or skew < -0.5)
skewed_features = data_dummies[numeric_features].skew()[np.abs(data_dummies[numeric_features].skew()) > 0.5].index

## Performing log(1+x) transformation
data_dummies[skewed_features] = np.log1p(data_dummies[skewed_features])

**Check the correlation among features using the heatmap**

In [None]:
# Get the correlation dataframe
df_corr = data.corr()

# Plot the heatmap
fig, ax = plt.subplots(figsize=(10, 8))
mask    = np.triu(np.ones_like(df_corr, dtype=np.bool))
sns.heatmap(
    df_corr, mask=mask, annot=True, fmt=".2f", vmin = -1, vmax = 1,
    cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9)
)
plt.show()

# Feature Extraction

In [None]:
# Separate the independent and dependent variable
X = data_dummies.drop("Attrition_Flag", axis = 1)
y = data_dummies["Attrition_Flag"]

# Get the training and testing pairs
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

**Chi Square test for feature extraction**

[Reference link](https://machinelearningmastery.com/feature-selection-with-categorical-data/)

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)

for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()

The above results give us the feature index value and their corresponding importance. Higher the value, more important the feature is.

In [None]:
X_train = X_train.iloc[:, [5, 6, 7, 9, 12, 13, 14, 15]]
X_test = X_test.iloc[:, [5, 6, 7, 9, 12, 13, 14, 15]]

# Training

**Using strong ensemble model classifier - Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

regressor = RandomForestClassifier(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# Evaluation

**Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix

arr_cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(arr_cm, index=[0, 1], columns=[0, 1])
fig = plt.figure(figsize=(4,3), dpi=120)
sns.heatmap(df_cm, annot=True, fmt="d")
plt.show()

**Classification Report**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_pred))

**Accuracy Score**

In [None]:
print('Model Accuracy:', round(accuracy_score(y_test, y_pred)*100, 3), '%')

### Do not forget to upvote : ) if you liked this notebook.!!