## Project Title 
### Forecasting the customer loyality towards Beta bank

## Introduction

#### The objective of this project is to develop a machine learning model for Beta Bank to analyze customer loyalty. This model aims to identify customers who may be considering moving their business to another financial institution. By doing so, Beta Bank can investigate the underlying reasons for potential customer churn and devise strategic plans to address these issues, ultimately enhancing customer retention.


In [1]:
## importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats as st
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [2]:
## creating the dataframe
df = pd.read_csv('/datasets/Churn.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/Churn.csv'

In [None]:
# looking into dataframe
df.head()

In [None]:
# converting all the column into lower cases
df.columns = df.columns.str.lower()

In [None]:
df.head()

In [None]:
# Looking all the datatypes
df.info()

In [None]:
df.shape

In [None]:
# Looking into missing data
df.isnull().sum()

In [None]:
## dropped missing data because of tenure column
df = df.dropna()


In [None]:
df.info()

In [None]:
df.exited.value_counts().plot.pie(autopct ='%.2f')

## From the above pie chart 1 denoted customers who left the bank and 0 denotes cuistomer who are still with the bank. We can see in the pie chart that  20 % of the customer who has left the bank.

In [None]:
df.describe()

In [None]:
## Looking at duplicate data
df.duplicated()

In [None]:
df.drop(['surname', 'rownumber','customerid'],axis=1, inplace=True)

#### Removed the surname, row number, and customer ID columns as they do not seem to contribute value to the model training process.

In [None]:
df.corr()['exited']

### Customer's leaving the bank was positively correlated to the age. As age increased customer exited the bank according to this data.

In [None]:
df_ohe = pd.get_dummies(df, drop_first=True)
geo_dict  = {'Spain':0, 'France':1, 'Germany':2}
df['geography'] = df['geography'].map(geo_dict)
gen_dict = {'Male':0, 'Female':1}
df['gender'] = df['gender'].map(gen_dict)

target = df['geography']
features = df.drop(['geography'],axis=1)

target = df['gender']
features = df.drop(['gender'],axis=1)

#### Creating dummy variables for the categorical data gender and georaphy so that it can be easier for machine learning model analysis.

In [None]:
## assigning variables to features and target
features = df.drop(['exited'],axis=1)
target = df['exited']

## Splitting the data into train set = 80%, test set = 10% and validation set= 10%

In [None]:
features_train, features_test, target_train, target_test= train_test_split(features, target, test_size=0.2, random_state=42)
features_valid, features_test, target_valid, target_test = train_test_split(features_test, target_test, test_size =0.5, random_state=42)

In [None]:
features_train.shape

In [None]:
target_train.shape

In [None]:
features_test.shape

In [None]:
target_test.shape

In [None]:
features_valid.shape

In [None]:
target_valid.shape

In [None]:
numeric = ['estimatedsalary','creditscore', 'age','tenure', 'balance', 'numofproducts', 'hascrcard', 'isactivemember']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])

#### Standarizing all the features so all hold equal importance irespective of its values. We will fit it on training data set  then transform on testing data set and validation data set.

In [None]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
model.score(features_valid, target_valid)

In [None]:
for depth in range (1, 10):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth=', depth, ':', end=' ')
    print(accuracy_score(target_valid, predictions_valid))

In [None]:
## create model via RandomForestClassifier 
model = RandomForestClassifier(random_state =12345)
model.fit(features_train, target_train)
y_pred = model.predict(features_valid)
model.score(features_valid, target_valid)

In [None]:
best_score = 0
best_est = 0
for est in range(1, 20): #choosing hyper parameter
    model = RandomForestClassifier(random_state=54321, n_estimators= est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

In [None]:
## create model via LogisticRegression
model = LogisticRegression(random_state = 12345, solver='liblinear')
model.fit(features, target)
predicted_valid = model.predict(features_valid)
model.score(features_valid, target_valid)
print(accuracy_score(target_valid, predicted_valid))

In [None]:
## apply the best model on test set with RandomForestClassifier
final_model = RandomForestClassifier(random_state=54321, n_estimators= 2) 
final_model.fit(features_train, target_train)
final_model.score(features_test, target_test)

### Models were initially developed using the training and validation datasets. The best performing model was then selected using the test dataset. Ultimately, the final model was built utilizing the Random Forest Classifier."

In [None]:
##  making observations of a frequent class less frequent in the data
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )
    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(
    features_train, target_train, 0.1
)


### Downsampling is performed to reduce the frequency of observations of a predominant class in the dataset. This helps balance the dataset, ensuring that the machine learning model remains accurate and reliable.

In [None]:
# Calculation of F1 score by RandomForestClassifier
final_model = RandomForestClassifier(random_state=54321, n_estimators= 1, class_weight ='balanced') 
final_model.fit(features_downsampled, target_downsampled)
y_pred = final_model.predict(features_test)
print('F1:', f1_score(target_test, y_pred))


In [None]:
# Calculation of F1 score by LogisticRegression
model = LogisticRegression(random_state = 12345, solver='liblinear', class_weight ='balanced')
model.fit(features_downsampled, target_downsampled)
y_pred = model.predict(features_test)
print('F1:', f1_score(target_test, y_pred))


In [None]:
# Calculation of F1 score by DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=12345, class_weight ='balanced')
model.fit(features_downsampled, target_downsampled)
y_pred = model.predict(features_test)    
print('F1:', f1_score(target_test, y_pred))

### Logistic Regression model gave me highest f1 score of 0.40 when I did the calculations on test data.

In [None]:
# Calculation of AUC-ROC score
model = LogisticRegression(random_state = 12345, solver='liblinear')
model.fit(features_downsampled, target_downsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print(auc_roc)

##  Auc_Roc score of 0.71 which is pretty good as it is closer to 1.

## Conclusion
#### 
The goal of this project is to develop a machine learning model for Beta Bank to analyze customer loyalty. The model's objective is to identify customers who might be considering switching to a different financial institution. By identifying these customers, Beta Bank can explore the underlying reasons for potential customer churn and develop strategic plans to address these issues, ultimately improving customer retention.

To achieve this, we created binary variables for the categorical data of gender and geography. Our data shows that 20% of the bank's customers have already left.

We removed the surname, row number, and customer ID columns as they didn't seem to add value to the model's training process. Then, we standardized all the features to ensure they hold equal importance regardless of their values. This standardization was done on the training dataset and applied to both the testing and validation datasets.

Models were initially developed using the training and validation datasets. The best performing model was then selected using the test dataset. Ultimately, the final model was built utilizing the Random Forest Classifier."
Downsampling is performed to reduce the frequency of observations of a predominant class in the dataset. This helps balance the dataset, ensuring that the machine learning model remains accurate and reliable.

The Logistic Regression model provided the highest F1 score of 0.40 based on calculations performed on the test data. Additionally, the model achieved an AUC-ROC score of 0.71, which is quite good, as it is close to 1.

