# Let us begin with a dilemma...

> Java has a startup, but lately he finds that his customers are leaving the services he provides. So he call's us to help him. As a data scientist, we need to look into data about his customers and find out which customers are likely to leave.

Let us go about this task

# What we will go through in this notebook:
* [First Steps - Preliminary work](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#First-Steps)
* [Data Preprocessing](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Data-Preprocessing)
* [Checking for Missing Values](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Checking-for-Missing-Values(NaN))
* [Visualization of Data](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Data-Visualization)
* [Converting Non numeric features to numeric features](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Converting-non-numeric-features-to-numeric-features)
* [Oversampling](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Generating-new-data-by-oversampling)
* [Scaling](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Scaling)
* [Various Models](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Models)
* [Hyperparameter Tuning](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Hyperparameter-Tuning)
* [Scope For Improvement](https://www.kaggle.com/duttasd28/java-s-dilemma?scriptVersionId=50296668#Scope-for-improvement)

# First Steps

In [None]:
# Import the data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
np.random.seed(0)
#==========================================================================
#==========================================================================
data = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv',
                   index_col = 'RowNumber')
data = data.sample(5000)
data.head()

As we can see, **Exited** is our dependent feature. Other columns are independent features

Let us check how many values of __Exited__ columns are there so that we can figure out if there is class imbalance or not

In [None]:
plt.figure(1, dpi=100)
values = data['Exited'].values
# Analysis
plt.text(
    x=0.2,
    y=7.0,
    s = "80%",
    fontsize=44,
    c='#ff8c00'
)
# text
plt.text(
    x=0.2,
    y=6.0,
    s = "of data points are non churn category\npoints, suggesting imbalance",
    c="gray"
)
# Hist
plt.hist(
    values,
    density=True,
    color='gray'
)
# 0 label
plt.annotate(
    s = "0",
    xy = (0.05, 7),
    fontsize=12,
    c='white'
)
# 1 label
plt.annotate(
    s = "1",
    xy = (0.95, 1.5),
    fontsize = 12,
    c='white'
)
plt.box(on=None)
plt.xlabel('Customer Distribution')

plt.yticks([])
plt.xticks([])
plt.title('Need to implement Data Imbalance Measures')
plt.show();

# Data Preprocessing
In this step, we are going to preprocess our data so that we can use it on our models.

Preprocessing involves the following:
* Checking for NaN values that is missing values in the data
* Visualise the data so that we can derive meaningful insights
* Split to training and test datasets
* Fill in NaN Values
* Convert non numeric features to numeric features so that we can do predictions
* Scale the data 

Let us go ahead with the first step, __checking for NaN/missing values__

# Checking for Missing Values(NaN)

In [None]:
# check for missing values
data.isnull().any()

Phew! We are lucky we did not get any null values. 
Usually there are null values in the dataset and we need to remove them.

Usually, There are various techniques to handle missing values. 

[This awesome notebook by Kaggle Grandmaster Parul Pandey](https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python) helped me learn a lot. Do check out if you like it!

# Data Visualization
Here we are going to plot graphs regarding the data to get a deeper insight.

In [None]:
# Import necessary plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Make figures inline
%matplotlib inline

Let us get a list of columns in the data so that we can predict better. 
We use the .info() method to get the datatypes too

In [None]:
data.info()

**Geography, Gender, Surname** are object data-types, while others are either int / float.

# Plotting with Matplotlib and Seaborn

In [None]:
plt.figure(figsize=(8, 8))
sns.set()
sns.boxplot(y = 'CreditScore', x = 'Exited', data = data, palette = 'husl');

In [None]:
plt.figure(figsize=(8, 5))
sns.violinplot(y = 'Exited' , x = 'Gender' , data = data, kind='boxen', palette = 'hot');

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x = 'Geography' , data = data);

Let us plot a heatmap of the correlations of the features with each other. That will help us discard non useful features.
It also gives us some idea as to what features predict dependent column best

In [None]:
plt.figure(figsize=(10, 10))
sns.set(style = 'white')
sns.heatmap(data.select_dtypes(include='number').corr(), annot = True, cmap = 'magma', square = True);

Pairplot - This plots graphs between every two variables. This is useful for visualisation

In [None]:
# Pairplot
data_random_sample = data.sample(frac = 0.4).reset_index()

plt.figure(figsize=(12, 8))
sns.pairplot(data_random_sample, corner = True, hue = 'Exited');

# Converting non numeric features to numeric features
We convert non numeric features to numeric features.
Also we drop columns which do not seem to contribute anything useful like **CustomerId**, **Surname**.

But first we will split the dataset into train and test dataset.

In [None]:
# Drop a useless feature
data.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [None]:
# Get dependent and independent features
X = data.iloc[:, :-1]
y = data.iloc[:, -1].astype('float')
X.head()

In [None]:
# Splitting to train test dataset
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 1)
len(y_train), len(y_val)

In [None]:
# Reset the indexes of the splitted data frames
X_train.reset_index(drop=True, inplace=True)
X_val.reset_index(drop=True, inplace=True)

y_train.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)

In [None]:
categorical_cols = [col for col in X_train.columns if X_train[col].dtypes == object]

In [None]:
# Label encoder object
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Create two empty data frames
X_train_categorical, X_val_categorical = pd.DataFrame(), pd.DataFrame()

# Label Encode the features
for col in categorical_cols:
    X_train_categorical[col] = label_encoder.fit_transform(X_train[col])
    X_val_categorical[col] = label_encoder.transform(X_val[col])

# Drop the non required columns
X_train.drop(categorical_cols, axis = 1, inplace = True)
X_val.drop(categorical_cols, axis = 1, inplace=True)

# put new colums in dataframe
X_train = X_train.join(X_train_categorical)
X_val = X_val.join(X_val_categorical)

# Generating new data by oversampling
Since we have an imbalanced dataset, we will increase the number of samples by SMOTE technique

In [None]:
from imblearn.combine import SMOTETomek
smk = SMOTETomek()
# Oversample training  data
X_train, y_train = smk.fit_sample(X_train, y_train)

# Oversample validation data
X_val, y_val = smk.fit_sample(X_val, y_val)

# Final check at the dataset before putting in model
Now we take a final look at the dataset

In [None]:
X_train.shape, X_val.shape

In [None]:
X_train[:5]

In [None]:
y_train.value_counts()

# Scaling
We scale the data so that datapoints are on the same level

### Note: we have labelled data, so we should not scale all the data.Otherwise meaning will be lost

In [None]:
columns = ['Balance', 'EstimatedSalary']  ## Columns to modify

## Subtract the mean, divide by standard deviation.
for col in columns:
    colMean = X_train[col].mean()
    colStdDev = X_train[col].std()
    X_train[col] = X_train[col].apply(lambda x : (x - colMean) / colStdDev)
    X_val[col] = X_val[col].apply(lambda x : (x - colMean) / colStdDev)    

In [None]:
X_train.head()

# Models
We will be using the following models 
* Logistic Regression
* Decision Tree
* Random Forest Classifier
* Extra Trees Classifier
* XGBClassifier
* ANN

In [None]:
# metric
from sklearn.metrics import f1_score

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver = 'lbfgs', max_iter = 300)

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
f1_score(y_preds, y_val)

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
f1_score(y_preds, y_val)

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(class_weight='balanced')

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
f1_score(y_val, y_preds)

# Extra Trees Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
f1_score(y_val, y_preds)

# XGBoost

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
f1_score(y_val, y_preds)


# Neural Network(TensorFlow)

In [None]:
from tensorflow import keras as K

In [None]:
model = K.Sequential()

model.add(K.layers.Dense(512, input_dim = 10, activation = 'relu'))

model.add(K.layers.Dense(256, activation = 'relu'))
model.add(K.layers.BatchNormalization())

model.add(K.layers.Dense(64, activation = 'relu'))
model.add(K.layers.Dropout(0.4))

model.add(K.layers.Dense(8, activation = 'relu'))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(1, activation = 'sigmoid'))

model.summary()

In [None]:
opt = K.optimizers.Adam(learning_rate=0.00001)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

In [None]:
history = model.fit(X_train, y_train, epochs=8, batch_size=32, validation_data=(X_val, y_val))

In [None]:
y_preds = model.predict_classes(X_val)

In [None]:
f1_score(y_val, y_preds)

# Hyperparameter Tuning
Let us tune hyperparameters of XGBoost to further improve our results.

We will be using RandomisedSearchCV for this. This searches randomly through a search space and gets the best parameters


In [None]:
from sklearn.model_selection import RandomizedSearchCV

model = XGBClassifier()  ## Model to tune

# define a parameters dictionary, which contains the search space to see
paramSearchSpace = {
    'n_estimators' : [10, 25, 70],  ## Number of trees
    'gamma' : [1, 0.05, 0.1],    ## Regularisation parameter
    'max_depth' : [2, 3, 5, 7],    ## max depth of tree
    'scale_pos_weight' : [60, 70, 80] # Num pos / num Neg
}

# make Grid Search CV object
clf = RandomizedSearchCV(model, param_distributions=  paramSearchSpace)

# Fit with data
clf.fit(X_train, y_train)

# See the best values we obtain
clf.best_params_, clf.best_score_

Randomized Search CV takes time!! Please wait!

In [None]:
finalModel = XGBClassifier(**clf.best_params_)
finalModel.fit(X_train, y_train)
y_preds = finalModel.predict(X_val)

## Final f1 score
f1_score(y_preds, y_val)

# Saving Best Model

In [None]:
import pickle
# Dump the model
pickle.dump(finalModel, open('ChurnModelFinal.pkl', 'wb'))

# Scope for improvement

* Better Hyperparameter Tuning
* Change optimizer for model

## Thank you!