### Deliverables
For a project, you repository/folder should contain the following:

- README.md with
    - Description of the problem
    - Instructions on how to run the project
- Data
    -You should either commit the dataset you used or have clear instructions how to download the dataset
- Notebook (suggested name - notebook.ipynb) with
    - Data preparation and data cleaning
    - EDA, feature importance analysis
    - Model selection process and parameter tuning
- Script train.py (suggested name)
    - Training the final model
    - Saving it to a file (e.g. pickle) or saving it with specialized software (BentoML)
- Script predict.py (suggested name)
    - Loading the model
    - Serving it via a web service (with Flask or specialized software - BentoML, KServe, etc)
- Files with dependencies
    - Pipenv and Pipenv.lock if you use Pipenv
    - or equivalents: conda environment file, requirements.txt or pyproject.toml
- Dockerfile for running the service
- Deployment
    - URL to the service you deployed or
    - Video or image of how you interact with the deployed service

## Midterm project
This workbook contain the analysis for the midterm project

It contain EDA, model selection and saving the scripts

### Load the needed libraries for the analysis

In [32]:
import numpy as np
import pandas as pd


from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import mutual_info_score

import pickle

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Loading the data and examine it

Below are the basic description of the columns
- Feature	Description
- customer_id	Unique identifier for each customer
- credit_score	Credit score of the customer
- country	Customer’s country of residence
- gender	Customer’s gender
- age	Customer’s age
- tenure	Number of years the customer has been with the bank
- balance	Account balance of the customer
- products_number	Number of products used by the customer
- credit_card	Whether the customer has a credit card (1: Yes, 0: No)
- active_member	Whether the customer is an active member (1: Yes, 0: No)
- estimated_salary	Estimated annual salary of the customer
- churn	Target variable indicating if the customer churned (1: Yes, 0: No)

In [9]:
## https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset/data?select=Bank+Customer+Churn+Prediction.csv
df = pd.read_csv("bank_churn_data.csv")

In [5]:
## Examine the first few row to the strucure of the data
df.head().T

Unnamed: 0,0,1,2,3,4
customer_id,15634602,15647311,15619304,15701354,15737888
credit_score,619,608,502,699,850
country,France,Spain,France,France,Spain
gender,Female,Female,Female,Female,Female
age,42,41,42,39,43
tenure,2,1,8,1,2
balance,0.0,83807.86,159660.8,0.0,125510.82
products_number,1,1,3,2,1
credit_card,1,0,1,0,1
active_member,1,1,0,0,1


In [10]:
## Check the data types. Note some  `credit_card`, `active_member` are indicated as int
## but are actually logical values
df.dtypes

customer_id           int64
credit_score          int64
country              object
gender               object
age                   int64
tenure                int64
balance             float64
products_number       int64
credit_card           int64
active_member         int64
estimated_salary    float64
churn                 int64
dtype: object

In [11]:
## Use describe to explore the data desctribution
df.describe()

Unnamed: 0,customer_id,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [12]:
## Drop customer ID
del df['customer_id']

In [13]:
df.head()

Unnamed: 0,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Set up the framework for logistic regression
Use logistic regression as the first model 

In [14]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

y_train = df_train.churn.values
y_test = df_test.churn.values
y_val = df_val.churn.values

del df_train['churn']
del df_test['churn']
del df_val['churn']

In [15]:
df_train.head()

Unnamed: 0,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary
0,789,France,Female,32,7,69423.52,1,1,0,107499.39
1,583,Germany,Female,41,5,77647.6,1,1,0,190429.52
2,767,Germany,Female,35,6,132253.22,1,1,0,115566.57
3,718,France,Male,48,9,0.0,2,1,1,72105.63
4,686,Germany,Male,26,1,57422.62,1,1,1,79189.4


#### Do  some EDA using the df_full_train
- check for NAs
- Examine the churn value in relation to other variables
- Check the categorical and numerical variable

In [16]:
df_full_train = df_full_train.reset_index(drop=True)

In [17]:
df_full_train.head()

Unnamed: 0,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,628,Germany,Male,29,3,113146.98,2,0,1,124749.08,0
1,626,France,Female,29,4,105767.28,2,0,0,41104.82,0
2,612,Germany,Female,47,6,130024.87,1,1,1,45750.21,1
3,646,Germany,Female,52,6,111739.4,2,0,1,68367.18,0
4,714,Spain,Male,33,8,122017.19,1,0,0,162515.17,0


In [18]:
df_full_train.isna().sum()

credit_score        0
country             0
gender              0
age                 0
tenure              0
balance             0
products_number     0
credit_card         0
active_member       0
estimated_salary    0
churn               0
dtype: int64

In [28]:
## Determine the global churn rate and round to 2 decimal places
global_churn_rate = round(df_full_train.churn.mean(),2)
global_churn_rate

0.2

In [21]:
## Set the numerical and catergorical variables 
numerical = ['credit_score','age','tenure', 'balance','products_number','estimated_salary']
categorical = ['country', 'gender','credit_card','active_member']

In [22]:
## NUmber of unique values for the catergorical variables 
df_full_train[categorical].nunique()

country          3
gender           2
credit_card      2
active_member    2
dtype: int64

### Tune ratio categorical variable
The section below determine the tune ratios of the categorical variables

From the tables below
- Country of residence is imporant in determining the weather a customer will churn or not
    - France and Spain redicence are about 20% less likely to churn
    - Germany residence are 50% more likely to churn
- Having a credit does not significantly affect churning
- For gender female gender is more likely to churn
- Active membership is also significantly affect membership
    - Active members are 30% less likely to churn while un-active members are 30% more likely to churn

In [30]:
from IPython.display import display
 
for c in categorical:
    #print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn_rate
    df_group['risk'] = df_group['mean'] / global_churn_rate
    display(df_group)
    print()
    print()

Unnamed: 0_level_0,mean,count,diff,risk
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
France,0.160991,3994,-0.039009,0.804957
Germany,0.318227,2030,0.118227,1.591133
Spain,0.168522,1976,-0.031478,0.842611






Unnamed: 0_level_0,mean,count,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,0.249314,3646,0.049314,1.246572
Male,0.163757,4354,-0.036243,0.818787






Unnamed: 0_level_0,mean,count,diff,risk
credit_card,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.211604,2344,0.011604,1.05802
1,0.199081,5656,-0.000919,0.995403






Unnamed: 0_level_0,mean,count,diff,risk
active_member,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.268424,3908,0.068424,1.342119
1,0.140029,4092,-0.059971,0.700147






### Mutual information of the categorical variables

Country of residence and active membership are more likely to affect churning rate

In [33]:
def mutual_info_churn_score(series):
    return mutual_info_score(series, df_full_train.churn)
 
mi = df_full_train[categorical].apply(mutual_info_churn_score)
mi

country          0.013117
gender           0.005598
credit_card      0.000100
active_member    0.012872
dtype: float64