# Scenario
* You are tasked with investigating customer churn
    * churn: when a customer quits a service.  High churn rate = bad for business.
* in the `data/` folder is a the dataset you will be working on. 


# Complete the following
* Find features that are high indicators of churn and build visualizations
* Build a model to predict churn. You can build any model you want including
    * Logistic Regression 
    * KNN
    * Bayesian Classifiers
* Choose an evaluation metric for your model
    * Accuracy vs Precision vs Recall vs F1
* Explain why you chose that Metric
* Apply a GridsearchCV to find the best hyper parameters for your model
* After you build your final model you must have
    * A confusion matrix supporting your model
    * Final Metric Score
* Make sure you have a validation set for your data


# Can you
* Work in groups? Yes
* Ask cohort-mates for help/advice? Yes
* Check what you did with a cohort-mate? Yes
* Ask me for advice? Yes

This is an opportunity to practice some ML before the Phase 3 project. 


## things to remember
* A data scientist is good at finding key insights to problems not just building models
* validate your model with a confusion matrix and have a validation set

# Setting the random state
Below we set a default random state for all randomized computations.

In [18]:
random_state = 42

# Data Validation
In this section we will import and inspect the data to ensure that our data is complete, correctly encoded, and 

## Validation Reports
Below we inspect the data using some standard reports 

In [17]:
import pandas as pd

df = pd.read_csv('data/Churn_Modelling.csv', index_col='RowNumber')
print(df.info())
for column in df.columns:
    if df[column].dtype == object:
        print(f"\n== {column}==")
        print(df[column].value_counts())
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1.1+ MB
None

== Surname==
Smith          32
Martin         29
Scott          29
Walker         28
Brown          26
               ..
Patrick     

Unnamed: 0_level_0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Validation Conclusions
* The data is complete. 
* We will drop `CustomerId` and `Surname` from the model since they are unlikely to contain any meaningful information.
* We need to dummify `Gender` and `Geography`.

# Holding out Test data for model validation.
In this section, we split the provided data into a training set, which will be used for training and cross validation, and a test set, which will be used for model validation. We will use 20% of our data for validation and 80% for training.

In [19]:
from sklearn.model_selection import train_test_split
X = df.drop('Exited',axis=1)
y = df['Exited']
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.20, random_state=random_state)

# Building our pipeline
In this section we build a pipeline.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

pipeline = Pipeline(
    steps=[
        ('encoding', OneHotEncoder(drop='first'))
    ]
)

In [28]:
ohe = OneHotEncoder()
pd.DataFrame(ohe.fit_transform(X_train))

Unnamed: 0,0
0,"(0, 1121)\t1.0\n (0, 9883)\t1.0\n (0, 1091..."
1,"(0, 6428)\t1.0\n (0, 9379)\t1.0\n (0, 1085..."
2,"(0, 4864)\t1.0\n (0, 8959)\t1.0\n (0, 1078..."
3,"(0, 5306)\t1.0\n (0, 8496)\t1.0\n (0, 1078..."
4,"(0, 7449)\t1.0\n (0, 8476)\t1.0\n (0, 1074..."
...,...
7995,"(0, 988)\t1.0\n (0, 9047)\t1.0\n (0, 10992..."
7996,"(0, 3696)\t1.0\n (0, 8494)\t1.0\n (0, 1090..."
7997,"(0, 247)\t1.0\n (0, 9486)\t1.0\n (0, 10959..."
7998,"(0, 7749)\t1.0\n (0, 9579)\t1.0\n (0, 1089..."


# Fitting the Pipeline

In [23]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('encoding', OneHotEncoder(drop='first'))])

In [26]:
pipeline.transform(X_train)

In [None]:
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
