# 4 Training and Modeling Data<a id='4_Training_and_Modeling_Data'></a>

## Contents <a id ="Content" > </a>

* [Introduction](#Introduction)
* [Imports](#Imports)
* [Train Test Split](#Train_Test_Split) 
* [Training and Modeling](#Training_and_Modeling)
    * [Model Selection](#Model_Selection)
    * [Evaluation Metrics](#Evaluation-Metrics)
        * [Training and Modeling](#Train_and_Model)
        * [Hyperparameter Tuning and Model Training](#Hyperparameter_Tuning_Training)
            * [Logistic Regression](#Logistic_Regression)
            * [Evaluation](#Evaluation)
* [Additional Models](#AdditionalModels)
* [Summary](#Summary)
* [Recommendations](#Recom)

## Introduction <a id = 'Introduction'></a>

## Imports <a id="Imports"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit


# remove warning
import warnings
warnings.filterwarnings("ignore")


In [2]:
X_train = pd.read_csv("../data/4.X_train.csv")
y_train = pd.read_csv("../data/4.y_train.csv")
X_test = pd.read_csv("../data/4.X_test.csv")
y_test = pd.read_csv("../data/4.y_test.csv")

In [3]:
X_train.shape,y_train.shape

((7088, 43), (7088, 1))

In [4]:
X_test.shape,y_test.shape

((3039, 43), (3039, 1))

In [5]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Customer_Age,7088.0,-3.454018e-16,1.000071,-2.531322,-0.661337,-0.038008,0.709986,3.327966
Credit_Limit,7088.0,8.597793,0.930669,7.271217,7.841395,8.418587,9.298671,10.449178
Total_Revolving_Bal,7088.0,0.9103482,0.638103,0.0,0.248822,1.0,1.396112,1.977219
Total_Trans_Amt,7088.0,8.167921,0.656979,6.390241,7.674617,8.267449,8.468633,9.824661
Avg_Utilization_Ratio,7088.0,0.275307,0.276635,0.0,0.023,0.175,0.502,0.999
Months_on_book,7088.0,-3.454018e-16,1.000071,-2.531322,-0.661337,-0.038008,0.709986,3.327966
Gender_M,7088.0,0.4671275,0.498953,0.0,0.0,0.0,1.0,1.0
Dependent_count_1,7088.0,0.1795993,0.38388,0.0,0.0,0.0,0.0,1.0
Dependent_count_2,7088.0,0.2650959,0.441415,0.0,0.0,0.0,1.0,1.0
Dependent_count_3,7088.0,0.2684819,0.443201,0.0,0.0,0.0,1.0,1.0


## Training and Modeling <a id=Training_and_Modeling ></a>

### Model Selection <a id=Model_Selection ></a>

Four competing supervised classfication models/algorithms are considered, namely, 
* Logistic Regression classification

### Training and Modeling <a id=Train_and_Model>

### Fit Model on Intercept

In [6]:
logistic_regression = ("model", LogisticRegression(fit_intercept=False,max_iter=500,random_state=632966))

model_params = {"model__C": (np.logspace(start=-4, stop=4, num=30))}

model_pipeline = Pipeline(steps=[logistic_regression])

#cross_validator = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
cross_validator = (StratifiedShuffleSplit(train_size=0.8, random_state=1337, n_splits=200))

# set up grid search
model_grid = (GridSearchCV(estimator=model_pipeline,
                           param_grid=model_params,
                           refit=True, 
                           scoring="roc_auc", 
                           cv=cross_validator))

In [7]:
%%time
# fit model on intercept (random guesses - baseline performance)
null_mod = model_grid.fit(np.ones(shape=X_train.shape[0]).reshape(-1,1), y_train)

Wall time: 41.2 s


In [8]:
print("Best model parameters - null model ")
print("Cost parameter: {:.03f}".format(null_mod.best_params_["model__C"])) 
print("Best score {:0.3f}".format(null_mod.best_score_))

Best model parameters - null model 
Cost parameter: 0.000
Best score 0.500


In [9]:
 np.mean(y_train), np.var(y_train)

(Attrition_Numeric    0.160694
 dtype: float64,
 Attrition_Numeric    0.134872
 dtype: float64)

In [10]:
#logistic_regression = ("model", LogisticRegression(fit_intercept=False,max_iter=500,random_state=632966))

#model_params = {"model__C": (np.logspace(start=-4, stop=4, num=30))}

#model_pipeline = Pipeline(steps=[logistic_regression])

cross_validator_test = (StratifiedShuffleSplit(train_size=0.2, random_state=1337, n_splits=100))

# set up grid search
model_grid_test = (GridSearchCV(estimator=model_pipeline,
                           param_grid=model_params,
                           refit=True, 
                           scoring="roc_auc", 
                           cv=cross_validator_test))

In [11]:
%time
# fit model on intercept (random guesses - baseline performance)
null_mod = model_grid_test.fit(np.ones(shape=X_test.shape[0]).reshape(-1,1), y_test)

Wall time: 0 ns


In [12]:
print("Best model parameters - null model ")
print("Cost parameter: {:.03f}".format(null_mod.best_params_["model__C"])) 
print("Best score {:0.3f}".format(null_mod.best_score_))

Best model parameters - null model 
Cost parameter: 0.000
Best score 0.500


## Refining The Linear Model

## Additional Models: <a id=AdditionalModels></a>

## Summary <a id =Summary> </a>

## Recommendations <a id = Recom></a>