# 4 Training and Modeling Data<a id='4_Training_and_Modeling_Data'></a>

## Contents <a id ="Content" > </a>

* [Introduction](#Introduction)
* [Imports](#Imports)
* [Train Test Split](#Train_Test_Split) 
* [Training and Modeling](#Training_and_Modeling)
    * [Model Selection](#Model_Selection)
    * [Evaluation Metrics](#Evaluation-Metrics)
        * [Training and Modeling](#Train_and_Model)
        * [Hyperparameter Tuning and Model Training](#Hyperparameter_Tuning_Training)
            * [Logistic Regression](#Logistic_Regression)
            * [Evaluation](#Evaluation)
* [Additional Models](#AdditionalModels)
* [Summary](#Summary)
* [Recommendations](#Recom)

## Introduction <a id = 'Introduction'></a>

## Imports <a id="Imports"></a>

In [25]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# remove warning
import warnings
warnings.filterwarnings("ignore")


In [2]:
X_train = pd.read_csv("../data/4.X_trained_df.csv")
y_train = pd.read_csv("../data/4.y_trained.csv")
X_test = pd.read_csv("../data/4.X_test.csv")
y_test = pd.read_csv("../data/4.y_test.csv")

In [3]:
X_train.shape,y_train.shape

((7088, 43), (7088, 1))

In [4]:
X_test.shape,y_test.shape

((3039, 43), (3039, 1))

In [5]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Customer_Age,7088.0,-3.454018e-16,1.000071,-2.531322,-0.661337,-0.038008,0.709986,3.327966
Credit_Limit,7088.0,8.597793,0.930669,7.271217,7.841395,8.418587,9.298671,10.449178
Total_Revolving_Bal,7088.0,0.9103482,0.638103,0.0,0.248822,1.0,1.396112,1.977219
Total_Trans_Amt,7088.0,8.167921,0.656979,6.390241,7.674617,8.267449,8.468633,9.824661
Avg_Utilization_Ratio,7088.0,0.275307,0.276635,0.0,0.023,0.175,0.502,0.999
Months_on_book,7088.0,46.30488,8.022018,26.0,41.0,46.0,52.0,73.0
Gender_M,7088.0,0.4671275,0.498953,0.0,0.0,0.0,1.0,1.0
Dependent_count_1,7088.0,0.1795993,0.38388,0.0,0.0,0.0,0.0,1.0
Dependent_count_2,7088.0,0.2650959,0.441415,0.0,0.0,0.0,1.0,1.0
Dependent_count_3,7088.0,0.2684819,0.443201,0.0,0.0,0.0,1.0,1.0


## Training and Modeling <a id=Training_and_Modeling ></a>

### Model Selection <a id=Model_Selection ></a>

Four competing supervised classfication models/algorithms are considered, namely, 
* Logistic Regression classification

### Training and Modeling <a id=Train_and_Model>

### Initial Not-Even-A-Model

In [6]:
#Calculate the mean of `y_train`
train_mean = y_train.mean()
train_mean

Attrition_Numeric    0.160694
dtype: float64

In [7]:
train_var = y_train.var()
train_var

Attrition_Numeric    0.134891
dtype: float64

In [19]:
# grid search function
def grid_search(X_train,y_train,parameters,model):
    scoring = 'roc_auc'
    verbose = 1
    clf_pipeline = Pipeline([("clf",model)])  
    clf_grid = GridSearchCV(clf_pipeline, parameters,  scoring=scoring,verbose = verbose)
    clf_grid.fit(X_train,y_train) 

    print("Best parameters for ", model) 
    print(clf_grid.best_params_)
    print(f"\nBest train %s score: {clf_grid.best_score_ :.2f}" % (scoring))
    return clf_grid.best_estimator_

In [20]:
#Setting ranges for each hyperparameter.
log_params = {"clf__solver": ["lbfgs", "sag", "saga"],
               "clf__C": np.arange(0.1,2,0.1), 
               "clf__class_weight": ["balanced", None]
              }

In [None]:


classifier = LogisticRegression(fit_intercept=True,max_iter=500,random_state=632966)
log_best_estimator = grid_search(X_train,y_train, log_params,classifier)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit


logistic_regression = ("model", LogisticRegression(fit_intercept=True,max_iter=500,random_state=632966))

model_params = {"model__C": (np.logspace(start=-4, stop=4, num=30))}

model_pipeline = Pipeline(steps=[logistic_regression])

cross_validator = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
# set up grid search
model_grid = (
  GridSearchCV(estimator=model_pipeline,
                           param_grid=model_params,
                           refit=True, 
                           scoring="f1_weighted", 
                           cv=cross_validator))


# fit model on intercept (random guesses - baseline performance)
null_mod = model_grid.fit(X_train, y_train)


In [13]:
y_tr_pred = np.array([train_mean] * 5)
y_tr_pred[:5]

array([[0.16069413],
       [0.16069413],
       [0.16069413],
       [0.16069413],
       [0.16069413]])

## Metrics

In [None]:
#R-squared, or coefficient of determination
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

In [None]:
#Mean Absolute Error
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

In [None]:
#Mean Squared Error
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

## Refining The Linear Model

## Additional Models: <a id=AdditionalModels></a>

## Summary <a id =Summary> </a>

## Recommendations <a id = Recom></a>