### Cross Validation

#### What is?
 - Cross validation is the implementation of making one or more supplementary datasets in addition to a conventional train/test split, in order to garner a better idea of how a predictive model will perform on "unseen" data before utimately picking the best model to be put into production

When we want to pick the "best" model, we want to examine the best set of parameters and hyperparameters for our models in relation to our data. In order to do this in an efficient and programmatic way, we can rely on two concepts:
cross validation, and grid search.

Grid Search will allow us to chose the best set of hyperparameters for our model at-a-glance by throwing a lot of training processes into an automated fashion.

- Let's examine this on a quick runthrough of a pipeline.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [2]:
# first, let's load up the cars dataset:
cars = pd.read_csv('cars.csv')

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297899 entries, 0 to 297898
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   price    297899 non-null  int64 
 1   year     297899 non-null  int64 
 2   mileage  297899 non-null  int64 
 3   city     297899 non-null  object
 4   state    297899 non-null  object
 5   vin      297899 non-null  object
 6   make     297899 non-null  object
 7   model    297899 non-null  object
dtypes: int64(3), object(5)
memory usage: 18.2+ MB


In [4]:
# This is a data set that I have with 
# a set of vehicles with sales information
# Some cars have probably sold unusually high. 
# can we identify which cars have sold high?
# can we make a classifier if that is the case?

In [5]:
# how do I define sold high?

In [6]:
cars.head()

Unnamed: 0,price,year,mileage,city,state,vin,make,model
0,16472,2015,18681,Jefferson City,MO,KL4CJBSBXFB267643,Buick,EncoreConvenience
1,15749,2015,27592,Highland,IN,KL4CJASB5FB245057,Buick,EncoreFWD
2,16998,2015,13650,Boone,NC,KL4CJCSB0FB264921,Buick,EncoreLeather
3,15777,2015,25195,New Orleans,LA,KL4CJASB4FB217542,Buick,EncoreFWD
4,16784,2015,22800,Las Vegas,NV,KL4CJBSB3FB166881,Buick,EncoreConvenience


In [7]:
# lets try to make an assessment of cars
# that are alike and see if we can predict
# what makes a car sell high

In [8]:
cars.groupby(['make', 'model','year']).price.mean()

make     model          year
AM       General        1997     62489.250000
                        1998     47499.500000
                        1999     48097.500000
                        2000     58658.142857
                        2001     71748.000000
                                    ...      
Porsche  PanameraTurbo  2013     72924.000000
                        2014     81624.333333
                        2015     88990.000000
                        2017    148993.333333
         Panamerabase   2013     43296.833333
Name: price, Length: 5833, dtype: float64

In [9]:
# lets put that mean price back into the original
# dataframe:
cars['mean_price'] = cars.groupby(
    ['make','model','year']).price.transform('mean')

In [10]:
cars.head()

Unnamed: 0,price,year,mileage,city,state,vin,make,model,mean_price
0,16472,2015,18681,Jefferson City,MO,KL4CJBSBXFB267643,Buick,EncoreConvenience,17291.768786
1,15749,2015,27592,Highland,IN,KL4CJASB5FB245057,Buick,EncoreFWD,16721.350598
2,16998,2015,13650,Boone,NC,KL4CJCSB0FB264921,Buick,EncoreLeather,19080.632911
3,15777,2015,25195,New Orleans,LA,KL4CJASB4FB217542,Buick,EncoreFWD,16721.350598
4,16784,2015,22800,Las Vegas,NV,KL4CJBSB3FB166881,Buick,EncoreConvenience,17291.768786


In [11]:
cars['sold_high'] = (cars.price > cars['mean_price']
                    ).astype(int)

In [12]:
cars['sold_high'].value_counts()

sold_high
0    158520
1    139379
Name: count, dtype: int64

In [13]:
num_feats = [col for col in cars.columns if (
    np.issubdtype(cars[col], np.number
    ) and cars[col].nunique() > 25)]

In [14]:
num_feats

['price', 'mileage', 'mean_price']

In [15]:
# mean price would not be appropriate to predict on
num_feats.remove('mean_price')

In [16]:
cars.columns

Index(['price', 'year', 'mileage', 'city', 'state', 'vin', 'make', 'model',
       'mean_price', 'sold_high'],
      dtype='object')

In [17]:
# city may be too granular, going to drop it
# vin is unique to car and not a driver
# mean price is not useful due to the way
# we defined the target
# price is also associated with our target

In [18]:
cars = cars.drop(columns=['city','vin','price', 'mean_price'])

In [19]:
cat_cols = [col for col in cars.columns if col not in num_feats]

In [20]:
cat_cols

['year', 'state', 'make', 'model', 'sold_high']

In [21]:
# remove the target
cat_cols.remove('sold_high')

In [22]:
# lets quickly encode categorical features
# im going to ignore ordinality for speed on this
# specific runthrough (will go back after mvp)
from sklearn.preprocessing import LabelEncoder

In [23]:
for col in ['state', 'make', 'model']:
    encoder = LabelEncoder()
    cars[col] = encoder.fit_transform(cars[col])

In [24]:
cars.head()

Unnamed: 0,year,mileage,state,make,model,sold_high
0,2015,18681,28,7,523,0
1,2015,27592,19,7,525,0
2,2015,13650,32,7,526,0
3,2015,25195,22,7,525,0
4,2015,22800,38,7,523,0


In [25]:
# previously:
# train_val, test = train_test_split(cars, stratify='sold_high', 
# random_state=1349, train_size=0.8)
# train, validate = train_test_split(train_val, stratify='sold_high',
# random_state=1349, train_size=0.7)

In [26]:
# using cv_score:

In [27]:
# split into train and test:

In [28]:
X, y = cars.drop(columns='sold_high'),\
cars[['sold_high']]

In [29]:
# we can feed more than one data set into a 
# single call of train_test split
# it will give us the datasets split
# in order (thing1.train, thing1.test, thing2.train, thing2.test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=1349)

In [30]:
X_train.shape, X_test.shape

((238319, 5), (59580, 5))

In [31]:
y_train.shape, y_test.shape

((238319, 1), (59580, 1))

In [32]:
# model object is what we need to proceed
# lets make one of those:
from sklearn.tree import DecisionTreeClassifier

In [35]:
clf = DecisionTreeClassifier(max_depth=4)
# note we are not fitting this model
# on our training set before
# we feed it into cross_val score
# the function is splitting 
cross_val_score(clf, X_train, y_train, cv=5)

array([0.63714753, 0.64090299, 0.63574186, 0.63544814, 0.63743365])

In [34]:
# on its face, we can use this
# with just a train test split
# and it will proceed with validation scores
# to give us a reasonable expectation
# for accuracy on dropoffs

Let's go a little further though:

In [38]:
# let's do a grid-searh:
# define a parameter grid:
# a parameter grid will be a dictionary
# of whavetver hyperparameters that you want to check
# contingent on your specific model type:
# for a decision tree classifier,
# itll look a little like this:
param_grid = {
    'max_depth': [None,10, 4, 3, 2],
    'min_samples_leaf': [1, 3, 5, 20],
    'criterion': ['gini', 'entropy'],
}

In [39]:

gsearch = GridSearchCV(DecisionTreeClassifier(),
                      param_grid)

In [40]:
gsearch

In [41]:
gsearch.fit(X_train, y_train)

In [44]:
results = gsearch.cv_results_

In [45]:
results.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_criterion', 'param_max_depth', 'param_min_samples_leaf', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [47]:
results_df_init = pd.DataFrame(results)

In [49]:
results_df_init.shape

(40, 16)

In [52]:
params = pd.DataFrame(results['params'])

In [55]:
params.head(2)

Unnamed: 0,criterion,max_depth,min_samples_leaf
0,gini,,1
1,gini,,3


In [57]:
results_df_init.head(2)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_samples_leaf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.589163,0.011214,0.013424,0.000782,gini,,1,"{'criterion': 'gini', 'max_depth': None, 'min_...",0.616671,0.615643,0.612852,0.617447,0.617859,0.616094,0.001788,32
1,0.554305,0.003057,0.01177,0.000326,gini,,3,"{'criterion': 'gini', 'max_depth': None, 'min_...",0.622692,0.622273,0.621433,0.623636,0.623419,0.622691,0.000797,29


In [60]:
splits = [col for col in results.keys() if 'split' in col]

In [63]:
chopped = pd.concat([params, results_df_init[splits]],axis=1)

In [70]:
# make a new column that says what algorithm was used:
chopped['model_type'] = 'decision_tree'

In [72]:
chopped.head(5)

Unnamed: 0,criterion,max_depth,min_samples_leaf,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,model_type
0,gini,,1,0.616671,0.615643,0.612852,0.617447,0.617859,decision_tree
1,gini,,3,0.622692,0.622273,0.621433,0.623636,0.623419,decision_tree
2,gini,,5,0.632133,0.630707,0.630958,0.632972,0.633342,decision_tree
3,gini,,20,0.660184,0.66096,0.661023,0.660519,0.660743,decision_tree
4,gini,10.0,1,0.67401,0.67573,0.674136,0.673863,0.675094,decision_tree


In [73]:
# concatenate any new model types onto 
# this dataframe and  (or a dataframe with congruent hyperparameter columns *****)
# and sort to begin to 
# select which model works best 
# for your use case!

Conclusion: automation is nice!
 - Make sure you know what you're talkin' about, though :)