<br>
<br>
<br>
<br>

# DAV 6150 Module 2: End-to-End Machine Learning Project Process Flow
<br>
<br>
<br>

## What is the Life Cycle of a Typical Data Science Project?

__Step 1__: Define a question to be investigated

__Step 2__: Identify + acquire the data you believe will help you answer the question(s) you have

__Step 3__: Exploratory Data Analysis (EDA)

__Step 4__: Data Preparation (e.g, "cleansing" your data of extraneous content; transforming the data into a format / structure that is appropriate for the problem at hand; deciding how to handle missing data values; etc.)

__Step 5__: Data Splitting (e.g., separating data into training + validation + testing subsets as needed)

__Step 6__: Model Training + Selection (including selecting which type(s) of model(s) to use)

__Step 7__: Model Testing

__Step 8__: Communicate your findings + revise model and/or data as needed (repeat Steps 2 - 7 as necessary)

__Step 9__: Model Deployment

## Data Splitting: Training, Evaluation / Validation, and Testing Subsets

- Machine learning models should be trained and tested on distinct subsets of the available data.


- Rule of thumb: set aside a substantive percentage of the available data for model testing, e.g., 20% - 35%.


- There is no "ideal" testing/training split. In general, the larger the amount of data, the more you can safely set aside for model testing while reducing the chance of an ineffective model being produced.


- Use __random sampling__ to create the training and testing subsets unless you are working with temporal data (e.g., time series data) whose content is sequential in nature + non-correlated with data from other temporal periods.


- A portion of the training subset (e.g, a random sample of 5% - 10% of the training subset) should then be set aside for purposes of evaluating / validating the results of the model training process (hence the name __Evaluation__ or __Validation__ subset).


- While this approach can be effective, partitioning the available training data into training and validation subsets can drastically reduce the number of samples which can be used for training the model. If we have too few samples available for model training the resulting model may end up being overfit to the random samples we generate for the training and validation subsets.


- We can avoid the need for the creation of a distinct evaluation / validation subset (and thereby improve the strength of our models) through the use of __cross validation__.


### Automated Data Splitting via scikit-learn: An Example

scikit-learn's __train_test_split()__ function can be used to create training and testing subsets. However, users of the function are required to first __separate the response variable for the model from the explanatory variables to be used__ for training the model. An example is shown below.


The data set used in the example is comprised of information related to airline flights departing from the two major airports in Houston, Texas. A summary / description of the data set can be found here: https://cran.r-project.org/web/packages/hflights/hflights.pdf. For purposes of this brief example, let's assume we want to construct a model that predicts the arrival delay for any given flight departing from either of the two airports. Effective use of the __train_test_split()__ function will require us to first segregate the arrival delay variable from the rest of the data attributes within the data set.


In [127]:
# load the pandas library
import pandas as pd

# load the train_test_split function from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# start by reading a set of sample data from github. This data set contains information related to flights
# departing from the two major airports in Houston, Texas
filename = "https://raw.githubusercontent.com/jtopor/DAV-5400/master/Project1/hflights.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
0,2011,1,1,6,1400.0,1500.0,AA,428,N576AA,60.0,...,-10.0,0.0,IAH,DFW,224,7.0,13.0,0,,0
1,2011,1,2,7,1401.0,1501.0,AA,428,N557AA,60.0,...,-9.0,1.0,IAH,DFW,224,6.0,9.0,0,,0
2,2011,1,3,1,1352.0,1502.0,AA,428,N541AA,70.0,...,-8.0,-8.0,IAH,DFW,224,5.0,17.0,0,,0
3,2011,1,4,2,1403.0,1513.0,AA,428,N403AA,70.0,...,3.0,3.0,IAH,DFW,224,9.0,22.0,0,,0
4,2011,1,5,3,1405.0,1507.0,AA,428,N492AA,62.0,...,-3.0,5.0,IAH,DFW,224,9.0,9.0,0,,0


In [12]:
# how many observations are contained within the example data set?
len(df)

20000

In [121]:
# check for missing ArrDelay values
df['ArrDelay'].isnull().sum()

262

In [133]:
# delete rows with missing ArrDelay values
indexNames = df[df['ArrDelay'].isnull()].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)

In [134]:
len(df)

19738

In [135]:
# check for missing DepDelay values
df['DepDelay'].isnull().sum()

0

In [136]:
# check for missing TaxiOut values
# check for missing ArrDelay values
df['TaxiOut'].isnull().sum()

0

In [137]:
# move the response variable (in this case "ArrDelay") to a separate variable
y = df.ArrDelay

In [138]:
# check results
y.head()

0   -10.0
1    -9.0
2    -8.0
3     3.0
4    -3.0
Name: ArrDelay, dtype: float64

In [139]:
# If you want to preserve the original dataframe in its entirety, make a copy of the original dataframe 
# so that the original is preserved
X = df.copy()

In [140]:
# now drop the ArrDelay column from 'X' so that the response variable is removed from the explanatory variables
X.drop('ArrDelay', axis=1, inplace=True)

In [141]:
# The ArrDelay column has been removed from the data set
X.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,AirTime,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
0,2011,1,1,6,1400.0,1500.0,AA,428,N576AA,60.0,40.0,0.0,IAH,DFW,224,7.0,13.0,0,,0
1,2011,1,2,7,1401.0,1501.0,AA,428,N557AA,60.0,45.0,1.0,IAH,DFW,224,6.0,9.0,0,,0
2,2011,1,3,1,1352.0,1502.0,AA,428,N541AA,70.0,48.0,-8.0,IAH,DFW,224,5.0,17.0,0,,0
3,2011,1,4,2,1403.0,1513.0,AA,428,N403AA,70.0,39.0,3.0,IAH,DFW,224,9.0,22.0,0,,0
4,2011,1,5,3,1405.0,1507.0,AA,428,N492AA,62.0,44.0,5.0,IAH,DFW,224,9.0,9.0,0,,0


In [142]:
# Now split the data into training and testing subsets. 
# We'll set aside 30% of the data for testing purposes; Remember to make sure you specify a value for the inital random_state
# if you want to have the ability to reproduce the exact same training + testing subsets repeatedly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

In [143]:
# Let's check the results
len(X_train)

13816

In [144]:
len(X_test)

5922

In [145]:
# Let's check the row indices of the new objects to see whether they match
y_test.head()

12934    -8.0
11353   -14.0
6490    -14.0
14765    -2.0
18634    -6.0
Name: ArrDelay, dtype: float64

In [146]:
X_test.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,AirTime,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
12934,2011,1,28,5,1151.0,1243.0,XE,2895,N12946,52.0,40.0,1.0,IAH,CRP,201,4.0,8.0,0,,0
11353,2011,1,25,2,2125.0,2216.0,WN,55,N515SW,51.0,43.0,-5.0,HOU,HRL,276,3.0,5.0,0,,0
6490,2011,1,10,1,640.0,916.0,DL,1960,N368NB,156.0,136.0,-2.0,IAH,MSP,1034,7.0,13.0,0,,0
14765,2011,1,19,3,2113.0,2241.0,XE,2192,N12921,88.0,69.0,-2.0,IAH,MAF,429,7.0,12.0,0,,0
18634,2011,1,2,7,1222.0,1342.0,XE,2962,N11191,140.0,118.0,-3.0,IAH,COS,809,7.0,15.0,0,,0


In [147]:
X_train.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,AirTime,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
17786,2011,1,5,3,1421.0,1549.0,XE,2375,N29917,88.0,68.0,-9.0,IAH,TUL,429,5.0,15.0,0,,0
19830,2011,2,25,5,1611.0,2029.0,CO,210,N57111,198.0,160.0,21.0,IAH,EWR,1400,11.0,27.0,0,,0
4096,2011,1,12,3,1426.0,1640.0,CO,370,N24212,254.0,230.0,1.0,IAH,SFO,1635,7.0,17.0,0,,0
11877,2011,1,29,6,730.0,824.0,WN,2727,N406WN,54.0,44.0,0.0,HOU,MSY,303,4.0,6.0,0,,0
13128,2011,1,27,4,2125.0,2221.0,XE,2648,N14158,116.0,103.0,5.0,IAH,ABQ,744,5.0,8.0,0,,0


In [148]:
y_train.head()

17786    -7.0
19830     1.0
4096    -11.0
11877    -6.0
13128   -12.0
Name: ArrDelay, dtype: float64

The data set has successfully been split into training and testing subsets. We first separated the response variable from the explanatory variables and then used the train_test_split() function to randomly sample items from the data set for inclusion in our training and testing subsets. We have set aside 30% of the data for testing purposes.

__NOTE__: For this simple example we opted to explicitly make a copy of the original data set and delete the response variable from that copy. However, an alternative approach would be to simply create a new dataframe object containing only those attributes you plan to make use of as explanatory variables within your model. There is no need to explicitly create a copy of the entire data set every time you want to split your data into training + testing subsets. However, if you want to make use of the __train_test_split()__ function you must always be able to somehow clearly delineate your response variable from your explanatory variables. 

## Cross Validation

Cross-validation uses __resampling__ of training data to evaluate the performance of machine learning models on a limited data sample. This allows us to avoid the need for the creation of a distinct evaluation / validation subset.


The most common approach to cross validation works as follows:


- __Step 1__: Split the training data into K non-randomly sampled subsets. These subsets are referred to as "__folds__". 


- __Step 2__: The model performance is then trained using __K-1__ of the folds as inputs to the training process. The fold not used for model training is used to evaluate the performance of the model. 


- __Step 3__: Model performance metrics are recorded.


- __Step 4__: A different fold is then selected for use as the validation subset


- __Step 5__: Repeat Steps 2 through 4 until each of the "K" folds has been used as the validation subset. At that point you will have trained and evaluated the model "K" number of times on "K" different subsets of the training data.


This process, referred to as "__K-Fold Cross Validation__" is also sometimes referred to as "V-Fold Cross Validation", where the letter "V" is  used to indicate the number of folds (instead of the letter "K").


In Python, we can automate this process using the __cross_val_score()__ function contained within the scikit-learn library:

- https://scikit-learn.org/stable/modules/cross_validation.html

<br>
<br>

*** __NOTE:__ The "score" produced by the __cross_val_score()__ function is dependent on the type of model you are attempting to cross validate. For example, when cross validating a linear regression model, the default metric produced by the __cross_val_score()__ function is __R^2__, while the default metric produced when cross validating a logistic regression model (and most classification algorithms) is __accuracy__. As such, always be certain to closely review the documentation for the __cross_val_score()__ function to ensure you understand the meaning of the metric being output by that function relative to the type of model you are working with.



#### How to select an appropriate value for "K" ? Some suggestions:

1. The value for "K" is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.


2. Set "K" = 10, a value that has been found through experimentation to generally result in a model with low bias and modest variance.


3. Set "K" = n, where n is the number of observations contained within the training data. This gives each item in the training data set an opportunity to be used as the model validation dataset. This approach is called "__leave-one-out cross-validation__" (LOOCV).


In practice, "K" is often set to a value of 5 or 10 since these values have been shown to produce testing error rate estimates that exhibit neither high bias nor very high variance (see "An Introduction to Statistical Learning", page 184, James, Witten, Hastie, and Tibshirani, ISBN-13: 978-1461471370).


#### Variations of K-Fold Cross Validation

- __Stratified Cross Validation__: In stratified cross validation we split the data into folds based on some sort of user-specified criteria. A common use of this technique is ensuring that each fold has the same proportion of observations having a given categorical value, e.g., if we have binary response variable, we ensure that each fold has a proportionaly number of samples containing each of the possible binary response values.


- __Repeated Cross Validation__: With repeated cross validation we train perform a K-folds cross validation N number of times but use random sampling to create a new set of folds prior to each repetition.


<br>

__**** IMPORTANT ****__: When using cross validation __you must still split your data set into training and testing subsets__. Cross validation eliminates the need to create a separate evaluation / validation subset but it __DOES NOT__ eliminate the need for a dedicated model testing subset.


### Using scikit-learn's Cross Validation Capabilities: An Example

scikit-learn provides us with an easy-to-use cross validation capability via the __cross_val_score()__ function. To make use of it, the user must first split their data into training and testing subsets (as we did above) and select the machine learning model they believe to be appropriate for their task at hand. We continue our example from above by constructing a small linear regression model for purposes of predicting airline flight delays using the provided "taxi out" and "departure delay" variables.

In [150]:
# load the LinearRegression() function from sklearn's 'linear_model' sub-library
from sklearn.linear_model import LinearRegression

# load the cross_val_score function from the sklearn.model_selection module
from sklearn.model_selection import cross_val_score

# create a new dataframe containing only the DepDelay and TaxiOut variables (our explanatory variables for the linear
# regression model)
newX_train = X_train[['DepDelay', 'TaxiOut']].copy()

# sanity check
newX_train.head()

Unnamed: 0,DepDelay,TaxiOut
17786,-9.0,15.0
19830,21.0,27.0
4096,1.0,17.0
11877,0.0,6.0
13128,5.0,8.0


In [155]:
# Assing the model function you want to use to a variable
model = LinearRegression()

# fit the model using 5-fold cross validation; note how the 'model' variable created above is used as a parameter for the 
# cross_val_score() function. Also note how we can specify the number of folds to use during cross validation via the 'cv' 
# parameter
scores = cross_val_score(model, newX_train, y_train, cv=5)

# print out the R^2 metrics derived from the K-fold cross validation of our linear regression model
print (scores)

[0.83750277 0.9001159  0.87302602 0.85602415 0.8689559 ]


In [156]:
import numpy as np

# calculate the average R^2 across all 5 folds
np.mean(scores)

0.8671249493162125

Our 5-fold cross validation shows that our model has a cross validated R^2 of 86.7%