# Creating test and training samples
Out goal might be to design an algorithm to predict flight delayes. To test our model, we will want to split our data into two parts:

1.  Training sample: This is the sample we will train our model on.
2.  Testing sample: This is the unseen data that we will test our trained model on. Good performance on this sample -with a previously traineed model using only the training data - will ensure that our model generalizes well.

A common split is 80% train and 20% test. We could just **randomly split** our data (this is quite typical), but in our case we will do some a bit more sophisticated.

We might guess from our own experience that the **carrier** (American Airlines, Southwest Delta, etc) has a big impact on flight delays. So when we split our data we would like to know for sure that our test sample is close in distribution for the carriers as our training sample. Will this be true if we just randomly split the data?

Let's first read in our flight data sample as before (the data is again in the */data* folder in github, in a subfolder named *flights*):

In [0]:
import pandas as pd

# Now let's print some data to the screem
flights = pd.read_csv("https://raw.githubusercontent.com/big-data-analytics-physics/data/master/flights/flights.csv")
print(flights.head())

What does the carrier disctribution look like for our full sample?   Let's plot it.

In [0]:
# 
# Need this code fragment for plotting
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

The code below plots the carrier distribution.  One new feature: **histnorm="probability"**.   This will plot fractions rather than counts in the histogram, which will be helpfull when we want to compare the **shape** of histograms with different total counts.

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
trace1 = go.Histogram(
    x=flights['carrier'],
    opacity=0.75,
    name="Income Category",
    histnorm='probability'
)

data = [trace1]
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='overlaid histogram')

# Random split
To split our data sample we will use a utility from from the sklearn (or scikit-learn) package which is designed just for this purpose.   It takes a dataframe as input, the desired test sample size (the remainder going to the training sample), and a random number seed (an interger).

In [0]:
from sklearn.model_selection import train_test_split
train_flights,test_flights = train_test_split(flights, test_size=0.2, random_state=42)

Now lets plot the two samples on the sample histogram, using the histnorm feature we just introduced:

In [0]:

enable_plotly_in_cell()
trace1 = go.Histogram(
    x=train_flights['carrier'],
    opacity=0.75,
    name="Training sample",
    histnorm='probability'
)
trace2 = go.Histogram(
    x=test_flights['carrier'],
    opacity=0.75,
    name="Testing sample",
    histnorm='probability'
)

data = [trace1, trace2]
layout = go.Layout(bargap=0.2,
    bargroupgap=0.1)
fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='overlaid histogram')

# Stratified Sampling
The above split looks really good (mostly because the input sample is huge - 162k rows!), but we can do even better using **stratified sampling**.     This allows us to match the fractions for each of the carriers to much higher precision.   We use the same method from sklearn, but give it an argument that tells it which column in our data sample to use for the stratification.  Note that it is important that this column be a **categorical** variable - one in which the members of that column fall into distinct categories (in our case the categories are the codes for a small set of airlines).

In [0]:
from sklearn.model_selection import train_test_split
train_flights,test_flights = train_test_split(flights, test_size=0.2, random_state=42,stratify=flights['carrier'])

enable_plotly_in_cell()
trace1 = go.Histogram(
    x=train_flights['carrier'],
    opacity=0.75,
    name="Training sample",
    histnorm='probability'
)
trace2 = go.Histogram(
    x=test_flights['carrier'],
    opacity=0.75,
    name="Testing sample",
    histnorm='probability'
)

data = [trace1, trace2]
layout = go.Layout(bargap=0.2,
    bargroupgap=0.1)
fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='overlaid histogram')