# Dataset preparation
In this workbook we will deal with two (more) issues which come up in dataset preparation: feature scaling and text features.

An extremely important thing to remember: when modifying the data sample that you use for training, you must use the **exact same** procedure before applying your trained model on test (or new unseen) data!   Asa we did earlier, we will make train and test samples to illustrate how this might be done.



## Get the data and make train/test subsamples
As before, we will use the **flights** data sample.   We will use a simple random split to make our train (80%) and test (20%) samples.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Now let's print some data to the screem
flights = pd.read_csv("https://raw.githubusercontent.com/big-data-analytics-physics/data/master/flights/flights.csv")
train_flights,test_flights = train_test_split(flights, test_size=0.2, random_state=42,stratify=flights['carrier'])

## Feature Scaling
Many fitting algorithms have trouble with input variables (or features) whose ranges differ greatly.   To fix this we can do feature scaling.   Note that **targets** (the thing that we are trying to fit or predict) generally do **not** need to be scaled.   There are two feature scaling techniques which come up often:
1.  min-max scaling: the feature values are scaled and shifted so that the lie from 0 to 1
2.  standardization: here the mean (by column) is subtracted from each feature, and the result is divided by the variance so that it has unit vairance.

We will use different techniques for different columns.  Generally we will not use these techniques for **categorical** columns (something different will be done).

A quick exploration of the data will tell us more about which columns might need which type of scale.   A simple print and historgram of the data will help here!

In [0]:
# 
# Need this code fragment for plotting
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

In [0]:
enable_plotly_in_cell()
print(train_flights.head())
train_flights.hist(bins=50,figsize=(20,15))
plt.show()

Looking at these plots and thinking about the meaning of the data, we will use the following:
0.  Nothing for the categorical variables: carrier, tailnum, origin, dest
1.   min-max scaling for: day, hour, minute
2.   standardization for:  all of the others except arr_delay, minute, day, month, year

Note: day, year, and month could be combined into a new categorical variable, which might have an impact on delays!

**NOTE**: we won't scale arr_delay (the arrival time delay in minutes) because we will assume that this is the quantity we are interested in fitting - this is our **target** variable.

# Min-Max scaling and sci-kit learn estimators

sklearn has a number of **estimators** which are incredibly useful for data preparation.   To use an estimator there are generally three steps:
1.  Instantiate the estimator:
from sklearn.preprocessing import MinMaxScaler

2.  Invoke the estimator.   For example:
scaler = MinMaxScaler()
There mighte be optional arguments that control how the estimator works.  Consult the documentation for this.

3.  Fit the data to train the estimator.   For example:
scaler.fit(train_flights[["day"]])     ## WHY TWO BRACKET??
You need to be careful with the type of argument expected by the estimator.  Most Scikit-Learn estimators require that data be strictly 2-dimensional. 

If we only want to transform one column of data and we select a single column like this:
train_flights['day']
technically, a **Pandas Series** is created which is a single dimension of data. We can force Pandas to create a one-column DataFrame, by passing a single-item list to the brackets like this:
train_flights[]['day']]

4.  Next we need to explicitly transform the data (remember: *fit* just trained the estimator):
scaled_days = scaler.transform(flights[]["day"]])

NOTE: we **trained** the estimator on the train_flight data.   How do we use this on the test_flight data?   **We only use the transform method!!**  This ensures that we treat our test data **exactly** like the training data:
scaled_days_test = scaler.transform(test_flights[]["day"]])



In [0]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_flights[["day","hour"]])
train_scaled_minmax = scaler.transform(train_flights[["day","hour"]])

test_scaled_minmax = scaler.transform(test_flights[["day","hour"]])


# Standardization
Here we will form a new dataframe by droping all of the columns we don't want to scale, and then scaling the remaining:

In [0]:
from sklearn.preprocessing import StandardScaler
stanScal = StandardScaler()
#
# Get the numerical colums
train_flights_subset = train_flights.drop(["arr_delay","carrier","minute", "hour", "day", "month", "year","origin","dest","tailnum"], axis=1)
stanScal.fit(train_flights_subset)
train_flights_subset_stanscaled = stanScal.transform(train_flights_subset)
for i in range(5):
  print(train_flights_subset_stanscaled[i])
  
#
# Do the same for the test data
test_flights_subset = test_flights.drop(["arr_delay","carrier","minute", "hour", "day", "month", "year","origin","dest","tailnum"], axis=1)
test_flights_subset_stanscaled = stanScal.transform(test_flights_subset)
for i in range(5):
  print(test_flights_subset_stanscaled[i])


# Dealing with Text Features
Let's take a quick look at one of the text features in our dataset: **origin**.   This is the 3 letter code the for airport the flight originates from:

In [0]:
# 
# Need this code fragment for plotting
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
trace1 = go.Histogram(
    x=train_flights['origin'],
    opacity=0.75,
    name="Flight Origin",
    histnorm='probability'
)

data = [trace1]
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='overlaid histogram')

The origin might be important for determing whether flight are delayed, but how can we use it in our fit.  (With just two items we might just do the analysis separately for each, but this might not be feasible with a large number of origins.)    

In [0]:
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
#
# Fit then transform with one-hot enocding
columnToEncode = 'origin'
onehot_encoder.fit(train_flights[[columnToEncode]])
train_flights_cat_one_hot = onehot_encoder.transform(train_flights[[columnToEncode]])
print("Transformed training data",type(train_flights_cat_one_hot))
for i in range(10):
  print(train_flights.iloc[i][columnToEncode],train_flights_cat_one_hot[i])

#
# Now just transform the test data (don't refit!!)
test_flights_cat_one_hot = onehot_encoder.transform(test_flights[[columnToEncode]])
#
# compare with:
#test_flights_cat_one_hot = onehot_encoder.fit_transform(test_flights[[columnToEncode]])
print("Transformed testing data",type(test_flights_cat_one_hot))
for i in range(10):
  print(test_flights.iloc[i][columnToEncode],test_flights_cat_one_hot[i])



# Putting Humpty-Dumpty back together!
The result of our three transforms on the training_flights data:
1.   train_scaled_minmax: a 2 column by many row numpy array, obtained using min/max scaling
2.   train_flights_subset_stanscaled: a multi-column y many row numpy array, obtained using standard scaling
3.   train_flights_cat_one_hot: a multi column by many row numpy array, obtained by using 1-hot scaled

If we want to use **ALL** of these in a subsequent fitting program, we need to recombine them into a single numpy array.   We will also pull out the **arr_delay** column - which we did not transform - as a **label** (also known as the target), since we will assume that this is the feature that we want to fit.
Here is how we do this



In [0]:
train_flights_labels = train_flights['arr_delay'].copy().values
print("shape labels    ",type(train_flights_labels),train_flights_labels.shape)
print("shape minmax    ",type(train_scaled_minmax),train_scaled_minmax.shape)
print("shape stanscaled",type(train_flights_subset_stanscaled),train_flights_subset_stanscaled.shape)
print("shape one_hot   ",type(train_flights_cat_one_hot),train_flights_cat_one_hot.shape)
train_flights_toFit = np.concatenate([train_scaled_minmax,train_flights_subset_stanscaled,train_flights_cat_one_hot], axis=1)
print("shape ALL       ",type(train_flights_toFit),train_flights_toFit.shape)


test_flights_labels = test_flights['arr_delay'].copy().values
test_flights_toFit = np.concatenate([test_scaled_minmax,test_flights_subset_stanscaled,test_flights_cat_one_hot], axis=1)
