## Load the Data

Thanks, Chris!

## Objectives & Summary

#### Objectives
1. Create two **general wrappers**
1. Use general wrappers to fing **salient features**
1. Compare salient features identified by Lasso and SelectKBest

### Summary
1. create **general regressor** and **general transformer**
1. abstracted input validation into a subroutine/function
1. connected transformer to regressor

In [1]:
from os import chdir, getcwd;
chdir('../../../lib')
assert getcwd().split('/')[-1] == 'lib'
from iowa_data import load_iowa_liquor_store_dataframes

In [2]:
iowa_liquor_store_df, \
    iowa_liquor_store_deltas_df = \
        load_iowa_liquor_store_dataframes()

iowa_liquor_store_df.head()

Unnamed: 0,zip_code,sale_2016,store_count_2016,sale_2015,store_count_2015,sale_2014,store_count_2014,sale_2013,store_count_2013,sale_2012,...,2014Q3,2014Q4,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4
0,50002,32544.92,2,38714.03,2,45283.05,2,39823.07,2,54142.66,...,16272.67,10457.78,11134.32,6578.94,8305.7,12695.07,9380.23,9119.31,5086.63,8958.75
1,50003,203493.66,3,344887.6,3,293773.18,3,328773.2,3,279232.38,...,71935.61,72899.4,71884.2,82246.12,114167.07,76590.18,60382.85,58310.29,48638.25,36162.27
2,50006,67189.24,2,105158.5,2,94319.67,2,91701.59,2,96378.56,...,23488.94,22927.93,25273.88,28463.93,26480.62,24940.09,22461.81,19791.71,13047.71,11888.01
3,50009,1617618.34,9,2409419.0,9,2209157.32,9,2125635.73,9,2066446.19,...,581605.35,644281.31,428414.64,636639.03,672692.13,671672.85,478918.96,455382.31,366348.22,316968.85
4,50010,4646154.79,22,7334014.0,21,7177603.68,21,6740832.92,20,6351832.32,...,1781861.98,2171059.41,1668418.66,1734813.68,1906493.82,2024287.55,1567149.74,1271515.64,1135230.3,672259.11


In [3]:
iowa_liquor_store_deltas_df.head()

Unnamed: 0_level_0,delta_2013_sales,delta_2014_sales,delta_2015_sales,delta_2015_sales,delta_2013_stores,delta_2014_stores,delta_2015_stores,delta_2016_stores,delta_2012Q2,delta_2012Q3,...,delta_2014Q3,delta_2014Q4,delta_2015Q1,delta_2015Q2,delta_2015Q3,delta_2015Q4,delta_2016Q1,delta_2016Q2,delta_2016Q3,delta_2016Q4
zip_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50002,-7159.795,2729.99,-3284.51,-3084.555,0,0,0,0,4305.485,-496.01,...,2488.655,-2907.445,338.27,-2277.69,863.38,2194.685,-1657.42,-130.46,-2016.34,1936.06
50003,16513.606667,-11666.673333,17038.13,-47131.303333,0,0,0,0,12655.213333,-193.553333,...,-6770.4,321.263333,-338.4,3453.973333,10640.316667,-12525.63,-5402.443333,-690.853333,-3224.013333,-4158.66
50006,-2338.485,1309.04,5419.425,-18984.64,0,0,0,0,-812.455,1257.775,...,-140.96,-280.505,1172.975,1595.025,-991.655,-770.265,-1239.14,-1335.05,-3372.0,-579.85
50009,29537.128778,9280.176667,22251.258889,-87977.812222,-1,0,0,0,13335.009,4768.633,...,920.737778,6963.995556,-23985.185556,23136.043333,4005.9,-113.253333,-21417.098889,-2615.183333,-9892.676667,-5486.596667
50010,2734.681789,4749.005429,7448.096667,-138049.894004,1,1,0,1,15738.596842,-3077.567368,...,8677.15,18533.210952,-23935.27381,3161.667619,8175.244762,5609.225238,-20778.991364,-13437.913636,-6194.788182,-21044.145


## Basic Preprocessing

Create a target vector using the `delta_2015Q4` column.

Create a feature array by removing this column (and any column associated with 2016).

In [17]:
twenty_sixteens = [col for col in iowa_liquor_store_deltas_df.columns if '2016' in col]
y = iowa_liquor_store_deltas_df['delta_2015Q4']
X = iowa_liquor_store_deltas_df.drop(twenty_sixteens + ['delta_2015Q4'], axis=1)

# Implement the Standard Sklearn Template

### Specific Cases: the Least Square Regressor and Standard Scaler

#### `least_squares_regressor_standard_workflow`

1. Pass the data as an argument
1. Create the model inside the function. Use the `sklearn` implementation of OLS.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

def least_squares_regressor_standard_workflow(X, y, random_state=None):
    # Receive the data
    # Split the data into training and testing sets; be sure to include an option to specify a random state
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

    # Create a new OLS model 
    model = LinearRegression()
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Score the model
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    # Return a dictionary containing the model, the train score, and the test score
    return {
        'model' : model,
        'train_score' : train_score,
        'test_score' : test_score
    }

In [5]:
from sklearn.preprocessing import StandardScaler

def scaler_standard_workflow(X, y, random_state=None):
    # Receive the data
    # Split the data into training and testing sets; be sure to include an option to specify a random state
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
    
    # Create a new scaler
    scaler = StandardScaler()
    
    # Fit the scaler
    # Fit on the train date
    scaler.fit(X_train)
    
    # transform the data 
    # make sure to transform both!!
    # don't transform the target data
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    # return a dictionary containing the transformed data
    
    return {
        'scaler' : scaler,
        'X_test' : X_test,
        'X_train' : X_train,
        'y_test' : y_test,
        'y_train' : y_train
    }

# Test Your Functions

Test the functions below. 

In [8]:
least_squares_regressor_standard_workflow(X,y)

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'test_score': 0.71769083277544377,
 'train_score': 0.71403806379189172}

In [18]:
#scaler_standard_workflow(X,y)

### Explain why the output of the `scaler_standard_workflow` can not be passed to `least_squares_regressor_standard_workflow`

## General Cases: A general regressor and a general transformer

#### `general_regressor_standard_workflow`

This method should be able to receive data that has already been split. In order to handle this, you will need to

1. Handle the case where one of `X_test` or `y_test` is missing
1. Use `X` and `y` as the training data if testing data is received
1. Perform the train test split if testing data is not received


In [50]:
def validate_inputs(X, y, X_test, y_test, random_state=None):
    if X_test is None and y_test is None:
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
    elif X_test is None or y_test is None:
        raise ValueError('You need to pass both X_test and y_test.')
    else:
        X_train = X
        y_train = y

    return X_train, X_test, y_train, y_test

In [51]:
def general_regressor_standard_workflow(model, X, y, random_state=None, X_test=None, y_test=None):
    ''''''

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
    #TODO: Fit the model
    model.fit(X_train, y_train)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    return {
        'model': model,
        'train_score': train_score,
        'test_score': test_score
    }
    

In [52]:
def general_transformer_standard_workflow(transformer, X, y, split_data=False, random_state=None, X_test=None, y_test=None):

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

    transformer.fit(X_train, y_train)
    
    transformer.transform(X_train)
    transformer.transform(X_test)

    return {
            'transformer': transformer,
            'X_train': X_train,
            'X_test': X_test,
            'y_train': y_train,
            'y_test': y_test
    }
    

In [59]:
from sklearn.linear_model import Ridge, Lasso, SGDRegressor, ElasticNet, LinearRegression
ridge_output = general_regressor_standard_workflow(Ridge(), X,y)
lasso_output = general_regressor_standard_workflow(Lasso(), X,y)
ols_output = general_regressor_standard_workflow(LinearRegression(), X,y)
sgd_output = general_regressor_standard_workflow(SGDRegressor(), X,y)
elastic_output = general_regressor_standard_workflow(ElasticNet(), X,y)



In [60]:
sgd_output['train_score'], sgd_output['test_score']

(-3.9065682520941555e+30, -8.9975286688470677e+30)

In [61]:
elastic_output['train_score'], elastic_output['test_score']

(0.73637864505842998, 0.40779919327856906)

In [56]:
lasso_output['train_score'], lasso_output['test_score']

(0.83158831820344248, -2.140607179725472)

In [44]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

In [62]:
standard_scaler_output = general_transformer_standard_workflow(StandardScaler(), X,y)
select_k_best = general_transformer_standard_workflow(SelectKBest(f_regression,k=5),X,y)

In [66]:
standard_scaler_output = general_transformer_standard_workflow(SelectKBest(k=5), X, y, 
                                                               random_state=42)
general_regressor_standard_workflow(ElasticNet(),
                                    standard_scaler_output['X_train'],
                                    standard_scaler_output['y_train'],
                                    X_test = standard_scaler_output['X_test'],
                                    y_test = standard_scaler_output['y_test'])

{'model': ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
       max_iter=1000, normalize=False, positive=False, precompute=False,
       random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
 'test_score': 0.3163940396866754,
 'train_score': 0.85761408447987442}

#### Test Your Functions

## Scaling, then Fit-Score

#### Use `general_transformer_standard_workflow` to

1. split the data into training and testing sets
1. scale your data using `Scaler`
1. return the split-scaled data

#### Pass the results of this and a new `Lasso` model that you created to `general_regressor_standard_workflow` and

1. receive the split data
1. fit the `Lasso` model
1. score the `Lasso` model

#### Examine the coefficients of the `Lasso` model and create a list of booleans representing whether or not a feature was used.

Optionally, use the code below to represent which features were identified as "salient".

    plt.matshow(mask.reshape(1, -1), cmap='gray_r')
    plt.xlabel("Sample index")

## Identify Salient Features using ANOVA via `SelectKBest`

#### Use `general_transformer_standard_workflow` to

1. split the data into training and testing sets
1. scale your data using `Scaler`
1. return the split-scaled data

#### Pass the results of this and a new `SelectKBest` model that you created to `general_transformer_standard_workflow` and

1. receive the split data
1. fit the `SelectKBest` model
1. transform the data 
1. return the transformed data
1. Use `.get_support()` to identify the salient features

Compare the salient features identified by `SelectKBest` to the salient features identified by the `Lasso`. 

If you plotted the salient features for the `Lasso`, do the same for the features from `SelectKBest`.

Tweak the `k` value of `SelectKBest` as you see fit.

## Fit models using the Transformed Feature Set received from `SelectKBest`

#### Pass the results of your `SelectKBest` and each of three different models (below) to `general_regressor_standard_workflow` and

1. receive the split data
1. fit the model
1. score the model

#### Use these models

1. `Ridge`
1. `LinearRegression`
1. `SGDRegressor`

## Prepare a Brief Write-Up of What you Found