# Week 6 Problem 1

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

# Due Date: 6 PM, February 26, 2018

In [1]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, f_classif, mutual_info_classif, RFE
from sklearn.preprocessing import RobustScaler, Imputer
from sklearn.svm import SVC, LinearSVC

from nose.tools import assert_equal, assert_true, assert_false, assert_almost_equal
import numpy.testing as npt


The cell below reads in a simulated dataset where y is an unknown function of a, b, and c.

In [2]:
df = pd.read_csv('/home/data_scientist/data/misc/sim.data')
df.head()

Unnamed: 0,a,b,c,y
0,0.004539,0.818678,194.381891,0
1,0.001367,0.243724,245.378577,0
2,1.579454,0.465842,849.943583,0
3,7.189778,0.456895,129.707932,0
4,97.743634,0.319419,120.998294,1


### Problem 1.1
For Problem 1.1 complete the function f_eng performing a 80/20 split on the training/testing features and labels. Fit a RobustScaler on the training features and transform the training and testing features. Where appliciable set the random_state argument to 999. Return in this order: the transformed training features, training labels, transformed testing features, and testing labels.

In [6]:
def f_eng(data):
    '''
    Splits the training the data and scales the training and testing features
    
    Parameters
    ----------
    data: dataframe containing simulated dataset.
    
    Returns
    -------
    Training features as a multi dimensional numpy array (contains 80% of the features)
    Testing features as a multi dimensional numpy array (contains 20% of the features)
    Training labels as pandas series (contains 80% of the labels)
    Testing labels as pandas series (contains 20% of the labels)
    
    '''
    # Split the DataFrame into a DataFrame of features and a Series of labels
    features = df[df.columns[:-1]]
    labels = df.y
    
    X_train, X_test, y_train, y_test \
       = train_test_split(features, labels, test_size = 0.2, random_state = 999)
    
    rs = RobustScaler()
    new_x_train = rs.fit_transform(X_train, y_train)
    new_x_test = rs.transform(X_test)
    
    return new_x_train, y_train, new_x_test, y_test

In [7]:
X_train, y_train, X_test, y_test = f_eng(df.copy())

assert_equal(type(X_train), np.ndarray)
assert_equal(type(X_test), np.ndarray)

assert_equal(type(y_train), pd.core.series.Series)
assert_equal(type(y_test), pd.core.series.Series)

assert_equal(len(X_train), 800, msg='Make sure that performed a 80/20 split on the training and testing set')
assert_equal(len(y_train), 800, msg='Make sure that performed a 80/20 split on the training and testing set')


npt.assert_almost_equal(X_test[0:10], [[-0.30738115403404553, -0.9736985887996362, -0.49101207865721125], [0.18488952387724725, 0.03189822187579699, 0.8034181154818177], [-0.2198828359426681, 0.3028030199063013, 0.8448490192685679], [-0.1433424985516722, -0.011974164525011167, -0.12831623314094662], [0.09892255871589173, -0.8613137137863287, 0.13219278852677604], [-0.312907980172404, 0.9049807671626223, 1.0237472568184407], [0.8872733278385387, 0.5737690433939666, -0.4839982999236612], [0.43514883771829216, 1.0018458848881164, -0.49359000079927745], [0.08946178277685655, 0.6132921249115915, 0.7152472385801251], [0.9151026455050681, 0.07884286112089098, -0.3973694534680506]], decimal=2)
npt.assert_almost_equal(X_train[345:355], [[-0.3139239995672711, 0.8450084840904671, -0.06351279246792367], [-0.1487563778794215, -0.5104392475039827, -0.0528450006168689], [-0.19904502938519728, -0.8087579296249421, 1.148338814039014], [-0.09315677202225736, -0.2352769528032403, -0.44881999956456353], [-0.2411588708452692, -0.5054923224648885, 0.18491364134755], [-0.25026611398236226, -0.26454867151904576, 1.5482613893681398], [-0.3061110733202404, 0.47425884363241466, -0.09892128734229258], [-0.30962370002167466, -0.7524807869313633, 0.6640450283476379], [-0.2851410690135015, -0.01612153252307675, 1.0374926446256747], [0.10280334279756671, -0.6158793579748008, 0.32803827308069755]], decimal=2)

assert_equal(y_train[0:10].tolist(), [0, 0, 1, 1, 0, 2, 1, 2, 1, 2])
assert_equal(y_test[100:120].tolist(), [1, 2, 1, 2, 0, 1, 2, 1, 2, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2])


The code cell below creates a validation set.

In [8]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=.7, random_state=0)

### Problem 1.2

To complete Problem 1.2 finish writing *var_thres* by iterating over the thresholds. For each threshold create a [VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) feature selector with the current threshold. *Fit the feature selector on the **training data** and then **transform the training data and validation data**.* Create a [support vector classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) and fit it on the training data and get the mean accuracy score on the validation set. Keep track which threshold causes the support vector classifier to obtain the highest mean acccuracy and return that threshold. Set the random_state argument to 0 where applicable.

In [26]:
def var_thres(X_train, X_val, y_train, y_val, thresholds):
    '''
    Parameters
    ----------
    X_train: numpy array containing training features
    X_val: numpy array containing validation features
    y_train: pandas series containing training labels
    y_val: pandas series containing validation labels
    thresholds: numpy array containing thresholds
    
    returns
    -------
    best_threshold: floating point number
    '''
    #generate scores to keep track of the performance of different thresholds 
    scores = []
    
    # Select features at difference variance thresholds
    for idx, thresh in enumerate(thresholds):
        vt = VarianceThreshold(thresh)
        # Fit filter
        new_x_train = vt.fit_transform(X_train)
        new_x_val = vt.transform(X_val)
        
        #fit SVC classifier on the new data
        svc = SVC(random_state = 0)
        svc.fit(new_x_train, y_train)
        
        #predict the val data and obtain the prediction accuracy
        scores.append(svc.score(new_x_val, y_val))
    
    best_threshold = thresholds[np.argmax(scores)]
    return best_threshold


In [27]:
best_threshold = var_thres(X_train, X_val, y_train, y_val, np.linspace(0, .5, 30))
sel = VarianceThreshold(threshold=best_threshold)
new_X_train = sel.fit_transform(X_train, y_train)
new_X_val = sel.transform(X_val)
new_X_test = sel.transform(X_test)

model = SVC(random_state=0)
model.fit(new_X_train, y_train)
val_score = model.score(new_X_val, y_val)
print('Validation Score [using threshold %s]:'%best_threshold, val_score)

assert_true(val_score >= .91)

Validation Score [using threshold 0.0]: 0.933333333333


### Problem 1.3

Complete Problem 1.3 by finishing the function *rfe_fit* by creating a [Linear Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) and a [recursive feature elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) (rfe) feature selection model. Fit and transform the training data. Use the transformed features to fit the linear support vector classifier created. Return the fitted model and feature selector. Set the random_state argument to 0 where applicable.

In [28]:
def rfe_fit(X_train, y_train, num_features=2):
    '''
    Parameters
    ----------
    x_train: numpy array containing training features
    y_train:  numpy array containing training labels
    
    Returns
    -------
    LinearSVC model
    RFE model
    '''
    # set the n_features_to_select argument in the RFE constructor to the num_features parameter
    estimator = LinearSVC()
    selector = RFE(estimator, n_features_to_select = num_features)
    new_x_train = selector.fit_transform(X_train, y_train)
    
    estimator.fit(new_x_train, y_train)
    
    return estimator, selector

In [29]:
model2, selector = rfe_fit(X_train, y_train, 2)
new_X_val2 = selector.transform(X_val)
val_score2= model2.score(new_X_val2, y_val)
print("Validation Score:", val_score2)

assert_almost_equal(val_score2, .8208, places=2)


Validation Score: 0.820833333333
