# Week 2 Problem 4

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

# Due Date: 6 PM, January 29, 2018

In [11]:
import pandas as pd
import numpy as np

import sklearn as sk
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing  import StandardScaler

from nose.tools import assert_equal, assert_true, assert_false

The cell below reads in a simulated dataset where y is an unknown function of a, b, and c.

In [12]:
df = pd.read_csv('/home/data_scientist/data/misc/sim.data')
df.head()

Unnamed: 0,a,b,c,y
0,0.004539,0.818678,194.381891,0
1,0.001367,0.243724,245.378577,0
2,1.579454,0.465842,849.943583,0
3,7.189778,0.456895,129.707932,0
4,97.743634,0.319419,120.998294,1


### Problem 4.1 

In the classify function below use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function  to split the simulated data (*df*) with a 70/30 split for the training and testing set respectively. The random_state argument passed into the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function should be 0.

Use a [Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) implemented in sci-kit learn to train your model to classify y using any combination of a, b, and c. *In this order:* return your support vector classifier, testing features, and testing labels.  

**Using *only* the material you've learned from this course thus far** your support vector classifier should have a mean accuracy greater than or equal to 90% on the testing set. Feel free to import any functions or classes that have been dicussed thus far in the readings.

---
It's important to: 
- set the random_state argument to 0 when splitting the data,
- perform 70/30 split,
- not to modify your labels after splitting your data,
- use a Support Vector Classifier,
- and not to train using your testing set.  

*Otherwise you risk losing points.*

In [3]:
def classify(df):
    '''
    Splits simulated data passed in and trains a support vector class
    
    Parameters
    ----------
    df: dataframe containing simulated dataset.
    
    Returns
    -------
    Trained support vector classifier
    30% of the features for testing (as a Pandas Dataframe)
    and 30% of the labels for testing (as a Pandas Series)
    '''
    
    independent_vars = list(df)
    independent_vars.remove('y')
    dependent_var = 'y'

    X_train, X_test, Y_train, Y_test = \
         train_test_split(df[independent_vars], df[dependent_var],
                          test_size = 0.3, random_state=0)
        
    from sklearn import svm   

    # specify parameters
    model = SVC(kernel = 'linear', gamma = 0.01, C = 10)
    model = model.fit(X_train, Y_train)
    
    # Predict on test data and report scores
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(Y_test, y_pred)

    return model, X_test, Y_test

In [4]:
model, test_features, y_test = classify(df)

assert_equal(len(test_features), 300, msg='Per the instructions use 30% of your data for testing')
assert_equal(len(y_test), 300, msg='Per the instructions use 30% of your data for testing')


In [5]:
assert_true(model.score(test_features, y_test) > 0.9, msg='Your Support Vector Machine Classifier should have a mean accuracy greater than 90%')



Creating a 80/20 split for training and testing respectively. These variables will be used for Problem 4.2 & 4.3.

In [6]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=0)
X_train = train_data.drop('y', axis=1)
X_test = test_data.drop('y', axis=1)
y_train = train_data['y']
y_test = test_data['y']

### Problem 4.2
In the search function below a [logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model will be passed in. In order to complete this problem search over the parameter spaces you deem important and fit the model on the training features and labels passed in. In order to recieve full credit you must: use a logistic regression model as the estimator, return an GridSearch object, and have a mean accuracy greater than or equal 90.5%. *You are free to use any of the functions or classes that have been covered thus far.*

In [7]:
def search(model, train_features, train_labels):
    '''
    Searches over parameter spaces of a logistic regression model
        
    Parameters
    ----------
    model: Logistic Regression Model
    train_features: pandas dataframe containing features to train on
    train_labels: pandas series containing labels to classify
    
    Returns
    -------
    GridSearch Object
    '''   
    search_space = [{'C': [1E6, 1E5],'penalty': ['l1', 'l2'], 'class_weight': ['balanced']}]
    
    # Perform a grid search to find model with best accuracy
    clf = GridSearchCV(model, search_space, scoring='accuracy')
    clf.fit(train_features, train_labels)
    
    return clf

In [8]:
model = search(LogisticRegression(random_state=0), X_train, y_train)

assert_equal(type(model), type(sk.model_selection._search.GridSearchCV(model, param_grid={})))
score = model.score(X_test, y_test)
assert_true(score >= .905, msg='Mean Accuracy is not greator 90.5%')

### Problem 4.3
In the d_tree function below train a decision tree classifier on the training features and labels. Using the trained model make predictions using the testing features. Return the trained model and the predictions (in this order). To recieve full credit: do not use another classifier besides a Decision Tree Classifier, and your classifier should have a mean accuracy score greater than or equal to 94%.

In [9]:
def d_tree(model, X_train, y_train, X_test):
    '''
    Trains a Decision Tree Classifier on X_train & y_train and creates predictions with X_test.
    
    Parameters
    ----------
    model:  Decision Tree Classifier
    X_train: pandas dataframe containing features to train on
    y_train: pandas series containing labels to classify
    X_test: pandas dataframe containing features to make predictions with
    
    Returns
    -------
    model: Can be either a Decision Tree Classifier or a Grid Search Object (Hint)
    predictions: Predictions from the model using X_test.
    '''
    
    search_space = [{'max_depth': [10],'criterion': ['gini', 'entropy']}]
    
    # Perform a grid search to find model with best accuracy
    clf = GridSearchCV(model, search_space, scoring='accuracy')
    clf.fit(X_train, y_train)
    
    predictions = clf.predict(X_test)
    
    return clf, predictions


In [10]:
model, pred = d_tree(DecisionTreeClassifier(random_state=0), X_train, y_train, X_test)

assert_true(accuracy_score(y_test, pred) >= 0.94)
