# Week 2 Problem 3

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

# Due Date: 6 PM, January 29, 2018

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import seaborn as sns

from nose.tools import assert_equal, assert_in, assert_is_not
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal, assert_index_equal

# Tips Dataset

For this assignment, we will be using the built-in dataset called ``Tips`` which contains information about how much restaurant staff receive in tips. Suppose we know also whether customers leave a review after their visit. The restaurant is interested in improving their profile on Yelp and so tasked us with analyzing whether a customer will either leave a review or not. We will use a support vector machine model for this problem.

In [2]:
def getData():

    # Load in the dataset as a Pandas DataFrame
    data = sns.load_dataset("tips")
    
    # Create the labels
    data['review'] = np.random.randint(0, 2, size=len(data))
    # Preview the dataset
    return data

data = getData()
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,review
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,1


As you can see, we have binary classes (`review` is either 0 or 1) and 7 features. There is an obvious correlation between the `total_bill`, `tip` and `size` as we would expect the larger party size to have a bigger bill which leads to a bigger tip.

In [3]:
# Compute the correlation matrix
data.corr(method='pearson')

Unnamed: 0,total_bill,tip,size,review
total_bill,1.0,0.675734,0.598315,-0.03661
tip,0.675734,1.0,0.489299,-0.026924
size,0.598315,0.489299,1.0,-0.058571
review,-0.03661,-0.026924,-0.058571,1.0


We can see that indeed those features are correlated. Note, the presense of correlated features affects classification models differently. For support vector machine, in particular, it does not really matter. However, it is often a best practice to identify them in case you would like to use other models or feature selection. For example, principal component analysis (PCA) tries to pick components with maximum variance, high correlation will cause PCA to inflate the affect of the components.

# Cleaning the data

## Question 1

Let us clean the data as follows:
- __Create a new column in the `data` DataFrame called `tip_percentage` which contains the percentage of the bill in tip.__ Note, the `total_bill` column is the bill before tip. For example, a \$2.16 tip of a \$12 bill is 18%.
- __Remove the `total_bill` and `tip` columns.__

In [4]:
data = getData()

#create a new column called tip_percentage
data['tip_percentage'] = data['tip'] / data['total_bill']

#remove the total_bill and tip columns, axis=1 indicates the column
columns = ['total_bill', 'tip']
data.drop(columns, inplace=True, axis=1)

In [5]:
assert_equal(len(data.columns), 7)
assert_equal(isinstance(data['tip_percentage'], pd.Series), True)


## Question 2

We have several categorical variables (`sex`, `smoker`, `day`, `time`) which we need to encode to indicator variables in order to use the scikit-learn. Create a function called `data_encoding()` which will take a pandas DataFrame and a list of variable names to encode.

__Hint__: You may use the pandas built-in function `get_dummies()`.

In [6]:
def data_encoding(data, colnames):
    '''    
    Parameters
    ----------
    data: A pandas.DataFrame
    colnames: A list of strings
    
    Returns
    -------
    result: A pandas.DataFrame
    '''
    
    result = pd.get_dummies(data[colnames])

    return result

In [7]:
# Encode the data by calling the function above
categorical_data = data_encoding(data, ['sex', 'smoker', 'day', 'time'])

# Contains only our categorical data encoded
categorical_data.head()

Unnamed: 0,sex_Male,sex_Female,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun,time_Lunch,time_Dinner
0,0,1,0,1,0,0,0,1,0,1
1,1,0,0,1,0,0,0,1,0,1
2,1,0,0,1,0,0,0,1,0,1
3,1,0,0,1,0,0,0,1,0,1
4,0,1,0,1,0,0,0,1,0,1


In [8]:
assert_equal(len(categorical_data.columns), 10)
assert_equal(isinstance(categorical_data, pd.DataFrame), True)
assert_equal(pd.Series(['sex_Male', 'sex_Female']).isin(categorical_data.columns).all(), True)
assert_equal(pd.Series(['smoker_Yes', 'smoker_No']).isin(categorical_data.columns).all(), True)


Let us create the training and testing set to be used with our model.

In [9]:
# Join the categorical data with the numerical
features = pd.concat([categorical_data, data[['size']]], axis=1)
labels = data.review

# Perform a 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=10)

## Question 3

Write a function call `train_predict_svm()` which train an `svm.SVC()` model using your training data and makes a prediction based on the testing data (note, there are multiple SVM implementation so make sure you use the correct one).

Specifically, your function takes the the training data (`X_train` and `y_train`), testing data (`X_test`), model parameter `gamma`, `kernel`, and `C` to output an SVM model and the predictions.

In [10]:
def train_predict_svm(X_train, y_train, X_test, kern, gam, c):
    '''    
    Parameters
    ----------
    X_train: A pandas.DataFrame of the features
    y_train: A pandas.Series of the labels
    X_test: A pandas.DataFrame of the features
    kernel; A String specifying the kernel
    gamma: A float for the kernel coefficient
    C: A float for the penalty of the error term
    
    Returns
    -------
    model: A svm.SVC instance
    prediction: A numpy array
    '''
    
    from sklearn.svm import SVC

    # specify parameters
    model = SVC(kernel = kern, gamma = gam, C = c)
    model = model.fit(X_train, y_train)
    
    # Predict on test data and report scores
    prediction = model.predict(X_test)

    return model, prediction

In [11]:
# Train some different SVM model
svm_model, pred = train_predict_svm(X_train, y_train, X_test, 'linear', 0.001, 10)
svm_model2, pred2 = train_predict_svm(X_train, y_train, X_test, 'poly', 0.0001, 100)
svm_model3, pred3 = train_predict_svm(X_train, y_train, X_test, 'rbf', 0.01, 10)

In [12]:
assert_equal(isinstance(svm_model, svm.SVC), True)
assert_equal(svm_model.C, 10)
assert_equal(svm_model.gamma, 0.001)
assert_equal(svm_model.kernel, 'linear')

assert_equal(isinstance(svm_model2, svm.SVC), True)
assert_equal(svm_model2.C, 100)
assert_equal(svm_model2.gamma, 0.0001)
assert_equal(svm_model2.kernel, 'poly')


In [13]:
# Print the accuracy of the three models
print('Model 1 (Linear):', accuracy_score(y_test, pred))
print('Model 2 (Poly):', accuracy_score(y_test, pred2))
print('Model 3 (Radial):', accuracy_score(y_test, pred3))

Model 1 (Linear): 0.448979591837
Model 2 (Poly): 0.428571428571
Model 3 (Radial): 0.448979591837


We see that the choice of kernel, gamma and C affects the accuracy greatly. The process of identifying the best parameters for a model is called `hyper-parameter tuning`. You might recall we performed a naive parameter tuning in the assignment for k-NN by training several k-NN models and recording their accuracy. This approach would take too long if our parameter space is large or if we have too much data. 

Let us explore a simple approach where we will tune the hyper-parameters of SVM by searching through a grid. Note, that there are several advanced tuning techniques that is outside the scope of this class.

Scikit-learn has a built-in grid search function which is very intuitive to use. In the eample below, we search 4 different kernels, 3 different values of gamma, and 4 different values of `C`. We train different models with different parameter combinations to find the model which yields the best accuracy.

In [14]:
# Define the search space
search_space = [{'kernel': ['rbf'], 'gamma': [0.1, 0.01, 0.001], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'gamma': [0.1, 0.01, 0.001], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['sigmoid'], 'gamma': [0.1, 0.01, 0.001], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['poly'], 'gamma': [0.1, 0.01, 0.001], 'C': [1, 10, 100, 1000]}]

# Perform a grid search to find model with best accuracy
clf = GridSearchCV(svm.SVC(), search_space, scoring='accuracy')
clf.fit(X_train, y_train)

print('Best C:', clf.best_estimator_.C) 
print('Best Kernel:', clf.best_estimator_.kernel)
print('Best Gamma:', clf.best_estimator_.gamma)

# Make the prediction using the fine-tuned model and compute accuracy
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

Best C: 1
Best Kernel: rbf
Best Gamma: 0.1
Accuracy: 0.448979591837


# Question 4

Create a function called tuning_SVM() which performs a grid search for an svm.SVC() model. Your function will take the training and testing data along with the search space for the kernel, gamma and C. It will output the parameter values which yields the best accuracy score.

In [15]:
def tuning_SVM(X_train, y_train, X_test, search_space):
    '''    
    Parameters
    ----------
    X_train: A pandas.DataFrame of the features
    y_train: A pandas.Series of the labels
    X_test: A pandas.DataFrame of the features
    search_space: A list of dictionaries
    
    Returns
    -------
    best_C: A int
    best_kernel: A String
    best_gamma: A float
    best_accuracy: A numpy.float64
    '''
    # Perform a grid search to find model with best accuracy
    clf = GridSearchCV(svm.SVC(), search_space, scoring='accuracy')
    clf.fit(X_train, y_train)
    
    best_C = clf.best_estimator_.C 
    best_kernel = clf.best_estimator_.kernel
    best_gamma = clf.best_estimator_.gamma

    # Make the prediction using the fine-tuned model and compute accuracy
    y_pred = clf.predict(X_train)
    best_accuracy = accuracy_score(y_train, y_pred)

    return best_C, best_kernel, best_gamma, best_accuracy

In [16]:
# Might take a bit for this test to finish

search_space1 = [{'kernel': ['rbf', 'poly'], 'gamma': [0.1, 0.01, 0.001], 'C': [1, 10, 100, 1000]}]
C1, kernel1, gamma1, accuracy1 = tuning_SVM(X_train, y_train, X_test, search_space1)
assert_equal(type(C1), int)
assert_equal(type(gamma1), float)
assert_equal(type(kernel1), str)
assert_equal(type(accuracy1), np.float64)
assert_equal(kernel1 in ['rbf', 'poly'], True)
assert_equal(gamma1 in [0.1, 0.01, 0.001], True)
assert_equal(C1 in [1, 10, 100, 1000], True)
