# Lab 4: SVM + Neural Networks #


In [1]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ParameterGrid

import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv('./lab_4_training.csv')
df_test = pd.read_csv('./lab_4_test.csv')
df_train.head()

Unnamed: 0.1,Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,1303,male,20,second,green,73.0,210.0,0,1,10.0,Yes,5.0,50.0,1.0,15.0
1,36,male,20,third,other,71.0,90.0,1,0,15.0,Yes,4.0,10.0,0.0,1.0
2,489,male,22,fourth,hazel,75.0,200.0,0,1,1.0,Yes,2.0,150.0,1.0,10.0
3,1415,male,19,second,brown,72.0,35.0,2,2,20.0,Yes,5.0,100.0,0.0,7.0
4,616,male,22,fourth,hazel,71.0,15.0,2,1,10.0,Yes,7.0,10.0,0.0,5.0


***
### Question 1###
Calculate a baseline accuracy measure using the majority class.

** Question 1.a**  
Find the majority class in the training set. If you always predicted this class in the training set, what would your accuracy be?

In [3]:
df_train.gender[df_train.gender == "male"].count() - df_train.gender[df_train.gender == "female"].count()

-120

The majority of gender is female.

In [4]:
accuracy = accuracy_score(df_train.gender, ["female"]*len(df_train))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 53.774%


**Question 1.b**   
If you always predicted this same class (majority from the training set) in the test set, what would your accuracy be?

In [5]:
accuracy = accuracy_score(df_test.gender, ["female"]*len(df_test))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 52.261%


***
### Question 2 ###
Get started with Neural Networks.

   
Choose a NN implementation and specify which you choose. Be sure the implementation allows you to modify the number of hidden layers and hidden nodes per layer.  

NOTE: When possible, specify the logsig (sigmoid/logistc) function as the transfer function for the output node and use Levenberg-Marquardt backpropagation (lbfgs). It is possible to specify logsig or logistic in Sklearn MLPclassifier (Neural net).  

scikit learn

**Question 2.a**   
Train a neural network with a single 10 node hidden layer. Only use the Height feature of the dataset to predict the Gender. You will have to change Gender to a 0 and 1 class. After training, use your trained model to predict the class using the height feature from the training set. What was the accuracy of this prediction?

In [6]:
# change gender to zero one class
from sklearn import preprocessing 
le = preprocessing.LabelEncoder()
le.fit(["male", "female"])
y = le.transform(df_train.gender)
X = df_train.height
X = X.to_frame()

In [7]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y)

MLPClassifier(activation='logistic', alpha=1e-05, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [8]:
accuracy = accuracy_score(y, clf.predict(X)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


**Question 2.b**  
Take the trained model from question 2.b and use it to predict the test set. This can be accomplished by taking the trained model and giving it the Height feature values from the test set. What is the accuracy of this model on the test set?

In [9]:
yt = le.transform(df_test.gender)
Xt = df_test.height
Xt = Xt.to_frame()
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


**Question 2.c**   
Neural Networks tend to prefer smaller, normalized feature values. Try taking the log of the height feature in both training and testing sets or use a Standard Scalar operation in SKlearn to centre and normalize the data between 0-1 for continuous values. Repeat question 2.c with the log version and the normalized and centered version of this feature

In [10]:
X = np.log(df_train.height)
X = X.to_frame()
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y)
accuracy = accuracy_score(y, clf.predict(X)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


In [11]:
Xt = np.log(df_test.height)
Xt = Xt.to_frame()
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


In [12]:
scaler = StandardScaler()
X = scaler.fit_transform(df_train.height)
X = pd.DataFrame(X.reshape(len(X),-1))
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y)
accuracy = accuracy_score(y, clf.predict(X)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


In [13]:
Xt = scaler.fit_transform(df_test.height)
Xt = pd.DataFrame(Xt.reshape(len(Xt),-1))
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


***
### Question 3 ###
Get started with Support Vector Machines.

  
Chose a SVM implementation and specify which you choose. Be sure the implementation allows you to choose between linear and RBF kernels.

scikit learn

**Question 3.a**   
Use the same dataset from 2.a using the linear kernel to find training set prediction accuracy.

In [14]:
X = df_train.height
X = X.to_frame()
clf = SVC(kernel='linear')
clf.fit(X, y) 
accuracy = accuracy_score(y, clf.predict(X)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 83.396%


**Question 3.b**   
Use the same dataset from 2.a using the linear kernel to find test set prediction accuracy

In [15]:
Xt = df_test.height
Xt = Xt.to_frame()
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 83.166%


**Question 3.c**   
Use the same dataset from 2.a using the RBF kernel  to find training set prediction accuracy

In [16]:
clf = SVC(kernel='rbf')
clf.fit(X, y) 
accuracy = accuracy_score(y, clf.predict(X))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


**Question 3.d**   
Use the same dataset from 2.a using the RBF kernel  to find test set prediction accuracy

In [17]:
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


**Question 3.e**   
Use the same dataset from 2.c (log) using the RBF to find test set prediction accuracy

In [18]:
X = np.log(df_train.height)
X = X.to_frame()
Xt = np.log(df_test.height)
Xt = Xt.to_frame()
clf = SVC(kernel='rbf')
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


**Question 3.f**   
Z-score is a normalization technique. It is the value of a feature minus the average value for that feature in the training set, divided by the standard deviation of that feature in the training set. Repeat question 3.e using Z-score and note if there is any difference in accuracy and comment on why there is a change or no change in accuracy

In [19]:
X = (df_train.height - np.mean(df_train.height)) / np.std(df_train.height)
X = X.to_frame()
Xt = (df_test.height - np.mean(df_test.height)) / np.std(df_test.height)
Xt = Xt.to_frame()
clf = SVC(kernel='rbf')
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


The accuracy score does not change since Z-score only centralizes and normalizes the feature data without changing the relationship between the feature and the target.

***

### Question 4 ###
The rest of features in this dataset barring a few are categorical. Neither ML method accepts categorical features, so transform year, eyecolor, exercise into a set of binary features, one feature per unique original feature value, and mark the binary feature as ‘1’ if the feature value matches the original value and ‘0’ otherwise. Using only these binary variable transformed features, train and predict the class of the test set.

In [20]:
features = df_train[['year','eyecolor', 'exercise']]
features = features.to_dict('records')
vec = DictVectorizer()
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
features_t = df_test[['year','eyecolor', 'exercise']]
features_t = features_t.to_dict('records')
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)

**Question 4.a**    
What was your accuracy using Neural Network with a single 10 node hidden layer? During training, use a maximum number of iterations of 50. (Expected training time: ~15 mins)

In [21]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1, max_iter = 50)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 59.548%


**Question 4.b**    
What was your accuracy using a SVM with RBF kernel?

In [22]:
clf = SVC(kernel='rbf')
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 58.543%


***
### Question 5###
Using a NN, does height + eye color predict the test set class better by:

Z Score is the best.

In [23]:
features = df_train[['eyecolor']]
features = features.to_dict('records')
vec = DictVectorizer()
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['height'] = df_train.height
features_t = df_test[['eyecolor']]
features_t = features_t.to_dict('records')
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['height'] = df_test.height

**Question 5.a**  
Keeping the original feature values?

In [24]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.930%


**Question 5.b**  
Taking the log of the original values?

plus one for one-hot variables

In [25]:
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X = X+1
X['height'] = df_train.height
X = np.log(X)
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt = Xt+1
Xt['height'] = df_test.height
Xt = np.log(Xt)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.678%


**Question 5.c**  
Taking the Z-score of the original values?

In [26]:
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['height'] = df_train.height
X = (X - np.mean(X)) / np.std(X)
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['height'] = df_test.height
Xt = (Xt - np.mean(Xt)) / np.std(Xt)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 87.186%


***
### Question 6 ###
Repeat question 5 for exercise hours + eye color

Z Score is the best.

In [27]:
features = df_train[['eyecolor']]
features = features.to_dict('records')
vec = DictVectorizer()
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['exercisehours'] = df_train.exercisehours
features_t = df_test[['eyecolor']]
features_t = features_t.to_dict('records')
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['exercisehours'] = df_test.exercisehours

Original Value

In [28]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 57.789%


Log (plus one for all variables)

In [29]:
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['exercisehours'] = df_train.exercisehours
X = X+1
X = np.log(X)
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['exercisehours'] = df_test.exercisehours
Xt = Xt+1
Xt = np.log(Xt)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 57.286%


Z Score

In [30]:
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['exercisehours'] = df_train.exercisehours
X = (X - np.mean(X)) / np.std(X)
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['exercisehours'] = df_test.exercisehours
Xt = (Xt - np.mean(Xt)) / np.std(Xt)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 59.548%


***
### Question 7###
Combine the features from question 4, 5, and exercise hours from question 6 (using the best normalization feature set form questions 5 and 6)

Z Score

In [31]:
features = df_train[['year','eyecolor', 'exercise']] # exercisehours, height
features = features.to_dict('records')
X = vec.fit_transform(features).toarray()
X = pd.DataFrame(X)
X['height'] = df_train.height
X['exercisehours'] = df_train.exercisehours
X = (X - X.mean()) / X.std()
features_t = df_test[['year','eyecolor', 'exercise']]
features_t = features_t.to_dict('records')
Xt = vec.fit_transform(features_t).toarray()
Xt = pd.DataFrame(Xt)
Xt['height'] = df_test.height
Xt['exercisehours'] = df_test.exercisehours
Xt = (Xt - Xt.mean()) / Xt.std()

**Question 7.a**  
What was the NN accuracy on the test set using the single 10 node hidden layer?

In [32]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, activation = "logistic",
                    hidden_layer_sizes=(10,), random_state=1)
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt))
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 87.688%


**Question 7.b**  
What was the SVM accuracy on the test set the RBF kernel?

In [33]:
clf = SVC(kernel='rbf')
clf.fit(X, y) 
accuracy = accuracy_score(yt, clf.predict(Xt)) 
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 86.683%


***
### Question 8- Bonus###
Can you improve your test set prediction accuracy by 5% or more?  

See how close to that milestone of improvement you can get by modifying the tuning parameters of either Neural Networks (the number of hidden layers, number of hidden nodes in each layer, the learning rate aka mu) or with SVM (choosing kernel, C, and gamma). A great guide to tuning parameters is explained in this guide: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. 

While the guide is specific to SVM and in particular the C and gamma parameters of the RBF kernel, the method applies to generally to any ML technique with tuning parameters.

Please also write a paragraph in a markdown cell below with an explanation of your approach and evaluation metrics.
