# Lab 4: SVM + Neural Networks #
Yiyi Chen & Sayan Sanyal  
2017.02.23

In [1]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ParameterGrid

import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv('./lab_4_training.csv')
df_test = pd.read_csv('./lab_4_test.csv')
df_train.head()

Unnamed: 0.1,Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,1303,male,20,second,green,73.0,210.0,0,1,10.0,Yes,5.0,50.0,1.0,15.0
1,36,male,20,third,other,71.0,90.0,1,0,15.0,Yes,4.0,10.0,0.0,1.0
2,489,male,22,fourth,hazel,75.0,200.0,0,1,1.0,Yes,2.0,150.0,1.0,10.0
3,1415,male,19,second,brown,72.0,35.0,2,2,20.0,Yes,5.0,100.0,0.0,7.0
4,616,male,22,fourth,hazel,71.0,15.0,2,1,10.0,Yes,7.0,10.0,0.0,5.0


***
### Question 1###
Calculate a baseline accuracy measure using the majority class.

In [3]:
majority_gender = df_train['gender'].value_counts().idxmax()
feature = 'gender'

def get_accuracy(df, feature, majority_label):
    accuracy = df[df[feature] == majority_label].shape[0]/df.shape[0]
    print('accuracy: {:.3f}%'.format(accuracy*100))
    
def get_accuracy_clf(clf, X_train, y_train, X_test, y_test):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('accuracy: {:.3f}%'.format(accuracy*100))

** Question 1.a**  
Find the majority class in the training set. If you always predicted this class in the training set, what would your accuracy be?

In [4]:
get_accuracy(df_train, feature, majority_gender)

accuracy: 53.774%


**Question 1.b**   
If you always predicted this same class (majority from the training set) in the test set, what would your accuracy be?

In [5]:
get_accuracy(df_test, feature, majority_gender)

accuracy: 52.261%


***
### Question 2 ###
Get started with Neural Networks.

In [6]:
X_train = df_train['height'].reshape(-1, 1)
y_train = df_train['gender'].map(lambda x: 0 if x == 'female' else 1)
X_test = df_test['height'].reshape(-1, 1)
y_test = df_test['gender'].map(lambda x: 0 if x == 'female' else 1)

**Question 2.a**   
Choose a NN implementation and specify which you choose. Be sure the implementation allows you to modify the number of hidden layers and hidden nodes per layer.  

NOTE: When possible, specify the logsig (sigmoid/logistc) function as the transfer function for the output node and use Levenberg-Marquardt backpropagation (lbfgs). It is possible to specify logsig or logistic in Sklearn MLPclassifier (Neural net).  

**Answer**:  
sklearn.neural_network.MLPClassifier

In [7]:
clf = MLPClassifier(hidden_layer_sizes=(10), 
                    activation='logistic', 
                    solver='lbfgs', 
                    random_state=32462781) # random_state = 324627814 finds a bad local minimum

**Question 2.b**   
Train a neural network with a single 10 node hidden layer. Only use the Height feature of the dataset to predict the Gender. You will have to change Gender to a 0 and 1 class. After training, use your trained model to predict the class using the height feature from the training set. What was the accuracy of this prediction?

In [8]:
clf.fit(X_train, y_train)
accuracy = clf.score(X_train, y_train)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


**Question 2.c**  
Take the trained model from question 2.b and use it to predict the test set. This can be accomplished by taking the trained model and giving it the Height feature values from the test set. What is the accuracy of this model on the test set?

In [9]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


**Question 2.d**   
Neural Networks tend to prefer smaller, normalized feature values. Try taking the log of the height feature in both training and testing sets or use a Standard Scalar operation in SKlearn to centre and normalize the data between 0-1 for continuous values. Repeat question 2.c with the log version and the normalized and centered version of this feature

In [10]:
# log version
X_train_log = np.log(X_train)
X_test_log = np.log(X_test)

print('log normalization')
get_accuracy_clf(clf, X_train_log, y_train, X_test_log, y_test)

log normalization
accuracy: 85.427%


In [11]:
# normalized and centered version using a Standard scaler
X_train_norm = StandardScaler().fit_transform(X_train)
X_test_norm = StandardScaler().fit_transform(X_test)

print('Standard Scaler normalization')
get_accuracy_clf(clf, X_train_norm, y_train, X_test_norm, y_test)

Standard Scaler normalization
accuracy: 85.427%


In [12]:
# normalized and centered version using a MinMax scaler
X_train_norm = MinMaxScaler().fit_transform(X_train)
X_test_norm = MinMaxScaler().fit_transform(X_test)

print('MinMax Scaler normalization')
get_accuracy_clf(clf, X_train_norm, y_train, X_test_norm, y_test)

MinMax Scaler normalization
accuracy: 73.618%


***
### Question 3 ###
Get started with Support Vector Machines.

**Question 3.a**   
Chosen a SVM implementation and specify which you choose. Be sure the implementation allows you to choose between linear and RBF kernels.

**Answer**:   
sklearn.svm.SVC

**Question 3.b**   
Use the same dataset from 2.b using the linear kernel to find training set prediction accuracy.

In [13]:
clf = SVC(kernel='linear', random_state=286501204)

clf.fit(X_train, y_train)
accuracy = clf.score(X_train, y_train)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 83.396%


**Question 3.c**   
Use the same dataset from 2.b using the linear kernel to find test set prediction accuracy

In [14]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 83.166%


**Question 3.d**   
Use the same dataset from 2.b using the RBF kernel  to find training set prediction accuracy

In [15]:
clf = SVC(kernel='rbf', random_state=286501204)

clf.fit(X_train, y_train)
accuracy = clf.score(X_train, y_train)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 84.654%


**Question 3.e**   
Use the same dataset from 2.b using the RBF kernel  to find test set prediction accuracy

In [16]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('accuracy: {:.3f}%'.format(accuracy*100))

accuracy: 85.427%


**Question 3.f**   
Use the same dataset from 2.d (log) using the RBF to find test set prediction accuracy

In [17]:
clf = SVC(kernel='rbf', random_state=286501204)
print('log normalization')
get_accuracy_clf(clf, X_train_log, y_train, X_test_log, y_test)

log normalization
accuracy: 85.427%


**Question 3.g**   
Z-score is a normalization technique. It is the value of a feature minus the average value for that feature in the training set, divided by the standard deviation of that feature in the training set. Repeat question 3.f using Z-score and note if there is any difference in accuracy and comment on why there is a change or no change in accuracy

In [18]:
# normalized and centered version using a Standard scaler
X_train_norm = StandardScaler().fit_transform(X_train)
X_test_norm = StandardScaler().fit_transform(X_test)

print('Standard Scaler normalization')
get_accuracy_clf(clf, X_train_norm, y_train, X_test_norm, y_test)

Standard Scaler normalization
accuracy: 85.427%


There is no change in accuracy as normalisation makes a difference in distance based algorithms when there is a confluence of multiple dimensions on the distance metric. Since we are dealing with only one dimension in this exercise, normalising doesn't help remove the confluence of multiple dimensions, because there isn't any!

***

### Question 4 ###
The rest of features in this dataset barring a few are categorical. Neither ML method accepts categorical features, so transform year, eyecolor, exercise into a set of binary features, one feature per unique original feature value, and mark the binary feature as ‘1’ if the feature value matches the original value and ‘0’ otherwise. Using only these binary variable transformed features, train and predict the class of the test set.

In [19]:
# clean up one malformatted data entry
# here I use get_dummies to encode categorical variables into binary features
# later I use label encoder and one hot encoder for the same task
df_train['year'].replace('first"', 'first', inplace=True)
df_test['year'].replace('first"', 'first', inplace=True)
X_train = pd.get_dummies(df_train[['year', 'eyecolor', 'exercise']])
X_test = pd.get_dummies(df_test[['year', 'eyecolor', 'exercise']])

**Question 4.a**    
What was your accuracy using Neural Network with a single 10 node hidden layer? During training, use a maximum number of iterations of 50. (Expected training time: ~15 mins)

In [20]:
clf = MLPClassifier(hidden_layer_sizes=(10), 
                    activation='logistic', 
                    max_iter=50,
                    solver='lbfgs', 
                    random_state=32462781)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 59.548%


**Question 4.b**    
What was your accuracy using a SVM with RBF kernel?

In [21]:
clf = SVC(kernel='rbf', random_state=286501204)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 58.794%


***
### Question 5###
Using a NN, does height + eye color predict the test set class better by:

In [22]:
# try out different preprocessing to encode categorical variables
# here I use label encoder and one hot encoder, previously I used get_dummies
le = LabelEncoder()
ohe = OneHotEncoder()

# train data
X_train_eyecolor = le.fit_transform(df_train['eyecolor'])
X_train_eyecolor = ohe.fit_transform(X_train_eyecolor.reshape(-1, 1)).toarray()
X_train_height = df_train['height'].reshape(-1,1).copy()

# test data
X_test_eyecolor = ohe.transform(le.transform(df_test['eyecolor']).reshape(-1, 1)).toarray()
X_test_height = df_test['height'].reshape(-1,1).copy()

columns = list(le.classes_) + ['height']

clf = MLPClassifier(hidden_layer_sizes=(10), 
                    activation='logistic', 
                    solver='lbfgs', 
                    random_state=32462781)

**Question 5.a**  
Keeping the original feature values?

In [23]:
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_height)), columns=columns)
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_height)), columns=columns)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 85.930%


**Question 5.b**  
Taking the log of the original values?

In [24]:
X_train_height = np.log(df_train['height']).reshape(-1,1).copy()
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_height)), columns=columns)

X_test_height = np.log(df_test['height']).reshape(-1,1).copy()
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_height)), columns=columns)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 85.930%


**Question 5.c**  
Taking the Z-score of the original values?

In [25]:
X_train_height = StandardScaler().fit_transform(df_train['height'].reshape(-1,1)).copy()
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_height)), columns=columns)

X_test_height = StandardScaler().fit_transform(df_test['height'].reshape(-1,1)).copy()
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_height)), columns=columns)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 86.935%


***
### Question 6 ###
Repeat question 5 for exercise hours + eye color

In [26]:
# 6.a
X_train_exercise = df_train['exercisehours'].reshape(-1,1).copy()
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_exercise)), columns=columns)

X_test_exercise = df_test['exercisehours'].reshape(-1,1).copy()
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_exercise)), columns=columns)

print('original features: ')
get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

# 6.b
# Handle log(0) by adding 1 to all exercise hours
X_train_exercise = np.log(df_train['exercisehours'].reshape(-1,1)+1).copy()
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_exercise)), columns=columns)

X_test_exercise = np.log(df_test['exercisehours'].reshape(-1,1)+1).copy()
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_exercise)), columns=columns)

print('log of exercise hours: ')
get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

# 6.c
X_train_exercise = StandardScaler().fit_transform(df_train['exercisehours'].reshape(-1,1)).copy()
X_train = pd.DataFrame(np.hstack((X_train_eyecolor, X_train_exercise)), columns=columns)

X_test_exercise = StandardScaler().fit_transform(df_test['exercisehours'].reshape(-1,1)).copy()
X_test = pd.DataFrame(np.hstack((X_test_eyecolor, X_test_exercise)), columns=columns)

print('z-score hours: ')
get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

original features: 
accuracy: 56.030%
log of exercise hours: 
accuracy: 57.789%
z-score hours: 
accuracy: 56.784%


***
### Question 7###
Combine the features from question 4, 5, and exercise hours from question 6 (using the best normalization feature set form questions 5 and 6)

In [27]:
# year, eyecolor, exercise, height, exercisehours
df_train['year'].replace('first"', 'first', inplace=True)
df_test['year'].replace('first"', 'first', inplace=True)

X_train_categorical = pd.get_dummies(df_train[['year', 'eyecolor', 'exercise']])
X_test_categorical = pd.get_dummies(df_test[['year', 'eyecolor', 'exercise']])

X_train_height = StandardScaler().fit_transform(df_train['height'].reshape(-1,1)).copy()
X_test_height = StandardScaler().fit_transform(df_test['height'].reshape(-1,1)).copy()

X_train_exercise = df_train['exercisehours'].reshape(-1,1).copy()
X_test_exercise = df_test['exercisehours'].reshape(-1,1).copy()

columns = list(X_train_categorical.columns) + ['height', 'exercisehours']

X_train = pd.DataFrame(np.hstack((X_train_categorical, 
                                  X_train_height, 
                                  X_train_exercise)), 
                       columns=columns)
X_test = pd.DataFrame(np.hstack((X_test_categorical, 
                                 X_test_height, 
                                 X_test_exercise)), 
                      columns=columns)

**Question 7.a**  
What was the NN accuracy on the test set using the single 10 node hidden layer?

In [28]:
clf = MLPClassifier(hidden_layer_sizes=(10), 
                    activation='logistic', 
                    solver='lbfgs', 
                    random_state=32462781)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 86.683%


**Question 7.b**  
What was the SVM accuracy on the test set the RBF kernel?

In [29]:
clf = SVC(kernel='rbf', random_state=286501204)

get_accuracy_clf(clf, X_train, y_train, X_test, y_test)

accuracy: 86.432%


***
### Question 8###
Can you improve your test set prediction accuracy by 5% or more?  

See how close to that milestone of improvement you can get by modifying the tuning parameters of either Neural Networks (the number of hidden layers, number of hidden nodes in each layer, the learning rate aka mu) or with SVM (choosing kernel, C, and gamma). A great guide to tuning parameters is explained in this guide: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. 

While the guide is specific to SVM and in particular the C and gamma parameters of the RBF kernel, the method applies to generally to any ML technique with tuning parameters. This question will incorporate using different holdout strategies to conduct tuning on the training set before using the best model to predict the test set. Note that you may reduce the size of the training set and you may also use any feature set and transformation of features you like to improve the prediction.


In [30]:
# NN with MLP Classifier
# Search for best model on training data
featurizer = DictVectorizer()
scaler = StandardScaler(with_mean=False)
model = MLPClassifier()

pipeline = Pipeline([('featurize', featurizer), ('scale', scaler), ('clf', model)])

# initial parameter grid used to evaluate the accuray of 
# diffferent parameter combinations. 
# we have included the final parameter grid used below
original_param_grid = [{
        'clf__activation': ['logistic', 'tanh', 'relu'],
        'clf__alpha':[0.0001, 0.001, 0.1],
        'clf__solver':['lbfgs', 'sgd', 'adam'],
        'clf__max_iter':[100, 200, 300, 500, 1000],
        'clf__random_state':[1]
    }]

def get_best_score(X_train, y_train, X_val, y_val, pipeline, param_grid):
    best_score, scores, best_grid = 0, [], None
    for params in ParameterGrid(param_grid):
        pipeline.set_params(**params)
        pipeline = pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_val)
        score = accuracy_score(y_pred, y_val)
        if score > best_score:
            best_score = score
            best_grid = params
    return best_score, best_grid

def get_accuracy(pipeline, params, X_train, y_train, X_test, y_test, action_description):
    pipeline.set_params(**params)
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    score = accuracy_score(y_pred, y_test)*100
    print("Score after {}: {:.4f}%".format(action_description, score))
    return y_pred

**Question 8.a**  
What was your best accuracy on the test set using the training set as your holdout? (for this question the training set is used for training AND testing)  

For example, specify some value of mu and C for the RBF SVM kernel. Use the training set to train the model then use the training set again to test the accuracy of the model. Try different values for C and mu, again training and testing on the training set. Take the most accurate model from your trials and use it to predict the real test set. Report your accuracy. Note that you should only be predicting the test set ONCE   

**Answer**  
Trying out using neural network (MLP)

In [31]:
X_train = df_train[["exercisehours", "eyecolor", "year", "exercise", "height"]].copy().to_dict(orient='records')
y_train = df_train['gender']

X_val = df_train[["exercisehours", "eyecolor", "year", "exercise", "height"]].copy().to_dict(orient='records')
y_val = df_train['gender']

X_test = df_test[["exercisehours", "eyecolor", "year", "exercise", "height"]].copy().to_dict(orient='records')
y_test = df_test['gender']

# final parameter grid for 8.a. the initial one is included above for reference
param_grid = [{
        'clf__activation': ['tanh'],
        'clf__alpha':[0.001],
        'clf__solver':['lbfgs'],
        'clf__max_iter':[100000, 500000],
        'clf__random_state':[1]
    }]

In [32]:
best_score, best_grid = get_best_score(X_train, y_train, X_val, y_val, pipeline, param_grid)
print("Accuracy: {:.5f}".format(best_score))
print("Grid: {}".format(best_grid))

Accuracy: 0.92201
Grid: {'clf__activation': 'tanh', 'clf__alpha': 0.001, 'clf__max_iter': 100000, 'clf__random_state': 1, 'clf__solver': 'lbfgs'}


In [33]:
y_pred = get_accuracy(pipeline, best_grid, X_train, y_train, X_test, y_test, 
                      'just tuning on training')

Score after just tuning on training: 92.2111%


**Question 8.b**   
What was your best accuracy on the test set using a 70% train 30% holdout?  

For this question, split the rows of your training set into a subset containing 70% of the rows and another containing the remaining 30%. Train on the 70% and test of the 30%. Tune your parameters and then take the model with the best accuracy on the 30% and use it to predict the real test set. Report your accuracy. Again, prediction of the test set should only be done ONCE.

In [34]:
# updating X_train, y_train, X_val, y_val with 70/30 split
X = df_train[["exercisehours", "eyecolor", "year", "exercise", "height"]].copy()
y = df_train['gender']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=424242)
X_train = X_train.to_dict(orient='records')
X_val = X_val.to_dict(orient='records')

# final parameter grid for 8.b. the initial one is included above for reference
param_grid = [{
        'clf__hidden_layer_sizes': [(5,),(4,),(3,)],
        'clf__activation': ['tanh', 'relu'],
        'clf__alpha':[0.001, 0.1],
        'clf__solver':['lbfgs'],
        'clf__max_iter':[300, 500, 800],
        'clf__random_state':[1]
    }]

In [35]:
best_score, best_grid = get_best_score(X_train, y_train, X_val, y_val, pipeline, param_grid)
print("Accuracy: {:.5f}".format(best_score))
print("Grid: {}".format(best_grid))

Accuracy: 0.86373
Grid: {'clf__alpha': 0.1, 'clf__max_iter': 500, 'clf__solver': 'lbfgs', 'clf__activation': 'tanh', 'clf__random_state': 1, 'clf__hidden_layer_sizes': (4,)}


In [36]:
y_pred = get_accuracy(pipeline, best_grid, X_train, y_train, X_test, y_test, 
                      'tuning on 30% validation')

Score after tuning on 30% validation: 86.1809%


**Question 8.c**  
Finally, what was your best accuracy on the test set using 5-fold cross validation?  

**8.c.i** For this question, use 5-fold cross-validation of the training set in order to tune your parameters and find the best model. After finding the tuning parameter values that give you the best cross-validated accuracy, train the model on the entire training set using those parameter values then use this trained model to predict the real test set and report your accuracy.  

In [37]:
# updating X_train, y_train, X_val, y_val for 5-fold cross-validation
X_train = df_train[["exercisehours", "eyecolor", "year", "exercise", "height"]].copy().to_dict(orient='records')
y_train = df_train['gender']

# final parameter grid for 8.c. the initial one is included above for reference
param_grid = [{
        'clf__hidden_layer_sizes': [(4,),(10,)],
        'clf__activation': ['tanh', 'relu'],
        'clf__alpha':[0.15, 0.1, 0.2],
        'clf__max_iter':[500, 1000],
        'clf__random_state':[1]
    }]

In [38]:
grid_search = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, cv=5)
grid_search.fit(X_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
print(best_parameters['clf'])

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   14.3s


Best score: 0.850
Best parameters set:
MLPClassifier(activation='relu', alpha=0.1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(4,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)


[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   37.7s finished


In [39]:
y_pred = get_accuracy(pipeline, best_parameters, X_train, y_train, X_test, y_test, 
                      '5-fold validation')

Score after 5-fold validation: 86.4322%


**8.c.ii** Please submit a file called test_set_prediction.txt that includes the predictions (between 0 and 1) of the test set. The predictions should be in the same order as the original test set file.  

Please also submit a file called cv_predictions.txt that includes the predictions (between 0 and 1) of the cross-validated training set. The predictions should be in the same order as the original training set file.These two files will help demonstrate the value of ensemble prediction in the next lab.  

NOTE: See documentation for cross-validation strategies in Python scikit-learn. The cross_val_score function is the simplest to use for this assignment- it takes as inputs the model, features, labels, and number of folds and outputs an array of accuracies over each fold. The KFold iterator function produces indices for training and testing sets, and is good to use when you want to examine custom error metrics.

In [40]:
y_pred_cvtrain = get_accuracy(pipeline, best_parameters, X_train, y_train, X_train, y_train, 
                              '5-fold validation on training')
pd.DataFrame(y_pred).to_csv("test_set_prediction.txt")
pd.DataFrame(y_pred_cvtrain).to_csv("cv_predictions.txt")

Score after 5-fold validation on training: 85.4088%


**Question 8.d**   
Please describe which hold out strategy resulted in highest accuracy on the real test set and the tuning parameter values you used to achieve your high score. Please write up how you went about trying to achieve your accuracy improvement. How many tuning parameter combinations do you use, did you methodologically sweep a range of parameters with some increment? Which changes had the most impact on accuracy improvement? Did you reduce the size of the training set to save computation time? What was the impact of this reduction on accuracy?

**Answer**   
Training on just the training set as the hold out gave us suspiciously good answers, but on closer inspection it wasn't that suprising given the signifcant overlap between the train and test sets. We would argue that the cross validation results would be more robust to foreign data sets. 

While tuning hyper-parameters, we followed the following approach. For tuning the MLP Classifier, we tested a myriad combination of hyper-parameters. We started with testing the architecture of the Net. We experimented with multiple as well as single layers, and also tried to vary the width or size of each layer. We found the 4 perceptrons in a single hidden layer was the sweet spot. We also tuned We went through a sweep of all other hyper paramenters, starting with large intervals and subsquently zooming in on the best intervals to find the best values. We tried a combination of values for Alpha, but the majority of tweaking came in terms of the activation functions or the solvers that we used. Interestingly, adam, the default solver never looked like a candidate in any of our solutions.

An aspect that affected our training time for these nets (other than the architecture), were the number of iterations. Especially for part 8a, the performance increase greatly with an increased number of iterations with tanh as an activation function. We were sure to always note whether the high scores were at the edge of the value list we had entered, and increased them gradually if that was the case. Though number of iterations was time consuming, thankfully we had access to compute resources that ensured that we did not have to compromise by not increasing a number further during validation. Hopefully, we did not compromise on accuracy for the sake of time saving during training.