## Introduction

Support Vector Machine is another well known algorithm for classification. The theory of it is well explained over [Here](https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72). It is similar to Logistic Regression in the sense that both of them are trying to find the best linear way to seperate between the classes. The difference is that SVM tries to find a solution which is as far as possible from the closest data point of each category, while Logistic Regression is finding the posterior class probability. That means that while SVM generates a binary classification, LR generates probabilities. 

There is a fundemental difference in the way these models work. Logistic Regression is a <u>parametric model</u>. That means that the mapping from input X to output Y is governed by a vector w of adaptive parameters. During the learning phase, a set of training data is used to determine to estimate these parameters. Predictions for new inputs are made purely on the learned parameter vector.

SVM is a part of another learning algorithms group called <u>Kernel Methods</u>. The basic idea here is to find some sort of transformation (Kernel Function) on the data so that it would be seperatable. For instance, if the data is not linearly seperatable, then logistic regression would not perform well, but using SVM with other kernel functions (such as RBF - Radial Basis Function) might be able to do the work. RBF is actually the most popular kernel using SVM.



In [48]:
# Data analysis
import numpy as np
import pandas as pd
import random as rnd
from statistics import mean
import math

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Support Vector Classifier
from sklearn.svm import SVC

# KFold Support
from sklearn.model_selection import KFold

In [49]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

I have tried to do as little feature engineering as possible in this case, since SVM does not rely on independence assumption.

In [50]:
# Removing PassengerId
train_df = train_df.drop('PassengerId', axis=1)

# Convering 'Name'
name_df = train_df.Name.str.split(expand = True)
train_df['Name'] = name_df[1]
train_df.loc[(train_df['Name'] != 'Mr.') & (train_df['Name'] != 'Mrs.') & (train_df['Name'] != 'Miss.') & (train_df['Name'] != 'Master.'), 'Name'] = 'Other'

# Converting 'Ticket'
train_df['Ticket'] = train_df['Ticket'].map(train_df['Ticket'].value_counts()).astype('int64')

# Converting 'Cabin'
train_df['Cabin'] = train_df['Cabin'].notnull().astype('int64')

# Remove missing values of 'Embarked'
train_df = train_df[train_df.Embarked.notnull()]
# Fill missing values with mean value
train_df = train_df.fillna(train_df.Age.mean())

# Encoding 'Sex' categorical data into a numeric columns
train_df.loc[train_df.Sex == 'male', 'Sex'] = 0
train_df.loc[train_df.Sex == 'female', 'Sex'] = 1
train_df['Sex'].astype('int64')
# Encoding 'Name' categorical data using 'One Hot' method - get_dummies in pandas
train_df = pd.get_dummies(train_df, columns=['Embarked'], prefix = ['embarked'])
train_df = pd.get_dummies(train_df, columns=['Name'], prefix = ['name'])


In [51]:
# Splitting to features and dependant variable
X = train_df.drop('Survived', 1)
y = train_df.Survived

# Scaling Features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-1, 1))
X = pd.DataFrame(scaler.fit_transform(X))


### SVM Recipe

The following [Article](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf) describes a simple recipe for running SVM in a manner which can improve our results. Here are they major steps of the recipe:

1. Transform the data to the format of an SVM package.
2. Conduct simple scaling of the data.
3. Consider the RBF Kernel: $K(x,y) = e^{-γ||x-y||^2}$
4. Use cross-validation to find the best parameter C and γ
5. Use the best parameter C and γ to train the whole training set
6. Test

### Using Linear Kernel

While the recipe said to try first the RBF Kernel, I am going to try first the linear kernel. The reason for that is that it is computationally faster to come up with a solution using this kernel. I am going to use the "LinearSVC" method, which is the same as using SVC with kernel="Linear", only it uses the liblinear library (which we already know from the logistic regression) in order to find a hyperplane. The LinearSVC is faster computationally than the regular SVC. 
Using the linear kernel, we need to find the best value of the parameter C. As described in the article, I will perform a grid search to do so. Each iteration, I have raised the size of C exponentionally. Then, I did a 10-Fold cross validation, and saved the mean accuracy rate for each C value. The value with the best mean accuracy rate would be the chosen one.




In [55]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
c_accuracy_rates = []
c_values = []
probabilities = {0: 0.676, 1: 0.324}
for i in range(-10,10):
    C_parameter = math.pow(2,i)
    c_values.append(C_parameter)
    linear_SVC = LinearSVC(C=C_parameter, class_weight=probabilities, max_iter=1000)
    cv = KFold(n_splits=10, random_state=42, shuffle=False)
    iteration_accuracy_rates = []
    for train_index, test_index in cv.split(X):
        # The initial vector of weights should be zero.
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        linear_SVC.fit(X_train, y_train)
        y_values_train = linear_SVC.predict(X_test)
        diff = y_values_train - y_test
        iteration_accuracy_rates.append(1.0 - (float(np.count_nonzero(diff)) / len(diff)))
    c_accuracy_rates.append(mean(iteration_accuracy_rates))    
print(c_accuracy_rates)
print(c_values)








[0.7255873340143003, 0.7582226762002042, 0.7807073544433095, 0.7953268641470889, 0.7997957099080695, 0.8144279877425945, 0.8155388151174668, 0.8121680286006129, 0.8144407558733402, 0.8144407558733402, 0.8211950970377937, 0.8200715015321757, 0.8200715015321757, 0.8189479060265576, 0.8189479060265578, 0.8122063329928498, 0.7975229826353422, 0.7964632277834526, 0.7806945863125638, 0.7155132788559755]
[0.0009765625, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]


In [56]:
print("The Max Value is: " +str(max(c_accuracy_rates)))
max_index = c_accuracy_rates.index(max(c_accuracy_rates))
print("The value of C that fits is: " +str(c_values[max_index]))

The Max Value is: 0.8211950970377937
The value of C that fits is: 1.0


### Summary

We can see some improvement using the linear support vector machine. An important thing to notice is the warning that the algorithm has failed to converge even after 1000 iterations (the default is 100). Therefore, I suspect that the data is <b>not linearly separable</b>. 

### RBF Kernel

In [64]:

c__rbf_accuracy_rates = []
c_rbf_values = []
probabilities = {0: 0.676, 1: 0.324}
for i in range(-10,10):
    C_parameter = math.pow(2,i)
    c_rbf_values.append(C_parameter)
    rbf_SVC = SVC(kernel="rbf", C=C_parameter, class_weight=probabilities)
    cv = KFold(n_splits=10, random_state=42, shuffle=False)
    iteration_accuracy_rates = []
    for train_index, test_index in cv.split(X):
        # The initial vector of weights should be zero.
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        rbf_SVC.fit(X_train, y_train)
        y_values_train = rbf_SVC.predict(X_test)
        diff = y_values_train - y_test
        iteration_accuracy_rates.append(1.0 - (float(np.count_nonzero(diff)) / len(diff)))
    c__rbf_accuracy_rates.append(mean(iteration_accuracy_rates))    
print(c__rbf_accuracy_rates)
print(c_rbf_values)
















[0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.7334652706843718, 0.7627170582226762, 0.7851889683350357, 0.807711950970378, 0.8155643513789581, 0.8189351378958121, 0.822305924412666, 0.8200459652706844, 0.8200331971399387, 0.8189223697650664, 0.8222931562819203, 0.8177987742594485, 0.8211695607763023]
[0.0009765625, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]


In [65]:
print("The Max Value is: " +str(max(c__rbf_accuracy_rates)))
max_index = c__rbf_accuracy_rates.index(max(c__rbf_accuracy_rates))
print("The value of C that fits is: " +str(c_rbf_values[max_index]))

The Max Value is: 0.822305924412666
The value of C that fits is: 8.0


### Poly Kernel

In [66]:
c__poly_accuracy_rates = []
c_poly_values = []
probabilities = {0: 0.676, 1: 0.324}
for i in range(-10,10):
    C_parameter = math.pow(2,i)
    c_poly_values.append(C_parameter)
    poly_SVC = SVC(kernel="poly", C=C_parameter, class_weight=probabilities)
    cv = KFold(n_splits=10, random_state=42, shuffle=False)
    iteration_accuracy_rates = []
    for train_index, test_index in cv.split(X):
        # The initial vector of weights should be zero.
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        poly_SVC.fit(X_train, y_train)
        y_values_train = poly_SVC.predict(X_test)
        diff = y_values_train - y_test
        iteration_accuracy_rates.append(1.0 - (float(np.count_nonzero(diff)) / len(diff)))
    c__poly_accuracy_rates.append(mean(iteration_accuracy_rates))    
print(c__poly_accuracy_rates)
print(c_poly_values)


















[0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6175689479060266, 0.6457226762002043, 0.7469611848825332, 0.7672114402451481, 0.8065628192032687, 0.8189096016343207, 0.8256639427987742, 0.8234167517875383, 0.8200459652706844, 0.8200331971399387, 0.8189096016343207, 0.8189096016343207, 0.8144279877425945, 0.8166751787538304]
[0.0009765625, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]


In [67]:
print("The Max Value is: " +str(max(c__poly_accuracy_rates)))
max_index = c__poly_accuracy_rates.index(max(c__poly_accuracy_rates))
print("The value of C that fits is: " +str(c_poly_values[max_index]))

The Max Value is: 0.8256639427987742
The value of C that fits is: 4.0


### Conclusions

Based on the results of these runs, I would submit my predictions based on polynomial SVC with C = 4