# SVM
    SVM (Support Vector Machine) is a machine learning algorithm used for regression and classification problems.
    It works by searching a hyperplane (a decision boundary that helps classify/separate data) in a N-dimensional space, that classifies the data points. Its goal is to find the hyperplane that has the maximum distance between data points of all the classes, by using support vectors, which are the data points that define the boundary that divides the data into 2 classes. A hyperplane's dimensions depends on the number of features, and the algorithm iteratively tries different hyperplanes until finding the best one.
    
    The best hyperplane si the one that maximizes the margin between the line and the closest data points.
    
   <img src="images/best_hyperplane.png" />
   
    The image shows different hyperplanes. The algorithm iterates finding these hyperplanes and chooses the one that maximizes margin.
    
    In order to maximize the margin between the data points and the hyperplane, the loss function used is hinge loss:
   <img src="images/hinge_loss_1.png" />
    
    The loss is 0 if the predicted value and the real have the same sign. If not, we calculate the loss value:
   <img src="images/loss.png" />
    
   #### But, what happens when the data cannot be separated by a line?
   
    What SVM does when it does not have linearly separable data, is use Soft Margin and Kernel Tricks:
    
    1 SOFT MARGIN: means trying to find a line to separate the data, tolerating some misclassified data. The degree of tolerance is important, and it is represented as the penalty C term. As C gets bigger, more penalty it gets when there is misclassification, so a lower C means it tolerates more misclassified data.
    
    2 KERNEL TRICK: means finding a non-linear decision boundary
    Kernel trick uses the existing features and transforms them into new features to find the non-linear decision boundary. Kernel options include linear, poly, rbf, simoid, precomputed, or callable.
    
   <img src="images/kernel_boundaries.png" />
   
   - POLINOMIAL KERNEL: generates new features by applying a polynomial combination of all existing features, for example applying X^2. 
    
   - RBF KERNEL: (Radial Basis Function) generates new features by measuring the distance between all dots to a specific dot/dots (centers)/ 
    
    
    When working with multiclassification problems, SVM applies one-to-one approach, modeling the hyperplane for each pair of options
    
    SVM, like other algorithms, works with numeric features, so we need to apply one-hot encoding for categorical data.

    SVM generally gives high degrees of accuracy in classification problems, and can have many applications, such as facial detection and email classification.
    

#### SVM parameters
    - kernel defines the type of function used to transform the dataset
        Dataset transformations are important, since they transform data, making it linearly separable. The kernel analyzes the relation between the points and tries to find the best function transformation, including linear, nonlinear, polynomial, radial basis function and sigmoid. 
        RBF is the most common, since it overcomes space complexities.
    - C defines the regularization of error
        This parameter regulates how soft a margin is. The smaller C is, the wider margins are, and this can lead to misclassifications. The larger C is, the narrower the margins become and this leads to fewer missclassifications. However, if this value is too high, OVERFITTING takes place. (as said before, a lower C means more tolerance to misclassified data).
    - gamma defines how loosely the model will fit the the training data, in order to prevent OVERFITTING
        Gamma defines the influence of each training sample, so a low gamma value leads to a more generalized model and a high gamma value can lead to overfitting.
  <img src="images/gamma_influence.png" />

    Having this in mind, we will create our SVM model for the dataset of Nivel de Adaptabilidad:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import keras
import tensorflow as tf

In [2]:
dataset = pd.read_csv('training-ds.csv', encoding='utf-8')
print(type(dataset))

<class 'pandas.core.frame.DataFrame'>


In [3]:
dataset[:10]

Unnamed: 0,Tipo de Red,Estudiante de Tecnología,Nivel de Educación,Vive en Ciudad,Tipo de Instituto,Edad,Dispositivo,Tipo de Internet,Situación Financiera,Género,Duración de la Clase,Nivel de Adaptación
0,3G,Si,Universidad,Si,Privado,21-25,Computadora,Wifi,Media,Masculino,03-Jun,Bajo
1,3G,No,Escuela,Si,Privado,Nov-15,Smartphone,Compra Megas,Media,Femenino,01-Mar,Moderado
2,3G,Si,Universidad,Si,Privado,21-25,Smartphone,Compra Megas,Mala,Masculino,01-Mar,Bajo
3,3G,Si,Escuela,Si,Privado,Nov-15,Smartphone,Compra Megas,Media,Masculino,01-Mar,Moderado
4,4G,Si,Universidad,Si,Privado,21-25,Computadora,Wifi,Buena,Masculino,01-Mar,Alto
5,3G,No,Escuela,No,Público,Nov-15,Smartphone,Compra Megas,Media,Masculino,0,Bajo
6,4G,Si,Escuela,Si,Privado,Nov-15,Smartphone,Compra Megas,Media,Masculino,01-Mar,Moderado
7,3G,No,Escuela,Si,Privado,Nov-15,Smartphone,Compra Megas,Media,Masculino,01-Mar,Moderado
8,4G,No,Universidad,Si,Privado,21-25,Smartphone,Compra Megas,Mala,Masculino,01-Mar,Alto
9,4G,No,Escuela,Si,Público,16-20,Smartphone,Wifi,Media,Masculino,01-Mar,Moderado


   #### One-hot encoding
       Similar to a regular neural network, SVM requires data to be numerical, therefore we will transform our columns to one-hot encoding:

In [4]:
dataset_one_hot_encoded = pd.get_dummies(dataset)  
dataset_one_hot_encoded

Unnamed: 0,Tipo de Red_2G,Tipo de Red_3G,Tipo de Red_4G,Estudiante de Tecnología_No,Estudiante de Tecnología_Si,Nivel de Educación_Colegio,Nivel de Educación_Escuela,Nivel de Educación_Universidad,Vive en Ciudad_No,Vive en Ciudad_Si,...,Situación Financiera_Mala,Situación Financiera_Media,Género_Femenino,Género_Masculino,Duración de la Clase_0,Duración de la Clase_01-Mar,Duración de la Clase_03-Jun,Nivel de Adaptación_Alto,Nivel de Adaptación_Bajo,Nivel de Adaptación_Moderado
0,0,1,0,0,1,0,0,1,0,1,...,0,1,0,1,0,0,1,0,1,0
1,0,1,0,1,0,0,1,0,0,1,...,0,1,1,0,0,1,0,0,0,1
2,0,1,0,0,1,0,0,1,0,1,...,1,0,0,1,0,1,0,0,1,0
3,0,1,0,0,1,0,1,0,0,1,...,0,1,0,1,0,1,0,0,0,1
4,0,0,1,0,1,0,0,1,0,1,...,0,0,0,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
959,0,1,0,1,0,0,0,1,0,1,...,0,1,0,1,0,1,0,0,0,1
960,0,1,0,1,0,0,1,0,0,1,...,0,1,1,0,0,1,0,0,0,1
961,0,0,1,1,0,0,0,1,0,1,...,0,1,0,1,1,0,0,0,1,0
962,0,1,0,1,0,0,0,1,0,1,...,0,1,0,1,0,1,0,0,0,1


#### Divide dataset into X and y
    The first 31 columns represent X and the last 3 represent y:

In [5]:
X = dataset_one_hot_encoded.iloc[:,:31]
X

Unnamed: 0,Tipo de Red_2G,Tipo de Red_3G,Tipo de Red_4G,Estudiante de Tecnología_No,Estudiante de Tecnología_Si,Nivel de Educación_Colegio,Nivel de Educación_Escuela,Nivel de Educación_Universidad,Vive en Ciudad_No,Vive en Ciudad_Si,...,Tipo de Internet_Compra Megas,Tipo de Internet_Wifi,Situación Financiera_Buena,Situación Financiera_Mala,Situación Financiera_Media,Género_Femenino,Género_Masculino,Duración de la Clase_0,Duración de la Clase_01-Mar,Duración de la Clase_03-Jun
0,0,1,0,0,1,0,0,1,0,1,...,0,1,0,0,1,0,1,0,0,1
1,0,1,0,1,0,0,1,0,0,1,...,1,0,0,0,1,1,0,0,1,0
2,0,1,0,0,1,0,0,1,0,1,...,1,0,0,1,0,0,1,0,1,0
3,0,1,0,0,1,0,1,0,0,1,...,1,0,0,0,1,0,1,0,1,0
4,0,0,1,0,1,0,0,1,0,1,...,0,1,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
959,0,1,0,1,0,0,0,1,0,1,...,1,0,0,0,1,0,1,0,1,0
960,0,1,0,1,0,0,1,0,0,1,...,1,0,0,0,1,1,0,0,1,0
961,0,0,1,1,0,0,0,1,0,1,...,1,0,0,0,1,0,1,1,0,0
962,0,1,0,1,0,0,0,1,0,1,...,1,0,0,0,1,0,1,0,1,0


In [6]:
y = dataset_one_hot_encoded.iloc[:,31:]
y

Unnamed: 0,Nivel de Adaptación_Alto,Nivel de Adaptación_Bajo,Nivel de Adaptación_Moderado
0,0,1,0
1,0,0,1
2,0,1,0
3,0,0,1
4,1,0,0
...,...,...,...
959,0,0,1
960,0,0,1
961,0,1,0
962,0,0,1


#### Splitting data into training and testing
    We will use 80% of the data for training  and 20% for testing:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(771, 31) (771, 3)
(193, 31) (193, 3)


In [8]:
# import sklearn
# sklearn.svm.SVC(*, 
#     C=1.0,                          # The regularization parameter
#     kernel='rbf',                   # The kernel type used 
#     degree=3,                       # Degree of polynomial function 
#     gamma='scale',                  # The kernel coefficient
#     coef0=0.0,                      # If kernel = 'poly'/'sigmoid'
#     shrinking=True,                 # To use shrinking heuristic
#     probability=False,              # Enable probability estimates
#     tol=0.001,                      # Stopping crierion
#     cache_size=200,                 # Size of kernel cache
#     class_weight=None,              # The weight of each class
#     verbose=False,                  # Enable verbose output
#     max_iter=- 1,                   # Hard limit on iterations
#     decision_function_shape='ovr',  # One-vs-rest or one-vs-one
#     break_ties=False,               # How to handle breaking ties
#     random_state=None               # Random state of the model
# )

#### Defining our model:
    Sklearn provides the necessary modules for training an SVM model:
    We will start with a linear kernel:

In [9]:
from sklearn.svm import SVC
model = SVC(kernel='linear') 
model.fit(X_train, y_train)

ValueError: y should be a 1d array, got an array of shape (771, 3) instead.

    When we try to send y as one-hot encoding, we get an error. SVM expects the output to be of 1 dimension, so we will transform it:
    So now, if one-hot encoding for a line was
        Nivel_de_adaptabilidad_alto   |    Nivel_de_adaptabilidad_medio   |    Nivel_de_adaptabilidad_bajo
                   0                  |                   1               |                 0
    , each column would represent a number, so it would be:
        Nivel_de_adaptabilidad
               1

In [10]:
print("BEFORE-----")
print(y_train.shape)
print(y_test.shape)
print(type(y_train), type(y_test))
print(y_train.iloc[0,:])
print(y_train.iloc[1,:])
print(y_train.iloc[5,:])
y_train = y_train.apply(lambda x: x.argmax(), axis=1).values
print(y_train.shape)
y_test = y_test.apply(lambda x: x.argmax(), axis=1).values
print("AFTER------")
print(y_test.shape)
print(y_train[0])
print(y_train[1])
print(y_train[5])

BEFORE-----
(771, 3)
(193, 3)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
Nivel de Adaptación_Alto        1
Nivel de Adaptación_Bajo        0
Nivel de Adaptación_Moderado    0
Name: 730, dtype: uint8
Nivel de Adaptación_Alto        0
Nivel de Adaptación_Bajo        1
Nivel de Adaptación_Moderado    0
Name: 223, dtype: uint8
Nivel de Adaptación_Alto        0
Nivel de Adaptación_Bajo        0
Nivel de Adaptación_Moderado    1
Name: 955, dtype: uint8
(771,)
AFTER------
(193,)
0
1
2


In [11]:
from sklearn.svm import SVC
model = SVC(kernel='linear') 
model.fit(X_train, y_train)

SVC(kernel='linear')

In [12]:
predictions = model.predict(X_test)
print(predictions[:5])

[2 2 1 1 1]


In [13]:
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.6511024643320363
Testing accuracy:  0.6735751295336787


    For a linear kernel, we got 67% of test accuracy. 
    Now we will try a polynomial kernel:

In [14]:
model = SVC(kernel='poly', degree=2) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.7976653696498055
Testing accuracy:  0.7357512953367875


In [15]:
model = SVC(kernel='poly', degree=3) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.8599221789883269
Testing accuracy:  0.7616580310880829


In [16]:
model = SVC(kernel='poly', degree=4) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.8949416342412452
Testing accuracy:  0.8082901554404145


In [17]:
model = SVC(kernel='poly', degree=5) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.9027237354085603
Testing accuracy:  0.8134715025906736


    With polynomial kernel, we reached 81% of test accuracy with 5 degrees.
    For our third experiment, we will try the rbf kernel:

In [18]:
model = SVC(kernel='rbf') 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.8352788586251622
Testing accuracy:  0.7668393782383419


    RBF worsened test accuracy a little. 
    We will see if the parameters for SVM can help:

In [19]:
model = SVC(kernel='rbf', C=0.1) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.6783398184176395
Testing accuracy:  0.6735751295336787


    A low C means that the model tolerates more misclassifications. A value of 0.1 caused underfitting in our model.

In [27]:
model = SVC(kernel='rbf', C=10) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.9027237354085603
Testing accuracy:  0.8186528497409327


    A greater C value improved our test accuracy.

#### GridSearchCV

In [42]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm
#Create a svm Classifier and hyper parameter tuning 
ml = svm.SVC() 
  
# defining parameter range
param_grid = {'C': [ 0.1, 1, 10, 100, 1000,10000], 
              'gamma': [1,0.1,0.01,0.001,0.0001],
              'kernel': ['rbf']} 
  
grid = GridSearchCV(ml, param_grid, refit = True, verbose = 1,cv=15)
  
# fitting the model for grid search
grid_search=grid.fit(X_test, y_test)

Fitting 15 folds for each of 30 candidates, totalling 450 fits


In [43]:
print(grid_search.best_params_)

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}


In [44]:
model = SVC(kernel='rbf', C=100, gamma=0.01) 
model.fit(X_train, y_train)
print('Training accuracy: ',model.score(X_train, y_train))
print('Testing accuracy: ',model.score(X_test, y_test))

Training accuracy:  0.8041504539559015
Testing accuracy:  0.7461139896373057


#### References:
https://datagy.io/python-support-vector-machines/

https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496