# Support Vector Machines

Support Vector Machines (SVM) is an extension of the support vector classifier which uses **kernels** to create non linear boundaries.

**Kernels**

Some functional relationship between two observations. Popular kernels are:
* Linear
* Polynomial 
* Radial

# Regression Model

## View dataset

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("Movie_regression.csv", header=0)

In [3]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,3D_available,Time_taken,Twitter_hastags,Genre,Avg_age_actors,Num_multiplex,Collection
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,YES,109.6,223.84,Thriller,23,494,48000
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,NO,146.64,243.456,Drama,42,462,43200
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,NO,147.88,2022.4,Comedy,38,458,69400
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,YES,185.36,225.344,Drama,45,472,66800
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,NO,176.48,225.792,Drama,55,395,72400


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

## Missing Value Imputation

In [5]:
# Time_taken variable has missing values

df['Time_taken'].mean() # Take the mean value of variable

157.3914979757085

In [6]:
df['Time_taken'].fillna(value=df['Time_taken'].mean(),  inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           506 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

## Dummy Variable Creation

This is used to convert categorical varriables into numerico variables to run the regression or classification techniques. 

In [7]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,3D_available,Time_taken,Twitter_hastags,Genre,Avg_age_actors,Num_multiplex,Collection
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,YES,109.6,223.84,Thriller,23,494,48000
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,NO,146.64,243.456,Drama,42,462,43200
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,NO,147.88,2022.4,Comedy,38,458,69400
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,YES,185.36,225.344,Drama,45,472,66800
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,NO,176.48,225.792,Drama,55,395,72400


In [8]:
df = pd.get_dummies(df, columns = ['3D_available','Genre'], drop_first = True)
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,48000,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,43200,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,69400,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,66800,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,72400,0,0,1,0


## X-Y Split

This is where we devide our dataset into independent variables and dependent variables. 

So, we know that Collection column is our dependent variable. We will get the other 19 columns into our independent variables, leaving us with one dependent variable (Collection)

In [9]:
X = df.loc[:, df.columns != 'Collection']   # X will have all data except for collection column
type(X)
X.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,0,0,1,0


In [10]:
print(X.shape)  # We had 20 columns (including Collection column) and now the X variable only has 19

(506, 19)


In [11]:
Y = df['Collection']
type(Y)
Y.head()

0    48000
1    43200
2    69400
3    66800
4    72400
Name: Collection, dtype: int64

In [12]:
Y.shape

(506,)

## Test-Train split

This needs to be done because we don't train our model with the complete dataset. So we will be spliting the data into Test and Train data.

In [13]:
from sklearn.model_selection import train_test_split
                                                                                                # If you keep random_state equal throughout the code, the data will be the same.    
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)    # This function will split the data into 80% train and 20% test

In [14]:
x_train.head()  # Function to print first 5 lines of the train data, you can se indexes are not in order.

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
220,27.1618,67.4,0.493,38612.805,162.0,8.485,8.64,8.485,8.67,8.52,480270,174.68,224.272,23,536,0,0,0,1
71,23.1752,76.62,0.587,33113.355,91.0,7.28,7.4,7.29,7.455,8.16,491978,200.68,263.472,46,400,0,0,0,0
240,22.2658,64.86,0.572,38312.835,127.8,6.755,6.935,6.8,6.84,8.68,470107,204.8,224.32,24,387,1,1,0,0
6,21.7658,70.74,0.476,33396.66,140.1,7.065,7.265,7.15,7.4,8.96,459241,139.16,243.664,41,522,1,0,0,1
417,538.812,91.2,0.321,29463.72,162.6,9.135,9.305,9.095,9.165,6.96,302776,172.16,301.664,60,589,1,0,0,0


In [15]:
x_train.shape

(404, 19)

In [16]:
x_test.shape

(102, 19)

## Standardizing Data

Standardizing means that we will conver the mean and variance of each of the variables to be 0 and 1 respectively. So we will convert our variables in such way that the mean of all the variables should be **zero** and the variance should be **one**. 

Data need to be standarized so that variables with larger value, do not have a greater impact on SVM prediction that those variables with a smaller value. For example: If one variable 1 has values ranging from 1,000 to 5,000 and variable 2 with values between 0 and 1. Variable 1 will have a greater impact on our model, but this impact should be the same.

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

In [18]:
x_test_std

array([[-0.40835869, -1.12872913,  0.83336883, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [ 0.71925111,  0.9988844 , -0.65283979, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40257488,  0.39610829,  0.05115377, ...,  1.50268577,
        -0.48525664, -0.75225758],
       ...,
       [-0.3982601 , -0.85812418,  0.89420778, ..., -0.66547513,
        -0.48525664,  1.3293319 ],
       [-0.39934279, -0.07637654,  0.58132175, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40088071, -0.36702631,  0.31189212, ..., -0.66547513,
        -0.48525664, -0.75225758]])

In [19]:
x_test  # Just to show the different values between the original data and the standarized data.

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
329,21.3448,61.48,0.540,35179.815,90.7,7.320,7.460,7.275,7.515,8.62,483051,111.040000,283.616,53,452,0,1,0,0
371,204.6460,91.20,0.369,34529.880,173.5,9.310,9.525,9.320,9.505,7.96,454281,196.000000,268.000,25,609,1,1,0,0
219,22.2850,82.78,0.450,35402.015,165.9,8.175,8.375,8.315,8.405,8.72,451935,123.200000,263.680,21,561,1,1,0,0
403,516.0340,91.20,0.307,29713.695,169.5,9.125,9.310,9.060,9.100,6.96,384237,157.391498,301.328,40,677,1,0,0,1
78,21.1292,80.66,0.563,34618.760,127.2,7.330,7.500,7.450,7.690,8.26,447528,176.480000,303.392,53,377,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,20.4110,56.48,0.590,35457.565,109.2,5.340,5.535,5.285,5.465,8.54,539589,115.880000,303.952,22,383,1,0,1,0
455,115.0474,91.20,0.287,36246.375,160.0,8.695,8.790,8.630,9.015,6.96,408819,155.640000,282.256,36,620,0,0,0,1
60,22.9864,65.26,0.547,31891.255,139.7,6.335,6.420,6.235,6.560,7.06,465689,157.391498,222.992,30,439,0,0,0,1
213,22.8104,76.18,0.511,35413.125,105.8,7.945,8.040,7.910,8.215,7.28,504226,151.240000,204.496,3,441,1,1,0,0


## Traning Regression Tree

When doing a regression model, we will have to use the svm.SVR to create the model.
 
Parameters are variables you need to pass to the fucntion to train the model and attributes are information of the model already trained. 


Which has the following **parameters**:
1. Kernel: Specifies the kernel type to be used in the algorithm. Default = RBF
2. Degree: Specifies the degree of the polynomial kernel function. **For 'poly' kernel only** 
3. Gamma: Specifies the kernel coeficient for **'rbf', 'poly' and 'sigmoid'**
4. Coef: Independent term in kernel function. **It is only signifcant in 'poly' and 'sigmoid'**
5. Tol: Specifies the tolerance for stopping criterion.
6. C: Specifies penalty parameter C of the error term.
7. Epsilon: Epsilon in the epsilon-SVR-model.
8. And more.

Which has the following **attributes**:
1. support_: Shows the indices of support vectors.
2. support_vectors_: Shows support vectors.
3. And more. 

In [28]:
from sklearn.svm import SVR
svr = SVR(kernel='linear', C = 500)

In [29]:
svr.fit(x_train_std, y_train)

SVR(C=500, kernel='linear')

## Predict values using trained model

In [30]:
y_test_pred = svr.predict(x_test_std)
y_train_pred = svr.predict(x_train_std)

In [31]:
y_test_pred

array([54172.88652594, 42360.53493594, 47535.6161038 , 18213.44501931,
       48306.49797341, 39911.30264079, 34154.82438396, 43328.81857968,
       30564.16305183, 47083.06891753, 12883.41530221, 38441.56120077,
       37564.58419372,  8833.24108759, 66743.92804381, 61806.13858547,
       38834.69156187, 65442.06460586, 57070.43984968, 43390.83509428,
       53825.48454162, 39211.12088433, 41594.24518385, 58253.12540463,
       43061.94784582, 13500.91595214, 41078.73255014, 29760.22077231,
       74387.42055942, 44639.66629805, 36802.04559594, 37452.7520558 ,
       37731.81412737, 41015.55708641, 52033.72818305, 35527.23097608,
       25303.12301261, 36494.03996199, 36688.32304647, 35826.51769434,
       47334.38882548, 45971.16054321, 44806.68292955, 27789.57095574,
       50531.91304271, 46829.47085744, 33272.97210625, 40080.92308753,
       15821.17728949, 51706.04815075, 41321.11684481, 39172.79222897,
       47629.5707835 , 67355.55689102, 25492.33272751, 41679.44337221,
      

## Model performance

In [32]:
from sklearn.metrics import mean_squared_error, r2_score

In [33]:
mean_squared_error(y_test, y_test_pred)

160031907.61713442

In [34]:
r2_score(y_train, y_train_pred)

0.7125944752203937

In [35]:
r2_score(y_test, y_test_pred)

0.5028901447877382

# Classification Model

Will be using the same steps as in the regression model.

## Preprocessing data

In [11]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [12]:
ds = pd.read_csv('Movie_classification.csv', header=0)
ds.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,3D_available,Time_taken,Twitter_hastags,Genre,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,YES,109.6,223.84,Thriller,23,494,48000,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,NO,146.64,243.456,Drama,42,462,43200,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,NO,147.88,2022.4,Comedy,38,458,69400,1
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,YES,185.36,225.344,Drama,45,472,66800,1
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,NO,176.48,225.792,Drama,55,395,72400,1


In [13]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

In [14]:
ds['Time_taken'].mean()

157.3914979757085

In [15]:
ds['Time_taken'].fillna(value = ds['Time_taken'].mean(), inplace = True)    # Filling null instances with a value
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           506 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

In [16]:
ds = pd.get_dummies(ds, columns= ['3D_available', 'Genre'], drop_first=True)
ds.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,...,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,...,109.6,223.84,23,494,48000,1,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,...,146.64,243.456,42,462,43200,0,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,...,147.88,2022.4,38,458,69400,1,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,...,185.36,225.344,45,472,66800,1,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,...,176.48,225.792,55,395,72400,1,0,0,1,0


In [17]:
xClassification = ds.loc[:,ds.columns!='Start_Tech_Oscar']
type(xClassification)

pandas.core.frame.DataFrame

In [18]:
xClassification.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,48000,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,43200,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,69400,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,66800,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,72400,0,0,1,0


In [19]:
yClassification = ds['Start_Tech_Oscar']
yClassification.head()

0    1
1    0
2    1
3    1
4    1
Name: Start_Tech_Oscar, dtype: int64

In [20]:
from sklearn.model_selection import train_test_split
xClassificationTrain, xClassificationTest, yClassificationTrain, yClassificationTest = train_test_split(xClassification, yClassification, test_size=0.2, random_state=0)

In [21]:
xClassificationTrain.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
220,27.1618,67.4,0.493,38612.805,162.0,8.485,8.64,8.485,8.67,8.52,480270,174.68,224.272,23,536,53400,0,0,0,1
71,23.1752,76.62,0.587,33113.355,91.0,7.28,7.4,7.29,7.455,8.16,491978,200.68,263.472,46,400,43400,0,0,0,0
240,22.2658,64.86,0.572,38312.835,127.8,6.755,6.935,6.8,6.84,8.68,470107,204.8,224.32,24,387,54000,1,1,0,0
6,21.7658,70.74,0.476,33396.66,140.1,7.065,7.265,7.15,7.4,8.96,459241,139.16,243.664,41,522,45800,1,0,0,1
417,538.812,91.2,0.321,29463.72,162.6,9.135,9.305,9.095,9.165,6.96,302776,172.16,301.664,60,589,20800,1,0,0,0


In [22]:
print(xClassificationTrain.shape)
print(xClassificationTest.shape)

(404, 20)
(102, 20)


## Standardizing Data

In [23]:
from sklearn.preprocessing import StandardScaler

scClassification = StandardScaler().fit(xClassificationTrain)
xClassificationTrainStd = scClassification.transform(xClassificationTrain)

In [24]:
xClassificationTestStd = scClassification.transform(xClassificationTest)
xClassificationTestStd

array([[-0.40835869, -1.12872913,  0.83336883, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [ 0.71925111,  0.9988844 , -0.65283979, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40257488,  0.39610829,  0.05115377, ...,  1.50268577,
        -0.48525664, -0.75225758],
       ...,
       [-0.3982601 , -0.85812418,  0.89420778, ..., -0.66547513,
        -0.48525664,  1.3293319 ],
       [-0.39934279, -0.07637654,  0.58132175, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40088071, -0.36702631,  0.31189212, ..., -0.66547513,
        -0.48525664, -0.75225758]])

## Training Model

In [25]:
from sklearn import svm

In [26]:
clf_svm_l = svm.SVC(kernel='linear', C = 0.01)
clf_svm_l.fit(xClassificationTrainStd, yClassificationTrain)

SVC(C=0.01, kernel='linear')

## Predic values using trained model

In [27]:
yClassificationTrainPredicted = clf_svm_l.predict(xClassificationTrainStd)
yClassificationTestPredicted = clf_svm_l.predict(xClassificationTestStd)

In [28]:
yClassificationTestPredicted

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int64)

## Model Performance

In regression we used differente metrics, such as R squared and mean squared error, but in classification is used accuracy and confusion matrix

In [29]:
from sklearn.metrics import accuracy_score, confusion_matrix
confusion_matrix(yClassificationTest, yClassificationTestPredicted)

#    0  |  1
# 0| 11 | 33
# 1|  5 | 53

# So there are 11 true negatives and 53 true positives, therefore there are 33 false negatives and 5 false positives.

array([[11, 33],
       [ 5, 53]], dtype=int64)

In [30]:
accuracy_score(yClassificationTest, yClassificationTestPredicted)

0.6274509803921569

In [31]:
clf_svm_l.n_support_

# first number is the number of support vectors of class 0 and the second is support vector in class 1

array([186, 189])

## Hyper parameter Tuning

How can we be sure that our C value is the best possible value? For every different of C, our model will be different, and so are the results. We could change manually the value of C and compare each of the results between the models. 

Finding the best C value can be a long process if done manually, so for that we can use **grid search** by sklearn. Where you can provide different values of C, and Python will automatically run and evaluate which of them is the best C value.

In [37]:
from sklearn.model_selection import GridSearchCV
params = {'C': (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000)}  # Creating a dictionary for the hyperparameter C, C is the key.
clf_svm_l = svm.SVC(kernel='linear')    # Creating Suport Vector Classification object

In [38]:
# Parameters: model, parameters (in the form of dictionary), n_jobs used to maximize power of computer (equal to -1),
#             cv for cross-validation, verbose is the amount of messages received while training (equal to 1), scoring is how will bedetermined the performance of each model
svm_grid_lin = GridSearchCV(clf_svm_l, params, n_jobs = -1,
                            cv = 10, verbose = 1, scoring = 'accuracy')


In [39]:
svm_grid_lin.fit(xClassificationTrainStd, yClassificationTrain)

Fitting 10 folds for each of 13 candidates, totalling 130 fits


GridSearchCV(cv=10, estimator=SVC(kernel='linear'), n_jobs=-1,
             param_grid={'C': (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50,
                               100, 500, 1000)},
             scoring='accuracy', verbose=1)

In [40]:
svm_grid_lin.best_params_   # this will give us the information of the best model of those tested.

{'C': 0.1}

In [41]:
linsvm_clf = svm_grid_lin.best_estimator_       # We will get the best model, and saving it into a new object

In [42]:
accuracy_score(yClassificationTest, linsvm_clf.predict(xClassificationTestStd))

0.5980392156862745

## Polynomial Kernel

In [71]:
clf_svm_p3 = svm.SVC(kernel = 'poly', degree=2, C=0.01)
clf_svm_p3.fit(xClassificationTrainStd, yClassificationTrain)

SVC(C=0.01, degree=2, kernel='poly')

In [72]:
yClassificationTrainPredicted = clf_svm_p3.predict(xClassificationTrainStd)
yClassificationTestPredicted = clf_svm_p3.predict(xClassificationTestStd)

In [73]:
accuracy_score(yClassificationTest, yClassificationTestPredicted)

0.5686274509803921

In [74]:
clf_svm_p3.n_support_

array([186, 193])

## Radial Kernel

In [75]:
clf_svm_r = svm.SVC(kernel='rbf', gamma=0.5, C=10)  # Gamma is the impact of nearby point when using radial kernel
clf_svm_r.fit(xClassificationTrainStd, yClassificationTrain)

SVC(C=10, gamma=0.5)

In [76]:
yClassificationTrainPredicted = clf_svm_r.predict(xClassificationTrainStd)
yClassificationTestPredicted = clf_svm_r.predict(xClassificationTestStd)

In [77]:
accuracy_score(yClassificationTest, yClassificationTestPredicted)

0.6176470588235294

In [78]:
clf_svm_r.n_support_

array([186, 218])

## Radial Grid

In [83]:
params = {'C': (0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50),
          'gamma': (0.001, 0.01, 0.1, 0.5, 1)}
clf_svm_r = svm.SVC(kernel='rbf')

In [84]:
svm_grid_rad = GridSearchCV(clf_svm_r, params, n_jobs=-1, cv=3, verbose=1, scoring='accuracy')

In [85]:
svm_grid_rad.fit(xClassificationTrainStd, yClassificationTrain)

Fitting 3 folds for each of 40 candidates, totalling 120 fits


GridSearchCV(cv=3, estimator=SVC(), n_jobs=-1,
             param_grid={'C': (0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50),
                         'gamma': (0.001, 0.01, 0.1, 0.5, 1)},
             scoring='accuracy', verbose=1)

In [86]:
svm_grid_rad.best_params_

{'C': 50, 'gamma': 0.001}

In [87]:
radsvm_clf = svm_grid_rad.best_estimator_

In [88]:
accuracy_score(yClassificationTest, radsvm_clf.predict(xClassificationTestStd))

0.6176470588235294