Support Vector Machines (SVM) are another algorithm that can be used for two or multi-class problems. For example, it could be used to distinguish between pictures of dogs and cats, or to identify the type of animal from several more types. As SVMs use distance measurements to find the optimal classification, it is easiest to explain and visualize in the two-class situation. 

Let's imagine that our data just as two features that determine the outcome class of the data point. We could plot these two features on an x/y axis (feature 1 being x and feature 2 being y), and change the color of the data points to indicate to which class they belong (class A or B). In order to create a rule to classify new data, Support Vector Machines draw the line between the two classes on our graph that maximizes the distance between the data and the line. It looks like this:

Just like decision trees are defined by the nodes that make them up, SVM are defined by what are called 'support vectors'. Support vectors are those data points that are closest to the line that divides the two classes in the data. Here, we can see that the blue circles and the red stars are separated by a line that is drawn so that it is as far as possible from both the nearest red star and nearest blue circle. This ensures that when the algorithm is asked to classify new data, even if this new data lies a little bit closer to the wrong class than the training data from its class, it is likely to be classified into the correct class. Finding the maximum margin between the two classes ensures that if new data is slightly closer to the class to which it does not actually belong, it will be as least likely to be misclassified as possible. In our example, this means that even if there is a pair of a similar-looking cat and dog, the boundary will be drawn to distinguish between these two, and therefore all of the other less similar cats and dogs

In [1]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn import svm
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report



One hot encoding data

In [2]:
param_grid = {'C':[1,10,100],'gamma':[1,0.1,0.001,], 'kernel':['linear','rbf'xº]}

In [3]:
grid = GridSearchCV(SVC(),param_grid,refit = True, verbose=2)

In [4]:
#grid.fit(X_train,y_train)

In [5]:
#grid.best_params_

In [6]:
grid.best_params_={'C': 1, 'gamma': 1, 'kernel': 'rbf'}

In [7]:
#Test with random search

# KNN DF BALANCED

In [8]:
df2 = pd.read_csv('./data/X_mean_knn.csv')
df3 = pd.read_csv('./data/y_mean_knn.csv')
X = df2
y = df3
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)


  return f(**kwargs)


0.6783772480133835

In [9]:
prediction = clf.predict(X_test)

In [10]:
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.66      1.00      0.80      3016
           1       0.99      0.13      0.23      1766

    accuracy                           0.68      4782
   macro avg       0.82      0.56      0.51      4782
weighted avg       0.78      0.68      0.59      4782

[[3013    3]
 [1535  231]]


# OH BALANCED

In [11]:
df2 = pd.read_csv('./data/X_mean_knn.csv')
df3 = pd.read_csv('./data/y_mean_knn.csv')
X = df2
y = df3
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)


  return f(**kwargs)


0.6685487243831033

In [12]:
prediction = clf.predict(X_test)

In [13]:
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.65      1.00      0.79      2974
           1       0.99      0.12      0.22      1808

    accuracy                           0.67      4782
   macro avg       0.82      0.56      0.51      4782
weighted avg       0.78      0.67      0.57      4782

[[2972    2]
 [1583  225]]


# OH DF

In [14]:
df4 = pd.read_csv('./data/df_0_oh.csv')
X = df4.drop(columns='Revenue')
y = df4['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8407367280606717

In [15]:
prediction = clf.predict(X_test)

In [16]:

print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.84      1.00      0.91      3104
           1       0.00      0.00      0.00       588

    accuracy                           0.84      3692
   macro avg       0.42      0.50      0.46      3692
weighted avg       0.71      0.84      0.77      3692

[[3104    0]
 [ 588    0]]


  _warn_prf(average, modifier, msg_start, len(result))


# KNN UNBALANCED

In [17]:
df4 = pd.read_csv('./data/df_mean_knn.csv')
X = df4.drop(columns='Revenue')
y = df4['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8556338028169014

In [18]:
prediction = clf.predict(X_test)
prediction

array([0., 0., 0., ..., 0., 0., 0.])

In [19]:
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

         0.0       0.86      1.00      0.92      3159
         1.0       0.00      0.00      0.00       533

    accuracy                           0.86      3692
   macro avg       0.43      0.50      0.46      3692
weighted avg       0.73      0.86      0.79      3692

[[3159    0]
 [ 533    0]]


  _warn_prf(average, modifier, msg_start, len(result))


# OH BALANCED

In [20]:
df0 = pd.read_csv('./data/X_0_oh.csv')
df1 = pd.read_csv('./data/y_0_oh.csv')

In [21]:
X = df0
y = df1['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.6755685374504485

In [22]:
prediction = clf.predict(X_test)
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.66      1.00      0.79      2980
           1       0.98      0.15      0.25      1813

    accuracy                           0.68      4793
   macro avg       0.82      0.57      0.52      4793
weighted avg       0.78      0.68      0.59      4793

[[2974    6]
 [1549  264]]


In [23]:
df1.Revenue.value_counts()

0    10067
1     5907
Name: Revenue, dtype: int64

# DF dropping categorical values

In [24]:
categorical_list = ["Month","OperatingSystems","Browser", "Region","TrafficType"]

In [25]:
df2 = pd.read_csv('./data/X_mean_knn.csv')
df3 = pd.read_csv('./data/y_mean_knn.csv')
#X = df2.drop(categorical_list, axis =1)
X = df2
y = df3['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')
clf.fit(X_train, y_train)

SVC(C=1, gamma=1)

In [26]:
prediction = clf.predict(X_test)

In [27]:
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.66      1.00      0.79      2970
           1       0.99      0.14      0.24      1812

    accuracy                           0.67      4782
   macro avg       0.82      0.57      0.52      4782
weighted avg       0.78      0.67      0.58      4782

[[2967    3]
 [1562  250]]


# DF get_dummies

In [28]:
def get_dummies(df, varlist):
    for var in varlist:
        df_slice = pd.get_dummies(df[var])
        df = pd.concat([df.drop(var, axis =1), df_slice], axis =1)
    return df

In [29]:
df2 = pd.read_csv('./data/X_mean_knn.csv')
df3 = pd.read_csv('./data/y_mean_knn.csv')
categorical_list = ["Month","OperatingSystems","Browser", "Region","TrafficType"]
X = df2
X = get_dummies(X, categorical_list)

In [30]:
X

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,11,12,13,14,15,16,17,18,19,20
0,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.200000,0.200000,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
1,0.000000,0.000000,0.000000,0.000000,2.000000,64.000000,0.000000,0.100000,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
2,0.000000,-1.000000,0.000000,-1.000000,1.000000,-1.000000,0.200000,0.200000,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
3,0.000000,0.000000,0.000000,0.000000,2.000000,0.000000,0.050000,0.140000,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
4,0.000000,0.000000,0.000000,0.000000,10.000000,627.500000,0.020000,0.050000,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15933,0.800732,0.000000,0.000000,0.000000,20.599634,0.000000,0.004449,0.010881,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
15934,0.369435,9.235877,0.000000,0.000000,14.000000,370.043350,0.013392,0.027211,133.597782,0.0,...,0,0,0,0,0,0,0,0,0,0
15935,12.095032,0.000000,1.226242,42.880129,151.131210,0.000000,0.000525,0.014963,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0
15936,2.080231,32.699133,0.000000,0.000000,25.561619,1720.782255,0.000000,0.007534,0.000000,0.0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df2 = pd.read_csv('./data/X_mean_knn.csv')
df3 = pd.read_csv('./data/y_mean_knn.csv')
categorical_list = ["Month","OperatingSystems","Browser", "Region","TrafficType"]


In [32]:
X = df2


In [33]:
X = get_dummies(X, categorical_list)


In [34]:
y = df3['Revenue']


In [35]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)


In [36]:
clf = svm.SVC(C= 1, gamma= 1, kernel= 'rbf')


In [37]:
fsldfñs

NameError: name 'fsldfñs' is not defined

In [38]:
clf.fit(X_train, y_train)

SVC(C=1, gamma=1)

In [39]:
prediction = clf.predict(X_test)

In [40]:
print(classification_report(y_test,prediction))
print(confusion_matrix(y_test, prediction))

              precision    recall  f1-score   support

           0       0.63      1.00      0.78      2973
           1       1.00      0.05      0.09      1809

    accuracy                           0.64      4782
   macro avg       0.82      0.52      0.44      4782
weighted avg       0.77      0.64      0.52      4782

[[2973    0]
 [1719   90]]


In [None]:
pd.get_dummies(df3.Month)