<a href="https://colab.research.google.com/github/soumyakrath/IMTPDS2021/blob/main/GitHub-pwd/social_svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Classifying data using Support Vector Machines(SVM)

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane that categorizes new examples. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is a number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.
The most important question that arises while using SVM is how to decide the right hyperplane.

Here, we used a dataset of Social network aids from file Social.csv

Various features like Gender, Age, EstimatedSalary used in classification whether the person engaged in	Purchased or not.


In [38]:
library(tidyverse)

In [39]:
dataset = read.csv('https://raw.githubusercontent.com/soumyakrath/IMTPDS2021/main/GitHub-pwd/social_network_svm.csv')

In [40]:
head(dataset)

Unnamed: 0_level_0,User.ID,Gender,Age,EstimatedSalary,Purchased
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<int>
1,15624510,Male,19,19000,0
2,15810944,Male,35,20000,0
3,15668575,Female,26,43000,0
4,15603246,Female,27,57000,0
5,15804002,Male,19,76000,0
6,15728773,Male,27,58000,0


In [41]:
install.packages('fastDummies')
library('fastDummies')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [42]:
dataset <- dummy_cols(dataset, select_columns = c('Gender'), remove_selected_columns = TRUE)

In [43]:
head(dataset)

Unnamed: 0_level_0,User.ID,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,15624510,19,19000,0,0,1
2,15810944,35,20000,0,0,1
3,15668575,26,43000,0,1,0
4,15603246,27,57000,0,1,0
5,15804002,19,76000,0,0,1
6,15728773,27,58000,0,0,1


In [44]:
dataset = dataset[,-1]
head(dataset)

Unnamed: 0_level_0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,19,19000,0,0,1
2,35,20000,0,0,1
3,26,43000,0,1,0
4,27,57000,0,1,0
5,19,76000,0,0,1
6,27,58000,0,0,1


In [45]:
str(dataset)

'data.frame':	400 obs. of  5 variables:
 $ Age            : int  19 35 26 27 19 27 27 32 25 35 ...
 $ EstimatedSalary: int  19000 20000 43000 57000 76000 58000 84000 150000 33000 65000 ...
 $ Purchased      : int  0 0 0 0 0 0 0 1 0 0 ...
 $ Gender_Female  : int  0 0 1 1 0 0 1 1 0 1 ...
 $ Gender_Male    : int  1 1 0 0 1 1 0 0 1 0 ...


In [46]:
summary(dataset)

      Age        EstimatedSalary    Purchased      Gender_Female 
 Min.   :18.00   Min.   : 15000   Min.   :0.0000   Min.   :0.00  
 1st Qu.:29.75   1st Qu.: 43000   1st Qu.:0.0000   1st Qu.:0.00  
 Median :37.00   Median : 70000   Median :0.0000   Median :1.00  
 Mean   :37.66   Mean   : 69742   Mean   :0.3575   Mean   :0.51  
 3rd Qu.:46.00   3rd Qu.: 88000   3rd Qu.:1.0000   3rd Qu.:1.00  
 Max.   :60.00   Max.   :150000   Max.   :1.0000   Max.   :1.00  
  Gender_Male  
 Min.   :0.00  
 1st Qu.:0.00  
 Median :0.00  
 Mean   :0.49  
 3rd Qu.:1.00  
 Max.   :1.00  

In [47]:
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))


In [48]:
str(dataset)

'data.frame':	400 obs. of  5 variables:
 $ Age            : int  19 35 26 27 19 27 27 32 25 35 ...
 $ EstimatedSalary: int  19000 20000 43000 57000 76000 58000 84000 150000 33000 65000 ...
 $ Purchased      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ Gender_Female  : int  0 0 1 1 0 0 1 1 0 1 ...
 $ Gender_Male    : int  1 1 0 0 1 1 0 0 1 0 ...


In [49]:
summary(dataset)

      Age        EstimatedSalary  Purchased Gender_Female   Gender_Male  
 Min.   :18.00   Min.   : 15000   0:257     Min.   :0.00   Min.   :0.00  
 1st Qu.:29.75   1st Qu.: 43000   1:143     1st Qu.:0.00   1st Qu.:0.00  
 Median :37.00   Median : 70000             Median :1.00   Median :0.00  
 Mean   :37.66   Mean   : 69742             Mean   :0.51   Mean   :0.49  
 3rd Qu.:46.00   3rd Qu.: 88000             3rd Qu.:1.00   3rd Qu.:1.00  
 Max.   :60.00   Max.   :150000             Max.   :1.00   Max.   :1.00  

In [50]:
install.packages('caTools')
library(caTools)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [51]:
# Splitting the dataset into the Training and Test set

set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)

training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)


In [52]:
dim(training_set)
dim(test_set)

In [53]:
split

In [54]:
head(training_set)

Unnamed: 0_level_0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>
1,19,19000,0,0,1
3,26,43000,0,1,0
6,27,58000,0,0,1
7,27,84000,0,1,0
8,32,150000,1,1,0
10,35,65000,0,1,0


In [55]:
head(test_set)

Unnamed: 0_level_0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>
2,35,20000,0,0,1
4,27,57000,0,1,0
5,19,76000,0,0,1
9,25,33000,0,0,1
12,26,52000,0,1,0
18,45,26000,1,0,1


In [56]:
# Feature Scaling

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])


In [57]:
head(training_set)

Unnamed: 0_level_0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<dbl>,<dbl>
1,-1.7655475,-1.4733414,0,-0.9916984,0.9916984
3,-1.0962966,-0.7883761,0,1.0050098,-1.0050098
6,-1.0006894,-0.3602727,0,-0.9916984,0.9916984
7,-1.0006894,0.381773,0,1.0050098,-1.0050098
8,-0.5226531,2.2654277,1,1.0050098,-1.0050098
10,-0.2358313,-0.1604912,0,1.0050098,-1.0050098


In [58]:
head(test_set)

Unnamed: 0_level_0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<dbl>,<dbl>
2,-0.3041906,-1.5135434,0,-1.1,1.1
4,-1.0599437,-0.3245603,0,0.9,-0.9
5,-1.8156969,0.2859986,0,-1.1,1.1
9,-1.248882,-1.0957926,0,-1.1,1.1
12,-1.1544129,-0.4852337,0,0.9,-0.9
18,0.6405008,-1.3207353,1,-1.1,1.1


In [59]:
install.packages('e1071')
library(e1071)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [60]:
# Fitting SVM to the Training set

classifier = svm(Purchased ~ .,
				data = training_set,
				type = 'C-classification',
				kernel = 'linear')


In [61]:
summary(classifier)


Call:
svm(formula = Purchased ~ ., data = training_set, type = "C-classification", 
    kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 

Number of Support Vectors:  116

 ( 58 58 )


Number of Classes:  2 

Levels: 
 0 1




In [62]:
# Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3], type="class")


In [63]:
y_pred

In [64]:
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred, dnn=c("Actual","Predicted"))
cm


      Predicted
Actual  0  1
     0 58  6
     1 14 22

#####The accuracy turns out to be 80%

In [65]:
#Now let's tune the SVM parameters to get a better accuracy on the training dataset
svm_tune <- tune(svm, train.x=training_set[,-3], train.y=training_set$Purchased, 
            kernel="radial", ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))
print(svm_tune)


Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    1     2

- best performance: 0.08333333 



In [66]:
#Gives an optimal cost to be 10 and a gamma value of 0.5

svm_model_after_tune <- svm(Purchased ~ ., data=training_set, type='C-classification',kernel="radial", cost=10, gamma=0.5)
summary(svm_model_after_tune)


Call:
svm(formula = Purchased ~ ., data = training_set, type = "C-classification", 
    kernel = "radial", cost = 10, gamma = 0.5)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  10 

Number of Support Vectors:  72

 ( 37 35 )


Number of Classes:  2 

Levels: 
 0 1




In [67]:
pred <- predict(svm_model_after_tune, newdata = test_set[-3])

In [68]:
table(test_set[, 3], pred, dnn=c("Actual","Predicted"))

      Predicted
Actual  0  1
     0 57  7
     1  4 32

#####The results show us that there is an improved accuracy of about 89%, results are obtained in the form of a confusion matrix
#####False positive(FP) count is reduced