# Support Vector Machines Classification with Python

## Purpose

This post will provide an example of SVM using Python broken into the following steps.

<p>Data preparation</p>
<p>Model Development</p>

<p>We will use the linear kernel.</p>

## Modules

In [2]:
import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn import model_selection

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. 

## Data Preparation

We now need to load our dataset and remove any missing values.

In [3]:
df=pd.DataFrame(data('OFP'))
df=df.dropna()
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,black,sex,maried,school,faminc,employed,privins,medicaid,region,hlth
1,5,0,0,0,0,1,2,0,6.9,yes,male,yes,6,2.881,yes,yes,no,other,other
2,1,0,2,0,2,0,2,0,7.4,no,female,yes,10,2.7478,no,yes,no,other,other
3,13,0,0,0,3,3,4,1,6.6,yes,female,no,10,0.6532,no,no,yes,other,poor
4,16,0,5,0,1,1,2,1,7.6,no,male,yes,3,0.6588,no,yes,no,other,poor
5,3,0,0,0,0,0,2,1,7.9,no,female,yes,6,0.6588,no,yes,no,other,other


Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

### Dummy Variables

In [4]:
dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)

Let's look at the data frame again

In [5]:
df

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,black,...,employed,privins,medicaid,region,hlth,black_person,Male,job,single,insured
1,5,0,0,0,0,1,2,0,6.9,yes,...,yes,yes,no,other,other,1,1,1,0,1
2,1,0,2,0,2,0,2,0,7.4,no,...,no,yes,no,other,other,0,0,0,0,1
3,13,0,0,0,3,3,4,1,6.6,yes,...,no,no,yes,other,poor,1,0,0,1,0
4,16,0,5,0,1,1,2,1,7.6,no,...,no,yes,no,other,poor,0,1,0,0,1
5,3,0,0,0,0,0,2,1,7.9,no,...,no,yes,no,other,other,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4402,11,0,0,0,0,0,0,0,8.4,no,...,no,yes,no,other,other,0,0,0,0,1
4403,12,0,0,0,0,0,2,0,7.8,no,...,no,yes,no,other,other,0,0,0,1,1
4404,10,0,20,0,1,1,5,0,7.3,no,...,no,yes,no,other,other,0,1,0,0,1
4405,16,1,0,0,0,0,0,0,6.6,no,...,no,yes,no,other,other,0,0,0,0,1


If you look at the dataset now you will see a lot of variables that are not necessary. This is a by-product of making the dummy variables. Below is the code to remove the information we do not need.

In [6]:
df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,school,faminc,black_person,Male,job,single,insured
1,5,0,0,0,0,1,2,0,6.9,6,2.881,1,1,1,0,1
2,1,0,2,0,2,0,2,0,7.4,10,2.7478,0,0,0,0,1
3,13,0,0,0,3,3,4,1,6.6,10,0.6532,1,0,0,1,0
4,16,0,5,0,1,1,2,1,7.6,3,0.6588,0,1,0,0,1
5,3,0,0,0,0,0,2,1,7.9,6,0.6588,0,0,0,0,1


### Scaling of Variables

 Now we need to scale the data. This is because SVM is sensitive to scale.

In [7]:
df = (df - df.min()) / (df.max() - df.min())
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,school,faminc,black_person,Male,job,single,insured
1,0.05618,0.0,0.0,0.0,0.0,0.125,0.25,0.0,0.069767,0.333333,0.069717,1.0,1.0,1.0,0.0,1.0
2,0.011236,0.0,0.014184,0.0,0.166667,0.0,0.25,0.0,0.186047,0.555556,0.067331,0.0,0.0,0.0,0.0,1.0
3,0.146067,0.0,0.0,0.0,0.25,0.375,0.5,1.0,0.0,0.555556,0.029826,1.0,0.0,0.0,1.0,0.0
4,0.179775,0.0,0.035461,0.0,0.083333,0.125,0.25,1.0,0.232558,0.166667,0.029926,0.0,1.0,0.0,0.0,1.0
5,0.033708,0.0,0.0,0.0,0.0,0.0,0.25,1.0,0.302326,0.333333,0.029926,0.0,0.0,0.0,0.0,1.0


## Model Development

Before developing our model we need to prepare the train and test sets we begin by placing our independent and dependent variables in different data frames. 

### Independent and Dependent Variables

In [8]:
X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']]
y=df['single']

Now we create our train and test sets.We are using a 70/30 split.

### Train and Test Sets

In [9]:
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.

### Models

In [10]:
h1=svm.LinearSVC(C=1)

In [11]:
h1.fit(X_train,y_train)
h1.score(X_train,y_train)

0.7493514915693904

The accuracy of the training model looks resonable at almost 75%. Let's evaluate our model now

## Model Testing

In [12]:
y_pred=h1.predict(X_test)

In [13]:
pd.crosstab(y_test,y_pred)

col_0,0.0,1.0
single,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,519,191
1.0,172,440


Now for additional details

In [14]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.75      0.73      0.74       710
         1.0       0.70      0.72      0.71       612

    accuracy                           0.73      1322
   macro avg       0.72      0.72      0.72      1322
weighted avg       0.73      0.73      0.73      1322



The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single.