# Support Vector Machines Regression with Python

## Purpose

This post will provide an example of SVM using Python broken into the following steps.

<p>Data preparation</p>
<p>Model Development</p>

<p>We will use the linear kernel.</p>

## Modules

In [1]:
import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn import model_selection
from statsmodels.tools.eval_measures import mse

we want to predict family income (famlinc), which is a continuous variable.  Below is some initial code.

## Data Preparation

We now need to load our dataset and remove any missing values.

In [2]:
df=pd.DataFrame(data('OFP'))
df=df.dropna()
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,black,sex,maried,school,faminc,employed,privins,medicaid,region,hlth
1,5,0,0,0,0,1,2,0,6.9,yes,male,yes,6,2.881,yes,yes,no,other,other
2,1,0,2,0,2,0,2,0,7.4,no,female,yes,10,2.7478,no,yes,no,other,other
3,13,0,0,0,3,3,4,1,6.6,yes,female,no,10,0.6532,no,no,yes,other,poor
4,16,0,5,0,1,1,2,1,7.6,no,male,yes,3,0.6588,no,yes,no,other,poor
5,3,0,0,0,0,0,2,1,7.9,no,female,yes,6,0.6588,no,yes,no,other,other


we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

In [3]:
dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)
df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df = (df - df.min()) / (df.max() - df.min())
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,school,faminc,black_person,Male,job,single,insured
1,0.05618,0.0,0.0,0.0,0.0,0.125,0.25,0.0,0.069767,0.333333,0.069717,1.0,1.0,1.0,0.0,1.0
2,0.011236,0.0,0.014184,0.0,0.166667,0.0,0.25,0.0,0.186047,0.555556,0.067331,0.0,0.0,0.0,0.0,1.0
3,0.146067,0.0,0.0,0.0,0.25,0.375,0.5,1.0,0.0,0.555556,0.029826,1.0,0.0,0.0,1.0,0.0
4,0.179775,0.0,0.035461,0.0,0.083333,0.125,0.25,1.0,0.232558,0.166667,0.029926,0.0,1.0,0.0,0.0,1.0
5,0.033708,0.0,0.0,0.0,0.0,0.0,0.25,1.0,0.302326,0.333333,0.029926,0.0,0.0,0.0,0.0,1.0


We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable

### Scaling of Variables

 Now we need to scale the data. This is because SVM is sensitive to scale.

In [4]:
df = (df - df.min()) / (df.max() - df.min())
df.head()

Unnamed: 0,ofp,ofnp,opp,opnp,emr,hosp,numchron,adldiff,age,school,faminc,black_person,Male,job,single,insured
1,0.05618,0.0,0.0,0.0,0.0,0.125,0.25,0.0,0.069767,0.333333,0.069717,1.0,1.0,1.0,0.0,1.0
2,0.011236,0.0,0.014184,0.0,0.166667,0.0,0.25,0.0,0.186047,0.555556,0.067331,0.0,0.0,0.0,0.0,1.0
3,0.146067,0.0,0.0,0.0,0.25,0.375,0.5,1.0,0.0,0.555556,0.029826,1.0,0.0,0.0,1.0,0.0
4,0.179775,0.0,0.035461,0.0,0.083333,0.125,0.25,1.0,0.232558,0.166667,0.029926,0.0,1.0,0.0,0.0,1.0
5,0.033708,0.0,0.0,0.0,0.0,0.0,0.25,1.0,0.302326,0.333333,0.029926,0.0,0.0,0.0,0.0,1.0


## Model Development

Before developing our model we need to prepare the train and test sets we begin by placing our independent and dependent variables in different data frames. 

### Independent and Dependent Variables

In [6]:
X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']]
y=df['faminc']

Now we create our train and test sets.We are using a 70/30 split.

### Train and Test Sets

In [7]:
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Next, we will create our model with the code below.

In [8]:
h1=svm.SVR()

In [9]:
h1.fit(X_train, y_train)

SVR()

## Model Evaluation

In [10]:
y_pred=h1.predict(X_test)

In [11]:
print(mse(y_pred,y_test))

0.005797630422076203
