### Lesson 7 - Assignment

Using the Abalone csv file, create a new notebook to build an experiment using support vector machine classifier and regression. Perform each of the following tasks and answer the questions:

- Convert the continuous output value from continuous to binary (0,1) and build an SVC
- Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?
- Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?
- Show recall, precision and f-measure for the best model
- Using the original data, with rings as a continuous variable, create an SVR model
- Report on the predicted variance and the mean squared error

*The target field is “Rings”. Since the output is continuous the solution can be handled by a Support Vector Regression or it can be changed to a binary Support Vector Classification by assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import data
df = pd.read_csv('abalone.csv')

In [3]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
Sex               4177 non-null object
Length            4177 non-null float64
Diameter          4177 non-null float64
Height            4177 non-null float64
Whole Weight      4177 non-null float64
Shucked Weight    4177 non-null float64
Viscera Weight    4177 non-null float64
Shell Weight      4177 non-null float64
Rings             4177 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


### Process data
To build the models we will need numeric data as inputs. Therefore we need to turn the categorical data into numeric variables. To do this, I'll use one hot encoding.

In [5]:
# one hot encode Sex column
df = pd.get_dummies(df, columns=["Sex"], prefix=["Sex"])

### Convert the continuous output value from continuous to binary (0,1)
Assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’)

In [6]:
# assign <=11 to class ‘0’ and those that are older to class: ‘1’
df['Target'] = [0 if x <= 11 else 1 for x in df['Rings']] 

In [7]:
# drop Rings column with continuous variables
df1 = df.drop(['Rings'], axis=1)

### Define predictors (X) and target (Y)

In [8]:
# define X 
X = df1.iloc[:, 0:8]

# define Y
Y = df1.iloc[:, 9]

### Split the data into training and testing sets

In [9]:
# split test and train data
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=1)

### Set the hyperparameters 

In [10]:
cost = 1 # penalty parameter of the error term
gamma = 5 # defines the influence of input vectors on the margins

In [11]:
from sklearn import svm, metrics
from sklearn.metrics import classification_report

# Train a LinearSVC
clf1 = svm.LinearSVC(C=cost).fit(X_train, y_train)
clf1.predict(X_test)
print("LinearSVC")
print(classification_report(clf1.predict(X_test), y_test))

# Test linear, rbf and poly kernels
for k in ('linear', 'rbf', 'poly'):
    clf = svm.SVC(gamma=gamma, kernel=k, C=cost).fit(X_train, y_train)
    clf.predict(X_test)
    print(k)
    print(classification_report(clf.predict(X_test), y_test))

LinearSVC
              precision    recall  f1-score   support

           0       0.92      0.86      0.89       558
           1       0.75      0.85      0.80       278

    accuracy                           0.86       836
   macro avg       0.84      0.85      0.84       836
weighted avg       0.86      0.86      0.86       836

linear
              precision    recall  f1-score   support

           0       0.92      0.86      0.89       564
           1       0.74      0.85      0.79       272

    accuracy                           0.85       836
   macro avg       0.83      0.85      0.84       836
weighted avg       0.86      0.85      0.86       836

rbf
              precision    recall  f1-score   support

           0       0.92      0.86      0.89       564
           1       0.74      0.85      0.79       272

    accuracy                           0.86       836
   macro avg       0.83      0.86      0.84       836
weighted avg       0.87      0.86      0.86       836

### Create an SVR using original data

In [12]:
from sklearn.svm import SVR

In [13]:
# use the original dataframe with rings as a continuous variable
df2 = df.drop(['Target'], axis=1)

### Define predictors (X) and target (Y)

In [14]:
# define X 
X2 = df2.drop('Rings', axis=1)

# define y
Y2 = df2['Rings']

### Split the data into training and testing sets

In [15]:
# split test and train data
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X2, Y2, test_size = 0.2, random_state=1)

In [16]:
regressor = SVR(kernel='rbf', gamma=gamma, C=cost)

# train the model using the training sets
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

### Report on the predicted variance and the mean squared error

In [17]:
from sklearn.metrics import mean_squared_error, r2_score

In [20]:
# print mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred))
# print r squared
print('Coefficient of determination: %.2f'% r2_score(y_test, y_pred))

Mean squared error: 4.71
Coefficient of determination: 0.52
