# Assignment 7 - Age of Abalone
## Author - Salinee Kingbaisomboon
### UW NetID: 1950831

## Instructions
1. Convert the continuous output value from continuous to binary (0,1) and build an SVC
2. Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?
3. Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?
4. Show recall, precision and f-measure for the best model
5. Using the original data, with rings as a continuous variable, create an SVR model
6. Report on the predicted variance and the mean squared error

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm, metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore") # To suppress warning

%matplotlib inline

pd.options.display.max_rows = None
pd.options.display.max_columns = None

# Read and perform data cleaning

In [2]:
# Load data
filename = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
df = pd.read_csv(filename, header=None)

# Headers from https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
feature_names = ['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']

# Assign column's names
df.columns = np.append(feature_names, ['Rings'])

# Convert target attribute from continuous to binary
df['Rings Binary'] = np.where(df['Rings'] < 11, 0, 1)

# Convert categorical columns to numeric columns
# Will affect Sex columns (M = 2, F = 0, I = 1)
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)

cols = df.columns
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)

In [3]:
# View first five rows of the data frame
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Rings Binary
0,2,66,56,18,715,417,199,263,14,1
1,2,45,36,17,285,178,94,113,6,0
2,0,81,67,26,962,480,280,374,8,0
3,2,63,56,24,718,400,225,273,9,0
4,1,41,34,15,253,159,76,87,6,0


In [4]:
# Print DataFrame's size
print(df.shape)
# Print DataFrame's data types
# Note: we can see that all columns were numeric columns now (after did the onehot-encoded), except the target columns
print(df.dtypes)

(4177, 10)
Sex               int32
Length            int64
Diameter          int64
Height            int64
Whole weight      int64
Shucked weight    int64
Viscera weight    int64
Shell weight      int64
Rings             int64
Rings Binary      int64
dtype: object


## Split Test and Train data
Based on **Tip and Practical Use** from https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use, Support Vector Machine algorithms are not scale invariant, so it is **highly recommended to scale data**.

In [5]:
# Separate the independent variables (AKA Features) from the dependent labels (AKA Target)
targetOutcome_scaled = pd.DataFrame(df_normalized,columns=['Rings Binary'])
allFeatures_scaled = pd.DataFrame(df_normalized,columns=df_normalized.columns.difference(['Rings Binary', 'Rings']))

# Split the Training (90%) and Testing Data (10%)
X_scaled, XX_scaled, Y_scaled, YY_scaled = train_test_split(allFeatures_scaled, targetOutcome_scaled, test_size = 0.1, random_state = 0)

## Guess for Hyperparameters and Kernel

In [6]:
# Test a LinearSVC
clf1 = svm.LinearSVC(C=0.9, max_iter=12000).fit(X_scaled, Y_scaled)
y1_predict = clf1.predict(XX_scaled)
print("LinearSVC")
print(classification_report(clf1.predict(XX_scaled), YY_scaled))
print('Accuracy of the LinearSVC on the training data is {:.2f} out of 1'.format(clf1.score(X_scaled, Y_scaled)))
print('Accuracy of the LinearSVC on the test data is {:.2f} out of 1'.format(clf1.score(XX_scaled, YY_scaled)))

LinearSVC
              precision    recall  f1-score   support

         0.0       0.90      0.78      0.84       300
         1.0       0.58      0.77      0.66       118

    accuracy                           0.78       418
   macro avg       0.74      0.78      0.75       418
weighted avg       0.81      0.78      0.79       418

Accuracy of the LinearSVC on the training data is 0.78 out of 1
Accuracy of the LinearSVC on the test data is 0.78 out of 1


## Using sklearn.model_selection.SearchGridCV to find the best settings

In [7]:
parameters = {'kernel':['linear', 'rbf', 'sigmoid'], 'C':[0.1, 0.9, 100], 'gamma': [0.1, 1, 5]}
svc = svm.SVC()
gsc = GridSearchCV(svc, parameters)

grid_result = gsc.fit(X_scaled, Y_scaled)
best_params = grid_result.best_params_

In [8]:
# Best hyperparameters
best_params

{'C': 100, 'gamma': 1, 'kernel': 'rbf'}

## Set the best Hyperparameters & SVM Model

In [9]:
# Based on the GridSearchCV
cost = best_params['C']
gamma = best_params['gamma']
kernel = best_params['kernel']

In [12]:
clf = svm.SVC(gamma=gamma, kernel=kernel, C=cost).fit(X_scaled, Y_scaled)
y_predicted = clf.predict(XX_scaled)
print(classification_report(clf.predict(XX_scaled), YY_scaled))
print('Accuracy of the Best SVM Model on the training data is {:.2f} out of 1'.format(clf.score(X_scaled, Y_scaled)))
print('Accuracy of the Best SVM Model on the test data is {:.2f} out of 1'.format(clf.score(XX_scaled, YY_scaled)))

              precision    recall  f1-score   support

         0.0       0.92      0.80      0.86       299
         1.0       0.62      0.82      0.71       119

    accuracy                           0.81       418
   macro avg       0.77      0.81      0.78       418
weighted avg       0.83      0.81      0.81       418

Accuracy of the Best SVM Model on the training data is 0.80 out of 1
Accuracy of the Best SVM Model on the test data is 0.81 out of 1


## Create SVR Model
Using the original data, with rings as a continuous variable.

In [13]:
# Separate the independent variables (AKA Features) from the dependent labels (AKA Target)
targetOutcome_scaled_original = pd.DataFrame(df_normalized,columns=['Rings'])
allFeatures_scaled_original = pd.DataFrame(df_normalized,columns=df_normalized.columns.difference(['Rings Binary', 'Rings']))

# Split the Training (90%) and Testing Data (10%)
X_scaled_original, XX_scaled_original, Y_scaled_original, YY_scaled_original = train_test_split(allFeatures_scaled_original, targetOutcome_scaled_original, test_size = 0.1, random_state = 0)

In [17]:
svr = svm.SVR(gamma=gamma, kernel=kernel, C=cost).fit(X_scaled_original, Y_scaled_original)
y_svr_predicted = svr.predict(XX_scaled_original)
print('Accuracy of the SVR on the training data is {:.2f} out of 1'.format(svr.score(X_scaled_original, Y_scaled_original)))
print('Accuracy of the SVR on the test data is {:.2f} out of 1'.format(svr.score(XX_scaled_original, YY_scaled_original)))

print('Mean Square Error of the SVR on the test data is {:.2f}'.format(mean_squared_error(YY_scaled_original, y_svr_predicted)))

print('Explained Variance Score of the SVR on the test data is {:.2f}'.format(explained_variance_score(YY_scaled_original, y_svr_predicted)))

print('R2 Score of the SVR on the test data is {:.2f}'.format(r2_score(YY_scaled_original, y_svr_predicted)))

Accuracy of the SVR on the training data is 0.60 out of 1
Accuracy of the SVR on the test data is 0.59 out of 1
Mean Square Error of the SVR on the test data is 0.01
Explained Variance Score of the SVR on the test data is 0.59
R2 Score of the SVR on the test data is 0.59


***
**Summary:**
1. On the <font color=blue>**best guess**</font>, I used **LinearSVC** with C = 0.9: the **Accuracy** is 78%.
2. I used **GridSearchCV** to search for the <font color=blue>**best hyperparameters**</font>: the results are
    - **C**: 100
    - **Gamma**: 1
    - **Kernel**: RBF
    - This model yeild the **Accuracy** at 80% on train data and 81% on test data
3. For **SVR** model:
    - **Mean Square Error** is 0.01
    - **Explained Variance Score** is 0.59
    - **R2 Score** is 0.59
4. We can conclude that using **SVC** yeild the better accuracy and less bias than using **SVR** for this data set.
***