# Support Vector Machine 2

Dataset: UCI Skin Segmentation Data Set: http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation#

The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from FERET database and PAL database.  
Total learning sample size is 245057; out of which 50859 is the skin samples and 194198 is non-skin samples.

Color FERET Image Database: [Web Link], PAL Face Database from Productive Aging Laboratory, The University of Texas at Dallas: [Web Link].

### Import libraries

In [1]:
import pprint as pp
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### Load data

In [2]:
df = pd.read_csv('data/Skin_NonSkin.txt', header=None, names=['B', 'G', 'R', 'Skin'], sep='\t')

### Examine data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245057 entries, 0 to 245056
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   B       245057 non-null  int64
 1   G       245057 non-null  int64
 2   R       245057 non-null  int64
 3   Skin    245057 non-null  int64
dtypes: int64(4)
memory usage: 7.5 MB


In [4]:
df.shape

(245057, 4)

In [5]:
df.sample(n=10)

Unnamed: 0,B,G,R,Skin
238080,166,165,115,2
224764,75,74,46,2
203521,199,197,162,2
174390,163,161,113,2
130221,169,170,136,2
180599,255,0,128,2
56446,168,165,114,2
207915,234,99,79,2
277,208,216,253,1
226598,53,58,19,2


In [6]:
# Review distribution of target values
#  1=skin, 2=non-skin
df['Skin'].value_counts()

2    194198
1     50859
Name: Skin, dtype: int64

### Separate independent and dependent variables

In [7]:
X = df.drop('Skin', axis=1)     # Independent variables
y = df['Skin']                  # Dependent variable

### Scale the features
Since SVM is very sensitive to features with different ranges, we need to scale the features  
Distribute the feature values around 0 with a standard deviation of 1

In [8]:
# Instantiate StandardScaler
sc = StandardScaler()

In [9]:
X = sc.fit_transform(X)

### Split data into training and test sets

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Train model with default hyperparameters

In [11]:
# Instantiate SVC classifier with default radial basis function (rbf) kernel
classifier = SVC()

In [12]:
pp.pprint(classifier.get_params())

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}


In [13]:
%%time

# Fit the model
classifier.fit(X_train, y_train)

CPU times: user 12.2 s, sys: 185 ms, total: 12.4 s
Wall time: 12.5 s


### Evaluate model

In [None]:
%%time 

# Predict against test set
y_pred = classifier.predict(X_test)

In [None]:
# Print accuracy score
f'Model accuracy = {accuracy_score(y_test, y_pred):.4f}'

### Train model with the linear kernel

In [None]:
# Instantiate SVC classifier with the linear kernel
classifier = SVC(kernel='linear')

In [None]:
%%time

# Fit the model
classifier.fit(X_train, y_train)

### Evaluate updated model

In [None]:
%%time 

# Predict against test set
y_pred = classifier.predict(X_test)

In [None]:
# Print accuracy score
f'Model accuracy = {accuracy_score(y_test, y_pred):.4f}'

### Train model with the polynomial kernel

In [None]:
# Instantiate SVC classifier with the polynomial (poly) kernel
classifier = SVC(kernel='poly')

In [None]:
%%time

# Fit the model
classifier.fit(X_train, y_train)

### Evaluate updated model

In [None]:
%%time 

# Predict against test set
y_pred = classifier.predict(X_test)

In [None]:
# Print accuracy score
f'Model accuracy = {accuracy_score(y_test, y_pred):.4f}'