# Support Vector Machine: Classification

Support Vector Machine (SVM) is a supervised learning algorithm that can be used for both prediction of a binary variable (classification) and quantitative variable (regression problems) even if it is primarily used for classification problems. The goal of SVM is to create a hyperplane that linearly divide n-dimensional data points in two components by searching for an optimal margin that correctly segregate the data into different classes and at the same time be separated as much as possible from all the observations. In
addition to linear classification, it is also possible to compute a non-linear classification using what we call the kernel trick (kernel function) that maps inputs into high dimensional feature spaces. The kernel function adapted to specific problems allows a real flexibility to adapt to different situations. SVM allows to create a classifier, or a discrimination function, that we can generalize and apply for prediction such as in image classification, diagnostics, genomic sequences, or drug discovery.  SVM was developed at
AT&T Bell Laboratories by Vladimir Vapnik and colleagues. To select the optimal hyperplane amongst many hyperplanes that might classify our data, we select the one that has the largest margin or in another words that represents the largest separation between the different classes. It is an optimization problem under constraints where the distance between the nearest data point and the optimal hyperplane (on each side) is maximized. The hyperplane is then called the maximum-margin hyperplane allowing us to create a
maximum-margin classifier. The closest data-points are known as support vectors and margin is an area which generally do not contains any data points. If the optimal hyperplane is too close to data points and the margin too small, it will be difficult to predict new data and the model will fail to generalize well. In non-linear cases, we will need to introduce a kernel function to search for nonlinear separating surfaces. The method will induce a nonlinear transformation of our dataset towards an intermediate space that we call feature space of higher dimension.

## Support Vector Machine: Linear Kernel

In [16]:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score

# load data: Capture the dataset in Python using Pandas DataFrame
csv_data = '../data/datasets/neurons_binary.csv'
df = pd.read_csv(csv_data, delimiter=';')

# Drop row having at least 1 missing value
df = df.dropna()

# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']

# Splitting the data : training and test (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
Normalize = preprocessing.StandardScaler()
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)

# Model 
model=svm.SVC(kernel='linear')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test, y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train, y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1 Score", "Cross-validation mean", "Cross-validation std"], columns={'SVM_linear'})

metrics_dataframe

Unnamed: 0,SVM_linear
Accuracy,0.801471
Precision,0.801471
Recall,0.801471
F1 Score,0.801471
Cross-validation mean,0.934126
Cross-validation std,0.002855


## Support Vector Machine: Radial Basis Function Kernel

In [5]:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score

# load data: Capture the dataset in Python using Pandas DataFrame
csv_data = '../data/datasets/neurons.csv'
df = pd.read_csv(csv_data, delimiter=';')

# Drop row having at least 1 missing value
df = df.dropna()

# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']

# Splitting the data : training and test (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
Normalize = preprocessing.StandardScaler()
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)

# Model 
model=svm.SVC(kernel='rbf')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test, y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train, y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1 Score", "Cross-validation mean", "Cross-validation std"], columns={'SVM_rbf'})

metrics_dataframe

Unnamed: 0,SVM_rbf
Accuracy,0.860115
Precision,0.860115
Recall,0.860115
F1 Score,0.860115
Cross-validation mean,0.902377
Cross-validation std,0.002033


## Support Vector Machine: Sigmoid Kernel

In [6]:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score

# load data: Capture the dataset in Python using Pandas DataFrame
csv_data = '../data/datasets/neurons.csv'
df = pd.read_csv(csv_data, delimiter=';')

# Drop row having at least 1 missing value
df = df.dropna()

# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']

# Splitting the data : training and test (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
Normalize = preprocessing.StandardScaler()
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)

# Model 
model=svm.SVC(kernel='sigmoid')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test, y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train, y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1 Score", "Cross-validation mean", "Cross-validation std"], columns={'SVM_sigmoid'})

metrics_dataframe

Unnamed: 0,SVM_sigmoid
Accuracy,0.755918
Precision,0.755918
Recall,0.755918
F1 Score,0.755918
Cross-validation mean,0.728386
Cross-validation std,0.01085


## Support Vector Machine: Polynomial Kernel

In [7]:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score

# load data: Capture the dataset in Python using Pandas DataFrame
csv_data = '../data/datasets/neurons.csv'
df = pd.read_csv(csv_data, delimiter=';')

# Drop row having at least 1 missing value
df = df.dropna()

# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']

# Splitting the data : training and test (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
Normalize = preprocessing.StandardScaler()
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)

# Model 
model=svm.SVC(kernel='poly')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test, y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train, y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1 Score", "Cross-validation mean", "Cross-validation std"], columns={'SVM_poly'})

metrics_dataframe

Unnamed: 0,SVM_poly
Accuracy,0.727762
Precision,0.727762
Recall,0.727762
F1 Score,0.727762
Cross-validation mean,0.670359
Cross-validation std,0.017285


# Support Vector Machine for Regression

In [17]:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score

# load data: Capture the dataset in Python using Pandas DataFrame
csv_data = '../data/datasets/neurons.csv'
df = pd.read_csv(csv_data, delimiter=';')

# Drop row having at least 1 missing value
df = df.dropna()

# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']

# Splitting the data : training and test (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
Normalize = preprocessing.StandardScaler()
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)

from sklearn.decomposition import PCA
# define PCA. n_components: number of principal components we want to keep
pca = PCA(n_components=2)

# transform data
X_train = pca.fit_transform(X_test)
X_test = pca.fit_transform(X_test)

# Model 
model=svm.SVC(kernel='sigmoid')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test, y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train, y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1 Score", "Cross-validation mean", "Cross-validation std"], columns={'SVM_sigmoid'})

metrics_dataframe

ValueError: Found input variables with inconsistent numbers of samples: [5576, 22300]