# Wine Quality Prediction using Support Vector Machine

## 1. Introduction
Support vector machine (SVM) is a popular machine learning technique well suited for both classification and regression tasks. This notebook provides a brief introduction to SVM and showcases its application to wine quality prediction using UCI's Wine Quality Data Set.

### 1.1 Support Vector Machine
Consider the dataset in following image. 

![image1](..\Resources\svm_img1.png)

In this image, we have got few samples belonging to two classes: one of them is represented in white circles and other in dark circles. X<sub>1</sub> and X<sub>2</sub> are two features (attributes or measurements or predictors) which are being studied for their accuracy of prediction of the two classes of interest. The task of the classifier is to find a 'boundary' between samples of the dataset such that samples get separated as well as possible. The classifiers carryout a statistical analysis on the (training) dataset and find such boundaries automatically. While finding an optimal boundary a classifier may evaluate many boundary candidates, like H<sub>1</sub>, H<sub>2</sub> and H<sub>3</sub> shown in the figure and decide on a best candidate. Note that, H<sub>1</sub> does NOT separates all samples but H<sub>2</sub> and H<sub>3</sub> do. Though both H<sub>2</sub> and H<sub>3</sub> are valid boundaries, H<sub>3</sub> is considered better since it leaves out 'maximum' space for 'unseen' samples (samples that are not in training set). SVM is formulated to efficiently find the 'maximal margin hyperplane' (the red line) for a given dataset. In other words, SVM is guaranteed to find a maximal margin hyperplane if one exists (i.e. data is linearly separable).

That brings us to the question, what happens when data is NOT linearly separable. Now consider below dataset (left part of the image):

![image1](..\Resources\svm_img2.png)

Similar to previous dataset, this dataset too has samples belonging to two classes and their two predictors are in x and y axis. Just that, there is NO nice line separating them any more. What makes SVMs really popular is their ability to work well even in this type of datasets. SVM uses a notion of 'kernel' to map the original dataset to a very high dimensional space, where the samples become linearly separable (please see right part of the image). SVMs do this automatically for us. This kernel mapping is based on a strong mathematical concept which makes mapping to very large dimensional space computationally feasible.

Support Vector Machine can be applied not only to classification problems but also to the case of regression. Still it contains all the main features that characterize maximum margin algorithm: a non-linear function is learned by linear learning machine mapping into high dimensional kernel induced feature space. The capacity of the system is controlled by parameters that do not depend on the dimensionality of feature space.

In this notebook, we will be using SVR implementation provided by python's scikit learn library. The library has several in-built kernels including Radial Basis Function or RBF.

## 2. Load Data
Let us start with loading the data and take a quick peek at it for sanity check. The dataset contains details of white and red wine in separate csv files. We load both of these files and combine them in a single data frame.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import seaborn as sns
from sklearn.metrics import confusion_matrix
%matplotlib inline


urlw='../Resources/winequality-white.csv'
dataset_r = pd.read_csv(urlw, sep=';')  
dataset_r["color"] = 0
urlr='../Resources/winequality-red.csv'
dataset_w = pd.read_csv(urlr, sep=';')  
dataset_w["color"] = 1
dataset = pd.concat([dataset_r, dataset_w], axis=0)
print(dataset.shape)
dataset.head()

(6497, 13)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0


The dataset appears to have been loaded properly. We have 6497 samples in our dataset. The column 'quality' is our target column and remaining 11 columns provide various physicochemical properties of wine samples. We have added one additional column 'color' which is set 0 for red wine and 1 for white.

Next, let us look at the distribution of our target column 'quality'.

## 3. Train and Test Set
Process of building a machine learning model requires significant amount of parameter tuning before we arrive at a good model. In this process, however, we might end up over fitting to our data. Hence, at the end of this process we need to 'confirm' that we have not over fit and our model indeed works well on unseen data. All that we have to do is to keep few samples out of the training phase, call that a test set and use them only one time at the end of the training phase. Usually, performance of the model is reported on such test set.

In [2]:
X = dataset.drop(columns = "quality")
y = dataset["quality"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=0)

print("Number of train samples = " + str(X_train.shape[0]))
print("Number of test samples = " + str(X_test.shape[0]))

Number of train samples = 5847
Number of test samples = 650


So, now we have 5847 samples in train set and remaining about 10% in test set.

### 4. Data Normalization
When dataset columns have different value ranges, often times, the loss function tend to give 'less importance' to columns with smaller value ranges. Data normalization is a technique to alleviate this undesirable effect.

First, let us confirm that the column values are having different value ranges. 

In [4]:
dataset[['chlorides', 'density', 'citric acid', 'volatile acidity', 'sulphates', 'pH', 'alcohol', 'fixed acidity', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide']].describe().loc[['min','max']]

Unnamed: 0,chlorides,density,citric acid,volatile acidity,sulphates,pH,alcohol,fixed acidity,residual sugar,free sulfur dioxide,total sulfur dioxide
min,0.009,0.98711,0.0,0.08,0.22,2.72,8.0,3.8,0.6,1.0,6.0
max,0.611,1.03898,1.66,1.58,2.0,4.01,14.9,15.9,65.8,289.0,440.0


We can say 'chlorides', 'density', 'citric acid', 'volatile acidity', 'sulphates' have low dynamic range. 'pH', 'alcohol', 'fixed acidity', 'residual sugar' have medium dynamic range and 'free sulfur dioxide', 'total sulfur dioxide' have very high value ranges.

There are many ways one can normalize this data. We are going for most frequently used StandardScaler() which standardizes features by removing the mean and scaling to unit variance.

In [5]:
scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


Now let us do Grid-Search for the best model on this standardized dataset.

## 5. Build and test SVR Model
We can tune three parameters of SVR namely, 
- kernel - linear or non-linear (for example : RBF)
- C      - Penalty term to prevent over fitting
- gamma  - Kernel coefficient

Parameters 'C' and 'gamma' can be set to any arbitrary positive floating-point numbers implying that we can try really large number of different models on the same train set. Hence, we need a mechanism to 'select' the best one of these models. That is where cross-validation comes into picture. Idea is that we divide our train set into k subsets, train the model with 'one particular combination of above parameters' on k-1 subsets and validate on the remaining subset. We do this k times leaving out different subset every time and fixing a parameter combination. At the end of k<sup>th</sup> step, we average out the performance across k steps. That will be the performance for that particular parameter combination. Now we change the parameter combination and repeat above k steps until we have tried all the parameter combination of interest. That is lot of computation. Fortunately, python utility GridSearchCV() can do that automatically for us.

In [7]:
parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
              {'C': [1, 2,4,6,8], 'kernel': ['rbf'],
               'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]


In [8]:

grid_model = GridSearchCV(svm.SVR(kernel='rbf',epsilon=1.0), parameters, cv=5)
grid_model.fit(X_train_norm, y_train)
print('Best Model Score: {:.2f}'.format(grid_model.best_score_))


print('Best Kernel:', grid_model.best_estimator_.kernel)
print('Best C:', grid_model.best_estimator_.C)
print('Best Gamma:', grid_model.best_estimator_.gamma)

Best Model Score: 0.36
Best Kernel: rbf
Best C: 2
Best Gamma: 0.1


In [10]:
import pickle
filename = 'svr_model.model'
pickle.dump(grid_model.best_estimator_, open(filename, 'wb'))

It is common practice to report the model accuracy on a hold-out test set.

In [11]:
print(r2_score(y_test, grid_model.predict(X_test_norm))

0.4477171108966258

We tried both 'linear' kernel and 'rbf'. For linear kernel, we tried 4 different penalty values 1, 10, 100 and 100. Additionally, for RBF kernel we experimented with 9 different gamma values ranging from 0.1 to 0.9. To assess a combination, we utilized cross validation with k = 5.

In spite of such exhaustive Grid Search, the best model $R^2$ is just 0.44. Though this number is still small, we have achieved nearly 12% improvement when we switched from Linear Regression algorithm to Support Vector Regression which is encouraging. Maybe we can do better if we use XGBoost or Random Forest with a more refined GridSearch.