## Q1

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.preprocessing import MinMaxScaler

## 1.a Data Preprocessing

In [2]:
# data from white wine and red wine is already merged into a single winequality datasheet
df = pd.read_csv("winequality.csv")
print(df.head())
print(df.shape)

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  type  
0      9.4        5     0  
1      9.8        5     0  
2 

In [3]:
# column 1 -> 11 = features, column 12 = quality, column 13 = wine category
(x, y) = df.iloc[:, 0:10], df.iloc[:, 11]

# normalization
n_x = MinMaxScaler().fit_transform(x)

# 80% for training, 20% for testing 
x_train, x_test, y_train, y_test = train_test_split(n_x, y, test_size=0.20)

print('Training set shape: {}'.format(x_train.shape))
print('Training labels shape: {}'.format(y_train.shape))
print('Test set shape: {}'.format(x_test.shape))
print('Test label shape: {}'.format(y_test.shape))

Training set shape: (5197, 10)
Training labels shape: (5197,)
Test set shape: (1300, 10)
Test label shape: (1300,)


## Support Vector Machine

In [4]:
def SVM(C, kernal, g="scale"):
    svm = SVC(C=C, kernel=kernal, gamma=g)
    svm.fit(x_train, y_train)
    
    y_pred = svm.predict(x_test)
    acc_svm = metrics.accuracy_score(y_test, y_pred)
    print('SVM Accuracy for {} at C = {} and gamma = {}:'.format(kernal, C, g), acc_svm)
    
    # cm_svm = confusion_matrix(y_test, y_pred)
    # class_names = ['0', '1', '2', '3', '4', '5', '6']

    # print(cm_svm)
    # print(classification_report(y_test, y_pred, target_names=class_names))
    # disp1 = ConfusionMatrixDisplay(confusion_matrix=cm_svm)
    # disp1.plot()
    return acc_svm

In [5]:
C1 = [1, 10, 50, 100]
C2 = [1, 10, 20, 30]

result = np.empty((3, 4), dtype=float)

# RBF Kernal
for idc, C in enumerate(C1): 
    acc_svm = SVM(C, 'rbf')
    result[0, idc] = acc_svm
    
# Poly Kernal 
for idc, C in enumerate(C1): 
    acc_svm = SVM(C, 'poly')
    result[1, idc] = acc_svm
    
# Linear Kernal
for idc, C in enumerate(C2): 
    acc_svm = SVM(C, 'linear')
    result[2, idc] = acc_svm



SVM Accuracy for rbf at C = 1 and gamma = scale: 0.546923076923077
SVM Accuracy for rbf at C = 10 and gamma = scale: 0.5930769230769231
SVM Accuracy for rbf at C = 50 and gamma = scale: 0.5838461538461538
SVM Accuracy for rbf at C = 100 and gamma = scale: 0.5892307692307692
SVM Accuracy for poly at C = 1 and gamma = scale: 0.5592307692307692
SVM Accuracy for poly at C = 10 and gamma = scale: 0.5738461538461539
SVM Accuracy for poly at C = 50 and gamma = scale: 0.573076923076923
SVM Accuracy for poly at C = 100 and gamma = scale: 0.5746153846153846
SVM Accuracy for linear at C = 1 and gamma = scale: 0.5069230769230769
SVM Accuracy for linear at C = 10 and gamma = scale: 0.5269230769230769
SVM Accuracy for linear at C = 20 and gamma = scale: 0.5238461538461539
SVM Accuracy for linear at C = 30 and gamma = scale: 0.5261538461538462


Accuracy result in table, the column represents the different regularization parameters used and the rows represents the different kernal used 

In [6]:
# create the dataframe
df = pd.DataFrame(result, columns=["C[0]", "C[1]", "C[2]", "C[3]"],
                index=["RBF", "Poly", "Linear"])

# print the dataframe
print(df)

            C[0]      C[1]      C[2]      C[3]
RBF     0.546923  0.593077  0.583846  0.589231
Poly    0.559231  0.573846  0.573077  0.574615
Linear  0.506923  0.526923  0.523846  0.526154


## 1.b

Comparing the accuracy result from the table above, we can see that RBF kernal performed the best overall with an average accuracy of 57.826925% compared to 57.019225% for poly kernal and only 52.12115% for linear kernal. This make sense since guassian kernal is generally the perfered function in svm. It is suitable for non-linear data and helps to make proper separation when there is no prior knowledge of data. On the otherhand the linear kernal is the most basic kernal and is mostly preferred for text-classification and linear kernal is just a more generalized representation of the linear kernal. Furthermore, we can see that increasing the regularization parameter c sees improvement in training and test accuracy as well, since a higher value of the regularization parameter will penalize the model more for misclassifying training examples and lead to a smaller margin, while a lower value of the regularization parameter will allow more margin violations and lead to a larger margin.

## 1.c Improving the model

In [7]:
C = [1, 10, 50, 100]
gamma = [10**-1, 10**0, 10*1, 10**2]

result = np.empty((len(C), len(gamma)), dtype=float)

for idc, c in enumerate(C): 
    for idg, g in enumerate(gamma): 
        acc_svm = SVM(c, 'rbf', g)
        result[idc, idg] = acc_svm

SVM Accuracy for rbf at C = 1 and gamma = 0.1: 0.49538461538461537
SVM Accuracy for rbf at C = 1 and gamma = 1: 0.5169230769230769
SVM Accuracy for rbf at C = 1 and gamma = 10: 0.5776923076923077
SVM Accuracy for rbf at C = 1 and gamma = 100: 0.6215384615384615
SVM Accuracy for rbf at C = 10 and gamma = 0.1: 0.5246153846153846
SVM Accuracy for rbf at C = 10 and gamma = 1: 0.5284615384615384
SVM Accuracy for rbf at C = 10 and gamma = 10: 0.5884615384615385
SVM Accuracy for rbf at C = 10 and gamma = 100: 0.6376923076923077
SVM Accuracy for rbf at C = 50 and gamma = 0.1: 0.5223076923076924
SVM Accuracy for rbf at C = 50 and gamma = 1: 0.5453846153846154
SVM Accuracy for rbf at C = 50 and gamma = 10: 0.5953846153846154
SVM Accuracy for rbf at C = 50 and gamma = 100: 0.64
SVM Accuracy for rbf at C = 100 and gamma = 0.1: 0.5269230769230769
SVM Accuracy for rbf at C = 100 and gamma = 1: 0.5592307692307692
SVM Accuracy for rbf at C = 100 and gamma = 10: 0.6046153846153847
SVM Accuracy for rbf 

Accuracy result for rbf kernal in table, the column represents the different regularization parameters used and the rows represents the different gamma parameters used. 

In [8]:
df = pd.DataFrame(result, columns=gamma,
                index=C)

# print the dataframe
print(df)

        0.1       1.0       10.0      100.0
1    0.495385  0.516923  0.577692  0.621538
10   0.524615  0.528462  0.588462  0.637692
50   0.522308  0.545385  0.595385  0.640000
100  0.526923  0.559231  0.604615  0.640000


Here we see that not only does a larger regularization parameter kernal improves the model, but also does a larger gamma parameter. In general a high gamma value may lead to overfitting, where the model captures noise in the training data and performs poorly on new, unseen data. A low gamma value may lead to underfitting, where the model is too simple to capture the underlying patterns in the data and performs poorly on both the training and testing data. Here since the number of data in the wine-dataset is very limited and consist of a low number of feature, a higher gamma is more suitable in order to prevent underfitting. Here we reached the optimal accuracy of 0.64 using C = 100 and gamma = 100