We test the performance of three regression methods on the wine data set http://archive.ics.uci.edu/ml/datasets/Wine+Quality. We will only consider the red wine data set, with 1599 samples. We use the first 1400 samples for training, and the last 199 samples for testing. The goal is to build a linear model of the first 11 features (together with a constant term) to predict the quality of the wine. All models are trained by solving the following optimization problem.  $$
\begin{equation*}
\begin{aligned}
\underset{w,\beta}{\text{minimize}}
\sum_{i=1}^{n} l(x_i^{T}w+\beta - y_i)
\end{aligned}
\end{equation*}$$ 
where the loss functions are

• least squares loss $ l(t) = t^{2} $

• Huber loss defined in the previous problem, with M = 1

• deadzone-linear loss $$ l(t) = \left\{
	\begin{array}{ll}
		0  & \mbox{if } |t| \leq 0.5 \\
		|t|- 0.5 & \mbox{if } |t| > 0.5
	\end{array}
\right. $$ 

In [0]:
# importing libraries and reading data from file
import pandas as pnd
import numpy as nmp
dataFile = pnd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep = ';')
dataFile.shape
dataFile.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [0]:
from sklearn.model_selection import train_test_split

featureColNames  = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']
outputName = ['quality']
X_features = dataFile[featureColNames].values
y = dataFile[outputName].values
#The data sample of 1599 records being split into train set(1400) and test set(199)
X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.124, random_state=42, shuffle = False)

In [0]:
#Applying Linear regression to fit least square and using the same to predict test data
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train,y_train) 
y_prediction = reg.predict(X_test)

In [0]:
#Calculating error rate
error = 0
for i in range (0,199):
  error += np.abs(y_prediction[i] - y_test[i]) 

error /= 199
print('Mean Absolution Error: '+ str(error))

Mean Absolution Error: [0.53296711]


In [0]:
# Applying huber loss in Convex Optimization form for building the model and using the same to predict test data
import cvxpy as cp
# Weight vector is of size 11, as we are selecting all 11 features of our data
weight = cp.Variable((11,1))
b = cp.Variable(1)
#The value of M is 1
cost = cp.sum(cp.huber((X_train@weight + b) - y_train, 1))
prob = cp.Problem(cp.Minimize(cost))
prob.solve(verbose = True)

-----------------------------------------------------------------
           OSQP v0.6.0  -  Operator Splitting QP Solver
              (c) Bartolomeo Stellato,  Goran Banjac
        University of Oxford  -  Stanford University 2019
-----------------------------------------------------------------
problem:  variables n = 4212, constraints m = 4200
          nnz(P) + nnz(A) = 26482
settings: linear system solver = qdldl,
          eps_abs = 1.0e-05, eps_rel = 1.0e-05,
          eps_prim_inf = 1.0e-04, eps_dual_inf = 1.0e-04,
          rho = 1.00e-01 (adaptive),
          sigma = 1.00e-06, alpha = 1.60, max_iter = 10000
          check_termination: on (interval 25),
          scaling: on, scaled_termination: off
          warm start: on, polish: on, time_limit: off

iter   objective    pri res    dua res    rho        time
   1  -2.2400e+04   8.00e+00   5.24e+06   1.00e-01   7.69e-03s
 100   5.3542e+02   4.44e-07   1.38e-07   1.51e-01   3.94e-02s
plsh   5.3542e+02   2.05e-14   2.28e-11  

535.4177751956599

In [0]:
#Applying the weight vector generated by train data to test model
y_prediction = np.dot(X_test , weight.value) + b.value
error = 0
for i in range (0,199):
  error += np.abs(y_prediction[i] - y_test[i]) 
error /= 199

print('Mean Absolution Error: '+ str(error))

Mean Absolution Error: [0.53271565]


In [0]:
# Applying Dead-zone linear loss in Convex Optimization form for building the model and using the same to predict test data 

weight = cp.Variable((11,1))
b = cp.Variable(1)
cost = cp.maximum(0,(cp.abs((X_train@weight + b)- y_train) - 0.5))
obj = cp.sum(cost)
prob = cp.Problem(cp.Minimize(obj))
prob.solve(solver=cp.ECOS)

210.32700397403707

In [0]:
# testing the model
y_prediction = nmp.dot(X_test , weight.value) + b.value
error = 0
for i in range (0,199):
  error += np.abs(y_prediction[i] - y_test[i]) 
error /= 199

print('Mean Absolution Error: '+ str(error))

Mean Absolution Error: [0.54810579]
