We test the performance of three classification methods on the ionosphere data
set https://archive.ics.uci.edu/ml/datasets/ionosphere. There are 351 samples. We
use the first 300 samples for training, and the last 51 samples for testing. The goal is to build
a linear model of the 34 features (together with a constant term) to predict the binary (±1)
outcome. All models are trained by solving the following optimization problem.

$$
\begin{equation*}
\begin{aligned}
\underset{w,\beta}{\text{minimize}}
\sum_{i=1}^{n} l(x_i^{T}w+\beta , y_i)
\end{aligned}
\end{equation*}$$
where the loss functions are

• least squares loss $ l(t) = (t-y)^{2} $

• logistic loss $ l(t,y) = log(1+exp(-yt)) $

• hinge loss $ l(t,y) = max(0,1-yt) $


In [2]:
# importing libraries and reading data from file
import pandas as pnd
import numpy as nmp
columnnames = []
for c in range (0,35):
  columnnames.append('col'+str(c))
#columnnames
dataFile = pnd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data', sep = ',',names=columnnames, header=None)
dataFile.shape
dataFile.head()

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.0,0.0376,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.3409,0.42267,-0.54487,0.18641,-0.453,g
1,1,0,1.0,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.0,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1.0,-0.13151,-0.453,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.1904,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
2,1,0,1.0,-0.03365,1.0,0.00485,1.0,-0.12062,0.88965,0.01198,0.73082,0.05346,0.85443,0.00827,0.54591,0.00299,0.83775,-0.13644,0.75535,-0.0854,0.70887,-0.27502,0.43385,-0.12062,0.57528,-0.4022,0.58984,-0.22145,0.431,-0.17365,0.60436,-0.2418,0.56045,-0.38238,g
3,1,0,1.0,-0.45161,1.0,1.0,0.71216,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.14516,0.54094,-0.3933,-1.0,-0.54467,-0.69975,1.0,0.0,0.0,1.0,0.90695,0.51613,1.0,1.0,-0.20099,0.25682,1.0,-0.32382,1.0,b
4,1,0,1.0,-0.02401,0.9414,0.06531,0.92106,-0.23255,0.77152,-0.16399,0.52798,-0.20275,0.56409,-0.00712,0.34395,-0.27457,0.5294,-0.2178,0.45107,-0.17813,0.05982,-0.35575,0.02309,-0.52879,0.03286,-0.65158,0.1329,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g


In [0]:
from sklearn.model_selection import train_test_split
X_features = dataFile[columnnames[0:34]].values
y = dataFile[columnnames[34]].values
#The data sample of 351 records being split into train set(300) and test set(51)
X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.143, random_state=42, shuffle = False)
#assigning 0 for label 'b' and 1 for label 'g'
for i in range(len(y_train)):
  if(y_train[i] == 'g'):
    y_train[i] = 1
  else:
    y_train[i] = -1
for i in range(len(y_test)):
  if(y_test[i] == 'g'):
    y_test[i] = 1
  else:
    y_test[i] = -1

In [0]:
#Applying Linear regression to fit least square and using the same to predict test data
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train,y_train) 
y_prediction = reg.predict(X_test)

In [42]:
#Calculating accuracy rate
error = 0
for i in range (0,51):
  y_prediction[i] = 1 if y_prediction[i] > 0 else -1
  if(y_test[i] != y_prediction[i]):
    error += 1

error /= .51
print('Accuracy Rate: '+ str(100-error))

Accuracy Rate: 100.0


In [81]:
#Applying Dead-zone linear loss in Convex Optimization form for building the model and using the same to predict test data 
import cvxpy as cp
#Generating weight vector of length 34, as there are 34 features 
weight = cp.Variable((34,1))
b = cp.Variable(1)
cost = cp.maximum(0,(1 - y_train@(X_train@weight + b)))
obj = cp.sum(cost)
prob = cp.Problem(cp.Minimize(obj))
prob.solve(solver=cp.ECOS)

4.2481105734661005e-13

In [82]:
# testing the model
y_prediction = nmp.dot(X_test , weight.value) + b.value
#Calculating accuracy rate
error = 0
for i in range (0,51):
  y_prediction[i] = 1 if y_prediction[i] > 0 else -1
  if(y_test[i] != y_prediction[i]):
    error += 1

error /= .51
print('Accuracy Rate: '+ str(100-error))


Accuracy Rate: 74.50980392156863


In [76]:
#Applying logistic loss in Convex Optimization form for building the model and using the same to predict test data 
import cvxpy as cp
#Generating weight vector of length 34, as there are 34 features 
weight = cp.Variable((34,1))
b = cp.Variable(1)
cost = cp.logistic(-y_train@(X_train@weight + b))
obj = cp.sum(cost)
prob = cp.Problem(cp.Minimize(obj))
prob.solve(solver=cp.ECOS)

-8.396351001675265e-09

In [85]:
# testing the model
y_prediction = nmp.dot(X_test , weight.value) + b.value
#Calculating accuracy rate
error = 0
for i in range (0,51):
  y_prediction[i] = 1 if y_prediction[i] > 0 else -1
  if(y_test[i] != y_prediction[i]):
    error += 1
#print(error)
error /= .51
print('Accuracy Rate: '+ str(100-error))


Accuracy Rate: 74.50980392156863


We see both the logistic loss and hinge loss performs exactly similar. They give prediction accuracy of 74.5%. Whereas, least square beats them all, with 100% accuracy.