### Assignment-3

### Dataset: Red Wine Quality

### The dataset is related to the red variant of "Vinho Verde" wine. It contains 1599 data points where features are the physicochemical properties and the target value is quality which is an integer score ranging from 0-10. Your task is to classify if the wine provided is good based on its physicochemical properties.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('winequality-red.csv')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


### (i) Create a new column on the dataset with binary values (i.e, 0 or 1) telling whether the wine is of good quality or not. You can categorise wines with quality>=7 to be of good quality. Drop the original ‘quality’ column.


In [3]:
df['is_good_quality'] = df['quality'].apply(lambda x:1 if x>=7 else 0)

In [4]:
df.drop('quality', axis=1, inplace=True)

In [5]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,is_good_quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,0
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,0
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,0
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,0
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,0
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,0
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,0


### (ii) Perform the data pre-processing steps that you feel are important for the given dataset.


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [7]:
# Assuming 'X' contains features and 'y' contains the target variable
X = df.drop('is_good_quality', axis=1)
y = df['is_good_quality']

In [8]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Standardizing the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### (iii) Apply following classification algorithms on the given dataset (you are allowed to use scikit-learn library until not specified ‘from scratch’):
### (a) Logistic Regression
### (b) K-Nearest Neighbors
### (c) Decision Trees Classifier
### (d) Random Forest Classifier
### (e) Logistic Regression from Scratch 

### (iv) Evaluate all your models based on the accuracy score and f1 score obtained on the test dataset.


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

In [11]:
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

# K-Nearest Neighbors
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

# Decision Trees Classifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

# Random Forest Classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

# Evaluating the models
models = [lr_model, knn_model, dt_model, rf_model]
preds = [lr_pred, knn_pred, dt_pred, rf_pred]

for model, pred in zip(models, preds):
    accuracy = accuracy_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    print(f"{model.__class__.__name__}: Accuracy - {accuracy:.4f}, F1 Score - {f1:.4f}",'\n')

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def compute_cost(X, y, theta):
    m = len(y)
    h = sigmoid(X @ theta)
    cost = (1 / m) * (-y.T @ np.log(h) - (1 - y).T @ np.log(1 - h))
    return cost

def gradient_descent(X, y, theta, learning_rate, iterations):
    m = len(y)
    for _ in range(iterations):
        h = sigmoid(X @ theta)
        gradient = X.T @ (h - y) / m
        theta -= learning_rate * gradient
    return theta

# X_train is training features, y_train is target variable
X_train = np.c_[np.ones((X_train.shape[0], 1)), X_train]  # Adding a column of ones for bias
theta = np.zeros(X_train.shape[1])  # Initializing parameters

learning_rate = 0.01
iterations = 1500

theta = gradient_descent(X_train, y_train, theta, learning_rate, iterations)

# Now we can use the obtained theta for predictions on the test set
X_test = np.c_[np.ones((X_test.shape[0], 1)), X_test]  # Adding a column of ones for bias
predictions = sigmoid(X_test @ theta)

# Evaluating the model
predictions_binary = (predictions >= 0.5).astype(int)
accuracy = accuracy_score(y_test, predictions_binary)
f1 = f1_score(y_test, predictions_binary)

print(f"Logistic Regression from Scratch: Accuracy - {accuracy:.4f}, F1 Score - {f1:.4f}")


LogisticRegression: Accuracy - 0.8656, F1 Score - 0.3768 

KNeighborsClassifier: Accuracy - 0.8812, F1 Score - 0.5128 

DecisionTreeClassifier: Accuracy - 0.8844, F1 Score - 0.5934 

RandomForestClassifier: Accuracy - 0.9000, F1 Score - 0.5897 

Logistic Regression from Scratch: Accuracy - 0.8625, F1 Score - 0.3125
