The dataset is related to the red variant of "Vinho Verde" wine. It contains 1599 data points where features are the physicochemical properties and the target value is quality which is an integer score ranging from 0-10. Your task is to classify if the wine provided is good based on its physicochemical properties.

(i) Create a new column on the dataset with binary values (i.e, 0 or 1) telling whether the wine is of good quality or not. You can categorise wines with quality>=7 to be of good quality. Drop the original ‘quality’ column.

(ii) Perform the data pre-processing steps that you feel are important for the given dataset.

(iii) Apply following classification algorithms on the given dataset (you are allowed to use scikit-learn library until not specified ‘from scratch’):

 Logistic Regression
 K-Nearest Neighbors
 Decision Trees Classifier
 Random Forest Classifier
 Logistic Regression from Scratch 

(iv) Evaluate all your models based on the accuracy score and f1 score obtained on the test dataset.

In [55]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-red.csv')

In [56]:
df['good_quality'] = (df['quality'] >= 7).astype(int)
df = df.drop('quality', axis=1)

In [57]:
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.drop('good_quality', axis=1)), columns=df.columns[:-1])
df_scaled['good_quality'] = df['good_quality']

In [58]:
X = df_scaled.drop('good_quality', axis=1)
y = df_scaled['good_quality']

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [60]:
#Importing necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

In [61]:
# Finding Best value of K
best_k = 0
highest_accuracy = 0

for k in range(1, 21):  # Testing k from 1 to 20
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    if accuracy > highest_accuracy:
        highest_accuracy = accuracy
        best_k = k

print(f"Best k value: {best_k}")
print(f"Highest accuracy: {highest_accuracy}")

Best k value: 15
Highest accuracy: 0.89375


In [62]:
# Logistic Regression
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(X_train, y_train)
log_reg_pred = logistic_regression.predict(X_test)

# K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

# Decision Trees
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
decision_tree_pred = decision_tree.predict(X_test)

# Random Forest
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
random_forest_pred = random_forest.predict(X_test)

In [63]:
# Logistic Regression from Scratch
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.theta = np.zeros(X.shape[1])
        m = len(y)
        for _ in range(self.n_iterations):
            z = np.dot(X, self.theta)
            h = self.sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / m
            self.theta -= self.learning_rate * gradient

    def predict(self, X):
        return np.round(self.sigmoid(np.dot(X, self.theta)))

# Instantiate and train Logistic Regression from Scratch
log_reg_scratch = LogisticRegressionScratch()
log_reg_scratch.fit(X_train, y_train)

# Predict on test set
predictions_scratch = log_reg_scratch.predict(X_test)


In [65]:
models = ["Logistic Regression", "K-Nearest Neighbors", "Decision Trees", "Random Forest", "Logistic Regression from Scratch"]
predictions = [log_reg_pred, knn_pred, decision_tree_pred, random_forest_pred, predictions_scratch]

for model, pred in zip(models, predictions):
    acc_score = accuracy_score(y_test, pred)
    f1 = f1_score(y_test, pred, average='weighted')
    print(f"Model: {model}")
    print(f"Accuracy Score: {acc_score:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print("-------------------------------------")

Model: Logistic Regression
Accuracy Score: 0.8656
F1 Score: 0.8442
-------------------------------------
Model: K-Nearest Neighbors
Accuracy Score: 0.8812
F1 Score: 0.8590
-------------------------------------
Model: Decision Trees
Accuracy Score: 0.8812
F1 Score: 0.8790
-------------------------------------
Model: Random Forest
Accuracy Score: 0.9094
F1 Score: 0.9020
-------------------------------------
Model: Logistic Regression from Scratch
Accuracy Score: 0.6750
F1 Score: 0.7222
-------------------------------------
