## Assignment-3

Dataset: Red Wine Quality

The dataset is related to the red variant of "Vinho Verde" wine. It contains 1599 data points where features are the physicochemical properties and the target value is quality which is an integer score ranging from 0-10. Your task is to classify if the wine provided is good based on its physicochemical properties.

(i) Create a new column on the dataset with binary values (i.e, 0 or 1) telling whether the wine is of good quality or not. You can categorise wines with quality>=7 to be of good quality. Drop the original ‘quality’ column.

(ii) Perform the data pre-processing steps that you feel are important for the given dataset.

(iii) Apply following classification algorithms on the given dataset (you are allowed to use scikit-learn library until not specified ‘from scratch’):

 Logistic Regression
 K-Nearest Neighbors
 Decision Trees Classifier
 Random Forest Classifier
 Logistic Regression from Scratch 

(iv) Evaluate all your models based on the accuracy score and f1 score obtained on the test dataset.


#### Importing libraries

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# importing dataset
df = pd.read_csv("winequality-red.csv")

# part (i)

df['good_quality'] = (df['quality'] >= 7).astype(int)
df.drop('quality', axis=1, inplace=True)

# part (ii)

X = df.drop('good_quality', axis=1)
y = df['good_quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler() 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# part (iii)

# Logistic Regression

logistic_model = LogisticRegression()
logistic_model.fit(X_train_scaled, y_train)
logistic_preds = logistic_model.predict(X_test_scaled)

# K-Nearest Neighbors

k_nn_model = KNeighborsClassifier()
k_nn_model.fit(X_train_scaled, y_train)
k_nn_preds = k_nn_model.predict(X_test_scaled)

# Decision Trees Classifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
dt_preds = dt_model.predict(X_test_scaled)

# Random Forest Classifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train_scaled, y_train)
rf_preds = rf_model.predict(X_test_scaled)

class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.001, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.zeros(n)
        self.bias = 0

        for _ in range(self.num_iterations):
            y_pred = self.sigmoid(np.dot(X, self.weights) + self.bias)
            dw = (1 / m) * np.dot(X.T, (y_pred - y))
            db = (1 / m) * np.sum(y_pred - y)
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        return np.round(self.sigmoid(np.dot(X, self.weights) + self.bias))

lr_model = LogisticRegressionScratch(learning_rate=0.001, num_iterations=1000)
lr_model.fit(X_train_scaled, y_train)
lr_model_preds = lr_scratch.predict(X_test_scaled)

# part (iv)
def evaluate_model(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"{model_name}:\nAccuracy: {accuracy:.4f}\nF1 Score: {f1:.4f}\n")


evaluate_model(y_test, logistic_preds, "Logistic Regression")
evaluate_model(y_test, knn_preds, "K-Nearest Neighbors")
evaluate_model(y_test, tree_preds, "Decision Trees Classifier")
evaluate_model(y_test, rf_preds, "Random Forest Classifier")
evaluate_model(y_test, lr_scratch_preds, "Logistic Regression from scratch")


Logistic Regression:
Accuracy: 0.8656
F1 Score: 0.3768

K-Nearest Neighbors:
Accuracy: 0.8812
F1 Score: 0.5128

Decision Trees Classifier:
Accuracy: 0.8969
F1 Score: 0.6374

Random Forest Classifier:
Accuracy: 0.9031
F1 Score: 0.6076

Logistic Regression from scratch:
Accuracy: 0.8562
F1 Score: 0.5000

