## FINAL ASSIGNMENT :
#### Dataset: Red Wine Quality

##### The dataset is related to the red variant of "Vinho Verde" wine. It contains 1599 data points where features are the physicochemical properties and the target value is quality which is an integer score ranging from 0-10. Your task is to classify if the wine provided is good based on its physicochemical properties.

##### (i) Create a new column on the dataset with binary values (i.e, 0 or 1) telling whether the wine is of good quality or not. You can categorise wines with quality>=7 to be of good quality. Drop the original ‘quality’ column.

##### (ii) Perform the data pre-processing steps that you feel are important for the given dataset.

##### (iii) Apply following classification algorithms on the given dataset (you are allowed to use scikit-learn library until not specified ‘from scratch’):

 ##### Logistic Regression
 ##### K-Nearest Neighbors
 ##### Decision Trees Classifier
 ##### Random Forest Classifier
 ##### Logistic Regression from Scratch 

##### (iv) Evaluate all your models based on the accuracy score and f1 score obtained on the test dataset.



In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Load the dataset
df = pd.read_csv("winequality-red.csv")

# (i) Create a new column for good quality
df['good_quality'] = (df['quality'] >= 7).astype(int)

# Drop the original 'quality' column
df.drop('quality', axis=1, inplace=True)

# (ii) Data pre-processing steps
# Split the data into features (X) and target variable (y)
X = df.drop('good_quality', axis=1)
y = df['good_quality']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features

# Create a StandardScaler object
scaler = StandardScaler() 
# Fit the scaler on the training data and simultaneously transform the features
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data using the previously fitted scaler (ensures same scaling parameters as training data)
X_test_scaled = scaler.transform(X_test)

# (iii) Apply classification algorithms
# Logistic Regression

# Create a Logistic Regression model instance
logistic_model = LogisticRegression()
# Train the Logistic Regression model on the scaled training data
logistic_model.fit(X_train_scaled, y_train)
# Use the trained Logistic Regression model to make predictions on the scaled test data
logistic_preds = logistic_model.predict(X_test_scaled)

# K-Nearest Neighbors

# Create an instance of the K-Nearest Neighbors (KNN) classifier model
knn_model = KNeighborsClassifier()
# Train the KNN model using the scaled training data and corresponding target values
knn_model.fit(X_train_scaled, y_train)
# Use the trained KNN model to make predictions on the scaled test data
knn_preds = knn_model.predict(X_test_scaled)

# Decision Trees Classifier

# Create an instance of the Decision Tree classifier model
tree_model = DecisionTreeClassifier()
# Train the Decision Tree model using the scaled training data and corresponding target values
tree_model.fit(X_train_scaled, y_train)
# Use the trained Decision Tree model to make predictions on the scaled test data
tree_preds = tree_model.predict(X_test_scaled)

# Random Forest Classifier

# Create an instance of the Random Forest classifier model
rf_model = RandomForestClassifier()
# Train the Random Forest model using the scaled training data and corresponding target values
rf_model.fit(X_train_scaled, y_train)
# Use the trained Random Forest model to make predictions on the scaled test data
rf_preds = rf_model.predict(X_test_scaled)

# Define a custom Logistic Regression class from scratch
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.0001, num_iterations=1500):
        # Initialize the logistic regression model with default or user-defined learning rate and number of iterations
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        # Initialize weights and bias to None
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        # Sigmoid activation function used to squash values between 0 and 1
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        # Get the number of samples (m) and features (n) from the input data X
        m, n = X.shape
        # Initialize weights to zeros and bias to zero
        self.weights = np.zeros(n)
        self.bias = 0

        for _ in range(self.num_iterations):
            # Calculate the predicted values
            y_pred = self.sigmoid(np.dot(X, self.weights) + self.bias)

            # Calculate the gradients
            dw = (1 / m) * np.dot(X.T, (y_pred - y))
            db = (1 / m) * np.sum(y_pred - y)

            # Update weights and bias using gradient descent
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        # Use the trained weights and bias to make predictions
        return np.round(self.sigmoid(np.dot(X, self.weights) + self.bias))


# Train the Logistic Regression model from scratch

# Create an instance of the Logistic Regression model from scratch
lr_scratch = LogisticRegressionScratch(learning_rate=0.0001, num_iterations=1500)
# Train the Logistic Regression model from scratch using the scaled training data
lr_scratch.fit(X_train_scaled, y_train)
# Use the trained Logistic Regression model from scratch to make predictions on the scaled test data
lr_scratch_preds = lr_scratch.predict(X_test_scaled)

# (iv) Evaluate models
# Define a function to print evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
    # Calculate accuracy using accuracy_score
    accuracy = accuracy_score(y_true, y_pred)
    # Calculate F1 score using f1_score
    f1 = f1_score(y_true, y_pred)
    # Print model evaluation metrics
    print(f"{model_name}:\nAccuracy: {accuracy:.4f}\nF1 Score: {f1:.4f}\n")

# Evaluate and print performance metrics for Logistic Regression model
evaluate_model(y_test, logistic_preds, "Logistic Regression")
# Evaluate and print performance metrics for K-Nearest Neighbors model
evaluate_model(y_test, knn_preds, "K-Nearest Neighbors")
# Evaluate and print performance metrics for Decision Trees Classifier model
evaluate_model(y_test, tree_preds, "Decision Trees Classifier")
# Evaluate and print performance metrics for Random Forest Classifier model
evaluate_model(y_test, rf_preds, "Random Forest Classifier")
# Evaluate and print performance metrics for Logistic Regression from scratch model
evaluate_model(y_test, lr_scratch_preds, "Logistic Regression from scratch")


Logistic Regression:
Accuracy: 0.8656
F1 Score: 0.3768

K-Nearest Neighbors:
Accuracy: 0.8812
F1 Score: 0.5128

Decision Trees Classifier:
Accuracy: 0.8781
F1 Score: 0.5618

Random Forest Classifier:
Accuracy: 0.9125
F1 Score: 0.6500

Logistic Regression from scratch:
Accuracy: 0.8562
F1 Score: 0.5000

