### Follow These Instructions

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.

# Assignment 3: Classification with Logistic Regression  [ __ /100  marks]


In this assignment we will use the `diabetes` dataset, which was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. 

We will use logistic regression to predict whether subjects have diabetes or not.

## Global Toolbox

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
seed=0

## Question 1.1 [ _ /3 marks]

Read the file `diabetes.csv` into a pandas DataFrame. Display the first 5 rows of the DataFrame. 

In [1]:
# ****** your code here ******
df = pd.read_csv("diabetes.csv")
print(df.head())


NameError: name 'pd' is not defined

## Question 1.2 [ _ /6 marks]

(1) How many classes are there? How many features are available to predict the outcome?

**Your answer**: 2 classes, Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI,Age

(2) Is the dataset class-balanced?

In [None]:
# ****** your code here ******
print(df.Outcome.value_counts())

**Your answer**: No

(3) For this classification problem, what is the baseline accuracy and how would you interpret it? Round into 3 decimal place.

In [2]:
# ****** your code here ******

A = df.Outcome[df['Outcome']==0].count()
B = df.Outcome[df['Outcome']==1].count()
baseline_accuracy = round(A/(A+B), 3)
print("Baseline Accuracy is:", baseline_accuracy)

NameError: name 'df' is not defined

## Question 1.3 [ _ /3 marks]

Use `train_test_split` with `random_state=0` to split the data into training and test sets. Leave `20%` for testing.

In [None]:
# Store all the features into variable "X"
# ****** your code here ******
X = df.drop('Outcome', axis='columns').values

# Store the output class values into variable "y" 
# ****** your code here ******
y = df.Outcome.values

# Split your X and y data using train_test_split 
# ****** your code here ******

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=seed)


## Question 2.1 [ _ /3 marks]

We will use sklearn's `LogisticRegression` to solve the classification problem. Before we move on, answer the following questions by reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).


(1) Does `LogisticRegression` use a penalty by default?  If yes, what penalty?

**Your answer**: 
Yes, it uses L2 regularization.

    
(2) If we apply a penalty during learning, what difference do you expect to see in the resulting coefficients (parameters), relative to not applying a penalty during learning?

**Your answer**: 
Applying a penalty during learning generally results in smaller magnitude coefficients compared to not applying a penalty. This occurs because the penalty discourages large coefficients by adding a cost to the loss function that increases with the magnitude of the coefficients. Specifically:

L1 Penalty (Lasso Regularization): Encourages sparsity by driving some coefficients to zero, which can lead to feature selection.
L2 Penalty (Ridge Regularization): Shrinks the coefficients towards zero, but typically doesn't result in exactly zero coefficients.
Elastic Net: Combines L1 and L2 penalties, encouraging both sparsity and coefficient shrinkage.
Without a penalty, the learning algorithm may fit the data too closely, potentially leading to overfitting.

    
(3) If using the default settings of `LogisticRegression`, do you need to include a column of 1s in your feature/design matrix? Briefly explain why or why not.


**Your answer**: 
No. Because it fits an intercept term by default, which negates the need to manually add a column of 1s to the feature matrix.

## Question 2.2 [ _ /10 marks]

Create a `LogisticRegression` model with `penalty=none`. Let's fisrt train and test this classifier using only "Insulin" as the input feature. Make a scatter plot of the points. Plot your prediction on the same graph.

In [None]:
# Create a LogisticRegression model without regularization 
# ****** your code here ******
df = pd.read_csv("diabetes.csv")

# Obtain training data and test data  
# ****** your code here ******
X = df.drop('Insulin', axis='columns').values
y = df.Insulin.values

# Fit to your training data using Logistic Regression 
# ****** your code here ******
dflr= LogisticRegression(penalty='none').fit(Xtrain,ytrain)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=seed)


# Create a scatter plot of the test data. 
# ****** your code here ******
sns.scatterplot(x=Xtest[:, 0], y=Xtest[:, 1], hue=ytest)


# Also plot your prediction using sns.lineplot
# lineplot needs 1d vector x

y_pred = dflr.predict(Xtest)
sorted_indices = np.argsort(Xtest[:, 0])
x_sorted = Xtest[sorted_indices, 0]
y_pred_sorted = y_pred[sorted_indices]
sns.lineplot(x=x_sorted, y=y_pred_sorted)


## Question 2.3 [ _ /10 marks]
Evaluate the classification performance using `Accuracy`, `Recall`, `Precision`, `Sensitivity` and `Specificity`.

In [None]:
# ****** your code here ******
# You can either write a function or not

ytest_hat = dflr.predict(Xtest)

def compute_performance(yhat, y, classes):
    tp = sum(np.logical_and(yhat == classes[1], y == classes[1]))
    tn = sum(np.logical_and(yhat == classes[0], y == classes[0]))
    fp = sum(np.logical_and(yhat == classes[1], y == classes[0]))
    fn = sum(np.logical_and(yhat == classes[0], y == classes[1]))
    print(f"tp: {tp} tn: {tn} fp: {fp} fn: {fn}")
    acc = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    sensitivity = recall
    specificity = tn / (fp + tn)
    print("Accuracy:", round(acc, 3), "Recall:", round(recall, 3), "Precision:", round(precision, 3),
          "Sensitivity:", round(sensitivity, 3), "Specificity:", round(specificity, 3))
    
compute_performance(ytest_hat, ytest, dflr.classes_)


## Question 3.1 [ _ /10 marks]

Create another `LogisticRegression` model with `penalty=none`. Train and test this classifier with all features and then evaluate the performance.

In [None]:
# Create a LogisticRegression model without regularization 
# ****** your code here ******
df = pd.read_csv("diabetes.csv")
X = df.drop("Outcome", axis="columns")
y = df.Outcome.values
# Fit to your training data using Logistic Regression 
# ****** your code here ******
DFLR = LogisticRegression(penalty='none')
dflr = DFLR.fit(X,y)
print(f"Intercepts:\n {dflr.intercept_.round(3)} \n\nCoefficients:\n {dflr.coef_.round(3)}")
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=seed)

# Compute your test predictions, given test inputs 
# ****** your code here ******

yhat = dflr.predict(X)
yhat_probs = dflr.predict_proba(X)

# Evaluate the performance
# ****** your code here ******
compute_performance(yhat, y, dflr.classes_)

Does using more features help to improve the classification?

**Your answer** : 

## Question 3.2 [ _ /10 marks]
Let's adjust the decision threshold from 0.5 (default) to 0.4 and 0.6, and then evlaute the performance.

In [None]:
# Using your classifer from last question, adjust the decision threshold and get the updated predictions 
# ****** your code here ******
threshold = 0.4
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=threshold, random_state=seed)

# Evaluate the performance
# ****** your code here ******
compute_performance(yhat, y, dflr.classes_)

In [None]:
# Using your classifer from last question, adjust the decision threshold and get the updated predictions 
# ****** your code here ******
threshold = 0.6
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=threshold, random_state=seed)


# Evaluate the performance
# ****** your code here ******
compute_performance(yhat, y, dflr.classes_)

What do you think is a better threshold? 

**Your answer**: 
0.6

## Question 3.3 [ _ /10 marks]

Create a final `LogisticRegression` model with `penalty=l2`, `C=0.01`. Train and test this classifier with all features and then evaluate the performance.

In [None]:
# Create a LogisticRegression model with l2 regularization 
# ****** your code here ******
X = df.drop('Outcome', axis='columns').values
y = df.Outcome.values

# Fit to your training data using Logistic Regression 
# ****** your code here ******
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.5, random_state=seed)
dflr= LogisticRegression(penalty=12,C = 0.01).fit(Xtrain,ytrain)
# Compute your test predictions, given test inputs 
# ****** your code here ******
y_hat = dflr.predict(X)

# Evaluate the performance
# ****** your code here ******
compute_performance(y_hat, y, dflr.classes_)

Does regularization help to improve the classification?

**Your answer** : Yes

## Question 4 [ _ /15 marks]

Plot ROC Curves for the classifiers you used in questions 2.2, 3.1, and 3.3. Use AUC to determine which classifier is the best.

In [None]:
# Use roc_curve to get FPR and TPR for each of the 3 classifiers 
# ****** your code here ******
ytest_prob = dflr.predict_proba(Xtest)

# Plot all of the ROC curves 
# ****** your code here ******
fpr, tpr, _ = roc_curve(ytest, ytest_prob[:,1], pos_label=dflr.classes_[1]) 
ax =sns.lineplot(x=fpr,y=tpr)
ax.set_xlabel("FP Rate")
ax.set_ylabel("TP Rate")

# Determine AUC for each of the ROC curves 
# ****** your code here ******
auc(fpr,tpr).round(3)

Which one is the best classifier?

**Your answer**: Multiclass Logistic Regression

## Question 5 [ _ /10 marks]

Multiclass Logistic Regression

In the classification lab, we trained a binary LR classifier using the _mnist_ dataset to discriminate entries which were equal to 5 from the rest. Use the same dataset to train a multiclass **Logistic Regression** using the [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)  with `l2` regularization. So, this time you will have 10 classes, *i.e.*, 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. For training use `max_iter=2000`, `tol=1e-3`, `random_state=seed`. For some `sklearn` functions you can set argument `n_jobs=N` to run them in parallel and speed up computations. A good value for N can be the number of physical CPU cores that your machine possesses (`N=-1` would use all cores). Check the documentations of the functions to take advantage from this where possible.

First load the data and plot a histogram to comment on class distribution qualitatively. For splitting the data into train and test sets, use `test_size=0.5` and `random_state=seed`. What is the balanced accuracy score of your model?

In [None]:
### your stuff

**Your written answer**: 

## Question 6 [ _ /10 marks]

Run the cell below to see how well your model can recognize a digit drawn by the mouse cursor. Set the variable `final_model`, run the cell, draw on the pop-up canvas, and once you close the canvas you will see the model's recognition of your input.

Despite the cell using your classifier, which has a high balanced accuracy score, it often makes mistakes and its performance seems questionable. Try to explain in words why is that so?

Caveat: The cell below will not run on headless servers, you will need to use a local installation of python. You might have some fun until you can get it to work, but that's ok, because I want you to try your hands on technicalities and not always rely on online services.

In [3]:
final_model=dflr # use the name of your final model
#!pip install tk-tools
from tkinter import *
import tkinter as tk
from PIL import Image
import io
import matplotlib as mpl

temp_file_name="TEMP_image_TEMP.jpg"
app = Tk()
app.geometry("300x300")
canvas = tk.Canvas(app, bg='white')
canvas.pack(anchor='nw', fill='both', expand=1)
def get_x_and_y(event):
    global lasx, lasy
    lasx, lasy = event.x, event.y

def draw_smth(event):
    global lasx, lasy
    canvas.create_line((lasx, lasy, event.x, event.y), fill='red', width=4)
    lasx, lasy = event.x, event.y
    ps = canvas.postscript(colormode = 'color')
    img = Image.open(io.BytesIO(ps.encode('utf-8')))
    img.save(temp_file_name)

canvas.bind("<Button-1>", get_x_and_y)
canvas.bind("<B1-Motion>", draw_smth)

app.mainloop()
img = Image.open(temp_file_name)
#resize image to 28x28 pixels
img = img.resize((28,28))
#convert rgb to grayscale
img = img.convert("L")
img = np.array(img)
img = 255.0 - img
plt.imshow(img, cmap = mpl.cm.binary); plt.axis("off")
# reshaping to support our model input
img = np.reshape(img, 28*28)

#predicting the class
print('\nInput recognized as ' + str(final_model.predict([img])[0])+'.')

NameError: name 'dflr' is not defined

**Your answer**: The model might perform poorly due to overfitting, differences in feature distribution between training and new data, or because logistic regression's linear decision boundaries might not capture the complex patterns in handwritten digits.