In [None]:
version = "REPLACE_PACKAGE_VERSION"

# Assignment 3: Classification and Evaluation

In this assignment we will build several classification models and calculate some useful metrics for gauging the performance of these models. The scenario we'll address is "spam" e-mail detection: a very important, widely-used supervised machine learning task that attempts to find unsolicited, mass-produced messages that have irrelevant and/or inappropriate content (often mass marketing or attempts at fraud). These are sometimes called "spam filters".

We treat this task as a binary classification problem: detecting if an email is "spam" (Class == 1) or not (Class == 0, a regular/good e-mail). Email systems will typically automatically move messages detected as "spam" to a "Spam" or "Deleted" folder so the user will not have to read them in their regular inbox.

In this setup, a *false positive* would mark a regular/good e-mail as spam. The key aspect of the "spam" scenario is that false positives are obviously very undesirable, because these would cause people to potentially lose valuable "good" messages. So we want a highly precise spam filter that has few/no false positives, but as we'll see, as a consequence it may let more spam through the filter. This is a classic precision / recall tradeoff, which you'll investigate below.

In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')

### Question 1. (10 pts)

Import the data from `assets/spam.csv`. What is the ratio of the counts of regular (class == 0) to spam (Class == 1) observations in the entire dataset? 

*This function should return a positive float less than 10.* 

In [None]:
def answer_one():
    
    df = pd.read_csv('assets/spam.csv')
    frac = len(df[df['Class'] == 0]) / len(df[df['Class'] == 1])
    
    return frac

In [None]:
# Autograder tests

stu_ans = answer_one()
assert isinstance(stu_ans, float), "Q1: Your function should return a float. "
assert 0.0 <= stu_ans <= 10.0, "Q1: Your answer must be between 0 and 10. "

del stu_ans

Now prepare the data: we break into training and testing sets as usual. But there's another critical step, since we're going to be using *regularized* classification methods on this data. We must first *perform feature normalization so that all features are on a standardized scale.*  We do this by applying the StandardScalar class from sklearn.preprocessing.

The details of how we do this are important. You must *first* split the data into training and test sets, and *only after the split*, do the feature normalization. This is to avoid giving information about the range of variables in the test split to the training split, which would be a form of data leakage.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("assets/spam.csv")

X = df.iloc[:, :-1]
y = df.iloc[:, -1]


X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=0)

scaler = StandardScaler().fit(X_train_raw)
X_train = scaler.transform(X_train_raw)
X_test = scaler.transform(X_test_raw)

# Use X_train, X_test, y_train, y_test for all of the following questions.
# Also, X_train_raw and X_test_raw will be useful for Question 7.

### Question 2. (15 pts)

We've seen that so-called *dummy* classifiers can be used as a simple *sanity-check* baseline against which to compare real classifier performance. If your classifier can't do much better than the dummy classifier, you probably have more work to do.

Using `X_train`, `X_test`, `y_train`, and `y_test`, train two dummy classifiers: (A) one that respects the training set's label distribution and (B) one that classifies everything as the majority class of the training data. Where appropriate, make sure to set the *random_state* parameter to zero. 

Then on the test set, for each of the classifiers A and B, compute precision, recall, and accuracy. Report your results as a single tuple as shown below. Once you have the results, it's instructive to compare these two different types of dummy baselines to understand why they are different.

*This function should a return a tuple of six floats, like so:*

*`(precision_score_A, recall_score_A, accuracy_score_A, precision_score_B, recall_score_B, accuracy_score_B)`.*

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

def answer_two():
    
    dummy_strat = DummyClassifier(strategy = 'stratified', random_state=0).fit(X_train, y_train)
    dummy_mf = DummyClassifier(strategy = 'most_frequent', random_state=0).fit(X_train, y_train)
    
    y_pred_mf = dummy_mf.predict(X_test)
    y_pred_strat = dummy_strat.predict(X_test)
    
    precision_score_A = precision_score(y_test, y_pred_strat)
    recall_score_A = recall_score(y_test, y_pred_strat)
    accuracy_score_A = accuracy_score(y_test, y_pred_strat)
    
    precision_score_B = precision_score(y_test, y_pred_mf)
    recall_score_B = recall_score(y_test, y_pred_mf)
    accuracy_score_B = accuracy_score(y_test, y_pred_mf)
    
    preA, recA, accA, preB, recB, accB = precision_score_A, recall_score_A, accuracy_score_A, precision_score_B, recall_score_B, accuracy_score_B
    
    return preA, recA, accA, preB, recB, accB


In [None]:
# Autograder tests

stu_ans = answer_two()

assert isinstance(stu_ans, tuple), "Q2: Your function should return a tuple. "
assert len(stu_ans) == 6, "Q2: The length of your returned tuple should be 6. "
assert all([isinstance(item, float) for item in stu_ans]), "Q2: Your tuple should only contain floats. "
#assert isinstance(stu_ans[0], float) and isinstance(stu_ans[1], float) and isinstance(stu_ans[2], float) and isinstance(stu_ans[3], float) and isinstance(stu_ans[4], float) and isinstance(stu_ans[5], float), "Q2: Your tuple should only contain floats. "

del stu_ans

### Question 3. (15 pts)

Using `X_train`, `X_test`, `y_train`, and `y_test`, train an SVC classifier with the default hyper-parameters. What are the accuracy, recall and precision of this classifier on the testing set?

*This function should a return a tuple of three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, recall_score, precision_score

def answer_three():
    
    model = SVC().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc, rec, pre = (accuracy_score(y_test, y_pred), recall_score(y_test, y_pred), precision_score(y_test, y_pred)) # accuracy, recall and precision

    return (acc, rec, pre)

In [None]:
# Autograder tests

stu_ans = answer_three()

assert isinstance(stu_ans, tuple), "Q3: Your function should return a tuple. "
assert len(stu_ans) == 3, "Q3: The length of your returned tuple should be 3. "
assert all([isinstance(item, float) for item in stu_ans]), "Q3: Your tuple should only contain floats. "

del stu_ans

### Question 4. (20 pts)

Train an SVC classifier with default hyper-parameters except for `{"C": 1e9, "gamma": 1e-8}`. What is the confusion matrix on the testing set if we use a threshold of `-100` for the decision function? That is, we classify instances with a raw score greater than -100 under the decision function as Class 1. 

*This function should return a 2x2 numpy array of 4 integers.*

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

def answer_four():
  
    model = SVC(C = 1e9, gamma = 1e-08).fit(X_train, y_train)
    
    y_pred = model.decision_function(X_test)
    y_pred = np.where(y_pred > -100, 1, 0)
    
    conf_mtrx = confusion_matrix(y_test, y_pred)
    
    return conf_mtrx

In [None]:
# Autograder tests

stu_ans = answer_four()

assert isinstance(stu_ans, np.ndarray), "Q4: Your function should return a np.ndarray. "
assert stu_ans.shape == (2, 2), "Q4: Your confusion matrix should be of size 2x2. "

del stu_ans

### Question 5. (20 pts)

Train a logistic regression spam e-mail classifier with default hyper-parameters using `X_train` and `y_train`. Create a precision-recall curve and a Receiver Operating Characteristic (ROC) curve using `y_test` and the probability estimates of being "spam" for X_test.

- Based on the precision-recall curve, what is the recall when the precision is $0.90$?

- Based on the ROC curve, what is the true positive rate when the false positive rate is $0.10$?

Write a function to return the answers. Note you can use a pure programming approach to get the answers: you don't have to actually "plot" the curves. However, it's quite instructive to plot these curves to understand just how e.g. precision and recall trade off. So you can plot the curves and read off the answers. Answers correct up to $\pm 0.02$ are accepted. 

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve

def answer_five():
    
    model = LogisticRegression().fit(X_train, y_train)
    y_scores = model.predict_proba(X_test)[:, 1]
    
    precision, recall, threshold = precision_recall_curve(y_test, y_scores)
    fpr_, tpr_, thresholds = roc_curve(y_test, y_scores)

    rec, tpr = recall[np.abs(precision - 0.90).argmin()], tpr_[np.abs(fpr_ - 0.10).argmin()] # recall and TP rate
    
    return rec, tpr

In [None]:
# Autograder tests

stu_ans = answer_five()

assert isinstance(stu_ans, tuple), "Q5: Your function should return a tuple. " 
assert len(stu_ans) == 2, "Q5: The length of your tuple should be 2. "
assert stu_ans[0] >= 0.7, "Q5: Your recall value should be greater than 0.7. "
assert stu_ans[1] >= 0.7, "Q5: Your TP rate value should be greater than 0.7. "

del stu_ans

In [None]:
# Remember to comment them out before submitting the notebook

# from sklearn.metrics import plot_precision_recall_curve

# log_reg = LogisticRegression().fit(X_train, y_train)
# disp = plot_precision_recall_curve(log_reg, X_test, y_test)
# del log_reg, disp

### Question 6. (15 pts)

Perform a grid search over the hyper-parameters listed below for a Logistic Regression classifier, optimizing for classifier **precision** for scoring and five-fold cross validation. 

**Note: Use the following parameter settings for the logistic regression:**
 * Use the 'liblinear' solver, which supports both L1 and L2 regularization.
 * Set `random_state=42`, since the solver uses randomization internally.

`'penalty': ['l1', 'l2']`

`'C':[0.005, 0.01, 0.05, 0.1, 1, 10]`

From `.cv_results_`, create an array of the mean test scores for each hyper-parameter combination. i.e.

|   `C`   	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.005`** 	|    ?	|   ? 	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.05`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|

<br>

*This function should return a 6 by 2 numpy array of floats that contain the values for each "?" above. Do not return a pd.DataFrame.*

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

def answer_six():    

    model = LogisticRegression(random_state = 42, solver = 'liblinear')
    
    grid_values = {'penalty': ['l1', 'l2'], 'C':[0.005, 0.01, 0.05, 0.1, 1, 10]}
    grid = GridSearchCV(model, param_grid = grid_values, scoring = 'precision')
    grid.fit(X_train, y_train)

    mean_test_scores = grid.cv_results_['mean_test_score'].reshape(6,2)
    
    return mean_test_scores

In [None]:
# Autograder tests

stu_ans = answer_six()
assert isinstance(stu_ans, np.ndarray), "Q6: Your function should return a np.ndarray. "
assert stu_ans.shape == (6, 2), "Q6: Your np.ndarray should be of shape (6, 2). "

del stu_ans 

In [None]:
# Optional: use the following function to help visualise the results from the grid search

def GridSearch_Heatmap(scores):
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    plt.figure()
    sns.heatmap(scores.reshape(6, 2), xticklabels=['l1','l2'], yticklabels=[0.005, 0.01, 0.05, 0.1, 1, 10])
    plt.yticks(rotation=0);

# Remember to comment it out before submitting the notebook
# GridSearch_Heatmap(answer_six())

### Question 7. Normalizing features when using regularization (5 pts)

Now re-run the code from Question 6 above, but using the raw *unnormalized* training data (i.e. X_train_raw as computed previously as part of Q1.). Return the *highest* precision you obtained from cross-validation in Q6 using normalized features, and the *highest* precision you obtain with the raw, unnormalized features. 

Your function should return a two-element tuple of floats `(best_precision_normalized, best_precision_unnormalized)`

It is very instructive to compare the results from (a) using correctly normalized features, vs. (b) forgetting to normalize the features and (c) the dummy baseline you computed in Q2.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

def answer_seven():  
    
    model = LogisticRegression(random_state = 42, solver = 'liblinear')
    
    grid_values = {'penalty': ['l1', 'l2'], 'C':[0.005, 0.01, 0.05, 0.1, 1, 10]}
    
    grid = GridSearchCV(model, param_grid = grid_values, scoring = 'precision')
    grid.fit(X_train, y_train)  
    
    grid_raw = GridSearchCV(model, param_grid = grid_values, scoring = 'precision')
    grid_raw.fit(X_train_raw, y_train)  
    
    best_precision_normalized = grid.cv_results_['mean_test_score'].max()
    best_precision_unnormalized = grid_raw.cv_results_['mean_test_score'].max()
    
    return (best_precision_normalized, best_precision_unnormalized)

In [None]:
# Autograder tests

stu_ans = answer_seven()

assert isinstance(stu_ans, tuple), "Q7: Your function should return a tuple. "
assert len(stu_ans) == 2, "Q7: The length of your returned tuple should be 2. "
assert all([isinstance(item, float) for item in stu_ans]), "Q7: Your tuple should only contain floats. "

del stu_ans