<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Classification Metrics
              
</p>
</div>

DS-NTL-010824
<p>Phase 3</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

# Objectives
- Calculate and interpret a confusion matrix
- Calculate and interpret classification metrics such as accuracy, recall, and precision
- Choose classification metrics appropriate to a business problem

In [None]:
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

Many classification metrics for evaluating model validation/test set performance:

- Changes which model you will pick during hyperparameter tuning.


Choice of evaluation metric:
- Major impact on how well model serves its intended goals.

#### Scenario: Identifying Fraudulent Credit Card Transactions
<center><img src = "Images/credit_card.png" width = 400/></center>

In [None]:
credit_data = pd.read_csv('data/credit_fraud_small.csv')

In [None]:
credit_data.info()

The dataset contains a bunch of features:
- The transaction amount
- The relative time of the transaction
- V1-V28 are relevant features: product of feature engineering.

Fraud transaction algorithms:
- Typically huge number of features 
- Can create small combination of features that encompass most variation in the full feature set:
    - Principal component analysis (PCA)
    - V1-V28 are these combination features


Target 'Class':
- 1 if the transaction was fraudulent
- 0 otherwise

In [None]:
credit_data['Class'].unique()

In [None]:
credit_data['Class'].value_counts()

What have we just learned about our target in our dataset?

Run a logistic regression on the credit card fraud data:

In [None]:
# Separate data into feature and target DataFrames
X = credit_data.drop('Class', axis = 1)
y = credit_data['Class']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,
                                                   random_state=1)
# Scale the data for modeling
cred_scaler = StandardScaler()
cred_scaler.fit(X_train)
X_train_sc = cred_scaler.transform(X_train)
X_test_sc = cred_scaler.transform(X_test)

# Train a logistic regresssion model with the train data
cred_model = LogisticRegression(random_state=42)
cred_model.fit(X_train_sc, y_train)

## Evaluation

Remember:
- .score(X,y) gets the accuracy of our classification model on predicting y given X.

In [None]:
cred_model.score(X_test_sc, y_test)

We got 99.88% accuracy! 
- Our model is good. Right?

Think again.

**Accuracy** = $\frac{TP + TN}{TP + TN + FP + FN}$

- Fraction of correct classifications.
- What the `.score()` method calculates.

**Class 1 (Fraud) = Our positive class**

- TP: True positive
- FP: False positive
- TN: True negative
- FN: False negative

<img src='images/precisionrecall.png' width=70%/>

Easy way to unpack the TP, TN, FP, FN is using the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix 

#nice function to visualize confusion matrix
#from sklearn.metrics import plot_confusion_matrix #depreciated
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# get predictions
y_pred = cred_model.predict(X_test_sc) 
# calculate confusion matrix
cfmat = confusion_matrix(y_test, y_pred) 

cfmat

Notice the way that sklearn displays its confusion matrix: The rows are \['actually false', 'actually true'\]; the columns are \['predicted false', 'predicted true'\].

So it displays:

$\begin{bmatrix}
TN & FP \\
FN & TP
\end{bmatrix}$

In [None]:
tn, fp, fn, tp = cfmat.flatten()
print(tn,fp,fn,tp)

print(cfmat)

In [None]:
#plot_confusion_matrix(cred_model,X_test_sc, y_test)
#plt.show()
ConfusionMatrixDisplay.from_estimator(cred_model, 
                      X_test_sc, y_test);

**Accuracy** = $\frac{TP + TN}{TP + TN + FP + FN}$

**Precision:** Accuracy of positive and negative predictions.

In words: How often did my model correctly identify transactions (fraudulent or not fraudulent)? This should give us the same value as we got from the `.score()` method.

In [None]:
acc = (tp + tn) / (tp + tn + fp + fn)
print(acc)

In [None]:
cred_model.score(X_test_sc, y_test)

My accuracy is great. But is our model doing well?

## True positive:

In [None]:
# true positives
tp

In [None]:
# false positives
fp

Model not doing well on fraud detection.

But the accuracy is great. What happened?

#### Accuracy is not a great metric when:
- There's a class imbalance
- When we care about the positive detections rate for *each* given class.

#### A better metric (for this case)

**Precision** = $\frac{TP}{TP + FP}$

**Precision:** Accuracy of positive predictions.

In this case: 
- Of the model's prediction of 'fraudulent', how many of those predictions were correct?

In [None]:
from sklearn.metrics import precision_score

In [None]:
prec = tp/(tp+fp)
prec

In [None]:
precision_score(y_test, y_pred)

In the given task of detecting credit card fraud:
    
Is precision something that the credit card company cares a lot about?

#### Another metric that could be important

**Recall** = **Sensitivity** = $\frac{TP}{TP + FN}$

**Recall:** Fraction of positives that were correctly identified.



Of the actual fraudulent transactions in our data, how many did our model predict as fraudulent?

In [None]:
from sklearn.metrics import recall_score

In [None]:
rec = tp / (tp + fn)
print(rec)

In [None]:
recall_score(y_test, y_pred)

In this task, is recall an important metric? Why or why not?

#### A metric balancing both recall and precision

In [None]:
from sklearn.metrics import f1_score

An $F$-score is a combination of precision and recall, which can be useful when both are important. 

The $F_1$ score is an equal balance of the two using a [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean).

$$F_1 = 2 \frac{Precision \cdot Recall}{Precision + Recall} = \\ \frac{2TP}{2TP + FP + FN}$$

In [None]:
f1_sc = 2*prec*rec / (prec + rec)
print(f1_sc)

In [None]:
f1_score(y_pred, y_test)


**F1 Score Interpretation**

0.9	Very good

0.8 - 0.9 Good

0.5 - 0.8 OK

0.5	Not good

Which of these metrics do you think a credit card company would care most about when trying to flag fraudulent transactions to deny?

#### `classification_report()`


In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

- The top rows show statistics for if you treated each label as the "positive" class
- **Support** shows the sample size in each class
- The averages in the bottom two rows are across the rows in the class table above (useful when there are more than two classes)

## Another example: Breast Cancer Prediction

In [None]:
from sklearn.datasets import load_breast_cancer

Load the data and train/test split

In [None]:
# Load the data
cancer_data_dict = load_breast_cancer()
X_cancer = cancer_data_dict['data']
cancer_feature_names = cancer_data_dict['feature_names']

cancer_features = pd.DataFrame(X_cancer, columns = cancer_feature_names)
cancer_features.head()

In [None]:
y_cancer = cancer_data_dict['target']
cancer_data_dict['target_names']

In [None]:
pd.DataFrame(cancer_data_dict['target']).value_counts()

 - 0 = Malignant
 - 1 = Benign

In [None]:
# Split into train and test
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(cancer_features, y_cancer,random_state=42)

Standard scale and fit the model

In [None]:
# Scale the data
bc_scaler = StandardScaler()
bc_scaler.fit(X_train_bc)
X_train_sc = bc_scaler.transform(X_train_bc)
X_test_sc = bc_scaler.transform(X_test_bc)

# Run the model
bc_model = LogisticRegression(solver='lbfgs', max_iter=100, random_state=42)
bc_model.fit(X_train_sc, y_train_bc)

## Predict on the test set

In [None]:
y_pred = bc_model.predict(X_test_sc)

Calculate the following for this model:
(scikit-learn's functions for this)

- Confusion Matrix
- Accuracy
- Precision
- Recall
- F1 Score

In [None]:
confusion_matrix(y_test_bc, y_pred)

#plot_confusion_matrix(bc_model, X_test_sc, y_test_bc);
ConfusionMatrixDisplay.from_estimator(bc_model, X_test_sc, y_test_bc);

In [None]:
print(bc_model.score(X_test_sc, y_test_bc))
print(precision_score(y_test_bc, y_pred))
print(recall_score(y_test_bc, y_pred))

In [None]:
print(classification_report(y_test_bc, y_pred))

Which of these metrics matter for this breast cancer detection problem?

#### Which metric to tune model hyperparameters with?

- Accuracy: misleading under class imbalance
    - Sometimes just fine.
- Precision: when false positives are much worse than false negatives
    - DNA crime-scene forensics.
- Recall: when false negatives are a lot worse 
    - X-ray imaging for cancer prediction.  

#### Multiclass Classification


**Multiclass classification**: more than two possible values for the target. An example:

- Classifying iris sub-species based on petal/sepal characteristics.

<center><img src = "Images/iris-dataset.png" width = 500 /></center>

Same metrics/methods to evaluate our models:
- Confusion matrices: number of rows/columns equal to the number of classes. 

- Metrics (precision/recall):
    - choose one class to be the "positive" class.
    - rest are assigned to the "negative" class. 
    - compute precision/recall for given "positive" class.

Repeat for each class.

In [None]:
from sklearn.datasets import load_iris

In [None]:
data_dict = load_iris()
X = data_dict['data']
features = pd.DataFrame(X, columns = data_dict['feature_names'])
#print(flowers.DESCR)
features.head()

In [None]:
y = data_dict['target']
data_dict['target_names']

- 0 = setosa
- 1 = versicolor
- 2 = virginica

In [None]:
# train-test split 
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(features, y, test_size = 0.3, random_state =42)

In [None]:
# Scale and transform
iris_scaler = StandardScaler()
X_train_iris_sc = iris_scaler.fit_transform(X_train_iris)
X_test_iris_sc = iris_scaler.transform(X_test_iris)

In [None]:
# fit model and get predictions
iris_model = LogisticRegression(max_iter = 10000)
iris_model.fit(X_train_iris_sc, y_train_iris)
y_pred_iris = iris_model.predict(X_test_iris_sc)

Our confusion matrix for the multiclass iris problem.

In [None]:
#plot_confusion_matrix(iris_model, X_test_iris_sc,y_test_iris);
ConfusionMatrixDisplay.from_estimator(iris_model, X_test_iris_sc, 
                      y_test_iris);


In [None]:
print(classification_report(y_pred_iris, y_test_iris))

Some issues with assessing the quality of a model solely on these metrics:
- Need to think a little bit more carefully about probabilities and classification thresholds
- The reciever operation curve (up next)

# Summary: Which Metric Should I Care About?


Well, it depends.

Accuracy:
- Pro: Takes into account both false positives and false negatives.
- Con: Can be misleadingly high when there is a significant class imbalance. (A lottery-ticket predictor that *always* predicts a loser will be highly accurate.)

Recall:
- Pro: Highly sensitive to false negatives.
- Con: No sensitivity to false positives.

Precision:
- Pro: Highly sensitive to false positives.
- Con: No sensitivity to false negatives.

F-1 Score:
- Harmonic mean of recall and precision.

The nature of your business problem will help you determine which metric matters.

Sometimes false positives are much worse than false negatives: Arguably, a model that compares a sample of crime-scene DNA with the DNA in a city's database of its citizens presents one such case. Here a false positive would mean falsely identifying someone as having been present at a crime scene, whereas a false negative would mean only that we fail to identify someone who really was present at the crime scene as such.

On the other hand, consider a model that inputs X-ray images and predicts the presence of cancer. Here false negatives are surely worse than false positives: A false positive means only that someone without cancer is misdiagnosed as having it, while a false negative means that someone with cancer is misdiagnosed as *not* having it.