# Evaluation: Precision & Recall
## Using the evaluation metrics we have learned, we are going to compare how well some different types of classifiers perform on different evaluation metrics
### We are going to use a dataset of written numbers which we can import from sklearn. Run the code below to do so. 


In [None]:
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist['data'], mnist['target']

### Now take a look at the shapes of the X and y matricies 

In [None]:
print(X.shape)
print(y.shape)

### Now, let's pick one entry and see what number is written. Use indexing to pick the 36000th digit

In [None]:
print(X[36000])

### You can use the .reshape(28,28) function and plt.imshow() function with the parameters cmap = matplotlib.cm.binary, interpolation="nearest" to make a plot of the number. Be sure to import matplotlib!

In [None]:
import matplotlib
import matplotlib.pyplot as plt

### Use indexing to see if what the plot shows matches with the outcome of the 36000th index

In [None]:
plt.imshow(X[36000].reshape(28,28), cmap=matplotlib.cm.binary, interpolation='nearest')

### Now lets break into a test train split to run a classification. Instead of using sklearn, use indexing to select the first 60000 entries for the training, and the rest for training.

In [None]:
X_train, y_train = X[:60000], y[:60000]
X_test, y_test = X[60000:], y[60000:]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

### We are going to make a two-class classifier, so lets restrict to just one number, for example 5s. Do this by defining a new y training and y testing sets for just the number 5

In [None]:
from sklearn.model_selection import train_test_split 

### Lets train a logistic regression to predict if a number is a 5 or not (remember to use the 'just 5s' y training set!)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y5_train)

### Does the classifier predict correctly the 36000th digit we picked before?

In [None]:
print(logreg.predict(X[36000].reshape(1, -1)))
print("Yes it did. It said it's not a five and it's not.")

In [None]:
y_pred = logreg.predict(X_test)

### To make some comparisons, we are going to make a very dumb classifier, that never predicts 5s. Build the classifier with the code below, and call it using: never_5_clf = Never5Classifier()

In [None]:
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never_5_clf = Never5Classifier()

### Now lets fit and predict on the testing set using our never 5 Classifier

In [None]:
never_5_clf.fit(X_train, y5_train)
never5_pred = never_5_clf.predict(X_test)

### Let's compare this to the Logistic Regression. Examine the confusion matrix, precision, recall, and f1_scores for each. What is the probability cutoff you are using to decide the classes?

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [None]:
# Logistic Regression
print("Confusion Matrix for the Logistic Regression")
print(confusion_matrix(y5_test, y_pred))
print("Precision", precision_score(y5_test, y_pred))
print("Recall", recall_score(y5_test, y_pred))
print("F-Score", f1_score(y5_test, y_pred))

In [None]:
print("Confusion Matrix for the Never 5")
print(confusion_matrix(y5_test, never5_pred))
print("Precision", precision_score(y5_test, never5_pred))
print("Recall", recall_score(y5_test, never5_pred))
print("F-Score", f1_score(y5_test, never5_pred))

### What are the differences you see? Without knowing what each model is, what can these metrics tell you about how well each works?

In [None]:
print("The biggest difference is the confusion matrix. On the second one we see that there are no True Negatives nor False Positives, only True Positives and False Negatives")
print("Without knowing which model is being used I would chose the first one.")
print("From the metrics we can see the F-Score is higher on the first one, therefore more reliable")

### Now let's examine the roc_curve for each. Use the roc_curve method from sklearn.metrics to help plot the curve for each

In [None]:
from sklearn.metrics import plot_roc_curve

In [None]:
# Logistic Regression
fpr, tpr, threshold = roc_curve(y5_test, y_pred)
plt.plot(fpr, tpr, 'b')
plt.plot([0, 1], [0, 1], 'r--')
plt.ylabel('TP Rate')
plt.xlabel('FP Rate')
plt.show()

In [None]:
plot_roc_curve(model_5,X_train,y_train_5)

### Now find the roc_auc_score for each. 

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
print("AUC Score for Logisitic Regression", roc_auc_score(y5_test, y_pred))

In [None]:
print("AUC Score for Never 5 Classifier", roc_auc_score(y5_test, never5_pred))

### What does this metric tell you? Which classifier works better with this metric in mind?

In [None]:
print("I would still go with the first one. The second one is only 'guessing'")