# Week 1: MSE and R^2, Accuracy and Error, Precision and Recall

#### Making Meaningful Predictions from Data
This week we were introduced to three sets of concepts. We looked at MSE and R^2 along with their relation to one another. We learned about how to measure Accuracy and Error mathematically. And finally we learned about Precision and Recall using true/false positive/negatives.

## Part 1: MSE and R^2

We are going to work with a 2D example of these topics for the purpose of visualization. In your capstone project at the end of this series you will use this on your dataset as a whole. This example is meant to be simpler as an introduction to the topics.

In [1]:
import numpy as np
from sklearn import linear_model

### The "Data"

In this example we are going to use a 2D numpy array to calculate two separate MSE and R^2 values. The array defined below will be the "data" for this example.

###### Note: Because of this exmaple using numpy arrays, most of our answers will return in an array form. Don't worry! This isn't affecting any of our answers! 

In [2]:
model = linear_model.LinearRegression()
exArray = np.array([[0], [1], [3], [2],
                   [2],[3],[4], [0]])

In [3]:
print(exArray.shape) #Check the dimanesions of exArray using numpy operators 
#if you are unfamiliar with this don't worry it is just for this example

exArray[0]

(8, 1)


array([0])

In [4]:
#Note that X and y must have the same dimensions!
y = exArray[4:]
X = exArray[:4]

In [5]:
#We have to fit our model!
model.fit(X,y)
print('Coefficient: \n', model.coef_)

Coefficient: 
 [[0.3]]


In [6]:
#The values of these test arrays are unimportant, so the values listed here are random.
#Try changing them around and running through the below code again to see how they affect MSE and R^2

XTest = np.array([[-2]])
yTest = np.array([[4]])

In [7]:
predictions = model.predict(XTest)

In [15]:
#TODO Calculate the squared differences as in lecture.

differences = [(x-y)**2 for (x,y) in zip(predictions,y)]

The MSE is just the average (Mean) of these squared differences:

In [16]:
#TODO Calculate the MSE

MSE =np.sum(differences)/len(differences)

print("MSE = " + str(MSE)) #Answer should be .64

MSE = 0.6400000000000005


As we saw in the lectures, the R^2 (and the FVU, or "Fraction of Variance Unexplained") normalize the Mean Squared Error based on the variance of the data:

In [17]:
#TODO Calculate R2
FVU = MSE / np.var(y)

#SOLN (there are more ways than this to solve this)
R2 = 1-FVU

print("R2 = " + str(R2)) #Answer should be roughly .707...

R2 = 0.7074285714285713


## Part 2: Accuracy and Error

To start measuring accuracy and error, we must first replace our output variable with a binary variable indicating if the value in the array is greater than a select value. For this example we will check if the values are greater than or equal to 2.

In [18]:
y_class = [(num >= 2) for num in y]
modelA = linear_model.LogisticRegression()
modelA.fit(X, y_class)
y_class

  y = column_or_1d(y, warn=True)


[array([ True]), array([ True]), array([ True]), array([False])]

In [19]:
predictionsA = modelA.predict(X)

In [20]:
correct = predictionsA == y_class

### Classification Diagnostics: Accuracy

In [27]:
#TODO Calculate Accuracy in any way not involving TP, TF, FP, FN

accuracy = len(correct)/len(correct)

print("Accuracy = " + str(accuracy[0])) #Answer should be .75

TypeError: 'float' object is not subscriptable

In [25]:
sum(correct)/len(correct)

array([0.75, 0.75, 0.75, 0.75])

In [24]:
correct

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [False, False, False, False]])

In [28]:
TP = sum([(p and l) for (p,l) in zip(predictionsA, y_class)])
FP = sum([(p and not l) for (p,l) in zip(predictionsA, y_class)])
TN = sum([(not p and not l) for (p,l) in zip(predictionsA, y_class)])
FN = sum([(not p and l) for (p,l) in zip(predictionsA, y_class)])

In [29]:
print("TP = " + str(TP))
print("FP = " + str(FP))
print("TN = " + str(TN))
print("FN = " + str(FN))

TP = [3]
FP = 1
TN = 0
FN = 0


In [31]:
#TODO Show that calculating accuracy using true/false positives/negatives will get the same answer

TFAccuracy = (TP+TN)/(TP+FP+TN+FN)
TFAccuracy[0]

0.75

In [32]:
TPR = TP / (TP + FN)
TNR = TN / (TN + FP)
print(TPR[0], TNR)

1.0 0.0


In [33]:
BER = 1 - 1/2 * (TPR + TNR)
print("Balanced error rate = " + str(BER)) #Answer should be .5

Balanced error rate = [0.5]


## Part 3: Precision and Recall

Precision and Recall are often used to rank performance of models. These can both be defined in terms of True Positives, False Positives, and False Negatives. (Though knowing any three of these true/false positives/negatives will give you the number of the remaining term)

In [None]:
#TODO Calculate Precision and recall in the terms defined in lecture.

precision = (TP)/(TP+)
recall = (TP)/(TP+)

precision, recall #Answers should be (.75, 1.0)

The F1 score is just the average (precisely, the harmonic mean) of precision and recall. This is useful since it's easy to have either a good precision, or a good recall in isolation, but it's hard for both values to be high simultaneously.

In [None]:
F1 = 2 * (precision*recall) / (precision + recall)
F1 #Answer should be roughly .857...

## You're All Done!

We have now introduced some basic ideas of classification and ranking. Next week you will use them on a dataset of reviews and see how these ideas can be applied to a proper dataset!