# Chapter 3: Predictive Model Building: Balancing Performance, Complexity, and Big Data

## Assessing Performance of Predictive Models
For regression:
- Mean squared error (MSE): $$(\frac{1}{m}) \sum_{i=1}^m (y_i - pred(x_i))^2$$
- Mean absolute error (MAE): $$(\frac{1}{m}) \sum_{i=1}^m |y_i - pred(x_i)|$$

## Achieving Harmony Between Model and Data
- OLS (Ordinary least squares regression) can sometimes overfit a problem
- 2 of the methods for adjusting OLS
1. **Forward stepwise regression** to control over-fitting
2. **Ridge regression**: Control overfitting by penalizing regression coefficients

---

## Comparison of MSE, MAE and RMSE

In [1]:
__author__ = 'shawn_ng'

#here are some made-up numbers to start with
target = [1.5, 2.1, 3.3, -4.7, -2.3, 0.75]
prediction = [0.5, 1.5, 2.1, -2.2, 0.1, -0.5]

error = []
for i in range(len(target)):
    error.append(target[i] - prediction[i])

#print the errors
print("Errors ",)
print(error)
#ans: [1.0, 0.60000000000000009, 1.1999999999999997, -2.5,
#-2.3999999999999999, 1.25]

#calculate the squared errors and absolute value of errors
squaredError = []
absError = []
for val in error:
    squaredError.append(val*val)
    absError.append(abs(val))

#print squared errors and absolute value of errors
print("Squared Error")
print(squaredError)
#ans: [1.0, 0.3600000000000001, 1.4399999999999993, 6.25,
#5.7599999999999998, 1.5625]

print("Absolute Value of Error")
print(absError)
#ans: [1.0, 0.60000000000000009, 1.1999999999999997, 2.5,
#2.3999999999999999, 1.25]

#calculate and print mean squared error MSE
print("MSE = ", sum(squaredError)/len(squaredError))
#ans: 2.72875

from math import sqrt
#calculate and print square root of MSE (RMSE)
print("RMSE = ", sqrt(sum(squaredError)/len(squaredError)))
#ans: 1.65189285367

#calculate and print mean absolute error MAE
print("MAE = ", sum(absError)/len(absError))
#ans: 1.49166666667

#compare MSE to target variance
targetDeviation = []
targetMean = sum(target)/len(target)
for val in target:
    targetDeviation.append((val - targetMean)*(val - targetMean))

#print the target variance
print("Target Variance = ", sum(targetDeviation)/len(targetDeviation))
#ans: 7.5703472222222219

#print the the target standard deviation (square root of variance)
print("Target Standard Deviation = ", sqrt(sum(targetDeviation)
/len(targetDeviation)))
#ans: 2.7514263977475797

Errors 
[1.0, 0.6000000000000001, 1.1999999999999997, -2.5, -2.4, 1.25]
Squared Error
[1.0, 0.3600000000000001, 1.4399999999999993, 6.25, 5.76, 1.5625]
Absolute Value of Error
[1.0, 0.6000000000000001, 1.1999999999999997, 2.5, 2.4, 1.25]
MSE =  2.72875
RMSE =  1.651892853668179
MAE =  1.4916666666666665
Target Variance =  7.570347222222222
Target Standard Deviation =  2.7514263977475797


## Measuring Performance for Classifier Trained on Rocks-Versus-Mines

In [None]:
__author__ = 'shawn_ng'
#use scikit learn package to build classified on rocks-versus-mines data
#assess classifier performance
import urllib2
import numpy
import random
from sklearn import datasets, linear_model
from sklearn.metrics import roc_curve, auc
import pylab as pl

def confusionMatrix(predicted, actual, threshold):
    if len(predicted) != len(actual): return -1
    tp = 0.0
    fp = 0.0
    tn = 0.0
    fn = 0.0
    for i in range(len(actual)):
        if actual[i] > 0.5: #labels that are 1.0 (positive examples)
            if predicted[i] > threshold:
                tp += 1.0 #correctly predicted positive
            else:
                fn += 1.0 #incorrectly predicted negative
        else: #labels that are 0.0 (negative examples)
            if predicted[i] < threshold:
                tn += 1.0 #correctly predicted negative
            else:
                fp += 1.0 #incorrectly predicted positive
    rtn = [tp, fn, fp, tn]
    return rtn