# Machine Learning

We’ll use machine learning to refer to creating and using models that are learned from data.

We’ll look at both **supervised models** (in which there is a set of data labeled with the correct answers to learn from), and **unsupervised models** (in which there are no such labels). 
There are various other types like semisupervised (in which only some of the data are labeled) and online (in which the model needs to continuously adjust to newly arriving data).

A common danger in machine learning is **overfitting**: producing a model that performs well on the data you train it on but that generalizes poorly to any new data.

The other side of this is **underfitting**, producing a model that doesn’t perform well even on the training data, although typically when this happens you decide your model isn’t good enough and keep looking for a better one

In [4]:
import random

def split_data(data, prob):
    """split data into fractions: [prob, 1-prob]"""
    
    results = [], []
    
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

# example

import dateutil.parser
import csv

data = []

with open("../data/stocks.txt", 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for line in reader:
        data.append(line)

In [6]:
data[0:20]

[['symbol', 'date', 'closing_price'],
 ['AAPL', '2015-01-23', '112.98'],
 ['AAPL', '2015-01-22', '112.4'],
 ['AAPL', '2015-01-21', '109.55'],
 ['AAPL', '2015-01-20', '108.72'],
 ['AAPL', '2015-01-16', '105.99'],
 ['AAPL', '2015-01-15', '106.82'],
 ['AAPL', '2015-01-14', '109.8'],
 ['AAPL', '2015-01-13', '110.22'],
 ['AAPL', '2015-01-12', '109.25'],
 ['AAPL', '2015-01-09', '112.01'],
 ['AAPL', '2015-01-08', '111.89'],
 ['AAPL', '2015-01-07', '107.75'],
 ['AAPL', '2015-01-06', '106.26'],
 ['AAPL', '2015-01-05', '106.25'],
 ['AAPL', '2015-01-02', '109.33'],
 ['AAPL', '2014-12-31', '110.38'],
 ['AAPL', '2014-12-30', '112.52'],
 ['AAPL', '2014-12-29', '113.91'],
 ['AAPL', '2014-12-26', '113.99']]

In [7]:
print("Data row ", len(data[1:]))

Data row  16555


In [11]:
test_data, train_data = split_data(data[1:], 0.30)

print("Rows in train data ", len(train_data))
print("Rows in test data ", len(test_data))

Rows in train data  11650
Rows in test data  4905


In [12]:
test_data[0:10]

[['AAPL', '2015-01-23', '112.98'],
 ['AAPL', '2015-01-14', '109.8'],
 ['AAPL', '2015-01-06', '106.26'],
 ['AAPL', '2015-01-02', '109.33'],
 ['AAPL', '2014-12-29', '113.91'],
 ['AAPL', '2014-12-26', '113.99'],
 ['AAPL', '2014-12-23', '112.54'],
 ['AAPL', '2014-12-22', '112.94'],
 ['AAPL', '2014-12-19', '111.78'],
 ['AAPL', '2014-12-16', '106.75']]

In [19]:
def train_test_split(x,y, test_pct):
    
    data = zip(x, y)
    
    train, test = split_data(data, 1 - test_pct)
    x_train, y_train = zip(*train)
    x_test, y_test = zip(*test)
    
    return x_train, x_test, y_train, y_test


x = [[row[0], row[1]] for row in data[1:]]
y = [row[2] for row in data[1:]]

In [20]:
train_x, test_x, train_y, test_y = train_test_split(x, y, 0.30)

In [21]:
print("Row in train x and train y ", len(train_x), len(train_y))

Row in train x and train y  11618 11618


In [23]:
print("Row in test x and test y ", len(test_x), len(test_y))

Row in test x and test y  4937 4937


# Correctness

- **True Positive**: This message is spam, and we correctly predicted spam
- **False Positive** (Type 1 error): This message is not spam, but we predicted spam
- **False Negative** (Type 2 error): This message is spam, but we not predicted not spam
- **True Negative**: This message is not spam, and we correctly predicted not spam

- 5 babies out of 1000 are named Luke
- 14 babies out of 1000 have leukemia

We assume these two factors are independent


|test     |leukemia|no leukemia|total   |
|---------|--------|-----------|--------|
|Luke     |     70 |      4930 | 5000   |
|not Luke |  13930 |    981070 | 995000 |
|total    |  14000 |    986000 |1000000 |

In [24]:
def accuracy(tp, fp, fn, tn):
    
    correct = tp + tn
    total = tp + fp + fn + tn
    return correct / total

print("Accurracy ", accuracy(70, 4930, 13939, 981070))

Accurracy  0.9811311698194716


That seems like a pretty impressive number. But clearly this is not a good test, which means that we probably shouldn’t put a lot of credence in raw accuracy.

It’s common to look at the combination of precision and recall. 

- **Precision** measures how accurate our positive predictions are.

- **Recall** measures what fraction of the positives our model identified


In [26]:
def precision(tp, fp, fn, tn):
    return tp / (tp + fp)

print("Precision ", precision(70, 4930, 13930, 981070))

Precision  0.014


In [27]:
def recall(tp, fp, fn, tn):
    return tp / (tp + fn)

print("Recall ", recall(70, 4930, 13930, 981070))

Recall  0.005


These are both terrible numbers, **reflecting that this is a terrible model**.

In [28]:
def f1_score(tp, fp, fn, tn):
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)
    return 2 * p * r / (p + r)

print("F1 Score ", f1_score(70, 4930, 13930, 981070))

F1 Score  0.00736842105263158


F1 score reaches its best value at 1 and worst at 0. [F1 score](https://en.wikipedia.org/wiki/F1_score).