# scikit-learn

Old packages:
```
pandas
numpy
matplotlib
seaborn
```

New packages:
```
pip install scikit-learn scipy
```

We will be using `scikit-learn` linear regression, or fitting data to a model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
import math

%matplotlib inline

# Machine Learning

_Machine learning_ is a **very large** topic, and we're going to do a survey of the topic

Machine learning: "a field of study that gives computers the ability to learn without being explicitly programmed." (Arthur Samuel)

Different types:
* **supervised** - prediction/regression, classification
* **unsupervised** - clustering, organizing

Machine learning "involves observing a set of examples that represent incomplete information about some statistical phenomenon, and then attempting to infer something about the process that generated those examples." (John Gottag, _Introduction to Programming and Computation with Python_)

(A large amount of what comes below comes from _Introduction to Programming and Computation with Python._)

**representation** and **generalization**

* **representation** is extracting structure from data
* **generalization** is making _predictions_ from data


In [4]:
dog_breeds = {"Alaskan Malamute": {"height": 24, "weight": 80, "energy": 4},
              "Bichon Frise": {"height": 10, "weight": 9.5, "energy": 4},
              "Irish Wolfhound": {"height": 32, "weight": 120, "energy": 2},
              "Basset Hound": {"height": 14, "weight": 50, "energy": 2}}

x = pd.DataFrame(dog_breeds)
x = x.transpose()
x

Unnamed: 0,energy,height,weight
Alaskan Malamute,4,24,80.0
Basset Hound,2,14,50.0
Bichon Frise,4,10,9.5
Irish Wolfhound,2,32,120.0


In [5]:
set_a = {'Alaskan Malamute', 'Irish Wolfound'}
set_b = {'Bichon Frise', 'Basset Hound'}


A _feature vector_ can be generated to descibe each dog breed. Each element in the vector can be used to describe some aspect of the breed.

`malamute = [4, 24, 80]`
`basset_hound = [2, 14, 50]`

Just wanted to focus on size, so we would generate our feature vector to only include height, weight

`malamute = [24, 80]`
`basset_hound = [14, 50]`

We need to select elements for our feature vector that are relevant to our situation.

In **supervised learning**, we (the programmers) determine the elements to include in the feature vector, we provide training (we say these examples are in one set, these are in the other). Applications of this can include: spam filtering, automatic labeling of documents, recommending products

In **unsupervised learning**, we have feature vectors from somewhere, but there's no labeling of them. The machine looks for some structure in the data that we don't even know about (necessarily). It can provide insights into data that we might not otherwise see

## Figuring out feature vectors

Most real world situations have **too much** data. Checking every posssible feature is likely to lead to a giant mess. Using too many features can give you a bad statistical model of things, and also can greatly slow down the process of machine learning

### Feature extraction

It's hard, but necessary. Even if we're trying "unsupervised" learning, we need human input at least to figure out what the feature vectors



In [6]:
animals = [{"_name": "Cobra", "egg-laying": True, "scales": True,
            "poisonous": True, "cold-blooded": True, "num_legs": 0}, 
           {"_name": "Rattlesnake", "egg-laying": True, "scales": True,
            "poisonous": True, "cold-blooded": True, "num_legs": 0},
           {"_name": "Boa constrictor", "egg-laying": False, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0},
           {"_name": "Alligator", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 4},
           {"_name": "Dart frog", "egg-laying": True, "scales": False,
            "poisonous": True, "cold-blooded": False, "num_legs": 4},
           {"_name": "Salmon", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0},
           {"_name": "Python", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0}]
pd.DataFrame(animals)


Unnamed: 0,_name,cold-blooded,egg-laying,num_legs,poisonous,scales
0,Cobra,True,True,0,True,True
1,Rattlesnake,True,True,0,True,True
2,Boa constrictor,True,False,0,False,True
3,Alligator,True,True,4,False,True
4,Dart frog,False,True,4,True,False
5,Salmon,True,True,0,False,True
6,Python,True,True,0,False,True


What _features_ are relevant to classifying an animal as a **reptile**?

## Measuring distance

We can measure the similarity of the features by treating them like coordinates in space and calculating the distance between them.

For instance, if we've got a feature vector [2, 4] and a feature vector [6, 2], we calculate the distance between using the pythagorean theorem.

**Euclidean distance**

First of all, we need to make sure all of our columns are numbers

In [8]:
df = pd.DataFrame(animals)
for col in df.columns:
    if col is not "_name":
        df[col] = df[col].astype(np.int)

df

Unnamed: 0,_name,cold-blooded,egg-laying,num_legs,poisonous,scales
0,Cobra,1,1,0,1,1
1,Rattlesnake,1,1,0,1,1
2,Boa constrictor,1,0,0,0,1
3,Alligator,1,1,4,0,1
4,Dart frog,0,1,4,1,0
5,Salmon,1,1,0,0,1
6,Python,1,1,0,0,1


In [13]:
# convert the transposed version of our DataFrame to a dictionary

df.index = df.pop('_name')
df.T.to_dict()

{'Alligator': {'cold-blooded': 1,
  'egg-laying': 1,
  'num_legs': 4,
  'poisonous': 0,
  'scales': 1},
 'Boa constrictor': {'cold-blooded': 1,
  'egg-laying': 0,
  'num_legs': 0,
  'poisonous': 0,
  'scales': 1},
 'Cobra': {'cold-blooded': 1,
  'egg-laying': 1,
  'num_legs': 0,
  'poisonous': 1,
  'scales': 1},
 'Dart frog': {'cold-blooded': 0,
  'egg-laying': 1,
  'num_legs': 4,
  'poisonous': 1,
  'scales': 0},
 'Python': {'cold-blooded': 1,
  'egg-laying': 1,
  'num_legs': 0,
  'poisonous': 0,
  'scales': 1},
 'Rattlesnake': {'cold-blooded': 1,
  'egg-laying': 1,
  'num_legs': 0,
  'poisonous': 1,
  'scales': 1},
 'Salmon': {'cold-blooded': 1,
  'egg-laying': 1,
  'num_legs': 0,
  'poisonous': 0,
  'scales': 1}}

In [23]:
animal_features = {}
for key, value in df.T.items():
    animal_features[key] = np.array(list(value.values))

In [24]:
animal_features

{'Alligator': array([1, 1, 4, 0, 1]),
 'Boa constrictor': array([1, 0, 0, 0, 1]),
 'Cobra': array([1, 1, 0, 1, 1]),
 'Dart frog': array([0, 1, 4, 1, 0]),
 'Python': array([1, 1, 0, 0, 1]),
 'Rattlesnake': array([1, 1, 0, 1, 1]),
 'Salmon': array([1, 1, 0, 0, 1])}

### Euclidean distance

It's a general formula to calculate the distance between two vectors of the same length

LaTeX:

$$distance(V1, V2) = \sqrt{\sum\limits_{i=1}^{len}(V1_i-V2_i)^{2}}$$

This looks complicated, it really isn't.

We just calculate the difference between element in each vector, square that, sum all the squares together

- V1 = [0, 0]
- V2 = [3, 4]

Sum the differences:

$$ (0 - 3)^2 + (0 - 4)^2 $$

$$ 9 + 16 $$

$$ 25 $$

Square root of that:

$$ \sqrt{25} $$

$$ 5 $$

Let's write a euclidean distance function


In [25]:
def euclidean_distance(v1, v2):
    squares = (v1 - v2) ** 2
    return math.sqrt(squares.sum())

In [28]:
euclidean_distance(np.array([0, 0]), np.array([3, 4]))

# write tests!
assert euclidean_distance(np.array([0, 0]), np.array([3, 4])) == 5.0

In [33]:
# create a DataFrame with animals as its rows (indexes) as well as
# the columns, each cell contains the euclidean distance between
# the animals

def compare_animals(animal_features, keys=None):
    '''Given a dict of animals' feature vectors, does the above'''
    
    if keys is None:
        keys = animal_features.keys()
    
    col_labels = row_labels = keys
    
    table = []
    
    for row_label in row_labels:
        row = []
        for col_label in col_labels:
            if col_label == row_label:
                row.append('--')
            else:
                row.append(euclidean_distance(animal_features[row_label], animal_features[col_label]))
        
        table.append(row)
    
    df = pd.DataFrame(table, columns=col_labels, index=row_labels)
    return df

compare_animals(animal_features)

Unnamed: 0,Salmon,Alligator,Rattlesnake,Cobra,Boa constrictor,Dart frog,Python
Salmon,--,4,1,1,1,4.358899,0
Alligator,4,--,4.123106,4.123106,4.123106,1.732051,4
Rattlesnake,1,4.123106,--,0,1.414214,4.242641,1
Cobra,1,4.123106,0,--,1.414214,4.242641,1
Boa constrictor,1,4.123106,1.414214,1.414214,--,4.472136,1
Dart frog,4.358899,1.732051,4.242641,4.242641,4.472136,--,4.358899
Python,0,4,1,1,1,4.358899,--


In [34]:
compare_animals(animal_features, ['Rattlesnake', 'Boa constrictor', 'Dart frog', 'Alligator'])

Unnamed: 0,Rattlesnake,Boa constrictor,Dart frog,Alligator
Rattlesnake,--,1.414214,4.242641,4.123106
Boa constrictor,1.414214,--,4.472136,4.123106
Dart frog,4.242641,4.472136,--,1.732051
Alligator,4.123106,4.123106,1.732051,--


We **could** make num_legs into a boolean like "has_legs"

In [35]:
df['has_legs'] = df.num_legs > 0

In [42]:
df['has_legs'] = df['has_legs'].astype(np.int)
df = df.drop(labels='num_legs', axis=1)

ValueError: labels ['num_legs'] not contained in axis

In [43]:
df

Unnamed: 0_level_0,cold-blooded,egg-laying,poisonous,scales,has_legs
_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cobra,1,1,1,1,0
Rattlesnake,1,1,1,1,0
Boa constrictor,1,0,0,1,0
Alligator,1,1,0,1,1
Dart frog,0,1,1,0,1
Salmon,1,1,0,1,0
Python,1,1,0,1,0


In [44]:
animal_features2 = {}
for key, value in df.T.items():
    animal_features2[key] = np.array(list(value.values))

In [45]:
compare_animals(animal_features2, ['Rattlesnake', 'Boa constrictor', 'Dart frog', 'Alligator'])

Unnamed: 0,Rattlesnake,Boa constrictor,Dart frog,Alligator
Rattlesnake,--,1.414214,1.732051,1.414214
Boa constrictor,1.414214,--,2.236068,1.414214
Dart frog,1.732051,2.236068,--,1.732051
Alligator,1.414214,1.414214,1.732051,--


In [46]:
compare_animals(animal_features2)

Unnamed: 0,Salmon,Alligator,Rattlesnake,Cobra,Boa constrictor,Dart frog,Python
Salmon,--,1,1,1,1,2,0
Alligator,1,--,1.414214,1.414214,1.414214,1.732051,1
Rattlesnake,1,1.414214,--,0,1.414214,1.732051,1
Cobra,1,1.414214,0,--,1.414214,1.732051,1
Boa constrictor,1,1.414214,1.414214,1.414214,--,2.236068,1
Dart frog,2,1.732051,1.732051,1.732051,2.236068,--,2
Python,0,1,1,1,1,2,--


# References and Further Reading

* [A Few Useful Things to Know about Machine Learning](http://www.astro.caltech.edu/~george/ay122/cacm12.pdf)
* [Visual Intro to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)