In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs

In [7]:
%matplotlib inline

Machine learning is a very large topic, and we're covering it for the one week, so this will be a survey.

## scikit-learn

The Python package we will be using for most everything this week is `scikit-learn`.

Download it: `pip install scikit-learn`.

Learn about it: http://scikit-learn.org/

## What is machine learning?

Lots of definitions. A simple one: "a field of study that gives computers the ability to learn without being explicitly programmed." (Arthur Samuel)

Different types:

* supervised learning: prediction/regression, classification
* unsupervised learning: clustering, organizing

Machine learning "involves observing a set of examples that represent incomplete information about some statistical phenomenon, and then attempting to infer something about the process that generated those examples." (John Gottag, _Introduction to Programming and Computation with Python_)

(A large amount of what comes below comes from _Introduction to Programming and Computation with Python._)

Machine learning is at its core about representation and generalization.

* __representation__ is extracting structure from data
* __generalization__ is making predictions from data

## Feature vectors

In [8]:
dog_breeds = {"Alaskan Malamute": {"height": 24, "weight": 80, "energy": 4},
              "Bichon Frise": {"height": 10, "weight": 9.5, "energy": 4},
              "Irish Wolfhound": {"height": 32, "weight": 120, "energy": 2},
              "Basset Hound": {"height": 14, "weight": 50, "energy": 2}}

set_a = {"Alaskan Malamute", "Irish Wolfhound"}
set_b = {"Bichon Frise", "Basset Hound"}

_How were the above separated?_

The information being used here is called a _feature vector_. Each element of the vector describes some feature of the example. _What other feature vectors might we have here? Which ones are more useful than others?_

In __supervised learning__, we have the labels we want to apply to our data and the feature vectors of our data, like we do above. Classification, a supervised learning technique, could take the data above and then given a new example, place it in the right set based on its height. This is used for many applications: detecting spam or fraud, labeling documents, recommending products.

In __unsupervised learning__, we have our feature vectors, but no labels. Unsupervised learning looks for structure in our feature vectors that we do not yet know. Given the dog breeds above, unsupervised learning might break them into tall and short dogs, heavy and light dogs, or high and low energy dogs.

### Figuring out our feature vectors

The problem with much of our data is that there's too much of it. If you used every possible feature to organize your data, you would likely end up with just a giant mess. Using too many features can make a bad statistical model, and can also slow down the learning process.

__Feature extraction__ is hard, but is necessary. Even in unsupervised learning, we need human input to decide what feature vectors to use.

Create a `list` of `dict`s that contain the following features:
  - _name (string)
  - egg-laying (bool)
  - scales (bool)
  - poisonous (bool)
  - cold-blooded (bool)
  - num_legs (int)
  
Create a `dataFrame` with this list.

<!---
animals = [{"_name": "Cobra", "egg-laying": True, "scales": True,
            "poisonous": True, "cold-blooded": True, "num_legs": 0}, 
           {"_name": "Rattlesnake", "egg-laying": True, "scales": True,
            "poisonous": True, "cold-blooded": True, "num_legs": 0},
           {"_name": "Boa constrictor", "egg-laying": False, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0},
           {"_name": "Alligator", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 4},
           {"_name": "Dart frog", "egg-laying": True, "scales": False,
            "poisonous": True, "cold-blooded": False, "num_legs": 4},
           {"_name": "Salmon", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0},
           {"_name": "Python", "egg-laying": True, "scales": True,
            "poisonous": False, "cold-blooded": True, "num_legs": 0}]
pd.DataFrame(animals)
--->

_What features help determine if an animal is a reptile or not, based off this data?_

## Measuring distance

Let's say we want to use the above data to give us the similarity of two animals. We might ask, for example, if an alligator is more like a cobra or a dart frog.

In order to do this, we can measure the similarity of the feature vectors, but the vectors must be made up of numbers first. Four of ours are booleans, so let's convert them.

Iterate over the `df.columns` and set anything not `_name` to an `int`

<!---
df = pd.DataFrame(animals)
for col in df.columns:
    if col is not "_name":
        df[col] = df[col].astype(np.int)
df
--->

Let's create a feature vector for each animal.

You can convert your `dataFrame` to a dictionary with:

`df.T.to_dict()`

Iterate over it's `.items()` and set the value in our animals dict to an `np.array` of the list of values.

<!---
df.index = df.pop("_name")
animal_dict = df.T.to_dict()
animal_features = {}
for key, value in animal_dict.items():
    animal_features[key] = np.array(list(value.values()))
animal_features
--->

Now, we are going to use a formula called the __Euclidean distance.__ This is used to compare the distance between equal-length vectors of numbers.

$$distance(V1, V2) = \sqrt{\sum\limits_{i=1}^{len}(V1_i-V2_i)^{2}}$$

Here's that in English:

The distance between vector 1 and vector 2 is the square root of the sum of the difference between each of their features squared.

This sounds really hard, but is much like something we've done before: the Pythagorean theorem. If you have two vectors with two elements each, you could see those as x/y coordinates.

* V1 = [0, 0]
* V2 = [3, 4]

Take the difference of each coordinate squared: $(3 - 0)^2 = 9; (4 - 0)^2 = 16$. 

Sum them: $9 + 16 = 25$.

Now find the square root: $\sqrt{25} = 5$.

The Euclidean distance between these vectors is 5, the same as the hypotenuse of a right triangle using them as coordinates would be. The difference is that the Euclidean distance can be used with vectors of any length.

Lets write our own Euclidean distance function to help us out.

Make sure it takes 2 vectors (lists of numbers) as parameters, calculates the squares of the vectors and stores as a new vector, and return the square root of the sum of the numbers in the squared vector.

<!---
import math

def euclidean_distance(v1, v2):
    squares = (v1 - v2) ** 2
    return math.sqrt(squares.sum())

euclidean_distance(np.array([0, 0]), np.array([3, 4]))
--->

Create a function that dates a dictionary (of animals for example), and creates a new `dataFrame` that contains the animal as both columns and rows while each cell contains the Euclidean distance between each of the animals.  Display `--` for instances of the animal when compared with itself.

<!---
def compare_animals(animals, keys=None):
    """Given a dictionary of animals -- keys are names, values are NumPy arrays --
    build a table of Euclidean distance between each animal."""
    if keys is None:
        keys = list(animals.keys())
    col_labels = keys
    row_labels = col_labels[:]
    table = []
    for rowl in row_labels:
        row = []
        for coll in col_labels:
            if rowl == coll:
                row.append("--")
            else:
                distance = euclidean_distance(animals[rowl], animals[coll])
                row.append(str(round(distance, 2)))
        table.append(row)

    df = pd.DataFrame(table, columns=col_labels, index=row_labels)
    return df
--->

Lets view the `dataFrame` returned when asked to compare the following animals:

 - Rattlesnake
 - Boa Constrictor
 - Dart frog
 - Alligator
 
<!---
compare_animals(animal_features, 
                ['Rattlesnake', 'Boa constrictor', 'Dart frog', 'Alligator'])
--->

Well, that looks wrong. _What might the problem be_?

of course! The num_legs doesn't contain a `bool`, it contains the count of legs for the given animal.  Lets replace `num_legs` to a boolean that represents if the animals has legs or not.

  - 0 if the animal has no legs
  - 1 if the animals has 1 or more legs
  
<!---
df = pd.DataFrame(animals)
for col in df.columns:
    if col is "_name":
        pass
    elif col is "num_legs":
        df[col] = df[col].map(lambda x: 0 if x == 0 else 1)
    else:
        df[col] = df[col].astype(np.int)
--->

Again, we need to convert our dataframe to a dictionary containing the `key` of the animal name and the `value` of a vector of feature values.  We've done this before but lets do it again for practice.

<!---
df.index = df.pop("_name")
animal_dict = df.T.to_dict()
animal_features = {}
for key, value in animal_dict.items():
    animal_features[key] = np.array(list(value.values()))
animal_features
--->

How does this change our Euclidean distance for each animal?

 - Rattlesnake
 - Boa Constrictor
 - Dart frog
 - Alligator
 
<!---
compare_animals(animal_features, 
                ['Rattlesnake', 'Boa constrictor', 'Dart frog', 'Alligator'])
--->

And for funzies lets check all animals against eachother.

<!---
compare_animals(animal_features)
--->

# References and Further Reading

* [A Few Useful Things to Know about Machine Learning](http://www.astro.caltech.edu/~george/ay122/cacm12.pdf)