In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
import math

In [2]:
%matplotlib inline

Machine learning is a very large topic, and we're covering it for the one week, so this will be a survey.

#scikit-learn

The Python package we will be using for most everything this week is scikit-learn.

Download it: ```pip install scikit-learn```.

Learn about it: http://scikit-learn.org/

#What is machine learning?
Lots of definitions. A simple one: "a field of study that gives computers the ability to learn without being explicitly programmed." (Arthur Samuel)

Different types:
* supervised learning: prediction/regression, classification
* unsupervised learning: clustering, organizing

Machine learning "involves observing a set of examples that represent incomplete information about some statistical phenomenon, and then attempting to infer something about the process that generated those examples." (John Gottag, *Introduction to Programming and Computation with Python*)

(A large amount of what comes below comes from *Introduction to Programming and Computation with Python.*)

Machine learning is at its core about representation and generalization.

* __representation__ is extracting structure from data
* __generalization__ is making predictions from data

#Feature vectors

In [None]:
dog_breeds = {"Alaskan Malamute": {"height": 24, "weight": 80, "energy": 4},
              "Bichon Frise": {"height": 10, "weight": 9.5, "energy": 4},
              "Irish Wolfhound": {"height": 32, "weight": 120, "energy": 2},
              "Basset Hound": {"height": 14, "weight": 50, "energy": 2}}

set_a = {"Alaskan Malamute", "Irish Wolfhound"}
set_b = {"Bichon Frise", "Basset Hound"}

*How were the above separated?*

The information being used here is called a feature vector. Each element of the vector describes some feature of the example. *What other feature vectors might we have here? Which ones are more useful than others?*

In __supervised learning__, we have the labels we want to apply to our data and the feature vectors of our data, like we do above. Classification, a supervised learning technique, could take the data above and then given a new example, place it in the right set based on its height. This is used for many applications: detecting spam or fraud, labeling documents, recommending products.

In __unsupervised learning__, we have our feature vectors, but no labels. Unsupervised learning looks for structure in our feature vectors that we do not yet know. Given the dog breeds above, unsupervised learning might break them into tall and short dogs, heavy and light dogs, or high and low energy dogs.

## Figuring out our feature vectors

The problem with much of our data is that there's too much of it. If you used every possible feature to organize your data, you would likely end up with just a giant mess. Using too many features can make a bad statistical model, and can also slow down the learning process.

__Feature extraction__ is hard, but is necessary. Even in unsupervised learning, we need human input to decide what feature vectors to use.

Create a ```list``` of ```dicts``` that contain the following features:

* _name (string)
* egg-laying (bool)
* scales (bool)
* poisonous (bool)
* cold-blooded (bool)
* num_legs (int)

Create a ```dataFrame``` with this list.

In [3]:
data = [
    {"_name": "Alligator", "egg_laying": True, "scale": True, "poisonous": False, "cold_blooded": True, "num_legs": 4}, 
    {"_name": "Boa Constrictor", "egg_laying": True, "scale": True, "poisonous": False, "cold_blooded": True, "num_legs": 0}, 
    {"_name": "Newt", "egg_laying": True, "scale": False, "poisonous": True, "cold_blooded": True, "num_legs": 4}, 
    {"_name": "Python", "egg_laying": True, "scale": True, "poisonous": False, "cold_blooded": True, "num_legs": 0}, 
    {"_name": "King Cobra", "egg_laying": True, "scale": True, "poisonous": True, "cold_blooded": True, "num_legs": 0}, 
]

In [4]:
original_df = pd.DataFrame(data)
original_df

Unnamed: 0,_name,cold_blooded,egg_laying,num_legs,poisonous,scale
0,Alligator,True,True,4,False,True
1,Boa Constrictor,True,True,0,False,True
2,Newt,True,True,4,True,False
3,Python,True,True,0,False,True
4,King Cobra,True,True,0,True,True


*What features help determine if an animal is a reptile or not, based off this data?*

# Measuring distance
Let's say we want to use the above data to give us the similarity of two animals. We might ask, for example, if an alligator is more like a cobra or a dart frog.

In order to do this, we can measure the similarity of the feature vectors, but the vectors must be made up of numbers first. Four of ours are booleans, so let's convert them.

Iterate over the ```df.columns``` and set anything not ```_name``` to an ```int```

In [5]:
for column in original_df.columns:
    if column != "_name":
        original_df[column] = original_df[column].astype(np.int)

In [6]:
original_df

Unnamed: 0,_name,cold_blooded,egg_laying,num_legs,poisonous,scale
0,Alligator,1,1,4,0,1
1,Boa Constrictor,1,1,0,0,1
2,Newt,1,1,4,1,0
3,Python,1,1,0,0,1
4,King Cobra,1,1,0,1,1


Let's create a feature vector for each animal.

You can convert your dataFrame to a dictionary with:

```df.T.to_dict()```

Iterate over it's ```.items()``` and set the value in our animals dict to an np.array of the list of values.

In [10]:
original_df.T.to_dict()

{0: {'_name': 'Alligator',
  'cold_blooded': 1,
  'egg_laying': 1,
  'num_legs': 4,
  'poisonous': 0,
  'scale': 1},
 1: {'_name': 'Boa Constrictor',
  'cold_blooded': 1,
  'egg_laying': 1,
  'num_legs': 0,
  'poisonous': 0,
  'scale': 1},
 2: {'_name': 'Newt',
  'cold_blooded': 1,
  'egg_laying': 1,
  'num_legs': 4,
  'poisonous': 1,
  'scale': 0},
 3: {'_name': 'Python',
  'cold_blooded': 1,
  'egg_laying': 1,
  'num_legs': 0,
  'poisonous': 0,
  'scale': 1},
 4: {'_name': 'King Cobra',
  'cold_blooded': 1,
  'egg_laying': 1,
  'num_legs': 0,
  'poisonous': 1,
  'scale': 1}}

In [11]:
animal_dict = original_df.T.to_dict()
new_animal_dict = {}

for key, features in animal_dict.items():
    animal_name = features.pop('_name')
    new_animal_dict[animal_name] = np.array(list(features.values()))

In [13]:
new_animal_dict

{'Alligator': array([1, 1, 0, 4, 1]),
 'Boa Constrictor': array([1, 1, 0, 0, 1]),
 'King Cobra': array([1, 1, 1, 0, 1]),
 'Newt': array([0, 1, 1, 4, 1]),
 'Python': array([1, 1, 0, 0, 1])}

Now, we are going to use a formula called the __Euclidean distance__. You might remember this from your movie review homework.  This is used to compare the distance between equal-length vectors of numbers.

$$distance(V1, V2) = \sqrt{\sum\limits_{i=1}^{len}(V1_i-V2_i)^{2}}$$

Here's that in English:

The distance between vector 1 and vector 2 is the square root of the sum of the difference between each of their features squared.

This sounds really hard, but is much like something we've done before: the Pythagorean theorem. If you have two vectors with two elements each, you could see those as x/y coordinates.

* V1 = [0, 0]
* V2 = [3, 4]

Take the difference of each coordinate squared: $(3 - 0)^2 = 9; (4 - 0)^2 = 16$. 

Sum them: $9 + 16 = 25$.

Now find the square root: $\sqrt{25} = 5$.

The Euclidean distance between these vectors is 5, the same as the hypotenuse of a right triangle using them as coordinates would be.  The difference is the Euclidean distance can be used with vectors of any length.

Let's write our own Euclidean distance function to help us out.

Make sure it takes 2 vectors (lists of numbers) as parameters, calculates the squares of the vectors and stores as a new vector, and return the square root of the sum of the numbers in the squared vector.

In [14]:
def euclidean_distance(v1, v2):
    squares = (v1 - v2) ** 2
    return math.sqrt(squares.sum())

In [15]:
print(euclidean_distance(np.array([1, 6, 8, 0]), np.array([7, 0, 0, 10])))

assert euclidean_distance(np.array([0, 0]), np.array([3, 4])) == 5

15.362291495737216


Create a function that creates a dictionary (of animals for example), and creates a new dataFrame that contains the animal as both columns and rows while each cell contains the Euclidean distance between each of the animals. Display -- for instances of the animal when compared with itself.

In [16]:
def create_distance_table(search_animals, new_animal_dict):
    data_frame_list = []
    for column_animal in search_animals:
        animal_dict_ = {}
        
        for row_animal in search_animals:
            if row_animal == column_animal:
                animal_dict_[row_animal] = "---"
            else:
                distance = euclidean_distance(new_animal_dict[column_animal], new_animal_dict[row_animal])
                animal_dict_[row_animal] = distance
        data_frame_list.append(animal_dict_)
    return data_frame_list
            

In [18]:
search_animals = ["Alligator", "Boa Constrictor", "Newt", "Python"]
df = pd.DataFrame(create_distance_table(search_animals, new_animal_dict))
df.index = search_animals

In [19]:
df

Unnamed: 0,Alligator,Boa Constrictor,Newt,Python
Alligator,---,4,1.414214,4
Boa Constrictor,4,---,4.242641,0
Newt,1.414214,4.242641,---,4.242641
Python,4,0,4.242641,---


Lets view the dataFrame returned when asked to compare the following animals:
* Rattlesnake
* Boa Constrictor
* Dart frog
* Alligator

Well, that looks wrong. *What might the problem be?*

of course! The ```num_legs``` doesn't contain a ```bool```, it contains the count of legs for the given animal. Lets replace num_legs to a boolean that represents if the animals has legs or not.

* 0 if the animal has no legs
* 1 if the animals has 1 or more legs

Lets look at the animal list again.

In [20]:
new_animal_dict

{'Alligator': array([1, 1, 0, 4, 1]),
 'Boa Constrictor': array([1, 1, 0, 0, 1]),
 'King Cobra': array([1, 1, 1, 0, 1]),
 'Newt': array([0, 1, 1, 4, 1]),
 'Python': array([1, 1, 0, 0, 1])}

In [21]:
original_df['num_legs'] = original_df['num_legs'].astype(bool).astype(np.int)

In [22]:
original_df

Unnamed: 0,_name,cold_blooded,egg_laying,num_legs,poisonous,scale
0,Alligator,1,1,1,0,1
1,Boa Constrictor,1,1,0,0,1
2,Newt,1,1,1,1,0
3,Python,1,1,0,0,1
4,King Cobra,1,1,0,1,1


In [23]:
animal_dict = original_df.T.to_dict()
new_animal_dict = {}

for key, features in animal_dict.items():
    animal_name = features.pop('_name')
    new_animal_dict[animal_name] = np.array(list(features.values()))

In [24]:
new_animal_dict

{'Alligator': array([1, 1, 0, 1, 1]),
 'Boa Constrictor': array([1, 1, 0, 0, 1]),
 'King Cobra': array([1, 1, 1, 0, 1]),
 'Newt': array([0, 1, 1, 1, 1]),
 'Python': array([1, 1, 0, 0, 1])}

In [25]:
cleaned_df = pd.DataFrame(create_distance_table(search_animals, new_animal_dict))
cleaned_df.index = search_animals
cleaned_df

Unnamed: 0,Alligator,Boa Constrictor,Newt,Python
Alligator,---,1,1.414214,1
Boa Constrictor,1,---,1.732051,0
Newt,1.414214,1.732051,---,1.732051
Python,1,0,1.732051,---


Again, we need to convert our dataframe to a dictionary containing the key of the animal name and the value of a vector of feature values. We've done this before but lets do it again for practice.

How does this change our Euclidean distance for each animal?

 - Rattlesnake
 - Boa Constrictor
 - Dart frog
 - Alligator
 
<!---
compare_animals(animal_features, 
                ['Rattlesnake', 'Boa constrictor', 'Dart frog', 'Alligator'])
--->

And for funzies lets check all animals against eachother.

# References and Further Reading

* [A Few Useful Things to Know about Machine Learning](http://www.astro.caltech.edu/~george/ay122/cacm12.pdf)