# Class 21: Intro to Machine Learning

Plan for today:
- Creating confidence intervals
- Introduction to Machine Learning


In [1]:
import YData

# YData.download.download_class_code(21)   # get class code    
# YData.download.download_class_code(21, TRUE) # get the code with the answers 

# YData.download.download_homework(9)  # downloads the homework 


If you are using colabs, you should uncomment and run the code below.

In [None]:
# !pip install https://github.com/lederman/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Intro to Machine Learning:  Features (X) and labels (y)

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [4]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
138,Adelie,Dream,37.0,16.5,185.0,3400.0,Female
110,Adelie,Biscoe,38.1,16.5,198.0,3825.0,Female
109,Adelie,Biscoe,43.2,19.0,197.0,4775.0,Male
100,Adelie,Biscoe,35.0,17.9,192.0,3725.0,Female
14,Adelie,Torgersen,34.6,21.1,198.0,4400.0,Male


In [5]:
# Let's explore how many different members there are of each species in our data set? 

species_counts = penguins.groupby("species").agg(count = ('island', 'count'))

species_counts


Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
Adelie,146
Chinstrap,68
Gentoo,119


#### Questions: 

1. If we had to guess the species of the penguin without knowing any of the penguin's features, species of penguin should we guess? 
A: Always guess Adelie


2. If we were to following the optimal guessing strategy, what percent of our guess would be correct (i.e., what would our classification accuracy be)? 


In [6]:
species_counts['count']/sum(species_counts['count'])

species
Adelie       0.438438
Chinstrap    0.204204
Gentoo       0.357357
Name: count, dtype: float64

To begin the classification process, let's store the features (X) and the labels (y) in separate names called `X_penguin_features` and `y_penguin_labels` respectively. 

In [7]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 
                               'bill_depth_mm',
                               'flipper_length_mm', 
                               'body_mass_g']]

y_penguin_labels = penguins['species']


## 2. k-Nearest Neighbors classifier


To explore classification, let's use a k-Nearest Neighbors classifier to predict the species of a penguin based on particular features the penguin has such as the penguin's bill length and body mass. 

Let's construct a K-Nearest Neighbor classifier (KNN) using 5 neighbors for predictions (i.e., k = 5 so we are using a 5-Nearest Neighbor classifier). 

We can do this using the `KNeighborsClassifier(n_neighbors = )` function.  


In [8]:
from sklearn.neighbors import KNeighborsClassifier

# Construct a classifier a 5 nearest neighbor classifier
knn = KNeighborsClassifier(n_neighbors = 5) 


Let's now train the classifier (the KNN classifier just stores the data during training)


In [9]:
# “train” the classifier (which for a KNN classifier just involves memorizing the training data)

knn.fit(X_penguin_features, y_penguin_labels) 


Let's now use the classifier to make predictions

In [10]:
# make predictions
penguin_preditions = knn.predict(X_penguin_features)

penguin_preditions[0:10]

array(['Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo',
       'Adelie', 'Adelie', 'Adelie', 'Adelie'], dtype=object)

Let's get the prediction (classificaton accuracy) which is the proportion of predictions that are correct

In [11]:
# get the classification accuracy
np.mean(penguin_preditions == y_penguin_labels)

np.float64(0.8378378378378378)

Let's repeat our analysis with k = 1 to see what happens...

In [12]:
# What happens if k = 1?

# construct a classifier
knn = KNeighborsClassifier(n_neighbors = 1) 

# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_penguin_features, y_penguin_labels) 

# make predictions
penguin_preditions = knn.predict(X_penguin_features)

# get classification accuracy
np.mean(penguin_preditions == y_penguin_labels)

np.float64(1.0)

Do we believe we have a perfect classifier???


## 3. Cross-validation

To avoid over-fitting, we need to split our data into a training and test set. 

The classifier "learns" the relationship between features (X) and labels (y) on the **training set**.

The classifier makes predictions on the features (X) of the **test set**. 

We compare the classifier's predictions on the test features (X) to the actual labels y, to get a more accuracy assessment of the **classification accuracy**.


Let's try this now...



In [13]:
# manually create a training with 250 examples, and a test set that has the rest of the data

X_train_manual = X_penguin_features.iloc[0:250, :]
y_train_manual = y_penguin_labels.iloc[0:250]

X_test_manual = X_penguin_features.iloc[250:, :]
y_test_manual = y_penguin_labels.iloc[250:]


print(X_train_manual.shape)
print(X_test_manual.shape)


(250, 4)
(83, 4)


In [14]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  
                                                    y_penguin_labels, 
                                                    random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(3)


(249, 4)
(84, 4)


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
328,43.3,14.0,208.0,4575.0
182,40.9,16.6,187.0,3200.0


In [15]:
from sklearn.neighbors import KNeighborsClassifier


# construct a classifier
knn = KNeighborsClassifier(n_neighbors = 1) 

# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_train_manual, y_train_manual) 



In [16]:
# get the predictions

penguin_preditions = knn.predict(X_test_manual)

penguin_preditions

array(['Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie',
       'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',
       'Chinstrap', 'Adelie', 'Gentoo', 'Chinstrap', 'Adelie', 'Adelie',
       'Chinstrap', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo',
       'Gentoo', 'Chinstrap', 'Adelie', 'Chinstrap', 'Chinstrap',
       'Gentoo', 'Chinstrap', 'Gentoo', 'Chinstrap', 'Adelie', 'Adelie',
       'Chinstrap', 'Adelie', 'Chinstrap', 'Gentoo', 'Chinstrap',
       'Adelie', 'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie',
       'Gentoo', 'Chinstrap', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo',
       'Chinstrap', 'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo',
       'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo',
       'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie',
       'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Adelie',
       'Adelie', 'Adelie', 'Gentoo'], dtyp

In [17]:
# Get the prediction accuracy 

np.mean(penguin_preditions == y_test_manual)



np.float64(0.7831325301204819)

In [18]:
# Test the classifier on the test set using the .score() method

knn.score(X_test_manual, y_test_manual) # prediction accuracy on the test set

0.7831325301204819

In [19]:
# What happens if we test the classifier on the training set? 

knn.score(X_train_manual, y_train_manual) # prediction accuracy on the training set



1.0

### K-fold cross-validation

In k-fold cross-validation we split our data into k-parts (note, the k here has no relation to the k in k-Nearest Neighbor - it is just that k is a frequent letter to use in math to denote integer values).  

To run a k-fold cross-validation analysis, we train the classifier on k-1 parts of the data and test it on the remaining part. We repeat this process k times to get k classification accuracies. We then take the average of these results as our estimate of our overall classification accuracy. 

We can use the scikit-learn `cross_val_score()` to easily do this...


In [20]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors = 1) # construct knn classifier

# do 5-fold cross-validation
scores = cross_val_score(knn, X_penguin_features,  y_penguin_labels, cv = 5)

print(scores)

print(scores.mean())

[0.85074627 0.89552239 0.88059701 0.8030303  0.77272727]
0.8405246494798734


## 4. Other classifiers

Many other types of classifiers that have been created. Scikit-learn makes it very easy to try out a range of classifiers. 

Let's explore the Support Vector Machine, and Random Forest Classifier on our penguin data...

In [21]:
# Suppress ConvergenceWarning if we need it.
#import warnings
#from sklearn.exceptions import ConvergenceWarning
#warnings.filterwarnings("ignore", category=ConvergenceWarning)
#warnings.filterwarnings("ignore", category=FutureWarning)


# Try a support vector machine (SVM)

from sklearn.svm import LinearSVC

svm = LinearSVC()   # max_iter=10000

scores = cross_val_score(svm, X_penguin_features,  y_penguin_labels, cv = 5)

print(scores.mean())

0.9879692446856627


In [22]:
# Try a random forest

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()

scores = cross_val_score(random_forest, X_penguin_features,  y_penguin_labels, cv = 5)

print(scores.mean())

0.9729534147444596


## 5. Building the KNN classifier

So far we have used the KNN classifier (and a few other classifiers). Let's now see if we can write code that will implement the KNN classifier.

We will do this by writing a several helper functions that build on each other. These functions are: 

1. `euclid_dist(x1, x2)`: finds the Euclidean distance between two points `x1` and `x2`

2. `get_labels_and_distances(test_point, X_train_features, y_train_labels)`: This function finds the distance between a test point and all the training points. It returns a DataFrame with the distance from all training points and the training labels for each point.

3. `classify_point(test_point, k, X_train_features, y_train_labels)`: Classifies which class a test point belongs to

4. `classify_all_test_data(X_test_data, k, X_train_features, y_train_labels)`: Classifiers which class all test points below to.


Let's start by writing a function that can get the Euclidean distance between two points `x` and `z`: 

$$dist(x, z) = \sqrt{\Sigma_{i = 1}^d (x_i - z_i)^2)}$$


In [23]:
def euclid_dist(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))


# test our function 
my_vec1 = np.array([1, 2, 3, 4])
my_vec2 = np.array([2, 3, 4, 5])

euclid_dist(my_vec1, my_vec2)

np.float64(2.0)

In [24]:
# Let's now write a function that returns the labels and distances 
# between a training point and all the test points


def get_labels_and_distances(test_point, X_train_features, y_train_labels):
    
    the_distances = []
    
    # get the distance between the test point and all training points
    for i in range(X_train_features.shape[0]):
        the_distances.append(euclid_dist(test_point, X_train_features.iloc[i]))

    
    # Create a DataFrame with the training labels and distances 
    labels_and_distances = pd.DataFrame({'label': y_train_labels, 'distance':the_distances})
    return labels_and_distances



# test our code 

test_data_point = X_test.iloc[0]
test_label = y_test.iloc[0]

labels_and_distances = get_labels_and_distances(test_data_point, X_train, y_train)

labels_and_distances.head(5)

Unnamed: 0,label,distance
0,Adelie,301.030829
328,Gentoo,525.106894
182,Chinstrap,850.212709
326,Gentoo,650.11922
124,Adelie,1000.271808


In [25]:
# get the k closest neighbors

k = 5

sorted_labels_dist = labels_and_distances.sort_values("distance")

sorted_labels_dist = sorted_labels_dist.iloc[0:k]

sorted_labels_dist

Unnamed: 0,label,distance
167,Chinstrap,2.012461
165,Chinstrap,2.872281
53,Adelie,9.20489
63,Adelie,14.676853
115,Adelie,27.202206


In [26]:
# get the majority label

count_table = sorted_labels_dist.groupby("label").count().reset_index()

sorted_count_table = count_table.sort_values("distance", ascending = False)

sorted_count_table.iloc[0]["label"]

'Adelie'

In [27]:
# write a function to do the classification on a test point 
# by putting together all the pieces

def classify_point(test_point, k, X_train_features, y_train_labels):
    
    labels_and_distances =  get_labels_and_distances(test_point, 
                                                     X_train_features, 
                                                     y_train_labels)

    sorted_labels_dist = labels_and_distances.sort_values("distance")
    sorted_labels_dist = sorted_labels_dist.iloc[0:k]
    
    
    count_table = sorted_labels_dist.groupby("label").count().reset_index()
    sorted_count_table = count_table.sort_values("distance", ascending = False)
    majority_class = sorted_count_table.iloc[0]["label"]
    
    return majority_class


# test our classifier on one test point

prediction = classify_point(test_data_point, 5, X_train, y_train)

print(prediction)

print(test_label)

Adelie
Chinstrap


In [28]:
# classify a full test set

def classify_all_test_data(X_test_data, k, X_train_features, y_train_labels):
    
    predictions = []
    
    for i in range(X_test_data.shape[0]):
        
        curr_test_point = X_test_data.iloc[i]
        
        curr_prediction = classify_point(curr_test_point, 
                                         k, 
                                         X_train_features, 
                                         y_train_labels)
        
        predictions.append(curr_prediction)

    return np.array(predictions)
    


# test the classifier on the whole test set    

all_predictions = classify_all_test_data(X_test, 5, X_train, y_train)

all_predictions


array(['Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo', 'Adelie',
       'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie',
       'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo',
       'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo',
       'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie',
       'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo',
       'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo',
       'Gentoo', 'Chinstrap', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo',
       'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Chinstrap',
       'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Gentoo',
       'Gentoo', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo'],
      dtype='<U9')

In [29]:
# get the classification accuracy

np.mean(all_predictions == y_test)



np.float64(0.7857142857142857)