# Machine Learning

## Goals

* Learn the jargon and understand the pipelines of Machine Learning
* Set up a classification model
* Set up a regression model
* Set up an unsupervised (clustering) model
---

Approach to learning: I would like to take an approach where we are playing with data as early as possible. As such, we will be talking about concepts as we encounter them in the data.

---

## What is machine learning?

Loosly, machine learning can be defined as:

```Algorithms that allow a computer to predict patterns in unseen data based on learning done on data that has previously been seen```

Typically, the more data you allow the computer to learn with, the better job it will do.

---

# Why use machine learning?

* Fraud detection
* Spam detection
* Credit risk
* Voice recognition
* Image recognition
* Recommendations (search engines)
* Finding patterns in the stock market
* Housing prices

## Traditional modeling vs machine learning

Usually in science we work with **white box model**, like a set of equations, and we fully understand how our models work.

[Image of equations]

With machine learning, we let the machine figure out the details. While we understand the algorithms that allow the computer to learn, we don't always understand the insights or details about the specific insights the algorithms learn about the data (**black box model**).

[Image of black box with inputs outputs]

This feature is sometimes used as an argument against the practice of machine learning.

Here is a thought provoking [blog entry about this from Rich Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) (pioneer of reinforcement learning, U of A faculty).

___

## The Machine Learning Process


* **Define the problem**
 * This can be difficult. What are we trying to achieve? What are we trying to predict?
* **Get the data**
 * This is often connected with the problem definition step, because knowing about the data helps clarify what we can do with it
* **Prepare data** 
 * Exploratory data analysis and visualization
 * Cleaning data
 * Often the most tedious and time consuming step
* **Select Algorithm**
 * Setting up one or more machine learning pipelines 
* **Train the model**
 * Feed the algorithm data.
* **Test the mode**
  * Maybe we need to go back and select a different algorithm to work with?
* **Select the best model** 
 * The definition of "best" depends on the type of problem, the type of data, and our goals
* **Predict**
  * Use the model to make predictions based on unseen data
* $$$ (?)


---

## Data science tools in Python we will be using

* Data analysis and cleaning/transforming: **pandas**
* Visualization: matplotlib (possibly **seaborn** and **plotly**)
* Scientific computing/number crunching: **numpy**
* Machine learning algorithms: **Scikit-learn**

In [None]:
# Lets make sure that we have the tools available to us ...

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [None]:
# ... if not, may need to uncomment one or more of:
# (Replace pip with conda where applicable)

# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install sklearn

## Getting the data for this notebook

We can download the data if we don't have it already

In [None]:
# Download data and solutions

import urllib.request
import os

def download_data(path):
    if os.path.exists(path):
        return
    if not os.path.exists('data'):
        os.mkdir('data')
    if not os.path.exists('solutions'):
        os.mkdir('solutions')
    url = 'https://raw.githubusercontent.com/ualberta-rcg/python-machine-learning/master/notebooks/' + path
    output_file = path
    urllib.request.urlretrieve(url, output_file)
    print("Downloaded " + path)

download_data('data/titanic/train.csv')
download_data('data/titanic/test.csv')
download_data('solutions/titanic-passenger-class.py')
download_data('solutions/titanic-random-forest-pipeline.py')

## More about Machine Learning tools

Four packages that are available to us for free:

* Scikit-learn
 * Easy to understand
 * Great for learning
 * Consistent interface
* Tensorflow
 * From Google
 * Takes advantage of GPUs
* Pytorch
 * From Facebook
 * Takes advantage of GPUs
* Keras
 * Built on top of Tensorflow
 * Easier to understand and use

Some comparisons:
https://towardsdatascience.com/scikit-learn-tensorflow-pytorch-keras-but-where-to-begin-9b499e2547d0

---

## A problem: The Titanic Kaggle challenge

We will learn some of the jargon of machine learning by walking through an example.

[Kaggle](https://www.kaggle.com/) (owned by Google) is a machine learning competition website.

Competitions are either for fun, for money, or might lead to a job offer.

The introductory competition involves the sinking of the [Titanic](https://www.kaggle.com/c/titanic):

```
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
```

In the competition, we are given 891 rows of data (one row per passenger) in the file `train.csv` where we know whether the passenger survived. As the name indicates, we will use this data to train a machine learning model.

There is also a file called `test.csv` that includes information on 418 other passengers, but missing from this data is whether the passenger survived or not. To enter the competition, we make predictions on this file and submitting it to the competition.

I've done two submissions, one which guessed 71% correctly, the other guessed 76% correctly.

We'll explore this problem to learn some of the concepts related to machine learning.

## An initial look ...

In [None]:
train_df = pd.read_csv('data/titanic/train.csv')
train_df.info()

In [None]:
# Lets look at the first 10 records ...
train_df.head(10)

Do you see any good candidates for predictors of survival?

## Features and Labels

In short, inputs to a machine learning model are called **features** (or **attribute**), and the output predicted is called a **label**.

The **label** that we are trying to predict in the Titanic challenge is clearly the **Survived** data.

We have 11 potential **features** we can use to predict this **label**.

* A feature can be binary, nominal, or numerical.
* We want to choose features that have predictive power.
* We want to choose features that are as independent as possible
    * E.g., if `weight_in_pounds` is a predictive feature in a model, don't also choose `weight_in_kilograms`
* Note that we can also design our own features from the data provided. (E.g., if we are predicting stock prices and we are given the previous opening and closing prices of a stock, maybe the difference would be a good predictor?). The process of designing and choosing features is called **feature engineering**.

A single variable in the data set that is 
* Feature/Attribute
  * A single variable (binary, nominal, numerical)


* Label/Class
 * An extra information that categorizes/classifies a given instance

* Instance/Feature vector
  * One entity described by features
* Dataset
 * Collection of labeled or unlabeled instances
---

**Question**: what could be possible features for a model that predicts housing prices?

---

## Types of Learning

### Supervised
* All instances in training data are **labeled**
* **Classification** - predicting nominal label
  * We are looking to build models that separate data into distinct classes
  * Algorithms:
    * Decision Trees
    * Random Forest
    * Support Vector Machine
  * E.g.,
    * Did the person survive the titanic? (True/False)
    * What species of plant is this?
* **Regression** - predicting numerical label
  * Based on previous data, predict a continuous numerical quantity
  * Algorithms
    * Linear regression
    * Polynomial regression
  * E.g.,
    * Predict the high temperature for tomorrow
    * Predict the closing price of a stock tomorrow

### Unsupervised
* There are **no labels** for the instances
* We are trying to find hidden meaning in data without additional guidance
* E.g., Find ten categories that a collection of emails fall into (clustering)
* Algorithms:
  * KMeans
  * KNN

### Reinforcement Learning
* Algorithms learn how to make actions on data points based on environment responses
* It’s impossible to get label without making an action
* Check this out, a DeepMind model learning to play Atari Breakout: https://www.youtube.com/watch?v=V1eYniJ0Rnk

---

## Exploratory Data Analysis

**Exploratory Data Analysis (EDA)** is poking around data (e.g., looking at descriptive statistics or plots) to gather insights.

---

The `value_counts` method for a series is useful for exploring the data. It in essense gives us the frequency table from a column in the data.

Lets find out the survival rate...

In [None]:
train_df['Survived'].value_counts()

In [None]:
counts = train_df['Survived'].value_counts()
percent = round(100 * counts[1] / len(train_df))
print('Survival rate: {}%'.format(percent))

We can look at the influence of various features on survival rate...

In [None]:
sex_counts = train_df['Sex'].value_counts()
sex_counts

In [None]:
# This is how the series returned from `value_counts` is organized

print(sex_counts.keys())
print(sex_counts.values)

Let's look at the data for the males alone

In [None]:
male_df = train_df[train_df['Sex'] == 'male']

In [None]:
male_df['Survived'].value_counts()

In [None]:
# and now the survival rate for males ...

counts = male_df['Survived'].value_counts()
percent = round(100 * counts[1] / len(male_df), 2)
print('Survival rate: {}%'.format(percent))

At this point you will have recognized that you have done an almost identical set of operations on two different dataframes ... this sounds like a good time to create a function!

In [None]:
def survival_report(df):
    if len(df) == 0:
        print('Empty data')
        return
    counts = df['Survived'].value_counts()
    print('             Lived: {}'.format(counts[1]))
    print('              Died: {}'.format(counts[0]))
    print('Chance of survival: {}%'.format(round(100 * counts[1] / len(df), 2)))

In [None]:
survival_report(male_df)

The new function makes exploring the data for the female passengers easy ...

In [None]:
survival_report(train_df[train_df['Sex'] == 'female'])

**Question: is gender a good predictor of survival?**

---

## Exercise

Using our `survival_report` function above, explore the survival rates for the values of Passenger class ...

In [None]:
### Your code here ...


In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
!cat solutions/titanic-passenger-class.py

## Load solution here

In [None]:
features = ['Sex', 'Pclass']
features = ["Pclass", "Sex", "SibSp", "Parch"]

# feat_column_names = list(tdf.columns[2:])
#features.remove('Survived')
#features

## One-hot encoding

In [None]:
X = pd.get_dummies(train_df[features])

In [None]:
X

In [None]:
y = train_df['Survived']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

---

## Decision Tree

The model we are going to initially try is a decision tree.

![](https://www.seriouseats.com/images/20100120-flowchart-floorfood.jpg)

A number of questions are asked about the data, and decisions are made based on the answers. These answers are refined or changed based on additional information.

---

### `sklearn.tree.DecisionTreeClassifier`

The [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) looks at your data and figures out what questions to ask automatically.

We can specify an option for `max_depth` (basically, how deep do we want the questions to go).

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)

---

### `fit`

Most models (all?) from Scikit-learn have a `fit` method that trains the model based on your features and labels.

In [None]:
# fit doesn't modify the model in place
# (returns a trained model)

model = model.fit(X_train, y_train)

---

### `predict`

Most models (all?) from Scikit-learn have a `predict` method that allows a trained model to make predictions when given unlabeled features.

Here we use our **test data** (data that the model wasn't trained on) as an input

In [None]:
predictions = model.predict(X_test)

In [None]:
print("The first ten predicted labels for the test data")
print(list(predictions[:10]))
print('The ten actual labels')
print(list(y_test[:10]))

### But what did the decision tree do?

In [None]:
# If you want to install graphviz ....
# Note for conda, you may have to install both graphviz and python-graphviz
# !pip install graphviz
# !conda install graphviz python-graphviz

In [None]:
import graphviz
from sklearn.tree import export_graphviz
from graphviz import Source

dot_data = export_graphviz(model,
                           feature_names=X.columns,
                           class_names=['Died', 'Survived'],
                           filled=True, rounded=True,
                           special_characters=True,
                           out_file=None)
graph = graphviz.Source(dot_data)
graph

# Measuring the quality of our predictions

Ah, statistics and jargon!

**Accuracy** is the proportion of predictions that are correct.

* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

**Precision** and **recall** are less obvious terms, and tied to **Type 1** and **Type 2** errors in statistics.

**Type I error** is the rejection of a true null hypothesis (also known as a **"false positive"**)
* E.g., an innocent person is convicted
* E.g., a healthy person got a medical test saying they are sick
* E.g., a legitimate email is marked as spam
* E.g., a predicted Titanic survivor is actually dead

**Type II error** is the non-rejection of a false null hypothesis (also known as a **"false negative"**)
* E.g., a guilty person is not convicted
* E.g., a sick person got a medical test saying they are healthy
* E.g., a legitimate spam is not marked as spam
* E.g., a predicted dead person actual survived the Titanic

**Sensitivity**: is the "True Positive rate", it measures the proportion of positives that are correctly identified
* E.g., proportion of guilty people correctly convicted
* E.g., proportion of sick people the medical test correctly identifies as sick
* E.g., proportion of spam that is correctly identified as spam
* E.g., proportion of Titanic survivors identified as survivors

**Specificity** is the "True Negative rate", it measures the proportion of negatives that are correctly identified
* E.g., proportion of innocent people not convicted
* E.g., proportion of healthy people the medical test correctly identify as healthy
* E.g., proportion of legitimate email that goes to our inbox (not marked as spam)
* E.g., proportion of dead Titanic passengers predicted to be dead

* https://en.wikipedia.org/wiki/Precision_and_recall
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

In [None]:
matrix = confusion_matrix(y_test, predictions)
matrix

In [None]:
tn = matrix[0][0] # True negatives
fn = matrix[1][0] # False negatives
fp = matrix[0][1] # False positives
tp = matrix[1][1] # True positives
print('True negatives: {}'.format(tn))
print('False negatives: {}'.format(fn))
print('False positives: {}'.format(fp))
print('True positives: {}'.format(tn))
print('Sensitivity: {}'.format(tp / (tp + fn)))
print('Specifity: {}'.format(tn / (tn + fp)))

---

### Putting it all together ...

We now have a lot of code spread out over a number of cells in this notebook.

Here we combine it together so we can get a better picture of what the entire pipeline looks like.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# More about this line shortly ...
# np.random.seed(1337)

# Load data
train_df = pd.read_csv('data/titanic/train.csv')

# Choose features and lables
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features])
y = train_df['Survived']

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Initialize model and fit to training data
model = DecisionTreeClassifier(max_depth=3)
model = model.fit(X_train, y_train)

# Use model to predict on unseen test data
predictions = model.predict(X_test)

# Evaluate how well the model did
print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

---

## Repeatability ...

Run the above cell a few times. What happens?

A lot of the code related to machine learning depends on the output of random number generators, which can make repeating results difficult.

Scikit-learn uses NumPy's random number generator, and luckily we can **seed** this random number generator (give the random generator in initial number so that the **same sequence of random numbers** are generated whenever that seed number is used).

**Uncomment the line that looks like this: `np.random.seed(1337)`**

Now run the code several times and watch the results.

Note: the number `1337` isn't special (it's often used for seeds, as it's internet slang for 'leet' or 'elite'). 

Pick whichever number makes you happy -- but use it everytime for the same pipeline if you are interested in repeatability.


---
---

## If a tree is good, is a forest better?

What if we could set up a whole bunch of decision trees, each asking different questions, then vote on their final decisions to get a prediction? This is what Random Forest does.

Random Forest is called an **ensemble** model, because it combines the results of multiple models to try to get a better answer.

![](https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png)

---

In Scikit-learn, Random Forest is implemented by [`sklearn.ensemble.RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Like with `DecisionTree`, we have a setting for the `max_depth` of the generated decision trees.

We also have a setting for the number of trees to use, `n_estimators`.

E.g., `model = RandomForestClassifier(n_estimators=100, max_depth=3)`

---

### Exercise: use a `RandomForestClassifier`

Set up a machine learning pipeline using a `RandomForestClassifier` model:

* Read the Titanic data from a file
* Split the data into training and testing data
* Fit a random forest model to our training data
* Predict labels using our test data.
* Evaluate this model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Your code goes here ...
# Hint: other than a couple of lines of code, it should look
#   very much like the decision tree pipeline above 

In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
!cat solutions/titanic-random-forest-pipeline.py

---

## Handling missing data

Let's look at the information about our data set again:

In [None]:
train_df.info()

---

Suppose we thought that age was a good predictor of survival.

We have a problem though:

**Not every row has data for age recorded**

Unfortunately, many machine learning algorithms **don't know how to handle empty data**.

We have some strategies to deal with this:

* We could throw out any rows with missing data
* We could substitute a default value (e.g., a mean if available, zero, -1, 9999, or some other value)

Let's look at how to throw out rows, and how to replace a null value with a mean.

---

### Throwing out rows

First, here is a tool for detecting the presence of missing (null) values:

In [None]:
train_df.isnull()

In [None]:
# Count the missing data in each column
train_df.isnull().sum()

Pandas dataframes have a method called [`dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) that can be used to filter out rows that have missing data.

We can specify which columns to look at using the `subset` keyword argument.

Note that this method by default does not modify the dataframe in place, but **returns a new dataframe** with the rows missing.

In [None]:
age_non_null_train_df = train_df.dropna(subset=['Age'])
age_non_null_train_df.info()

# If we decide to go further down this road, we might do either:
#    train_df.dropna(subset=['Age'], inplace=True)
#                or
#    train_df = train_df.dropna(subset=['Age'])

---

### Replacing missing data with a mean

With this strategy, we don't throw out any rows, but instead create a 'fictional' age value for the rows with missing age.

Let's look at the mean age from the non-null values:

In [None]:
train_df.describe()

Pandas `Series` have a method called [`fillna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html) that can replace null values in a column with a specific value.

In [None]:
# We can make a copy of the dataframe if we don't want to
# modify the original (optional)...
age_mean_train_df = train_df.copy()
# Overwrite the column with new data with the missing data filled
age_mean_train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

In [None]:
age_mean_train_df.describe()

## Exercise

Use age as an additional feature in a machine learning pipeline.

* Use either the dataset with the null rows thrown out (e.g., `train_df = age_non_null_train_df`) or the dataset with the missing age data replaced by mean (e.g., `train_df = age_mean_train_df`). You decide.
* Use which ever classifier algorithm you'd like (`DecisionTreeClassifier` or `RandomTreeClassifier`).
* Play with whichever keyword arguments you might like to change ... any effect on the results?

## Overfitting

---
---

## Unsupervised learning

**The problem**: we have some data, and we aren't given a label that neatly categorizes it. But we want to separate the data in some meaningful way (in clusters, using some measure of nearness).

**We need to supply how many clusters we want the algorithm to find ahead of time.**

We don't know what the clusters represent, just that we are hoping that there will be a division in the data that will help us understand it.

---

In the following example, we'll explore unsupervised clustering with an algorithm called KMeans.

First, lets create a function to create a mock dataset for us.

The function will sample one thousand points in the x-y plane (`blobs`) from 3 different probability distributions using the Scikit-learn function `make_blobs`. We will keep track of which distribution each point is sampled from (`cluster_labels`).

These will be returned from our function and stored in the variables `xy_points` and `labels` (**note: the KMeans algorithm won't know about the label here, but we can use it in this contrived example to examine the output**)

In [None]:
import numpy
from sklearn.datasets import make_blobs

numpy.random.seed(1337)

centers = [[-10, -10], [-10, 13], [8, -1]]

def get_points_and_labels(**kwargs):
    blobs, cluster_labels = make_blobs(n_samples=1000, n_features=2,
                                       centers=centers, cluster_std=5.0)
    return blobs, cluster_labels

xy_points, labels = get_points_and_labels(initialize_seed=True)

Next, lets make a function that will visualize our x-y points, optionally coloring the points if the labels are also included

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_clusters(title, xy_points, labels=None):
    plt.figure()
    plt.title(title)
    xy_points_df = pd.DataFrame(xy_points, columns=['x', 'y'])

    if labels is None:
        plt.scatter(xy_points_df.x, xy_points_df.y, c="grey")
    else:
        xy_points_df['labels'] = pandas.Series(labels)
        colours = ["red", "blue", "green"]
        clusters = [0, 1, 2]
        for cluster_id in clusters:
            cluster_data = \
                xy_points_df.loc[xy_points_df["labels"] == cluster_id,
                                 ["x", "y"]]
            plt.scatter(cluster_data.x, cluster_data.y,
                        c=colours[cluster_id-1])

    plt.show()

Let's look at our x-y points both with and without the labels.

In [None]:
print('First 10 xy_points: \n',xy_points[:10])
plot_clusters('Unlabeled clusters', xy_points)
print('First 10 labels: \n', labels[:10])
plot_clusters('Labeled clusters', xy_points, labels)

### KMeans

From [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering) ...

---

1. k initial randomly chosen "means" (or "seeds", in this case k=3) are randomly generated within the data domain (shown in color).
![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/K_Means_Example_Step_1.svg/200px-K_Means_Example_Step_1.svg.png)

---

2. k clusters are created by associating every observation with the nearest mean.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/K_Means_Example_Step_2.svg/200px-K_Means_Example_Step_2.svg.png)

---

3. The centroid of each of the k clusters becomes the new mean.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/K_Means_Example_Step_3.svg/200px-K_Means_Example_Step_3.svg.png)

---

4. Steps 2 and 3 are repeated until convergence has been reached (not quaranteed)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/K_Means_Example_Step_4.svg/200px-K_Means_Example_Step_4.svg.png)


---

In Scikit-learn, KMeans is provided by the [`sklearn.cluster.KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class.

Notice that we specify the number of clusters we want the algorithm to find `n_clusters`.
The setting `n_init=1000` means that we will try 1000 times with different initial means, then choose the best result.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init=1000)
kmeans.fit(xy_points)

In [None]:
print('\nActual cluster means')
for x_y in centers:
    print('%f,%f' % (x_y[0], x_y[1]))
    
print('\nPredicted cluster means')
for x_y in kmeans.cluster_centers_:
    print('%f,%f' % (x_y[0], x_y[1]))

In [None]:
kmeans_labels = kmeans.predict(xy_points)

print('First 10 actual labels: ', labels[:10])
print('First 10 computed labels: ', kmeans_labels[:10])

plot_clusters('Re-plot of original clusters', xy_points, labels)
plot_clusters('Calculated clusters', xy_points, kmeans_labels)

### Huh?

Notice that most of the predicted labels are actually wrong!

KMeans finds clusters, but it has no way of knowing what the actual labels mean. It just detects clusters.

You will notice in the above plot that the shape of the clusters are pretty close, but the colors of the individual clusters might be wrong.

In [None]:
xy_points2, labels2 = get_points_and_labels()
kmeans_labels2 = kmeans.predict(xy_points2)

plot_clusters('New clusters', xy_points2, labels2)
plot_clusters('New predicted clusters', xy_points2, kmeans_labels2)