# Lecture 15: September 11th, 2023

We made it to the last lecture :') How do we all feel today?

__Reminders and Updates:__ 
* Homework 7 and Homework 8 are due tonight at midnight. Use your tokens to get an extension, if you need it. 
* You can also use your tokens to submit a outcome revision. If you are missing an outcome and would like credit for it, you can use tokens to submit a revision assignment. Let's take a brief trip to Canvas to see what this looks like. 
* Today we will cover K-Means clustering, which is the topic of outcomes P18 and P19. They will be released after today's lecture and are due Wednesday at midnight, if you would like to attempt them (remember, an A only requires 18 outcomes).
* Final project is due Wednesday at midnight! No late submissions will be accepted! 

* Updated student hours with Yasmeen this week: instead of Tuesday student hours, I will host Wednesday student hours from 12:00pm-1:50pm (our normal lecture time). Come with any questions you might have! 


## Final Project Questions

 Q: What if we can't find a perfect linear fit?
 A: That's totally fine! Maybe just point out it doesn't look like a linear fit, and you can't say much about the relationship. You might consider fitting to a higher degree polynomial, as well.

Q: What if our precision is really low? 
A: That's also fine, just don't try to convince us that the model works really well. You might consider trying to improve the model, or at least elaborate on a few things that you could try. Negative results are fine: I'd rather you say "we can't make any conclusion" than try to convince of something that's not true.

## K-Means Clustering

Today we will see our first example from _unsupervised_ machine learning. Can anyone remind us of what the main difference is from _supervised_ machine learning?

__Main Difference:__ Supervised learning uses labeled data, unsupervised learning uses unlabeled data.

Clustering is one of the most famous examples of unsupervised learning, and is something we will see today.

Here is a nice flowchart from the website [GeeksforGeeks](https://www.geeksforgeeks.org/flowchart-for-basic-machine-learning-models/) which gives a nice outline of some different categories within machine learning. Don't take it too seriously, it's just kind of fun to look at. 

![](flowchart.jpeg)

### StandardScaler

I'm a little surprised we haven't talked about this topic yet, but today we'll be rescaling numeric data so that it has a mean of zero and a standard deviation of one. Why would we want to do something like this?

From the chat: we want to make things not dependent on units! 
Example: Suppose I have a DataFrame with one column in kilograms and one column in grams. We might expect that the grams column has numbers that are much larger than the kilograms column. Even though these are both weights, we might be biased towards grams since they are larger numbers. 

* Load the iris dataset from Seaborn and drop any rows with missing values.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import altair as alt

In [2]:
df = sns.load_dataset("iris").dropna()
df.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
72,6.3,2.5,4.9,1.5,versicolor
139,6.9,3.1,5.4,2.1,virginica
25,5.0,3.0,1.6,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa
68,6.2,2.2,4.5,1.5,versicolor


We still have the standard workflow for scikit-learn:
* import 
* instantiate
* `fit`
* `predict` or `transform`

* Import `StandardScaler` from `sklearn.preprocessing`

In [3]:
from sklearn.preprocessing import StandardScaler

* Instantiate a `StandardScaler` object and name it `scaler`.

In [4]:
scaler = StandardScaler()

In [5]:
type(scaler)

sklearn.preprocessing._data.StandardScaler

* Try fitting StandardScaler to the iris dataset.

In [6]:
scaler.fit(df)

ValueError: could not convert string to float: 'setosa'

What's going on here? Notice the species column has strings, and it doesn't make sense to scale these. When we use StandardScaler, we only want to apply it to numeric columns.

Let's get just the numeric columns of our DataFrame. Notice, the only non-numeric column is "species".

In [7]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [8]:
numcols = [c for c in df.columns if c != "species"]
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Here `!=` means not equal. Here's another way we could do it.

In [9]:
[c for c in df.columns if not (c == "species")]

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

I also just want to show you the following function.

In [10]:
from pandas.api.types import is_numeric_dtype

In [11]:
is_numeric_dtype(df["species"])

False

In [12]:
is_numeric_dtype(df["sepal_length"])

True

In [13]:
df.apply(is_numeric_dtype, axis=0)

sepal_length     True
sepal_width      True
petal_length     True
petal_width      True
species         False
dtype: bool

In [14]:
[c for c in df.columns if is_numeric_dtype(df[c])]

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Now let's try fitting again.

In [15]:
scaler.fit(df[numcols])

* Call `transform` on the numeric columns of `df`.

In [16]:
df[numcols] = scaler.transform(df[numcols])

In [17]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,-0.900681,1.019004,-1.340227,-1.315444,setosa
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa
2,-1.385353,0.328414,-1.397064,-1.315444,setosa
3,-1.506521,0.098217,-1.283389,-1.315444,setosa
4,-1.021849,1.249201,-1.340227,-1.315444,setosa
...,...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832,virginica
146,0.553333,-1.282963,0.705921,0.922303,virginica
147,0.795669,-0.131979,0.819596,1.053935,virginica
148,0.432165,0.788808,0.933271,1.448832,virginica


* Check that the mean and standard deviation of the resulting columns are 0 and 1, respectively. 

In [18]:
df.mean(axis=0)

sepal_length   -4.736952e-16
sepal_width    -7.815970e-16
petal_length   -4.263256e-16
petal_width    -4.736952e-16
dtype: float64

These values are not exactly zero, but very close. This is all we can expect (remember, we never expect floats to be exactly equal to something else).

In [19]:
df.std(axis=0)

sepal_length    1.00335
sepal_width     1.00335
petal_length    1.00335
petal_width     1.00335
dtype: float64

Similarly, these are not exactly equal to 1, but they are very close, and this is good enough.

***

First, let me show you how to implement K-Means clustering in scikit-learn. Then we'll go through the details of how it works.

* Import `KMeans` from `sklearn.cluster` and instantiate a `KMeans` object.

In [20]:
from sklearn.cluster import KMeans

In [21]:
kmeans = KMeans()

TypeError: 'KMeans' object is not callable

* Fit our `KMeans` object to the numeric columns of `df`.

Notice that when we fit here there's no target (unlike in Linear Regression, for instance). This is because KMeans is unsupervised. Recall that the target is some correct value we're trying to predict. Here, we don't have that.

In [22]:
kmeans.fit(df[numcols])

* Put the predicted clusters into a new column called "cluster".

In [24]:
df["cluster"] = kmeans.predict(df[numcols])

TypeError: predict() missing 1 required positional argument: 'X'

In [25]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,cluster
0,-0.900681,1.019004,-1.340227,-1.315444,setosa,1
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa,7
2,-1.385353,0.328414,-1.397064,-1.315444,setosa,7
3,-1.506521,0.098217,-1.283389,-1.315444,setosa,7
4,-1.021849,1.249201,-1.340227,-1.315444,setosa,1
...,...,...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832,virginica,6
146,0.553333,-1.282963,0.705921,0.922303,virginica,2
147,0.795669,-0.131979,0.819596,1.053935,virginica,2
148,0.432165,0.788808,0.933271,1.448832,virginica,6


* Make an altair chart with sepal_length along the x-axis, petal_length along the y-axis, and with colors/shapes determined by the cluster predicted by `kmeans`.

In [26]:
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [27]:
alt.Chart(df).mark_point(filled=True,size=100).encode(
    x="sepal_length",
    y="petal_length",
    color="cluster:N",
    shape="cluster:N"
)

Usually, we will specify the number of clusters that we want. If not, the default number of clusters is 8. In the code above, it's not that 8 was determined to be a good number of clusters, this is just the default.

In [28]:
help(kmeans)

Help on KMeans in module sklearn.cluster._kmeans object:

class KMeans(_BaseKMeans)
 |  KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
 |  
 |  K-Means clustering.
 |  
 |  Read more in the :ref:`User Guide <k_means>`.
 |  
 |  Parameters
 |  ----------
 |  
 |  n_clusters : int, default=8
 |      The number of clusters to form as well as the number of
 |      centroids to generate.
 |  
 |  init : {'k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='k-means++'
 |      Method for initialization:
 |  
 |      'k-means++' : selects initial cluster centroids using sampling based on
 |      an empirical probability distribution of the points' contribution to the
 |      overall inertia. This technique speeds up convergence, and is
 |      theoretically proven to be :math:`\mathcal{O}(\log k)`-optimal.
 |      See the description of `n_init` for mo

* Create a new `KMeans` object named `kmeans2` and specify that it should include 2 clusters.

In [30]:
kmeans2 = KMeans(n_clusters=2)

TypeError: 'KMeans' object is not callable

* Fit `kmeans2` and then store the predictions in a new column called cluster2.

In [31]:
kmeans2.fit(df[numcols])

In [32]:
df["cluster2"] = kmeans2.predict(df[numcols])

* Make the same altair chart as above, but now with the colors and shapes determined by cluster2.

In [33]:
alt.Chart(df).mark_point(size=100,filled=True).encode(
    x="sepal_length",
    y="petal_length",
    color="cluster2:N",
    shape="cluster2:N"
)

Notice how much more clear the clusters are here! This is looking a little more promising to me than the previous example.

* Using predictions from cluster2, make a list of altair charts that have the same encodings as before, but with the y-axis going through all columns of `numcols`.

In [34]:
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [35]:
chart_list = []
for c in numcols:
    chart = alt.Chart(df).mark_point(size=100,filled=True).encode(
        x="sepal_length",
        y=c,
        color="cluster2:N",
        shape="cluster2:N"
    )
    chart_list.append(chart)

In [36]:
#list of altair charts
chart_list

[alt.Chart(...), alt.Chart(...), alt.Chart(...), alt.Chart(...)]

In [37]:
#unpack the charts and stack them vertically
alt.vconcat(*chart_list)

## Importance of Scaling

### Exercise 

* Make a DataFrame `df` with two columns, “miles” and “cars”, containing the following five data points in (miles, cars): (0,1), (0,5), (1,0), (1,1), (1,5).

* For later use, make a copy of df called `df2`. Be sure to use .copy().

* Using K-Means clustering, divide this `df` data into two clusters. Store the data in a new column called “cluster”.

* Does the result match what you expect?

In [38]:
df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=["miles","cars"])
df

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [39]:
# here's another way
df = pd.DataFrame({"miles":[0,0,1,1,1],"cars":[1,5,0,1,5]})
df

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [40]:
df2 = df.copy()

Think about how you expect the points to be clustered together. Then compare with what we get below.

In [41]:
kmeans = KMeans(n_clusters=2)

TypeError: __init__() got an unexpected keyword argument 'n_cluster'

In [42]:
kmeans.fit(df)

In [43]:
df["clusters"] = kmeans.predict(df)

In [44]:
df

Unnamed: 0,miles,cars,clusters
0,0,1,0
1,0,5,1
2,1,0,0
3,1,1,0
4,1,5,1


Some observations: Two data points with 5 cars are in one cluster, while the remaining data points with 0 or 1 cars are in another cluster. 
__Note:__ The names of the clusters are not important.

* Rename the miles column of `df2` to "feet", and convert the values from miles to feet.

In [45]:
df2 = df2.rename({"miles":"feet"},axis=1).copy()
df2

Unnamed: 0,feet,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [46]:
#There are 5280 feet in a mile
df2.feet = df.miles

Unnamed: 0,feet,cars
0,0,1
1,0,5
2,147197952000,0
3,147197952000,1
4,147197952000,5


In [None]:
df2.feet = 5280*df2.feet

In [47]:
df2

Unnamed: 0,feet,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


Notice, `df2` contains the exact same information as `df`, but now the "feet" column is overpowering the "cars" column. This is bad because we only changed the unit of measurement used.

* Using K-Means clustering, divide the `df2` data into two clusters. Store the data in a new column called "cluster".

In [48]:
kmeans2 = KMeans(n_clusters=2)

In [49]:
kmeans2.fit(df2)

In [50]:
df2["clusters"] = kmeans2.predict(df2)

Feature names unseen at fit time:
- clusters



ValueError: X has 3 features, but KMeans is expecting 2 features as input.

In [51]:
df2

Unnamed: 0,feet,cars,clusters
0,0,1,1
1,0,5,0
2,1,0,1
3,1,1,1
4,1,5,0


Notice now that the clusters are different compared to when we first tried with just `df`. Here, because feet is so much larger than cars, it is influencing the model.

Exercise: Apply StandardScaler to `df2` and then try clustering again. See what happens!

__Upshot:__ 
Unless the same unit of measurement is used over the columns, we should normalize (e.g. with StandardScaler).

## The K-Means Algorithm

Reference: [Wikipedia page](https://en.wikipedia.org/wiki/K-means_clustering) on K-Means clustering. (See the standard algorithm section.)

This portion of the lecture I plan to sketch out on the iPad.

![](Teaching-55.jpg)

![](Teaching-56.jpg)

![](Teaching-57.jpg)

![](Teaching-58.jpg)

![](Teaching-59.jpg)

### Iterating by hand

__Question 1:__ Using the following dataset, run the KMeans algorithm by hand, starting with the initial points (3,0) and (1,0). (How many clusters do you expect with this setup?)

Two initial points means we expect 2 clusters.

Warning: Notice the scaling on the x- and y-axes.

In [52]:
import pandas as pd
import altair as alt
from sklearn.cluster import KMeans

df = pd.DataFrame({"x":[0,0,1,1,3,4],"y":[0,10,0,10,0,0]})
datapoints = alt.Chart(df).mark_point(size=100, color="black", filled=True).encode(
    x = "x",
    y = "y"
)
datapoints

__First Iteration:__
Cluster 1: Points closest to (3,0)
(3,0),(4,0)
Cluster 2: Points closest to (1,0)
(1,0),(0,0), (1,10), (0,10)

It's not too hard to see that (4,0) is closer to (3,0). Similarly, (0,0) is closer to (1,0)

Distance between (1,0) and (1,10): 10 
Distance between (3,0) and (1,10): $\sqrt{2^2 + (-10)^2} = \sqrt{4 + 100} = \sqrt{104} > 10$

Let's now compute the centroids of each cluster. 
Cluster 1: $\frac{1}{2}(3+4,0) = (3.5,0)$
Cluster 2: $\frac{1}{4}(1+0+1+0,0+0+10+10) = (0.5,5)$

In [53]:
round1 = pd.DataFrame({"x":[3.5,0.5],"y":[0,5]})
average1 = alt.Chart(round1).mark_point(size=100,filled=True,color="green").encode(
    x="x",
    y="y"
)

In [54]:
datapoints + average1

__Second Iteration:__ 
Cluster 1: Points closest to (3.5,0): (3,0),(4,0),(1,0),(0,0)
Cluster 2: Points closest to (0.5,5): (0,10) and (1,10)

Compute centroids again: (2,0) and (0.5,10)

In [55]:
round2 = pd.DataFrame({"x":[2,0.5],"y":[0,10]})
average2 = alt.Chart(round2).mark_point(color="orange",size=100,filled=True).encode(
    x="x",
    y="y"
)
datapoints + average1 + average2

Notice that this is where we terminate the algorithm because if we try to reassign points to the nearest centroid, they all stay in the same cluster.

Let's check how we did.

In [56]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(df)
df["clusters"] = kmeans.predict(df)

In [57]:
datapoints.encode(
    color="clusters:N"
)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=dbbc33c4-6750-4da5-b17f-5d0c69c22728' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>