In [None]:
%reload_ext nb_black

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Generate a 2 different sets of points using `np.random.normal`.

* Name the 1st `a` and use a mean of `0`, a standard deviation of `1`, and generate `5` points
* Name the 2nd `b` and use a mean of `4`, a standard deviation of `1`, and generate `5` points

In [None]:
np.random.seed(42)
a = np.random.normal(0, 1, 5)
b = np.random.normal(4, 1, 5)

# Put data into a dataframe's column `x`
# Create a `y` thats all zeros
df = pd.DataFrame({"x": np.hstack((a, b))})
df["y"] = 0

Create a scatter plot of the data.

In [None]:

plt.show()

* Choose `k` rows from the dataframe at random to be the initial centroids.
    * Note, that [other implementations](https://en.wikipedia.org/wiki/K-means%2B%2B) will do this a little more rigorously than `2` random points.
* Convert the centroids to a numpy array

In [None]:
k = 2

In [None]:
centroids = 
centroids

Add the centroids to the plot

In [None]:
sns.scatterplot("x", "y", data=df)

plt.show()

We want to build towards a for loop to assign each point to a centroid.  For this, we'll use euclidean distance (formula below).

$$\sum_{i=0}^{n}{(x_i - y_i)^2}$$

aka sum of squared differences between $x$ and $y$

* $n$ is the total number of features
* $i$ is the current feature index
* $x_i$ is the current feature value for observation $x$
* $y_i$ is the current feature value for observation $y$

The below code chunks have the beginnings of a for loop and blanks to fill in to compute euclidean distance between each row and each centroid

In [None]:
X = np.array(df)

# for x in X:
x = X[0]

In [None]:
# Take difference between x and centroids
diffs = 
diffs

In [None]:
# Square the differences
sq_diffs = 
sq_diffs

In [None]:
# Sum the squared differences by row
dists = 
dists

In [None]:
# Find the index of the centroid closest to x
label = 
label

In [None]:
# Use all of the components you just made to build a for loop
# that assigns a label to each row of X


Add the assigned labels as a column in the below dataframe named `assigned_df`.

In [None]:
assigned_df = df.copy()
assigned_df["label"] = labels

Replot the data with the points colored by cluster assignment

In [None]:
sns.scatterplot("x", "y", data=assigned_df)
plt.scatter(centroids[:, 0], centroids[:, 1], c="black", marker="x", s=100)
plt.show()

Aggregate `assigned_df` to update the centroids.
* Group by the `'label'` column and take the mean of every other column.
* Convert this output to a numpy array and assign it to `centroids`

In [None]:
agg_df = 
centroids = agg_df.values

Replot the data colored by `'label'` with the new centroids.

In [None]:
sns.scatterplot("x", "y", hue="label", data=assigned_df)
plt.scatter(centroids[:, 0], centroids[:, 1], c="black", marker="x", s=100)
plt.show()

The process we've been doing is rewritten as functions below.  Take a minute to read over the functions and confirm you understand the logic.

In [None]:
def init_centroids(df, k):
    centroids = df.sample(k).values
    return centroids

In [None]:
def assign_centroids(X, centroids):
    X = np.array(X)
    centroids = np.array(centroids)

    labels = []
    for x in X:
        dists = np.sum((x - centroids) ** 2, axis=1)
        label = dists.argmin()
        labels.append(label)

    return labels

In [None]:
def update_centroids(assigned_df):
    centroid_agg = assigned_df.groupby("label").mean()
    centroids = centroid_agg.values
    return centroids

In [None]:
def plot_kmeans(df, centroids):
    sns.scatterplot("x", "y", hue="label", data=df)
    plt.scatter(centroids[:, 0], centroids[:, 1], c="black", marker="x", s=100)
    plt.show()

Use the functions to: 
1. Initialize centroids
* Assign points to centroids
* Plot the current step
* Update centroids

In [None]:
centroids = 
labels = 

assigned_df = df.copy()
assigned_df["label"] = labels


centroids = 

Write a for loop to perform the assigning, plotting, and updating `n` times.

In [None]:
n = 3
centroids = init_centroids(df, 2)

In [None]:
for              :
    

Boom! That's a bonified k-means algorithm.  For extra practice you might:
* Re-do the process with random `y` values instead of all 0s
    * The same code should work
* Re-do the process with a 3rd feature, `z`
    * The same code should work (i think), but the plotting will only show `x` and `y`
* Wrap the whole process up in a single function or class.  Feature requests below:
    * Give the user the option to turn plots on/off
    * Give the user the option to pass in a `random_state` that is used during centroid initialization
    * Give the user the option to specify a maximum number of iterations before the algorithm stops
    * Cause the algorithm to stop early if the centroids didn't change (i.e. it's converged)

Now let's do it the `sklearn` way.

In [None]:
data_url = "https://docs.google.com/spreadsheets/d/1RJrLftlRnj6gmrYewqxykVKSyl7aV-Ktd3sUNQILidM/export?format=csv"
startup = pd.read_csv(data_url)
startup = startup.drop(columns="State")
startup.head()

* Create a scaled version of the data with `StandardScaler()`
* Initialize a `KMeans` instance with `k` clusters.
* `.fit()` it to the `scaled` data

In [None]:
k = 4

In [None]:
scaler = StandardScaler()
scaled = 

In [None]:
clst = 
clst.fit(scaled)

* Unscale the `clst.cluster_centers_` using your `StandardScaler` instance.  We need to do this for interpretation.
* Save the unscaled centroids to a dataframe with the same names as the `startup` dataframe

In [None]:
centroids = 
centroids_df = pd.DataFrame(centroids, columns=startup.columns)
centroids_df

* Interpret the output; try and give names to these clusters that represent their members
* Don't just look at the numbers; visualize the centroids somehow.  
    * A plot?
    * A formatted table?