# K-means clustering

In this section you will create clusters based on book ratings using K-means operations. 

Are there similarities between books when we plot them in a graph? Are there possible clusters we can derive from this spatial layout? 

To help us cluster our book data, we will be using the sklearn Kmeans and a graph layout function from Ch.7 in our book. 

## 1. Load the dataset

Load the subset book ratings you created in the previous notebook.

In [None]:
# your code here

## 2. Construct a ratings matrix 

To cluster our data, we need to construct a matrix with all the ratings. Each row of the matrix represents the ratings for a book, and each column represents a user. For each cell in the matrix, the value is the rating given by that user for that book.

Can you construct such a (huge!) matrix with the dataset? (Hint: use the pandas dataframe `pivot` function)

In [None]:
# your code here

## 3.  Construct a sparse matrix using scipy.sparse

Most of the cells in the matrix are empty; these are the cases where a user did not rate a particular book.

To work with the data more efficiently, we need to transform into a *sparse matrix* format. This is a data structure designed for sparse data (like our matrix). It only stores the cells that actually have a value.

Convert your matrix to a `scipy.sparse` `csr_matrix`.

Tip: `csr_matrix` will count "0" values as empty. If your matrix uses "NaN" values for empty cells, you will need to replace those with 0 first.

In [None]:
from scipy.sparse import csr_matrix

# your code here

## 4. Cluster!

Now that you have your matrix ready, it is time to cluster! 

Here we create a model for k-means clustering with 3 clusters. Can you fit the model with your data?

In [None]:
from sklearn.cluster import KMeans

kmeans_3_clusters = KMeans(n_clusters=3)

clusters = kmeans_3_clusters.fit(...)

Inspect the `clusters ` variable. What does it hold in terms of data?

In [None]:
clusters

## 5. Visualise clusters

Below is a (modified) `plot` function from our book discussed in Ch.7. section 7.4.2, which displays clusters in a graph.

It does the following:
- It simplifies the data into 2 dimensions
- It creates clusters based on the simplified data
- It shows a plot with the simplified data and the clusters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

def plot(user_ratings: csr_matrix, k: int):

        h = 0.2
        reduced_data = PCA(n_components=2).fit_transform(user_ratings)

        kmeans = KMeans(init='k-means++', n_clusters=k, n_init=10)
        kmeans.fit(reduced_data)

        x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
        y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

        Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)

        plt.figure(1)
        plt.clf()
        plt.imshow(Z, interpolation='nearest',
                   extent=(xx.min(), xx.max(), yy.min(), yy.max()),
                   cmap=plt.cm.Paired,
                   aspect='auto', origin='lower')

        centroids = kmeans.cluster_centers_
        plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
        plt.scatter(centroids[:, 0], centroids[:, 1],
                    marker='x', s=169, linewidths=3,
                    color='r', zorder=10)
        plt.title('K-means clustering of the user')

Let's use this function to see how our books data can be clustered. Edit the code below so it uses your user ratings.

In [None]:
plot(user_ratings=..., k=9)


## 6. How about users?

Can we you use the code above to cluster *users* depending on their ratings?

In [None]:
# code goes here

## 7. Saving clusters

Use the model you created in step 6 to create a dataframe that assigns each user to a cluster. It should have one column for user IDs, and one column for cluster labels. Export the dataframe to the data directory.

Tip: on a trained clustering model, you can use `.transform(data)` to get clusters for the input values in `data`.

In [None]:
# code goes here