# Unsupervised Learning - An Example of Movie Reviews

## Dataset

Let us import a dataset of movie reviews by different users.

Let's check the shape of the DataFrame. 

As we can see, there are
* 671 users (each row represents reviews submitted by a user)
* 1000 movies reviewed by the users
* not every user reviews every movie -- each NA value indicates that a user did not review the corresponding movie

Let's drop the `user_ID` column as it does not provide useful information.

Let's visualize the reviews by a heatmap. 

Given the large number of movies and users, let's focus on the first 20 users and first 30 movies.

We can use the argument `annot=True` to show the rating in each cell.

## Question

If a user didn't rate a movie, how can we predict their rating of the movie?
* For example, the **second user** did not rate the movie **Shawshank Redemption (1994)**. What would be their rating if they decide to watch and rate that movie?

**The Supervised Learning Approach**
* Consider users' ratings of other movie as "features", and the rating of the "focal movie" (e.g., Shawshank Redemption (1994)). Then we can build a regression/classification model.
* Challenges? 

**The Unsupervised Learning Approach**
* We can divide the users into **clusters**. Then, we can look at those users in the same cluster as the **second user** who rated the movie **Pulp Fiction (1994)**, and use the average rating as the prediction of the  **second user**'s rating of the movie.

## Clustering

Clustering is the process of **dividing a dataset into groups** where points in the same group (or cluster) are more similar to each other than to those in other clusters.

### $k$-Means Clustering

In $k$-means clustering, 
* there are $k$ clusters in total, each of which is represented by centeroid
* the clusters are created so as to minimize the intra-cluster distance, which is the squared distance between the data points and the centroid of the cluster they belongs to.

To use $k$-means clustering, we will first need to impute the missing values, and then scale the data.

Let's first replace the `NA` values by 0. 

Next, we may need to scale the data. Note that since $k$-means clustering is a distance-based method, in general we need to perform scaling. However, currently all columns are on the same scale of 0 to 5 (since all columns are ratings of 0 to 5).

To perform $k$-means clustering, we need to import the `KMeans` function from `sklearn`.

Suppose that we would like to divie the dataset into 10 clusters.

In [1]:


#build the k-means model

#perform the clustering by fitting the model to the data



The fitted model has a few useful attributes.

The `.cluster_centers_` attribute gives the coordinates of the centroids. We may think of a centroid as the ratings of a representative user in a cluster.

The `.labels_` attribute is the cluster labels for each data point in the dataset.

Each label is a number between 0 and $k-1$, indicating one of the $k$ clusters.

Let's append the labels to the original dataset as a column.

The `.inertia` attribute returns the the sum of squared distances of data points to their nearest cluster centroid.
* It measures how compact the clusters are, with lower values indicating points in the same cluster closer each other.
* It does not measure clustering quality in an absolute sense but can be used to compare different values of $k$.


But how do we find the appropriate number of clusters?

### The Elbow Method

The elbow method is a heuristic to determine the value of $k$.
* The method evaluates the inertia for $k=1,2,3,...$.
* As $k$ increases, inertia tend to decrease (why?)
* The methods identifies the **elbow point**, at which inertia stops decreasing rapidly.

From the chart:
* The curve shows a noticeable drop in inertia from $k=1$ to $k=5$.
* After $k=5$, the rate of decrease becomes less pronounced and more gradual.

Let's set `n_clusters` to 5 and redo the clustering.

Let's assign the labels to the last column of the `ratings` DataFrame (the dataset after droping the `user_ID` column).

As we can see, the second user has the cluster label 3. 

We can visualize the clusters using the heatmap.

We will sort the rows according to their cluster labels, so that in the DataFrame the rows of the same cluster are grouped together.

Next, let's group the `ratings` DataFrame and calculate the average ratings of the movies of each group.

The result DataFrame shows the average ratings of the movies of each cluster.

For example, cluster 3 users have an average rating of 4.72 for **Shawshank Redemption (1994)**. This implies that we can predict that it is reasonable to predict that the second user in the dataset (who is in cluster 3) is likely to give a high rating.