# Playlist Continuation

You may have used the automatic playlist continuation feature on Spotify. As a matter of fact, Spotify even hosts a [challenge](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge) to develop a system for the task of automatic playlist continuation, with a dataset of 1 million playlists consist of over 2 million unique tracks. Let us think of a simplified way to solve this challenge:

1. Each playlist is a vector of 2 million unique tracks;
2. We can find the most similar playlist to what the user is playing;
3. Suggest the song which the user has not listened to.

It may sound easy, but finding a similar vector with 2 million dimensions is almost impossible. We need a way to reduce the dimensionality of these vectors. Random projection is one simple way. For a $m \times n$ matrix $A$, it is a projection from $\mathbf{R}_n \to \mathbf{R}_m$ when applied to $n$-dimension $x$:

$$
Ax = b
$$

Johnson-Lindenstraus theorem simply proves that there is a theoretical guarantee that the errors will be smaller than some number for any random projection $\mathbf{R}_n \to \mathbf{R}_m$. If you are fine with this number of errors, you can go ahead with this projection.

Let us say, we are fine to project the playlist vector to 2 thousand dimensions. Then we need a random matrix which has a shape $2\ \text{thousand} \times 2\ \text{million}$. This is still a fairly difficult problem, many solutions have been proposed. For this tutorial, we do a simple version of projecting $\mathbf{R}_{200} \to \mathbf{R}_{10}$ using Normal Distribution. We first import `numpy`.

In [1]:
import numpy as np

We have a playlist vector $x$.

In [2]:
x = np.random.randint(0, 2, size=200)
print(f"x:\n{x}")

x:
[1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1
 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0
 0 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
 0 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1
 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 1
 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0]


And a random matrix of Normal Distribution with mean of 0 and standard deviation of 1.

In [3]:
A = np.random.normal(0, 1, size=(10, 200))
print(f"A:\n{A}")

A:
[[-1.64315459e+00 -1.27025873e+00  5.57921089e-01 ...  9.83378181e-01
  -1.04376228e-01  3.70316225e-01]
 [ 1.02136316e+00 -3.43561062e-01 -4.90366920e-01 ... -2.62505626e+00
  -4.97722256e-01 -8.53965514e-01]
 [-2.25068726e-01  1.45179102e+00  2.08905347e+00 ...  2.74035494e-01
   4.91896507e-01  2.20207021e-01]
 ...
 [-3.09133497e-01  7.76303050e-04 -7.54150289e-01 ...  1.12149276e+00
  -9.11537816e-01 -1.19507722e+00]
 [ 1.36929865e+00  2.26409325e-01  3.22288789e-01 ... -2.25175924e-01
  -8.62266466e-01  4.43092509e-01]
 [ 3.64083242e-01  1.31477953e+00 -2.93859179e-01 ... -1.73815713e+00
   3.10997797e-01  4.48063125e-01]]


Multiply and done.

In [4]:
b = A@x
print(f"b:\n{b}")

b:
[-15.80969941   9.60504752   8.07553026   3.32029057  -3.9975054
 -14.06385587   2.07505982  -8.54465734  15.61955639  -1.04477799]


On the exercise, you are asked to create a random matrix with specific probabilities. You can refer to [`numpy.random.choice()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html).

## Further Reading

[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)  
[Nearest Neighbors in Python with Annoy](https://calmcode.io/annoy/intro.html)  
[CS168 Lecture 4: Dimensionality Reduction](https://web.stanford.edu/class/cs168/l/l4.pdf)

# Mushroom Hunting

Now, let us be mushroom hunters who want to understand the underlying attributes of mushrooms. There is a great dataset prepared by researchers at University of Marburg. The research is published in [Nature Scientific Reports](https://www.nature.com/articles/s41598-021-87602-3).

Now, we import the libraries and read the mushroom dataset.

In [5]:
import altair as alt
import pandas as pd
import numpy as np

In [6]:
def calculate_mean(x):
    """
    Transform a string of "[lower, higher]" into its mean.
    """
    from ast import literal_eval
    
    try:
        return int(x)
    except ValueError:
        x = literal_eval(x)
        lower = int(x[0])
        higher = int(x[-1])
        mean = (lower + higher) / 2
        return mean

We first investigate the size of the mushrooms. In particular, the stem height and stem width.

In [7]:
mushroom = pd.read_csv("https://mushroom.mathematik.uni-marburg.de/files/PrimaryData/primary_data_edited.csv", sep=";")
mushroom["stem-height"] = mushroom["stem-height"].apply(calculate_mean)
mushroom["stem-width"] = mushroom["stem-width"].apply(calculate_mean)
mushroom.loc[:5, ["stem-height", "stem-width"]]

Unnamed: 0,stem-height,stem-width
0,17.5,17.5
1,8.0,15.0
2,11.0,15.0
3,11.0,17.5
4,11.0,15.0
5,6.0,12.5


These two attributes are correlated. Tall and wide stems mean that the mushrooms is chunky, and short and thin stems means that the mushrooms is probably tiny. These two dimensions can be reduced to one, which means that we can apply Principal Component Analysis (PCA).

In [8]:
alt.Chart(mushroom).mark_circle().encode(
    x="stem-height",
    y="stem-width",
    tooltip=["stem-height", "stem-width"],
)

Let us start applying PCA. We first centre the data.

In [9]:
# Select columns of "stem-height" and "stem-width"
mushroom = mushroom.loc[:, ["stem-height", "stem-width"]]
# Center the data so each feature has zero mean
mushroom = mushroom - mushroom.mean()
mushroom_chart = alt.Chart(mushroom).mark_circle().encode(
    x="stem-height",
    y="stem-width",
    tooltip=["stem-height", "stem-width"],
)
mushroom_chart

Then we reconstruct the data in row format, this is a purely arbitrary choice that follows most mathematical literature. If you know Linear Algebra, you know how to apply the same procedure on column format data.

In [10]:
# Consturct data in row format
D = mushroom.to_numpy().transpose()
# Compute covariance matrix
S = np.cov(D)
print(f"S:\n{S}")

S:
[[10.65907716 13.97140073]
 [13.97140073 97.21387283]]


In [11]:
# This is equivalent to D@D.T / dof
# Note that we divide the covariance by D.shape[-1] - 1 because
# we have D.shape[-1] data points in total and need minus 1 degree of freedom
S = D @ D.T / (D.shape[-1] - 1)
print(f"S:\n{S}")

S:
[[10.65907716 13.97140073]
 [13.97140073 97.21387283]]


In [12]:
# Compute eigenvalue and eigenvector
# Note that we use np.linalg.eigh() because S is
# a symmetric matix, np.linalg.eigh() gives better
# performance than np.linalg.eig() 
l, v = np.linalg.eigh(S)
print(f"l:\n{l},\nv:\n{v}")

l:
[ 8.45974247 99.41320753],
v:
[[-0.98783557  0.15550202]
 [ 0.15550202  0.98783557]]


In [13]:
# We sort the eigenvectors by their eigenvalues
# the higher the eigenvalue, the higher importance it associates
idx = np.argsort(l)[::-1]
l = l[idx]
v = v[:,idx]
print(f"v:\n{v}")

v:
[[ 0.15550202 -0.98783557]
 [ 0.98783557  0.15550202]]


With eigenvectors, we can plot the projection line on our graph.

In [14]:
eigenvector = pd.DataFrame(np.vstack([v[:, 0]*70, v[:, 0]*-30]), columns=["stem-height", "stem-width"])

eigenvector_chart = alt.Chart(eigenvector).mark_line(color="grey", opacity=0.8).encode(
    x="stem-height",
    y="stem-width",
    tooltip=["stem-height", "stem-width"],
)

eigenvector_chart + mushroom_chart

And do the projections.

In [15]:
# We can then use sorted eigenvectors to transform the orginal data
# For example, 2D -> 1D
mushroom["prjection"] = (v.T[:1] @ D).reshape(-1)
mushroom.head(5)

Unnamed: 0,stem-height,stem-width,prjection
0,10.910405,5.343931,6.975515
1,1.410405,2.843931,3.028657
2,4.410405,2.843931,3.495163
3,4.410405,5.343931,5.964752
4,4.410405,2.843931,3.495163


The process of covariance matrix and eigenvalue decomposition can be reduced to only using SVD decomposition on centred data. This gives better numerical stability because we don't have to calculate D @ D.T which could cause errors with floating numbers.

In [16]:
u, s, vh = np.linalg.svd(D)
# There is even no need to sort
# 2D -> 1D
mushroom["prjection_svd"] = (v.T[:1] @ D).reshape(-1)
mushroom.head(5)

Unnamed: 0,stem-height,stem-width,prjection,prjection_svd
0,10.910405,5.343931,6.975515,6.975515
1,1.410405,2.843931,3.028657,3.028657
2,4.410405,2.843931,3.495163,3.495163
3,4.410405,5.343931,5.964752,5.964752
4,4.410405,2.843931,3.495163,3.495163


I demonstrated a simple way to project 2D -> 1D. In practice, you normally do PCA on much higher dimensions. For example, here we can see that the cap diameter is correlated with stem sizes. You can try to find a projection that transform 3D -> 1D.

In [17]:
mushroom = pd.read_csv("https://mushroom.mathematik.uni-marburg.de/files/PrimaryData/primary_data_edited.csv", sep=";")
mushroom["cap-diameter"] = mushroom["cap-diameter"].apply(calculate_mean)
mushroom["stem-height"] = mushroom["stem-height"].apply(calculate_mean)
mushroom["stem-width"] = mushroom["stem-width"].apply(calculate_mean)
alt.Chart(mushroom).mark_circle().encode(
    x="stem-height",
    y="stem-width",
    size="cap-diameter",
    tooltip=["stem-height", "stem-width", "cap-diameter"],
)

## Further Reading

[Making sense of principal component analysis, eigenvectors & eigenvalues](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues)  
[Eigenfaces, for Facial Recognition](https://jeremykun.com/2011/07/27/eigenfaces/)  
[CS168 Lecture 7: Understanding and Using Principal Component Analysis (PCA)](https://web.stanford.edu/class/cs168/l/l7.pdf)  
[CS168 Lecture 8: How PCA Works](https://web.stanford.edu/class/cs168/l/l8.pdf)