# Movie Cloud Generator

This noteboook uses the tmdb 5000 movie dataset to generate a 2-dimensional projection of movies.

The goal is to create a projection that puts "similar" movies closer to each other, in order to create a map of cinema that is easy and intuitive to navigate. The take-away for the user is a self-discovered movie recommendation.

It uses multiple t-SNE (t-distributed stochastic neighbor embedding) operations to reduce the dimensionality of this high-dimensional categorical data.

The final resulting artifact is hosted [here](https://giorgos.fun/filmnet).

In [None]:
# Import necessary libraries
import os
import json
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import OneHotEncoder
from matplotlib import pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
# Read in files
credits = pd.read_csv("/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv")
movies = pd.read_csv("/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv")

In [None]:
# Consolidate everything into a single data frame

dataframe = pd.DataFrame()
dataframe["id"] = movies["id"] # reference
dataframe["title"] = movies["title"] # text
dataframe["year"] = [x[:4] for x in np.array(movies["release_date"].values.tolist())] # z
dataframe["popularity"] = movies["popularity"] # size
dataframe["genres"] = [json.loads(x) for x in np.array(movies["genres"].values.tolist())] # color
dataframe["crew"] = [json.loads(x) for x in np.array(credits["crew"].values.tolist())] # x / y
dataframe["cast"] = [json.loads(x) for x in np.array(credits["cast"].values.tolist())] # x / y
dataframe["keywords"] = [json.loads(x) for x in np.array(movies["keywords"].values.tolist())]
dataframe['rating'] = movies['vote_average']

dataframe = dataframe.iloc[np.where(dataframe['year'] != "")[0]]
dataframe = dataframe.iloc[np.where(dataframe['genres'])[0]]
dataframe = dataframe.iloc[np.where(dataframe['crew'])[0]]
dataframe = dataframe.iloc[np.where(dataframe['cast'])[0]]
dataframe = dataframe.iloc[np.where(dataframe['keywords'])[0]]

dataframe

We are using only part of the original data. 

The movie co-ordinates will be generated by considering the following questions:
1. Who made it?
2. Who is in it?
3. What is it about?
4. How is it categorized?

And therefore the four features we need are: 
1. Crew
2. Cast
3. Keywords
4. Genres

In [None]:
num_genres = 2
num_keywords = 20
num_cast = 50
num_crew = 50

# Encode numerical ids into strings
def encode(array, length, suffix):
    encoded = np.zeros(shape=(array.shape[0], length), dtype='<U10')
    for i, row in enumerate(array):
        encoded[i,:len(row)] = [str(x['id']) + suffix for x in row][:length]
    return encoded

cast_encoded = encode(dataframe['cast'], num_cast, 'ca')
crew_encoded = encode(dataframe['crew'], num_crew, 'cr')
genres_encoded = encode(dataframe['genres'], num_genres, 'ge')
keywords_encoded = encode(dataframe['keywords'], num_keywords, 'kw')

keywords_encoded

## Dimensionality reduction

As we can see above, the data has been encoded into lists of unique identifiers. This data cannot be codified into coordinates until we have reduced the dimensions and created a t-SNE projection.

What follows is the creation of 2-dimensional projection for each of these features and a preliminary test for a proper clustering split.

In [None]:
enc = OneHotEncoder()
cast_one_hot = enc.fit_transform(cast_encoded)

svd = TruncatedSVD(n_components=20)
svd_result = svd.fit_transform(cast_one_hot)

cast_tsne = TSNE(random_state=0, learning_rate=200, perplexity=5).fit_transform(svd_result)
plt.title("Cast 2D Projection")
plt.scatter(cast_tsne[:,0], cast_tsne[:,1])

In [None]:
enc = OneHotEncoder()
crew_one_hot = enc.fit_transform(crew_encoded)

svd = TruncatedSVD(n_components=50)
svd_result = svd.fit_transform(crew_one_hot)

crew_tsne = TSNE(random_state=0, learning_rate=50, perplexity=5).fit_transform(svd_result)
plt.title("Crew 2D Projection")
plt.scatter(crew_tsne[:,0], crew_tsne[:,1])

In [None]:
enc = OneHotEncoder()
genres_one_hot = enc.fit_transform(genres_encoded)

svd = TruncatedSVD(n_components=10)
svd_result = svd.fit_transform(genres_one_hot)

genres_tsne = TSNE(random_state=0, learning_rate=50, perplexity=30).fit_transform(svd_result)
plt.title("Genres 2D Projection")
plt.scatter(genres_tsne[:,0], genres_tsne[:,1])

In [None]:
enc = OneHotEncoder()
keywords_one_hot = enc.fit_transform(keywords_encoded)

svd = TruncatedSVD(n_components=100)
svd_result = svd.fit_transform(keywords_one_hot)

keywords_tsne = TSNE(random_state=0, learning_rate=50, perplexity=30).fit_transform(svd_result)
plt.title("Keywords 2D Projection")
plt.scatter(keywords_tsne[:,0], keywords_tsne[:,1])

## Combining the features

So far so good. It seems that we are getting a nice clustering for all of our data.
However, further combining these features leads to mostly noise (or random-seeming orderings of movies).

What I am presenting here is what is the mapping used for the live website prototype, which uses the keywords and genres only. It is still not ideal, but it is in a shape that is more easily parsed.

Previous (failed) configuration types have included:
* Using all four feature sets in a single t-SNE (genre, keyword, cast, crew)
* Performing numerical operations with the features; eg: adding cast + crew coordinates

Both of these included dozens of trials with different hyper-parameters, algorithms and amounts of data at each step.
There was no configuration that I tried which seemed to come close to the originally desired outcome.


In [None]:
one_matrix = np.concatenate((keywords_tsne, genres_tsne), axis=1)
one_tsne = TSNE(random_state=0, learning_rate=30, perplexity=40, early_exaggeration=60).fit_transform(one_matrix)
plt.scatter(one_tsne[:, 0], one_tsne[:, 1])

Below is a better view of the final outcome using the DBSCAN clustering algorithm

In [None]:
db = DBSCAN(eps=3.5, min_samples=20).fit(one_tsne)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
plt.figure(figsize=(30, 30))
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = one_tsne[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = one_tsne[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

In [None]:
# Save the data to a file that is readable by the website's code
combined = np.array([one_tsne[:,0], one_tsne[:,1]]).T
pop = (normalize(dataframe['popularity'][:, np.newaxis], axis=0) * 20)
pop_rating = pop.reshape((pop.shape[0],)) * dataframe['rating']
df = pd.DataFrame(combined * 50)

df['color'] = labels
df['scale'] = np.clip(pop_rating, 0.1, 2) * 50
df['title'] = movies['title'] + " (" + dataframe['year'] + ")"
df = df.dropna()

df.to_csv("movie_cloud.csv", sep="@", index=False, header=False)