# Bernoulli Embeddings
### *A demonstration using the MovieLens dataset (Part 2)*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yabramuvdi/yabramuvdi.github.io/blob/master/_notebooks/movies_embeddings_analysis.ipynb)

This notebook continues the demonstration of the Bernoulli embeddings that started [here](https://yabramuvdi.github.io/movies_embeddings_estimation). After having estimated the embeddings for all the movies available in our dataset we can now evaluate their quality. We will do this by looking at the nearest neighbors of some movies and solving a couple of analogy tasks. 

## Setup

In [1]:
# install libraries
!pip install annoy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting annoy
  Downloading annoy-1.17.1.tar.gz (647 kB)
[K     |████████████████████████████████| 647 kB 28.9 MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.1-cp37-cp37m-linux_x86_64.whl size=397059 sha256=364ec9b1ce68e9fd490b666032876cfb540d6ee79ca974d0572d9bdf76cb3e72
  Stored in directory: /root/.cache/pip/wheels/81/94/bf/92cb0e4fef8770fe9c6df0ba588fca30ab7c306b6048ae8a54
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.1


In [2]:
# clone the repository with the estimated embeddings
!git clone https://github.com/yabramuvdi/bernoulli-embeddings.git

Cloning into 'bernoulli-embeddings'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 23 (delta 7), reused 15 (delta 2), pack-reused 0[K
Unpacking objects: 100% (23/23), done.


In [3]:
# get data from MovieLens
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip

--2022-11-05 17:17:54--  https://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2022-11-05 17:18:16 (12.2 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]



In [4]:
# unzip data
!unzip ml-25m.zip

Archive:  ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from annoy import AnnoyIndex
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px

In [6]:
# define data path
model_path = "./bernoulli-embeddings/results/"
data_path = "./ml-25m/"

## Load the embeddings

In [7]:
# load the estimated embeddings
rho = np.load(model_path + "embeddings_final.npy")
print(rho.shape) # (num_movies, emb_dimension)

(6083, 50)


In [8]:
# load the dictionary mapping from movies to their position in 
# the embeddings matrix
with open(model_path + "item2idx.pkl", 'rb') as f:
    item2idx = pickle.load(f)

# reverse the dictionary
idx2item = {v:k for k,v in item2idx.items()}

# estimated embeddings have one more row because of the padding
print(len(item2idx))    

6082


## Load movies data

In [9]:
# read the data with the information for movies
df_movies = pd.read_csv(data_path + "movies.csv")
df_movies.columns = ["movie_id", "title", "genres"]

# select only the movies for which we have embeddings
df_movies = df_movies.loc[df_movies["movie_id"].isin(list(item2idx.keys()))]
df_movies

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
59794,201646,Midsommar (2019),Drama|Horror|Mystery
59844,201773,Spider-Man: Far from Home (2019),Action|Adventure|Sci-Fi
60090,202429,Once Upon a Time in Hollywood (2019),Comedy|Drama
60095,202439,Parasite (2019),Comedy|Drama


In [11]:
# create mapping from original movie ID to its name
id2movie = {}
movie2id = {}
for i, row in df_movies.iterrows():
    id2movie[row["movie_id"]] = row["title"]
    movie2id[row["title"]] = row["movie_id"]

In [12]:
# test dictionaries
id2movie[100], movie2id["City Hall (1996)"]

('City Hall (1996)', 100)

## Nearest neighbors

In [22]:
# define auxiliary functions
def build_indexer(vectors, num_trees=10):
    """ we will use a version of approximate nearest neighbors
        (ANNOY: https://github.com/spotify/annoy) to build an indexer
        of the embeddings matrix
    """
    
    # angular = cosine
    indexer = AnnoyIndex(vectors.shape[1], 'angular')
    for i, vec in enumerate(vectors):
        # add word embedding to indexer
        indexer.add_item(i, vec)
        
    # build trees for searching
    indexer.build(num_trees)
    
    return indexer

def find_nn(item_name, annoy_indexer, item2idx, idx2item, movie2id, id2movie, n=10):
    """ function to find the nearest neighbors of a given item
    """
    
    # name to index in original database
    item = movie2id[item_name]
    
    # original index to model index
    item_index = item2idx[item]
    
    nearest_indexes, distances =  annoy_indexer.get_nns_by_item(item_index, n+1, include_distances=True)
    nearest_items = [idx2item[i] for i in nearest_indexes[1:] if i > 0]
    
    # get names of movies
    nearest_movies = [id2movie[i] for i in nearest_items]
    
    return nearest_movies, distances

def list_print(x):
    for i, movie in enumerate(x):
        print(f"{i+1}. {movie}")

In [16]:
# create an indexer for our estimated embeddings
indexer_rho = build_indexer(rho, 20000)

In [23]:
movie = "Star Wars: Episode VI - Return of the Jedi (1983)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Star Wars: Episode VI - Return of the Jedi (1983):

1. Star Wars: Episode IV - A New Hope (1977)
2. Star Wars: Episode V - The Empire Strikes Back (1980)
3. Indiana Jones and the Last Crusade (1989)
4. Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
5. Star Trek II: The Wrath of Khan (1982)
6. Indiana Jones and the Temple of Doom (1984)
7. Star Trek: First Contact (1996)
8. Star Wars: Episode I - The Phantom Menace (1999)
9. Aliens (1986)
10. Terminator, The (1984)


In [24]:
movie = "Harry Potter and the Order of the Phoenix (2007)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Harry Potter and the Order of the Phoenix (2007):

1. Harry Potter and the Half-Blood Prince (2009)
2. Harry Potter and the Deathly Hallows: Part 1 (2010)
3. Harry Potter and the Goblet of Fire (2005)
4. Harry Potter and the Deathly Hallows: Part 2 (2011)
5. Harry Potter and the Prisoner of Azkaban (2004)
6. Harry Potter and the Chamber of Secrets (2002)
7. Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
8. Pirates of the Caribbean: Dead Man's Chest (2006)
9. Pirates of the Caribbean: At World's End (2007)
10. Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)


In [25]:
movie = "Die Hard (1988)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Die Hard (1988):

1. Lethal Weapon (1987)
2. Indiana Jones and the Last Crusade (1989)
3. Terminator, The (1984)
4. Indiana Jones and the Temple of Doom (1984)
5. Untouchables, The (1987)
6. Aliens (1986)
7. Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
8. Terminator 2: Judgment Day (1991)
9. Die Hard 2 (1990)
10. Hunt for Red October, The (1990)


In [26]:
movie = "Kill Bill: Vol. 1 (2003)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Kill Bill: Vol. 1 (2003):

1. Kill Bill: Vol. 2 (2004)
2. Sin City (2005)
3. Donnie Darko (2001)
4. Shaun of the Dead (2004)
5. Grindhouse (2007)
6. Snatch (2000)
7. V for Vendetta (2006)
8. Battle Royale (Batoru rowaiaru) (2000)
9. Reservoir Dogs (1992)
10. Team America: World Police (2004)


In [27]:
movie = "Lion King, The (1994)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Lion King, The (1994):

1. Aladdin (1992)
2. Beauty and the Beast (1991)
3. Pocahontas (1995)
4. Mrs. Doubtfire (1993)
5. Jumanji (1995)
6. Snow White and the Seven Dwarfs (1937)
7. Home Alone (1990)
8. Little Mermaid, The (1989)
9. Pinocchio (1940)
10. Hunchback of Notre Dame, The (1996)


In [28]:
movie = "2001: A Space Odyssey (1968)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of 2001: A Space Odyssey (1968):

1. Blade Runner (1982)
2. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
3. Clockwork Orange, A (1971)
4. Brazil (1985)
5. Apocalypse Now (1979)
6. Alien (1979)
7. Metropolis (1927)
8. Barry Lyndon (1975)
9. Planet of the Apes (1968)
10. Lawrence of Arabia (1962)


In [29]:
movie = "Exorcist, The (1973)"
N = 10
print(f"{N} RHO nearest neighbors of {movie}:\n")
nn_movies, nn_dists = find_nn(movie, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

10 RHO nearest neighbors of Exorcist, The (1973):

1. Halloween (1978)
2. Poltergeist (1982)
3. Carrie (1976)
4. Omen, The (1976)
5. Rosemary's Baby (1968)
6. Jaws (1975)
7. Misery (1990)
8. Nightmare on Elm Street, A (1984)
9. Shining, The (1980)
10. Texas Chainsaw Massacre, The (1974)


## Analogies

A very interesting, and surprising, use of word embeddings is to find word analogies. The famous example used by [Mikolov et al. (2013)](https://arxiv.org/pdf/1301.3781.pdf) searches for a word $X$ in the embedded space that is similar to "woman" in the same sense that "king" is similar to "man". This task can be expressed in terms of a simple vector arithmetic problem as follows:

$$
\vec{King}^{\,} - \vec{Man}^{\,} = \vec{X}^{\,} - \vec{Woman}^{\,} \\
\vec{King}^{\,} - \vec{Man}^{\,} + \vec{Woman}^{\,} = \vec{X}^{\,}
$$

Mikolov et al. (2013) find that when performing this operation on their trained embeddings, they are able to recover the word "queen".

$$ \vec{King}^{\,} - \vec{Man}^{\,} + \vec{Woman}^{\,} \approx \vec{Queen}^{\,} $$

We will play with this idea and try to extend it to our own domain (i.e. movies. Some of the analogies that we will try to solve are: 


$$ \vec{Star Wars V}^{\,} - \vec{Star Wars IV}^{\,} + \vec{LoR I}^{\,} \approx $$

$$ \vec{Harry Potter 5}^{\,} - \vec{Harry Potter 4}^{\,} + \vec{Kill Bill 1}^{\,} \approx $$

In [41]:
def find_nn_vector(vector, annoy_indexer, item2idx, idx2item, movie2id, id2movie, n=10):
    """ function to find the nearest neighbors of a given vector
    """
    
    # find the nearest neighbor of our query vector
    nearest_indexes, distances =  annoy_indexer.get_nns_by_vector(query_emb,  n+1, include_distances=True)
    nearest_items = [idx2item[i] for i in nearest_indexes[1:] if i > 0]
    
    # get names of movies
    nearest_movies = [id2movie[i] for i in nearest_items]
    
    return nearest_movies, distances

In [44]:
# define the movies for the analogy task
movie_pos_1 = "Harry Potter and the Half-Blood Prince (2009)"
movie_neg_1 = "Harry Potter and the Order of the Phoenix (2007)"
movie_pos_2 = "Kill Bill: Vol. 1 (2003)"

# get the embedded representation of our movies of interest
emb_pos_1 = rho[item2idx[movie2id[movie_pos_1]]]
emb_neg_1 = rho[item2idx[movie2id[movie_neg_1]]]
emb_pos_2 = rho[item2idx[movie2id[movie_pos_2]]]

# vector arithmetic
query_emb = emb_pos_1 - emb_neg_1 + emb_pos_2
query_emb.shape

(50,)

In [46]:
print(f"Which movie is similar to: {movie_pos_2} in the same sense that {movie_pos_1} is similar to {movie_neg_1}\n")
N = 10
nn_movies, nn_dists = find_nn_vector(query_emb, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

Which movie is similar to: Kill Bill: Vol. 1 (2003) in the same sense that Harry Potter and the Half-Blood Prince (2009) is similar to Harry Potter and the Order of the Phoenix (2007)

1. Kill Bill: Vol. 2 (2004)
2. Sin City (2005)
3. Snatch (2000)
4. Old Boy (2003)
5. Memento (2000)
6. Reservoir Dogs (1992)
7. V for Vendetta (2006)
8. Shaun of the Dead (2004)
9. Battle Royale (Batoru rowaiaru) (2000)
10. Donnie Darko (2001)


In [47]:
# define the movies for the analogy task
movie_pos_1 = "Star Wars: Episode V - The Empire Strikes Back (1980)"
movie_neg_1 = "Star Wars: Episode IV - A New Hope (1977)"
movie_pos_2 = "Lord of the Rings: The Fellowship of the Ring, The (2001)"

# get the embedded representation of our movies of interest
emb_pos_1 = rho[item2idx[movie2id[movie_pos_1]]]
emb_neg_1 = rho[item2idx[movie2id[movie_neg_1]]]
emb_pos_2 = rho[item2idx[movie2id[movie_pos_2]]]

# vector arithmetic
query_emb = emb_pos_1 - emb_neg_1 + emb_pos_2
query_emb.shape

(50,)

In [48]:
print(f"Which movie is similar to: {movie_pos_2} in the same sense that {movie_pos_1} is similar to {movie_neg_1}\n")
N = 10
nn_movies, nn_dists = find_nn_vector(query_emb, indexer_rho, item2idx, idx2item, movie2id, id2movie, N)
list_print(nn_movies)

Which movie is similar to: Lord of the Rings: The Fellowship of the Ring, The (2001) in the same sense that Star Wars: Episode V - The Empire Strikes Back (1980) is similar to Star Wars: Episode IV - A New Hope (1977)

1. Lord of the Rings: The Two Towers, The (2002)
2. Lord of the Rings: The Return of the King, The (2003)
3. Matrix, The (1999)
4. Pirates of the Caribbean: The Curse of the Black Pearl (2003)
5. Shrek (2001)
6. Indiana Jones and the Last Crusade (1989)
7. Star Wars: Episode V - The Empire Strikes Back (1980)
8. Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)
9. Kill Bill: Vol. 1 (2003)
10. Batman Begins (2005)
