In [2]:
import pandas as pd

Import Data

In [3]:
ratings = pd.read_csv('ratings.csv')
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [4]:
movies = pd.read_csv('movies.csv')
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


We want to filter movies with high ratings liked by same users

In [5]:
ratings = ratings[ratings.rating >= 4.0]
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100830,610,166528,4.0,1493879365
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047


In [6]:
movies = movies[['movieId', 'title']]
movies

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017)
9738,193583,No Game No Life: Zero (2017)
9739,193585,Flint (2017)
9740,193587,Bungo Stray Dogs: Dead Apple (2018)


Retrive movies liked by current user and increment count of pair of movies everytime seen together

In [7]:
from collections import defaultdict

pairs = defaultdict(int)

# Loop through the entire list of users
for group in ratings.groupby("userId"):
    # List of IDs of movies rated by the current user
    user_movies = list(group[1]["movieId"])

    # Count every time two movies are seen together
    for i in range(len(user_movies)):
        for j in range(i+1, len(user_movies)):
            pairs[(user_movies[i], user_movies[j])] += 1


Unpack the two movies and the corresponding score and if score if higher than 10 each (20 in pair) add weighted link to the graph

In [8]:
import networkx as nx
# Create a networkx graph
G = nx.Graph()

# Try to create an edge between movies that are liked together
for pair in pairs:
    movie1, movie2 = pair
    score = pairs[pair]

    # The edge is only created when the score is high enough
    if score >= 20:
        G.add_edge(movie1, movie2, weight=score)

print("Total number of graph nodes:", G.number_of_nodes())
print("Total number of graph edges:", G.number_of_edges())

Total number of graph nodes: 448
Total number of graph edges: 11266


Hyperparameters:

G: This is the graph on which node embeddings are to be learned.

dimensions: This parameter sets the number of dimensions of the output node embeddings.A higher number of dimensions can capture more complex features but at the cost of increased computational complexity and potential overfitting.

walk_length: This parameter defines the length of each random walk. Longer walks can capture more information about the graph structure, but they also tend to be more computationally expensive.

num_walks: This is the number of random walks to be performed from each node. More walks can provide more comprehensive information about the neighborhood of each node, enhancing the quality of the resulting embeddings at the cost of increased computation.

p: The return parameter (p) controls the likelihood of immediately revisiting a node in the walk. Setting p to 2 makes it less likely to sample an edge leading back to the node from which the current edge originated. This parameter helps the walk explore outward instead of getting trapped in a local neighborhood.

q: The in-out parameter (q) allows the search to differentiate between “inward” and “outward” nodes. With q set to 1, the random walks are unbiased, treating distant and nearby nodes equally when deciding the next step in the walk. If q is greater than 1, the random walk is biased towards nodes close to the previous node, and if q is less than 1, the random walk is biased towards nodes that are further away.

workers: This parameter specifies the number of worker threads to use in parallel to speed up the random walks. Setting this to 1 means that only a single thread will be used. Using more workers can significantly speed up the computation, especially on multi-core machines.

window: This parameter defines the maximum distance between the current node and its predicted neighbor in node sequences. A larger window size might capture more information about the graph's structure, but it could also include noise by bringing in distant relationships.

min_count: This parameter specifies the minimum count of node occurrences to consider it when training the model. Here, it's set to 1, meaning all nodes that appear in the random walks, even just once, will be included in the training process. This is useful in graphs where rare nodes might still hold significant meaning.

batch_words: This parameter determines the number of words (nodes, in the case of Node2Vec) processed internally in parallel. This impacts the training speed and memory usage. A smaller batch size like 4 means that gradients are updated more frequently, which can lead to faster convergence but might increase the training time due to the overhead of more frequent updates.


In [9]:
from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=64, walk_length=20, num_walks=200, p=2, q=1, workers=1)

model = node2vec.fit(window=10, min_count=1, batch_words=4)

Computing transition probabilities:   0%|          | 0/448 [00:00<?, ?it/s]

Computing transition probabilities: 100%|██████████| 448/448 [00:06<00:00, 66.52it/s] 
Generating walks (CPU: 1): 100%|██████████| 200/200 [00:31<00:00,  6.26it/s]


Recommend by converting title into movieID, loop through 5 most similar word vectors and convert the IDs to the titles with their corresponding similarity scores. 

In [11]:
def recommend(movie):
    filtered_movies = movies[movies.title == movie]
    if not filtered_movies.empty:
        movie_id = str(filtered_movies.movieId.values[0])
        for id in model.wv.most_similar(movie_id)[:5]:
            title = movies[movies.movieId == int(id[0])].title.values[0]
            print(f'{title}: {id[1]:.2f}')
    else:
        print(f"Movie '{movie}' not found in the dataset.")



In [12]:
recommend('Toy Story (1995)')


Lion King, The (1994): 0.58
Monty Python and the Holy Grail (1975): 0.57
Jurassic Park (1993): 0.55
Groundhog Day (1993): 0.54
Sixth Sense, The (1999): 0.52


In [13]:
recommend('Star Wars (1977)')

Movie 'Star Wars (1977)' not found in the dataset.
