In this notebook, we will use traditional machine learning methods to the problem of product recommendation. We will use the MovieLens dataset, which contains user ratings for movies, along with movie metadata. The dataset contains ~1M movie ratings collected from 3706 users. The user ratings are on a scale from 1 to 5 (stars). The dataset also contains identifiers for users and movies, as well as some additional metadata such as movie genres and timestamps of ratings.

Let's start by loading the data in this notebook. Since this is a large dataset and the computations are quite costly, I strongly suggest you use your own (local) computer to run the code (instead of running it in the Cloud using, e.g., Colab). Now, to read the ratings.dat file into a pandas DataFrame, we can use:

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Update the path according to where you locate the file in your computer!
ratings_file_path = '/Users/victormpreciado/PythonProjects/Networks/Data/MovieLens/ratings.dat'

# Assuming the separator is '::' for the MovieLens dataset
ratings = pd.read_csv(ratings_file_path, sep='::', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')

# Compute the average of the 'Rating' column
avg_rating = ratings['Rating'].mean()

# Now 'avg_rating' contains the average rating
print(avg_rating)

# Display the first few rows of the dataframe
ratings.head()

3.581564453029317


Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


As you can see above, each row contains the 'Rating' (1 to 5) that user 'UserID' assigns to movie 'MovieID'. We now download a second file named 'users.dat' containing a few attributes for each user, in particular, 'Gender', 'Age' (not the exact age, but the age group), 'Occupation' and 'Zip-code'. Let's load the file and take a look at the data...

In [2]:
# Update the path according to where the file is located
users_file_path = '/Users/victormpreciado/PythonProjects/Networks/Data/MovieLens/users.dat'

# Define column names for the users DataFrame
user_columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']

# Load the data
users = pd.read_csv(users_file_path, sep='::', header=None, names=user_columns, engine='python')

# Display the first few rows to verify
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


We also have features about each movie in the file 'movies.dat'. The features for each 'MovieID' are: 'Title' (including the year in parentheses) and 'Genres' (more than one genre possible).

In [3]:
# Update the path according to where the file is located
movies_file_path = '/Users/victormpreciado/PythonProjects/Networks/Data/MovieLens/movies.dat'

# Define column names for the movies DataFrame
movie_columns = ['MovieID', 'Title', 'Genres']

# Load the data with ISO-8859-1 encoding
movies = pd.read_csv(movies_file_path, sep='::', header=None, names=movie_columns, engine='python', encoding='ISO-8859-1')

# Display the first few rows to verify
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


After all the data available is loaded, we move on to the task of building a machine learning architecture able to predict the rating that a particular individual would assign to a movie. A simple approach to tackle this problem would be to take as input features the combination of the features of an 'User' with the features of a 'Movie' and consider 'Rating' as the Output variable. In this Notebook, we use a more sophisticated approach called Collaborative Filtering. This technique is based on creating a complete weighted graph where each node represents a user and the weight of each edge connecting two users measures how similar these two individuals are based on their Cosine Similarity, which we explain below.

To build the weighted network of Cosine Similarities, we first create a user-item rectangular matrix. The rows of this rectangular matrix are indexed by the users, while the columns by the movies. The entries are the ratings that individuals give to movies. By convention, we fill this matrix with a zero when an individual has not seen a movie. Since an individual has only seen a small fraction of movies, this built matrix is very sparse (most entries are zero). The code below builds and shows a part of this (sparse and rectangular) matrix...

In [4]:
# Creating the user-item matrix
user_item_matrix = ratings.pivot_table(index='UserID', columns='MovieID', values='Rating')

# Fill missing values with 0s (assuming that missing values mean unrated movies)
user_item_matrix = user_item_matrix.fillna(0)

# Display the user-item matrix
print(user_item_matrix.head()) # This shows part of the first 5 rows of the user-item matrix

MovieID  1     2     3     4     5     6     7     8     9     10    ...  \
UserID                                                               ...   
1         5.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
3         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
5         0.0   0.0   0.0   0.0   0.0   2.0   0.0   0.0   0.0   0.0  ...   

MovieID  3943  3944  3945  3946  3947  3948  3949  3950  3951  3952  
UserID                                                               
1         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
2         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
4         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
5         0.0   0.0   0.0   0.0   0.0   0.0   0

We can now build the complete weighted graph of Cosine Similarities. The Cosine Similarity between two users $\text{User1}$ and $\text{User2}$ is mathematically defined as:

$$
\text{CosSim}(\text{User1}, \text{User2}) = \frac{\sum_{Item=1}^{\text{# of Items}} \text{Rating}(\text{User1}, \text{Item}) \times \text{Rating}(\text{User2}, \text{Item})}{\sqrt{\sum_{Item=1}^{\text{# of Items}} (\text{Rating}(\text{User1}, \text{Item}))^2} \times \sqrt{\sum_{Item=1}^{\text{# of Items}} (\text{Rating}(\text{User2}, \text{Item}))^2}}
$$

where: $\text{Rating(User,Item)}$ is the ratings given by user $\textbf{User}$ to item $\textbf{Item}$.

In [5]:
# Create a user-item matrix for cosine similarity calculation
user_item_matrix = ratings.pivot_table(index='UserID', columns='MovieID', values='Rating').fillna(0)

# Compute user-user cosine similarity matrix
user_similarity = cosine_similarity(user_item_matrix)
print(user_similarity)

[[1.         0.09638153 0.12060981 ... 0.         0.17460369 0.13359025]
 [0.09638153 1.         0.1514786  ... 0.06611767 0.0664575  0.21827563]
 [0.12060981 0.1514786  1.         ... 0.12023352 0.09467506 0.13314404]
 ...
 [0.         0.06611767 0.12023352 ... 1.         0.16171426 0.09930008]
 [0.17460369 0.0664575  0.09467506 ... 0.16171426 1.         0.22833237]
 [0.13359025 0.21827563 0.13314404 ... 0.09930008 0.22833237 1.        ]]


Once we have built the matrix of Cosine Similarities, we build a network-based feature, called 'SimilarUserRatings', as follows: 1) Pick a particular user 'User' and find the top-50 most similar users (according to the Cosine Similarities); 2) Pick a movie 'Movie' and compute the average rating that the top-50 users most similar to 'User' assign to that movie. We can compute and include this new feature in the 'ratings' DataFrame, as follows... (this next cell takes a long time to compute, so be patient and make sure you use your own computer instead of Google Colab) 

In [6]:
# For each user-movie pair, calculate the average rating from top-N similar users (N=50)
def get_similar_users_rating(row, top_n=50):
    similar_users = np.argsort(-user_similarity[row['UserID'] - 1])[:top_n]
    similar_users_ratings = user_item_matrix.iloc[similar_users, user_item_matrix.columns.get_loc(row['MovieID'])]
    # Filter out zero values
    non_zero_ratings = similar_users_ratings[similar_users_ratings != 0]
    # Compute the mean of non-zero values
    return np.mean(non_zero_ratings) if len(non_zero_ratings) > 0 else avg_rating

# The next line creates a new column with 'SimilarUserRatings'
ratings['SimilarUsersRating'] = ratings.apply(get_similar_users_rating, axis=1) 
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,SimilarUsersRating
0,1,1193,5,978300760,4.166667
1,1,661,3,978302109,3.571429
2,1,914,3,978301968,4.470588
3,1,3408,4,978300275,4.117647
4,1,2355,5,978824291,4.263158


Apart from this network-based feature, we will also use features based on attributes to make our predictions. As attribute-based features for the 'Movie', we will only use its genre (using Multi-Hot encoding). As attribute-based features of the 'User', we consider their 'Age' group (as a numerical value), 'Gender' (as a binary variable), and 'Occupation' (as a One-Hot Encoding).

In [7]:
# Multi-hot encode movie genres
movies_genres = movies['Genres'].str.get_dummies(sep='|')

# One-hot encode user's occupation
users_occupation = pd.get_dummies(users['Occupation'], prefix='Occupation')

# Encode gender as binary
users['Gender'] = users['Gender'].map({'F': 0, 'M': 1})

# Combine user features
users_features = pd.concat([users[['UserID', 'Age', 'Gender']], users_occupation], axis=1)

# Merge ratings with user features and movie genres
ratings = ratings.merge(users_features, on='UserID')
ratings = ratings.merge(movies_genres, left_on='MovieID', right_index=True)
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,SimilarUsersRating,Age,Gender,Occupation_0,Occupation_1,Occupation_2,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1193,5,978300760,4.166667,1,0,False,False,False,...,0,0,0,0,0,1,0,0,0,0
120,2,1193,5,978298413,4.333333,56,1,False,False,False,...,0,0,0,0,0,1,0,0,0,0
1339,12,1193,4,978220179,4.611111,25,1,False,False,False,...,0,0,0,0,0,1,0,0,0,0
1518,15,1193,4,978199279,4.222222,25,1,False,False,False,...,0,0,0,0,0,1,0,0,0,0
1747,17,1193,5,978158471,3.888889,50,1,False,True,False,...,0,0,0,0,0,1,0,0,0,0


We now use a Linear Regression to train a movie rating predictor based on the features engineered above...

In [8]:
# Prepare the dataset for training
X = ratings.drop(['UserID', 'MovieID', 'Rating', 'Timestamp'], axis=1)
y = ratings['Rating']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.7915360998096428
