In [46]:
import pandas as pd
import numpy as np

## Data importation and understanding

In [47]:
movies_df = pd.read_csv('ml-latest-small/movies.csv', sep=',', names=['item_id', 'title', 'genres'], engine='python',skiprows=1)
tags_df = pd.read_csv('ml-latest-small/tags.csv', sep=',', names=['user_id', 'item_id', 'tag', 'timestamp'], engine='python',skiprows=1)
ratings_df = pd.read_csv('ml-latest-small/ratings.csv', sep=',', names=['user_id', 'item_id', 'rating', 'timestamp'], engine='python',skiprows=1)

The MovieLens dataset is a widely-used dataset for movie recommendation systems, collected by the GroupLens research group at the University of Minnesota. It consists of multiple tables containing data about movies, ratings, and tags provided by users. Here’s a description of each table and its variables:

### movies_df
- **item_id**: Unique identifier for movies.
- **title**: The title of the movie, typically containing the release year in parentheses.
- **genres**: A pipe-separated list of genres associated with the movie.

This DataFrame contains metadata about the movies. Each movie is identified by a unique ID and has associated attributes like the title and the list of genres it belongs to. The genres are categorical and are typically used to filter or describe the content of the movie.

In [48]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   item_id  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


### tags_df
- **user_id**: Unique identifier for users.
- **item_id**: Unique identifier for movies which corresponds to `item_id` in `movies_df`.
- **tag**: Textual tag provided by the user for a movie. These can be descriptive words or short phrases.
- **timestamp**: The timestamp when the tag was provided. This is typically a Unix time stamp and indicates when the user tagged the movie.

The `tags_df` table contains user-generated metadata for the movies. Each row indicates that a particular user has tagged a movie with a textual descriptor. These tags can be used for content-based filtering or to enhance the information about a movie beyond its genres.

In [49]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    3683 non-null   int64 
 1   item_id    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


### ratings_df
- **user_id**: Unique identifier for users, which corresponds to `user_id` in `tags_df`.
- **item_id**: Unique identifier for movies, consistent with `item_id` in both `movies_df` and `tags_df`.
- **rating**: The rating given to a movie by a user. This is typically on a defined scale, like 0.5 to 5 stars.
- **timestamp**: The timestamp when the rating was provided, in the same format as in `tags_df`.

The `ratings_df` table is a record of user ratings for movies. Each row documents that a user has assigned a numerical rating to a movie. This is the core data used in collaborative filtering for recommendation systems.

In [50]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    100836 non-null  int64  
 1   item_id    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


When using these tables for building a recommendation system, one usually starts with the `ratings_df` to build the utility matrix that relates users to movies through the ratings they've provided. `movies_df` can be used to provide readable movie titles and genres for recommendations, and `tags_df` can add additional context for the movies or for more sophisticated recommendation systems that also use content-based filtering methods.

## Exploring the data

In [51]:
# For movies_df to count unique movies
unique_movies = movies_df['item_id'].nunique()
print(f'Unique movies: {unique_movies}')

# For ratings_df to count unique users and movies
unique_users_ratings = ratings_df['user_id'].nunique()
unique_movies_ratings = ratings_df['item_id'].nunique()
print(f'Unique users in ratings: {unique_users_ratings}')
print(f'Unique movies in ratings: {unique_movies_ratings}')

# For tags_df to count unique users, movies, and tags
unique_users_tags = tags_df['user_id'].nunique()
unique_movies_tags = tags_df['item_id'].nunique()
unique_tags = tags_df['tag'].nunique()
print(f'Unique users in tags: {unique_users_tags}')
print(f'Unique movies in tags: {unique_movies_tags}')
print(f'Unique tags: {unique_tags}')

Unique movies: 9742
Unique users in ratings: 610
Unique movies in ratings: 9724
Unique users in tags: 58
Unique movies in tags: 1572
Unique tags: 1589


## Small preprocess

We first transform the columns in the table `ratings_df` to *int32*, as the recommender model needs this adaptation

In [52]:
ratings_df['user_id']=ratings_df['user_id'].astype('int32')
ratings_df['item_id']=ratings_df['item_id'].astype('int32')
ratings_df['timestamp']=ratings_df['timestamp'].astype('int32') #It is required by the model

We split the title of the movies in the `movies_df`, creating another column for the year of the film.

After that, we will get rid of the characters ', The', which are at the last part of some titles.

In [53]:
movies_df[['title', 'year']] = movies_df['title'].str.extract(r'(.+) \((\d{4})\)')

movies_df['title']=movies_df['title'].str.replace(', The', '')
movies_df['genres']=movies_df['genres'].str.replace('|',', ')

## Implementing the recommender

Here's a brief explanation of some common hyperparameters for factorization models and a suggested grid for searching:

1. **n_iter**: The number of epochs to run when training the model. More epochs could lead to better performance but also increase the risk of overfitting and computational time.
   - Suggested grid: `[1, 5, 10, 20]`

2. **embedding_dim**: The size of the latent feature vectors for users and items. Larger dimensions could capture more complex patterns but might overfit.
   - Suggested grid: `[8, 16, 32, 64]`

3. **learning_rate**: The step size at each iteration of the optimization algorithm. A smaller learning rate could lead to more precise convergence but might require more epochs.
   - Suggested grid: `[0.001, 0.01, 0.1]`

4. **l2**: The L2 regularization penalty. Higher values could prevent overfitting but might lead to underfitting if too large.
   - Suggested grid: `[0.0, 1e-6, 1e-5, 1e-3]`

5. **loss**: The loss function to be used. Common options are `pointwise`, `bpr`, or `hinge` loss, which are suitable for different types of recommendation tasks.
   - Suggested values: `['pointwise', 'bpr', 'hinge']`

6. **batch_size**: The number of samples per gradient update. Larger batches provide more accurate estimates of the gradient, but smaller batches might help the model to generalize better.
   - Suggested grid: `[32, 64, 128, 256]`


### Training the model

In [54]:
from spotlight.interactions import Interactions
from spotlight.factorization.explicit import ExplicitFactorizationModel
from spotlight.cross_validation import random_train_test_split

# Step 2: Create the Interactions object
interaction_data = Interactions(
    user_ids=ratings_df['user_id'].values,
    item_ids=ratings_df['item_id'].values,
    ratings=ratings_df['rating'].values,
    timestamps=ratings_df['timestamp'].values
)

# Split the interaction data into training and test sets
train, test = random_train_test_split(interaction_data)

# Initialize the model
model = ExplicitFactorizationModel(
    n_iter=5,           # Number of epochs of training
    embedding_dim=32,   # Latent factors (embedding size)
    use_cuda=False      # If you have a CUDA capable GPU, set to True to speed up training
)

# Fit the model on the training data
model.fit(train, verbose=True)

Epoch 0: loss 4.380506869695839
Epoch 1: loss 0.8117601574117631
Epoch 2: loss 0.5160774380269205
Epoch 3: loss 0.3599795619661336
Epoch 4: loss 0.2889907266100333


In [55]:
from spotlight.evaluation import rmse_score

train_rmse = rmse_score(model, train)
test_rmse = rmse_score(model, test)

print('Train RMSE {:.3f}, test RMSE {:.3f}'.format(train_rmse, test_rmse))

Train RMSE 0.474, test RMSE 1.051


### R1: Community recommender

Given a tag, a list of popular tags and a number of films (n), it returns the most *n* recommended films (ordered by rating) in the tag provided.

If the tag is not in the popular tags list, it returns an error.

In [56]:
# We first filter popular tags, which have more than 15 users
counter_tag= tags_df['tag'].value_counts()
popular_tag = counter_tag[counter_tag > 15]
popular_tag.head()

tag
In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
funny                 23
Name: count, dtype: int64

In [57]:
def community_recommender(tag,popular_tag,n):
    popular_tag = [t.lower() for t in popular_tag.index] 
    tag=tag.lower() # convert everything in lower
    if tag in popular_tag:
        movie_filtered_by_tag = tags_df[tags_df['tag'].str.lower() == tag]
        movie_list = movie_filtered_by_tag['item_id'] #we extract here the movies that has this tag
        movies_filtered = ratings_df[ratings_df['item_id'].isin(movie_list)].groupby(['item_id'])['rating'].mean().reset_index() #here we focuse on those movies and then we extract the mean rating.
        movies_filtered.sort_values(by='rating', ascending=False, inplace=True)  #we sort them 
        top_movie_ids = movies_filtered.iloc[0:n]['item_id'] #here we extract the first n movies
        recommended_movies = movies_df[movies_df['item_id'].isin(top_movie_ids)]['title'].values.tolist() #we extract the movies with that id
        return recommended_movies
    else:
        return 'No matches based on your research'


In [58]:
print("\n".join(community_recommender('suspense',popular_tag,10))) #Example of how to use it.

Usual Suspects
Pulp Fiction
Terminator 2: Judgment Day
Silence of the Lambs
Shining
Rosemary's Baby
Departed
Inception
Captain Phillips
Whiplash


### R2: Rating prediction given a user and a movie

Given a user_id and a movie_title, it predicts the rating for the given movie and user.

In [59]:
def predict_rating_for_user_and_movie_title(model, user_id, movie_title, movies_df):
    # Find the item ID for the given movie title
    item_id = movies_df[movies_df['title'] == movie_title]['item_id'].values
    
    # If the movie title is not found, return a message indicating so
    if item_id.size == 0:
        return "Movie title not found."
    
    # Assuming the first match is the correct one
    item_id = item_id[0]
    
    # Predict scores for all items for the user
    predictions = model.predict(user_ids=np.array([user_id]))
    
    # If the item ID is not in the mapping, return a message indicating the item was not found
    if item_id is None:
        return "Item ID not found in the mapping."
    
    # Extract the prediction for the specific item
    predicted_rating = predictions[item_id]
    
    return predicted_rating

# Example usage:
user_id = 1
movie_title = "Silence of the Lambs"
predicted_rating = predict_rating_for_user_and_movie_title(model, user_id, movie_title, movies_df)
print(f"Predicted rating for user {user_id} and movie '{movie_title}': {predicted_rating}")

Predicted rating for user 1 and movie 'Silence of the Lambs': 4.186103820800781


### R3: N recommended movies given a user

Given a user_id and number of recommendations (n), it returns the *n* recommended movies for that user.

In [60]:
def recommended_movies_by_user(model, user_id, n_movies):
    pred=model.predict(user_ids=user_id)
    sorted_indices = np.argsort(pred)[::-1] #we sort the indices 
    top_indices = sorted_indices[:n_movies]+1 #we extract the top n_movies. Now we have to extract the title of the associated movies to give the recommendation
    recommended_movies = movies_df[movies_df['item_id'].isin(top_indices)]['title'].values.tolist() #we extract the movies with that id
    return recommended_movies


In [61]:
recommended_movies_by_user(model=model, user_id=1, n_movies=10)

['Trust',
 'Manhattan',
 "Jumpin' Jack Flash",
 'Fright Night Part II',
 'Phantasm II']

### R4: N recommended movies for unseen users

Given a list of movies the user has seen and like, it returns the *n* movies most recommended, similar to the movies the user pass.

For that, we need another model which takes as input a list of movies, rather than a user_id.

In [62]:
from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.cross_validation import user_based_train_test_split

train, test = user_based_train_test_split(interaction_data)

train = train.to_sequence()
test = test.to_sequence()
model_seq = ImplicitSequenceModel(n_iter=5,
                                  representation='cnn',
                                  loss='bpr')
model_seq.fit(train)

In [66]:
def recommend_next_movies(list_of_movies,model,n_movies): #list_of_movies=titles of movies watched 
    #we firstly check if the titles are in the movie list and remove the ones that are not present. 
    #We also convert the titles in order to avoid uppercase sensibility (e.g. spiderman, Spiderman)
    list_of_movies = [title.lower() for title in list_of_movies]
    list_of_movies = [title for title in list_of_movies if title in movies_df['title'].str.lower().tolist()]
    if not list_of_movies: #we check if it is empty
        return 'The list of films provided is not in the catalogue' 
    indices_movies=movies_df[movies_df['title'].str.lower().isin(list_of_movies)]['item_id']
    pred=model.predict(sequences=np.array(indices_movies))
    sorted_indices = np.argsort(pred)[::-1]
    top_indices = sorted_indices[:n_movies]
    recommended_movies = movies_df[movies_df['item_id'].isin(top_indices)]['title'].values.tolist() 
    return recommended_movies

list_movies = ['Toy Story', 'Jumanji', 'Grumpier Old Men']

recommend_next_movies(list_movies, model_seq, 5)


['Speed',
 'X-Men',
 'Crouching Tiger, Hidden Dragon (Wo hu cang long)',
 'My Big Fat Greek Wedding',
 'Zombieland']