https://www.projectpro.io/article/recommender-systems-python-methods-and-algorithms/413

In [None]:
https://www.datacamp.com/tutorial/recommender-systems-python

# Download the datasets

In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-latest.zip

--2023-04-13 03:15:22--  https://files.grouplens.org/datasets/movielens/ml-latest.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277113433 (264M) [application/zip]
Saving to: ‘ml-latest.zip’


2023-04-13 03:16:46 (3.20 MB/s) - ‘ml-latest.zip’ saved [277113433/277113433]



In [3]:
!ls ml-latest.zip

ml-latest.zip


# Import the necessary libraries

In [7]:
import zipfile
import pandas as pd
import numpy as np

# Extract the datasets

In [4]:
zipfile = zipfile.ZipFile("ml-latest.zip")
zipfile.extractall()
zipfile.close()

In [11]:
!ls ml-latest

README.txt	   genome-tags.csv  movies.csv	 tags.csv
genome-scores.csv  links.csv	    ratings.csv


# Load the datasets

In [31]:
genome_tags = pd.read_csv("ml-latest/genome-tags.csv", header=0, names=["tagId", "genome_tag"])
movies = pd.read_csv("ml-latest/movies.csv")
tags = pd.read_csv("ml-latest/tags.csv")
genome_scores = pd.read_csv("ml-latest/genome-scores.csv")
links = pd.read_csv("ml-latest/links.csv")
ratings = pd.read_csv("ml-latest/ratings.csv")

In [32]:
genome_tags.head()

Unnamed: 0,tagId,genome_tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [14]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,14,110,epic,1443148538
1,14,110,Medieval,1443148532
2,14,260,sci-fi,1442169410
3,14,260,space action,1442169421
4,14,318,imdb top 250,1442615195


In [16]:
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.029
1,1,2,0.02375
2,1,3,0.05425
3,1,4,0.06875
4,1,5,0.16


In [17]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [18]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


# The Weighted Rating (WR) is a formula commonly used in recommendation systems to calculate a weighted average rating for items, such as movies or products, based on their ratings and other factors. The WR formula takes into account both the average rating of an item (R) and the number of votes or ratings it has received (v), as well as a minimum threshold of votes required for an item to be considered (m), and the mean vote across all items (C).

    In this project, I will be utilizing the Weighted Rating (WR) formula to calculate the weighted average ratings for movies, and incorporate it into the recommendation system.

    Weighted Rating (WR) = (v / (v + m)) * R + (m / (v + m)) * C
    
    WR: Weighted Rating, which is the final calculated value.
    v: Number of votes (or ratings) received by the item.
    m: Minimum votes (or ratings) required for the item to be considered.
    R: Average rating of the item.
    C: Mean vote (or rating) across all items.

# <font color='red'>Now, let's determine the values of each variable that are required for the computation.</font>

# We can create a new dataframe to store the movie information that we want to compute.

In [172]:
movies_query = pd.DataFrame()

# Average rating of the item.

In [173]:
movies_query['R'] = ratings.groupby("movieId").rating.mean()
movies_query.R

movieId
1         3.886649
2         3.246583
3         3.173981
4         2.874540
5         3.077291
            ...   
193876    3.000000
193878    2.000000
193880    2.000000
193882    2.000000
193886    3.250000
Name: R, Length: 53889, dtype: float64

# Mean vote (or rating) across all items.

In [174]:
C = movies_query.R.mean()
print(C, "out of", ratings.rating.max())

3.0685927253973193 out of 5.0


# Number of votes (or ratings) received by the item.

In [175]:
movies_query['v'] = ratings.groupby("movieId").rating.count()
movies_query.v

movieId
1         68469
2         27143
3         15585
4          2989
5         15474
          ...  
193876        1
193878        1
193880        1
193882        1
193886        2
Name: v, Length: 53889, dtype: int64

# Minimum votes (or ratings) required for the item to be considered.

In [176]:
PERCENTAGE = .95 # set to 95%
m = movies_query.v.quantile(PERCENTAGE)
print(f"{PERCENTAGE * 100}%", m)

95.0% 1855.0


# Calculate the weighted average ratings using the provided formula.

In [177]:
movies_query.head()

Unnamed: 0_level_0,R,v
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.886649,68469
2,3.246583,27143
3,3.173981,15585
4,2.87454,2989
5,3.077291,15474


In [178]:
def WR(query):
    R = query.R
    v = query.v
    return (v / (v + m)) * R + (m / (v + m)) * C

In [179]:
movies_query["WR"] = movies_query.apply(WR, axis=1)

In [180]:
movies_query.head()

Unnamed: 0_level_0,R,v,WR
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3.886649,68469,3.865071
2,3.246583,27143,3.235197
3,3.173981,15585,3.162772
4,2.87454,2989,2.948852
5,3.077291,15474,3.07636


# Now, let's include the movie title in this dataframe

In [181]:
movies_query = pd.merge(movies_query, movies, on="movieId")
movies_query.reset_index(drop=True, inplace=True)

In [182]:
movies_query.head()

Unnamed: 0,movieId,R,v,WR,title,genres
0,1,3.886649,68469,3.865071,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,3.246583,27143,3.235197,Jumanji (1995),Adventure|Children|Fantasy
2,3,3.173981,15585,3.162772,Grumpier Old Men (1995),Comedy|Romance
3,4,2.87454,2989,2.948852,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,3.077291,15474,3.07636,Father of the Bride Part II (1995),Comedy


# We can now sort the values to identify the top scores, and depending on our requirements, we can select the top 10 or top 20 recommendations for our recommendation system.

In [185]:
movies_query = movies_query.sort_values("WR", ascending=False)
movies_query.reset_index(drop=True, inplace=True)

In [186]:
movies_query.head(20)

Unnamed: 0,movieId,R,v,WR,title,genres
0,318,4.424188,97999,4.399005,"Shawshank Redemption, The (1994)",Crime|Drama
1,858,4.332893,60904,4.295523,"Godfather, The (1972)",Crime|Drama
2,50,4.291959,62180,4.25652,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
3,527,4.257502,71516,4.227443,Schindler's List (1993),Drama|War
4,1221,4.263035,38875,4.208636,"Godfather: Part II, The (1974)",Crime|Drama
5,2959,4.230663,65678,4.198743,Fight Club (1999),Action|Crime|Drama|Thriller
6,1193,4.22292,42181,4.174295,One Flew Over the Cuckoo's Nest (1975),Drama
7,296,4.173971,92406,4.152218,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
8,912,4.210098,31095,4.145834,Casablanca (1942),Drama|Romance
9,904,4.230799,22264,4.141413,Rear Window (1954),Mystery|Thriller


### <font color='green'>Notes:   One potential shortcoming of using a recommendation system with the Weighted Rating (WR) formula is that it may not fully capture the nuances and complexities of user preferences. The formula relies solely on weighted averages of ratings and does not take into account other factors such as user demographics, contextual information, or temporal dynamics. This can result in limitations in the accuracy and relevance of the recommendations, as it may not consider the individual preferences, tastes, and behaviors of users. Additionally, the WR formula may not perform well in scenarios with sparse or imbalanced data, as it gives equal weight to all ratings without considering their reliability or quality.</font>