### Chapter 2: (Heuristics) Popularity-based recommendations

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
#Reading the dataset
df = pd.read_csv('data-1m/dataset_combined.csv')
df.head()

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamp,gender,age,occupation,zipcode,age_desc,occ_desc
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268,F,1,10,48067,Under 18,K-12 student
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008,F,50,9,55117,50-55,homemaker
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496,M,25,12,11413,25-34,programmer
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952,M,25,17,61614,25-34,technician/engineer
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474,F,35,1,95370,35-44,academic/educator


In [35]:
#Finding the movies with the highest median rating
popularity_df = df[['title', 'rating']].groupby('title')['rating'].agg(['median', 'count']).sort_values('median', ascending=False)
popularity_df.head(10)

Unnamed: 0_level_0,median,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
To Live (Huozhe) (1994),5.0,61
One Flew Over the Cuckoo's Nest (1975),5.0,1725
Casablanca (1942),5.0,1669
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),5.0,628
Schlafes Bruder (Brother of Sleep) (1995),5.0,1
Star Wars: Episode IV - A New Hope (1977),5.0,2991
Schindler's List (1993),5.0,2304
Citizen Kane (1941),5.0,1116
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),5.0,470
Wallace & Gromit: The Best of Aardman Animation (1996),5.0,438


Schlafes Bruder (Brother of Sleep) (1995) has a median rating of 5, but only 1 person has rated it.

A better way to evaluating movie popularity is do a **Bayesian Average**. It's commonly used in rating systems (like **IMDb's Top 250**) to provide more reliable rankings when items have varying numbers of ratings.

A Bayesian average is a weighted average that helps account for different sample sizes and prevents items with very few ratings from dominating the rankings. 

Here's the Bayesian average formula:

$$
\text{Bayesian Average} = \frac{C \times M + R \times v}{M + v}
$$

Where:
- ***C*** = global mean rating
- ***M*** = minimum ratings required (threshold)
- ***R*** = mean rating for the item
- ***v*** = number of ratings for the item

In probabilistic terms, this is essentially a weighted average between the prior (global mean) and the observed data (item's mean rating), where the weights are determined by the minimum ratings threshold and the number of votes.

In [36]:
def calculate_bayesian_avg(df, C, M):
    """
    Calculate Bayesian average for movies
    df: DataFrame with 'median' and 'count' columns
    C: prior mean (global mean rating)
    M: minimum votes required
    """
    return (C * M + df['median'] * df['count']) / (M + df['count'])

# Calculate global mean (C)
C = popularity_df['median'].median()

# Set minimum votes threshold (M)
# Using a reasonable threshold based on your data, 
# let's say the 25th percentile of vote counts
M = popularity_df['count'].quantile(0.25)

# Add Bayesian average to the dataframe
popularity_df['bayesian_avg'] = calculate_bayesian_avg(popularity_df, C, M)

In [37]:
#Let's now look at the most popular movies
popularity_df = popularity_df.sort_values('bayesian_avg', ascending = False).reset_index()
popularity_df.head(10)

Unnamed: 0,title,median,count,bayesian_avg
0,American Beauty (1999),5.0,3428,4.98093
1,Star Wars: Episode IV - A New Hope (1977),5.0,2991,4.978175
2,Saving Private Ryan (1998),5.0,2653,4.975428
3,"Matrix, The (1999)",5.0,2590,4.974838
4,"Silence of the Lambs, The (1991)",5.0,2578,4.974722
5,Raiders of the Lost Ark (1981),5.0,2514,4.974087
6,Fargo (1996),5.0,2513,4.974077
7,"Sixth Sense, The (1999)",5.0,2459,4.973515
8,"Princess Bride, The (1987)",5.0,2318,4.971927
9,Schindler's List (1993),5.0,2304,4.971759


In [38]:
popularity_df[popularity_df['title'] == 'Schlafes Bruder (Brother of Sleep) (1995)']

Unnamed: 0,title,median,count,bayesian_avg
1641,Schlafes Bruder (Brother of Sleep) (1995),5.0,1,3.058824


Schlafes Bruder (Brother of Sleep) (1995) now has a bayesian avg of 3, and all the top movies that we se are the classics of late 19th century.

### **Pros:**

1. **Ease of Implementation**: Popularity-based systems are simple to set up, requiring only basic metrics like sales or views, making them quick to deploy.

2. **Scalability**: They handle large datasets well, suitable for platforms with extensive user bases or item catalogs.

3. **Good for New Users**: They provide relevant recommendations for users who haven't yet built up a profile or history.

### **Cons:**

1. **Lack of Personalization**: Recommendations do not reflect individual preferences, potentially leading to irrelevant suggestions.

2. **Over-Promotion of Popular Items**: Popular items might overshadow niche or less known but equally suitable items.

3. **Doesn't Address Niche Markets**: Less common interests or tastes might be overlooked, limiting the diversity of recommendations.