### Chapter 2: (Heuristics) Popularity-based recommendations

In [None]:
#Importing libraries
import pandas as pd
_______ # Import numpy
import matplotlib.pyplot as plt

In [None]:
#Reading the dataset
df = pd._______('data-1m/dataset_combined.csv')
df.head()

In [None]:
#Finding the movies with the highest median rating
popularity_df = df[['title', 'rating']].groupby('_______')['rating'].agg(['median', 'count']).sort_values('median', ascending=False)
popularity_df.head(10)

Schlafes Bruder (Brother of Sleep) (1995) has a median rating of 5, but only 1 person has rated it.

A better way to evaluating movie popularity is do a **Bayesian Average**. It's commonly used in rating systems (like **IMDb's Top 250**) to provide more reliable rankings when items have varying numbers of ratings.

A Bayesian average is a weighted average that helps account for different sample sizes and prevents items with very few ratings from dominating the rankings. 

Here's the Bayesian average formula:

$$
\text{Bayesian Average} = \frac{C \times M + R \times v}{M + v}
$$

Where:
- ***C*** = global mean rating
- ***M*** = minimum ratings required (threshold)
- ***R*** = mean rating for the item
- ***v*** = number of ratings for the item

In probabilistic terms, this is essentially a weighted average between the prior (global mean) and the observed data (item's mean rating), where the weights are determined by the minimum ratings threshold and the number of votes.

In [None]:
def calculate_bayesian_avg(df, C, M):
    """
    Calculate Bayesian average for movies
    df: DataFrame with 'median' and 'count' columns
    C: prior mean (global mean rating)
    M: minimum votes required
    """
    return (C * M + df['median'] * df['_______']) / (M + df['count'])

# Calculate global mean (C)
C = popularity_df['median']._______()

# Set minimum votes threshold (M)
# Using a reasonable threshold based on your data, 
# let's say the 25th percentile of vote counts
M = popularity_df['count']._______(0.25)

# Add Bayesian average to the dataframe
popularity_df['bayesian_avg'] = calculate_bayesian_avg(popularity_df, C, M)

In [None]:
#Let's now look at the most popular movies
popularity_df = popularity_df.sort_values('_______', ascending = False).reset_index()
popularity_df.head(10)

In [None]:
popularity_df[popularity_df['title'] == 'Schlafes Bruder (Brother of Sleep) (1995)']

Schlafes Bruder (Brother of Sleep) (1995) now has a bayesian avg of 3, and all the top movies that we se are the classics of late 19th century.

### **Pros:**

1. **Ease of Implementation**: Popularity-based systems are simple to set up, requiring only basic metrics like sales or views, making them quick to deploy.

2. **Scalability**: They handle large datasets well, suitable for platforms with extensive user bases or item catalogs.

3. **Good for New Users**: They provide relevant recommendations for users who haven't yet built up a profile or history.

### **Cons:**

1. **Lack of Personalization**: Recommendations do not reflect individual preferences, potentially leading to irrelevant suggestions.

2. **Over-Promotion of Popular Items**: Popular items might overshadow niche or less known but equally suitable items.

3. **Doesn't Address Niche Markets**: Less common interests or tastes might be overlooked, limiting the diversity of recommendations.