## Find the Most Critical Raters

Define a critical rater as one who gives low ratings to movies.

Write code to process a ratings.csv file and identify the top 10 most critical raters based on their average rating. Only raters with at least 5 ratings should be considered.

## Step 0: Create a Small Dataset for Testing

We want the dataset to be small enough to easily verify the results manually, but also to include a variety of users with different rating behaviors.

See the file `test_ratings.csv` for the dataset we will use for testing. It has four users with a total of 10 ratings for three movies.


## Step 1: Load the Ratings Data
Load the ratings CSV file into a pandas DataFrame. This step ensures the data is available for analysis and in the correct format.

In [21]:
import pandas as pd
test_df = pd.read_csv('test_ratings.csv')
test_df

Unnamed: 0,userId,movieId,rating
0,1,Movie1,4
1,1,Movie2,5
2,2,Movie1,3
3,3,Movie2,2
4,3,Movie3,5
5,4,Movie1,5
6,4,Movie2,4
7,4,Movie3,3


In [22]:
# automated test that the first and last rows are as expected
assert test_df.iloc[0].tolist() == [1, 'Movie1', 4], "First row does not match expected values"
assert test_df.iloc[-1].tolist() == [4, 'Movie3', 3], "Last row does not match expected values"


## Step 2: Compute Rating Count and Average Rating per User
Group the DataFrame by user and calculate the mean rating for each user. This step summarizes each user's rating behavior and prepares the data for ranking.

The results for the test dataset should look like this:

| userId | rating_count | average_rating |
|--------|--------------|----------------|
| 1      | 2            | 4.5           |
| 2      | 1            | 3           |
| 3      | 2            | 3.5           |
| 4      | 3            | 4.00           | 

In [23]:
def user_rating_stats(df):
    """
    Given a DataFrame with columns 'userId' and 'rating', return a DataFrame with columns:
    - userId
    - rating_count: number of ratings for each user
    - average_rating: average rating for each user
    """
    stats = df.groupby('userId')['rating'].agg(rating_count='count', average_rating='mean').reset_index()
    return stats

In [26]:
# Automated test for user_rating_stats using existing test_df
expected = pd.DataFrame({
    'userId': [1, 2, 3, 4],
    'rating_count': [2, 1, 2, 3],
    'average_rating': [4.5, 3.0, 3.5, 4.0]
})

test_rating_stats = user_rating_stats(test_df)
display(test_rating_stats)

# Sort for comparison
result_sorted = test_rating_stats.sort_values('userId').reset_index(drop=True)
expected_sorted = expected.sort_values('userId').reset_index(drop=True)

assert all(result_sorted['rating_count'] == expected_sorted['rating_count'])
assert all(abs(result_sorted['average_rating'] - expected_sorted['average_rating']) < 0.01)
print('Test passed!')

Unnamed: 0,userId,rating_count,average_rating
0,1,2,4.5
1,2,1,3.0
2,3,2,3.5
3,4,3,4.0


Test passed!


## Step 3: Filter Users with Too Few Ratings
Filter out users who have rated fewer than 5 movies. Do it with a function that takes a paramter for the minimum number of ratings; we can test it with different values even on our small test dataset.

For example, if we set the minimum rating count to 2, the filtered results for the test dataset should look like this:
| userId | rating_count | average_rating |
|--------|--------------|----------------|
| 1      | 2            | 4.5           |
| 3      | 2            | 3.5           | 
| 4      | 3            | 4.00           |


In [28]:
def filter_users_by_rating_count(stats_df, min_count=5):
    """
    Given a DataFrame with columns 'userId', 'rating_count', and 'average_rating',
    return a DataFrame with only users who have at least min_count ratings.
    """
    return stats_df[stats_df['rating_count'] >= min_count].reset_index(drop=True)

In [30]:
# Automated test for filter_users_by_rating_count
import pandas as pd

filtered_test_results = filter_users_by_rating_count(test_rating_stats, min_count=2)

expected = pd.DataFrame(
{
    'userId': [1, 3, 4],
    'rating_count': [2, 2, 3],
    'average_rating': [4.5, 3.5, 4.0]
})


assert all(filtered_test_results['userId'] == expected['userId'])
assert all(filtered_test_results['rating_count'] == expected['rating_count'])
assert all(abs(filtered_test_results['average_rating'] - expected['average_rating']) < 0.01)
print('Test passed!')

Test passed!


## Step 4: Sort and Select Top Users
Sort users by their average rating and select the top n users. This step identifies the most critical raters based on their average ratings.

For example, if we set n=2, the top 2 most critical raters from the filtered results should be:
| userId | rating_count | average_rating |
|--------|--------------|----------------|
| 3      | 2            | 3.5           |
| 4      | 3            | 4.00           |  

In [32]:
def select_top_critical_raters(stats_df, top_n=10):
    """
    Given a DataFrame with columns 'userId', 'rating_count', and 'average_rating',
    return the top_n users with the lowest average ratings (most critical raters).
    """
    return stats_df.sort_values('average_rating').head(top_n).reset_index(drop=True)

In [33]:
# Automated test for select_top_critical_raters using filtered_test_results
expected = pd.DataFrame({
    'userId': [3, 4],
    'rating_count': [2, 3],
    'average_rating': [3.5, 4.0]
})  .reset_index(drop=True)

final_test_result = select_top_critical_raters(filtered_test_results, top_n=2)

assert all(final_test_result['userId'] == expected['userId'])
assert all(final_test_result['rating_count'] == expected['rating_count'])
assert all(abs(final_test_result['average_rating'] - expected['average_rating']) < 0.01)
print('Test passed!')

Test passed!


## Now run the whole sequence of functions on the real dataset to get our answer.

In [34]:
real_ratings_df = pd.read_csv('movielens/ratings.csv')
real_stats = user_rating_stats(real_ratings_df)
filtered_real_stats = filter_users_by_rating_count(real_stats, min_count=5)
top_critical_raters = select_top_critical_raters(filtered_real_stats, top_n=10)
top_critical_raters

Unnamed: 0,userId,rating_count,average_rating
0,442,20,1.275
1,139,194,2.14433
2,508,24,2.145833
3,153,179,2.217877
4,567,385,2.245455
5,311,28,2.339286
6,298,939,2.363685
7,517,400,2.38625
8,308,115,2.426087
9,3,39,2.435897
