# Summary
This MovieLens dataset, released by GroupLens (University of Minnesota), contains over 100k ratings of movies. A brief exploratory analysis shows the following:

1. Though there are over 100k ratings, the user-movie interaction is sparse. The calculation below shows that sparsity > 98%. This shows that most users rate only a small number of movies and only a fraction of users provide ratings. This is a blatant example of sparsity. There are many missing values so the challenge is how can we know what rating a user would have given a movie so that we can recommend such movies that the user would potentially like.

    ```python
    # Calculate sparsity 
    num_users = ratings['userId'].nunique()
    num_movies = movies['movieId'].nunique()
    num_interactions = len(ratings)

    matrix_size = num_users * num_movies
    sparsity = 1- (num_interactions/matrix_size)

    print("Matrix sparsity:", sparsity)
2. The histogram of ratings per movie shows that a small number of movies receive many ratings but the vast majority of movies receive very few ratings. This shows that users heavily focus on a small portion of the catalog. Popular movies dominate the user-movie interaction. Hence a naive recommender system may recommend the popular movies given first the presence of ratings and second the high ratings for particular movies. This is a prime example of popularity bias. 
    - This could work but misses an apparent opportunity, that is, to provide personalized recommendations. 
3. Heavy users (most prolific raters) and cold users (least prolific) look similar here: 61 vs 63 average ratings, so the notion of “heavy” vs “cold” might barely separate behavior. user-level modeling might not benefit from heavy vs cold segmentation because the two groups are too similar.
4. Popular vs long-tail movies are very different: there are only 976 popular items compared to 3446 long-tail movies, so the catalog is highly skewed toward non-rated items. This suggests heavy-tailed item distribution but relatively uniform user activity. 
    - Maybe try strategies for sparse long-tail items such as content features or smoothing.
5. Users also exhibit skewed behavior. Some users rate over 350 movies while most users rate only a few movies.
6. Rating bias -- most ratings are around 3-4 and ratings >4 is not uncommon so users are more likely to provide ratings if they like the movie. It is less often that a user who dislike a provide would provide a rating. This shows explicit/implicit signals. Raters who like a movie provide rate the movie but less can be said about those people who do not like a movie.


In [1]:
import requests
import zipfile
import io
import pandas as pd

In [2]:
url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
response = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
print(zip_file.namelist())

['ml-latest-small/', 'ml-latest-small/links.csv', 'ml-latest-small/tags.csv', 'ml-latest-small/ratings.csv', 'ml-latest-small/README.txt', 'ml-latest-small/movies.csv']


In [7]:
with zip_file.open("ml-latest-small/movies.csv") as f:
    movies = pd.read_csv(f)
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [5]:
with zip_file.open("ml-latest-small/ratings.csv") as f:
    ratings = pd.read_csv(f)
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [47]:
# Calculate sparsity 
num_users = ratings['userId'].nunique()
num_movies = movies['movieId'].nunique()
num_interactions = len(ratings)

matrix_size = num_users * num_movies
sparsity = 1- (num_interactions/matrix_size)

sparsity_percent = round(sparsity, 5)*100

print("Matrix sparsity:", sparsity_percent)


Matrix sparsity: 98.303


In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
with zip_file.open("ml-latest-small/tags.csv") as f:
    tags = pd.read_csv(f)
tags.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


In [11]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


# EDA

In [12]:
import plotly.express as px

In [16]:
rating_counts = ratings["rating"].value_counts().sort_index()
rating_probs = ratings["rating"].value_counts(normalize=True).sort_index()

print("Counts:\n", rating_counts)
print("\nProportions:\n", rating_probs)

Counts:
 rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: count, dtype: int64

Proportions:
 rating
0.5    0.013586
1.0    0.027877
1.5    0.017762
2.0    0.074884
2.5    0.055040
3.0    0.198808
3.5    0.130271
4.0    0.265957
4.5    0.084801
5.0    0.131015
Name: proportion, dtype: float64


In [17]:
rating_fig = px.bar(
    rating_counts,
    x=rating_counts.index,
    y=rating_counts.values,
    labels={"x": "Rating", "y": "Count"},
    title="Distribution of Ratings (1–5)"
)
rating_fig.show()

In [19]:
ratings_per_user = ratings.groupby("userId")["movieId"].count()
print(ratings_per_user.describe())

count     610.000000
mean      165.304918
std       269.480584
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
Name: movieId, dtype: float64


In [20]:
fig = px.histogram(
    x=ratings_per_user,
    nbins=30,
    labels={"x": "Number of ratings per user", "y": "Number of users"},
    title="Distribution of Ratings per User"
)
fig.show()

In [None]:
ratings_per_movie = ratings.groupby("movieId")['rating'].count()
print(ratings_per_movie.describe())

            userId       rating    timestamp
count  9724.000000  9724.000000  9724.000000
mean     10.369807    10.369807    10.369807
std      22.401005    22.401005    22.401005
min       1.000000     1.000000     1.000000
25%       1.000000     1.000000     1.000000
50%       3.000000     3.000000     3.000000
75%       9.000000     9.000000     9.000000
max     329.000000   329.000000   329.000000


In [26]:
fig = px.histogram(
    x=ratings_per_movie,
    nbins=30,
    labels={"x": "Number of ratings per movie", "y": "Number of movies"},
    title="Distribution of Ratings per Movie"
)
fig.show()

In [54]:
# Users
heavy_user_threshold = ratings_per_user.quantile(0.90)   # top 10%
cold_user_threshold  = ratings_per_user.quantile(0.10)   # bottom 10%

heavy_users = ratings_per_user[ratings_per_user >= heavy_user_threshold]
cold_users  = ratings_per_user[ratings_per_user <= cold_user_threshold]

print("Heavy users (top 10% by #ratings):", len(heavy_users))
print("Cold users (bottom 10% by #ratings):", len(cold_users))

# Movies
popular_movie_threshold   = ratings_per_movie.quantile(0.90)  # top 10%
longtail_movie_threshold  = ratings_per_movie.quantile(0.10)  # bottom 10%

popular_movies = ratings_per_movie[ratings_per_movie >= popular_movie_threshold]
longtail_movies = ratings_per_movie[ratings_per_movie <= longtail_movie_threshold]

print("Popular movies (top 10% by #ratings):", len(popular_movies))
print("Long-tail movies (bottom 10% by #ratings):", len(longtail_movies))

Heavy users (top 10% by #ratings): 61
Cold users (bottom 10% by #ratings): 63
Popular movies (top 10% by #ratings): 976
Long-tail movies (bottom 10% by #ratings): 3446


In [30]:
user_df = ratings_per_user.to_frame(name="num_ratings").reset_index()

fig = px.histogram(
    user_df,
    x="num_ratings",
    nbins=30,
    title="Ratings per User with Heavy/Cold Regions"
)

fig.add_vline(x=heavy_user_threshold, line_dash="dash", annotation_text="Heavy user threshold")
fig.add_vline(x=cold_user_threshold, line_dash="dash", annotation_text="Cold user threshold")
fig.show()

In [33]:
movie_df = ratings_per_movie.to_frame(name="num_ratings").reset_index()

fig = px.histogram(
    movie_df,
    x="num_ratings",
    nbins=30,
    title="Ratings per Movie with Popular/Long-tail Regions"
)

fig.add_vline(x=popular_movie_threshold, line_dash="dash", annotation_text="Popular threshold")
fig.add_vline(x=longtail_movie_threshold, line_dash="dash", annotation_text="Long-tail threshold")
fig.show()