## Exploring 'Ratings' table
### 1 Library and athena connection

In [1]:
import pandas as pd
import os, sys

sys.path.append(os.path.abspath(".."))

from src.utils import get_athena_connection, read_sql_df
from src.config import DB_ATHENA


# Athena connection
conn = get_athena_connection()
print(f"Connected to Athena database:'{DB_ATHENA}'")

def run_sql(sql: str) -> pd.DataFrame:
    """
    Execute a SQL query on Athena (MovieLens 32M) and return a Pandas DataFrame.
    """
    return read_sql_df(sql, conn=conn)

Connected to Athena database:'movielens32m'


In [2]:
# 4) Verification in Athena

# Show all tables in movielens32m
run_sql(f"SHOW TABLES IN {DB_ATHENA}")

# Preview the ratings_parquet table
run_sql(f"SELECT * FROM {DB_ATHENA}.ratings_parquet LIMIT 5")


  df = pd.read_sql(sql, conn)


Unnamed: 0,userid,movieid,rating,timestamp
0,124206,2600,3.0,2000-08-01 05:56:41
1,124206,2605,2.0,2000-08-01 05:44:59
2,124206,2676,2.0,2000-08-01 05:48:35
3,124206,2692,5.0,2000-08-01 05:52:24
4,124206,2707,4.0,2000-08-01 05:39:21


### 2 General dataset description

In [3]:
run_sql("SHOW COLUMNS FROM ratings_parquet")

  df = pd.read_sql(sql, conn)


Unnamed: 0,field
0,userid
1,movieid
2,rating
3,timestamp


In [4]:
run_sql(f"""
SELECT
    column_name,
    data_type,
    is_nullable
FROM information_schema.columns
WHERE table_schema = '{DB_ATHENA}'
  AND table_name   = 'ratings_parquet'
ORDER BY ordinal_position
""")

  df = pd.read_sql(sql, conn)


Unnamed: 0,column_name,data_type,is_nullable
0,userid,integer,YES
1,movieid,integer,YES
2,rating,double,YES
3,timestamp,timestamp(3),YES


Comment

- `userId`: INTEGER  
- `movieId`: INTEGER  
- `rating`: DOUBLE  
- `timestamp`: TIMESTAMP WITH PRECISION OF MILISECONDS 

- The column `is_nullable` is `YES` for all fields.  
  - This means the table allows null (NULL) values.  
  - Because the data was loaded from external files, there are no enforced NOT NULL constraints.


### 3 Dataset individual basic exploration
#### 3.1 Dataset composition

In [5]:
#see first 10 rows

run_sql("SELECT * FROM ratings_parquet LIMIT 10")

  df = pd.read_sql(sql, conn)


Unnamed: 0,userid,movieid,rating,timestamp
0,124206,2600,3.0,2000-08-01 05:56:41
1,124206,2605,2.0,2000-08-01 05:44:59
2,124206,2676,2.0,2000-08-01 05:48:35
3,124206,2692,5.0,2000-08-01 05:52:24
4,124206,2707,4.0,2000-08-01 05:39:21
5,124206,2712,4.0,2000-08-01 05:44:59
6,124206,2734,5.0,2000-08-01 05:50:08
7,124206,2762,3.0,2000-08-01 05:52:55
8,124206,2858,5.0,2000-08-01 05:37:57
9,124206,2912,4.0,2000-08-03 07:30:22


#### 3.2 Missing values

In [6]:
#Count number of missing values
run_sql("""
SELECT
    COUNT(*) - COUNT(userId)    AS missing_userId,
    COUNT(*) - COUNT(movieId)   AS missing_movieId,
    COUNT(*) - COUNT(rating)    AS missing_rating,
    COUNT(*) - COUNT(timestamp) AS missing_timestamp
FROM ratings_parquet
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,missing_userId,missing_movieId,missing_rating,missing_timestamp
0,0,0,0,0


#### 3.3 Basic statistics

In [7]:
#Identification of maximum, minimum values and counts of ratings
run_sql("""
SELECT
    MIN(userId)                AS min_userId,
    MAX(userId)                AS max_userId,
    COUNT(DISTINCT userId)     AS total_users,
    MIN(movieId)               AS min_movieId,
    MAX(movieId)               AS max_movieId,
    COUNT(DISTINCT movieId)    AS total_movies,
    MIN(rating)                AS min_rating,
    MAX(rating)                AS max_rating,
    AVG(rating)                AS avg_rating,
    MIN(timestamp)             AS min_timestamp,
    MAX(timestamp)             AS max_timestamp,
    COUNT(*)                   AS total_ratings
FROM ratings_parquet
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,min_userId,max_userId,total_users,min_movieId,max_movieId,total_movies,min_rating,max_rating,avg_rating,min_timestamp,max_timestamp,total_ratings
0,1,200948,200948,1,292757,84432,0.5,5.0,3.540396,1995-01-09 11:46:44,2023-10-13 02:29:07,32000204


Comments

- There are 200.948 unique user IDs providing ratings.  
- A total of 84,432 movies have been evaluated.  
- Ratings range from 0.5 to 5.0, with an average value of 3.540396.  
- The earliest rating timestamp is from january 9, 1995, at 11:46:44.  
- The most recent rating timestamp is from october 13, 2023, at 02:29:07.  
- The dataset contains a total of 32,000,204 ratings (rows).


#### 3.4 Ratings distribution
##### 3.4.1 Number of ratings and average rating per user

In [8]:
#Number of ratings and average rating per user
run_sql("""
SELECT
    userId,
    COUNT(*)              AS total_ratings,
    ROUND(AVG(rating), 2) AS avg_rating
FROM ratings_parquet
GROUP BY userId
ORDER BY total_ratings DESC, avg_rating DESC
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,userId,total_ratings,avg_rating
0,175325,33332,3.08
1,17035,9577,2.57
2,55653,9178,3.28
3,123465,9044,2.53
4,171795,9016,3.18
...,...,...,...
200943,160861,20,0.50
200944,135854,20,0.50
200945,69215,20,0.50
200946,5333,20,0.50


Comments
 - User 175325 had rated 33332 movies, with an average rating of 3.08 . This means that if this user watched 1 movie a day, he/she would need 91 years to watch all these movies, thus there's probably a bot or an agencie behind this number. Also to mentioned that this number is very far from the second to fourth user with around 9K movies rated.

In [9]:
# Number of users with only one rating
run_sql("""
SELECT COUNT(*) AS users_with_one_rating
FROM (
    SELECT userId
    FROM ratings_parquet
    GROUP BY userId
    HAVING COUNT(*) = 1
)
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,users_with_one_rating
0,0


In [10]:
#Total number of ratings
run_sql("SELECT COUNT(*) FROM ratings_parquet")


  df = pd.read_sql(sql, conn)


Unnamed: 0,_col0
0,32000204


##### 3.4.4 Minimum number of ratings provided by a user

In [11]:
# Minimum number of ratings provided by a user
run_sql("""
SELECT MIN(cnt) AS min_ratings_per_user
FROM (
    SELECT COUNT(*) AS cnt
    FROM ratings_parquet
    GROUP BY userId
)
""")



  df = pd.read_sql(sql, conn)


Unnamed: 0,min_ratings_per_user
0,20


In [12]:
#Number of users with less than 20 ratings
run_sql("""
SELECT COUNT(*) AS users_below_20
FROM (
    SELECT userId, COUNT(*) AS cnt
    FROM ratings_parquet
    GROUP BY userId
)
WHERE cnt < 20
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,users_below_20
0,0


In [13]:
run_sql("SELECT COUNT(DISTINCT userId) FROM ratings_parquet")


  df = pd.read_sql(sql, conn)


Unnamed: 0,_col0
0,200948


Comments:

While exploring the data, we noticed something unexpected: there is not a single user with fewer than 20 ratings.
This is unusual, because our understanding of the official MovieLens datasets — including the Full(33M) version — is that they typically contain many users who rated only a handful of movies, sometimes just one.

We verified that our table is complete and contains the full 32,000,204 ratings, with no missing or truncated data.
To investigate further, we performed a targeted analysis to count how many users had fewer than 20 ratings, and the result was consistently zero across all thresholds.

This confirms that the dataset we are using is apparently a filtered variant of MovieLens 32M, in which all users with fewer than 20 ratings were removed before distribution.
The dataset is therefore internally consistent, but it does not include the long tail of low-activity users present in the original GroupLens release of (Full)33M.

In [14]:
run_sql("SHOW COLUMNS FROM movies_parquet")

  df = pd.read_sql(sql, conn)


Unnamed: 0,field
0,movieid
1,title
2,genres


##### 3.4.2 Average ratings per movie ordered from best to worst rating

In [15]:
# Average rating per movie ordered from best to worst rating
run_sql("""
SELECT
    m.movieid,
    m.title,
    ROUND(AVG(r.rating), 2) AS avg_rating,
    COUNT(*)                AS total_ratings
FROM ratings_parquet AS r
JOIN movies_parquet  AS m
  ON r.movieid = m.movieid
GROUP BY m.movieid, m.title
ORDER BY avg_rating DESC, total_ratings DESC
""")


  df = pd.read_sql(sql, conn)


Unnamed: 0,movieid,title,avg_rating,total_ratings
0,234089,"Love, Kennedy (2017)",5.0,4
1,200016,The Nagano Tapes (2018),5.0,3
2,165787,Lonesome Dove Church (2014),5.0,3
3,202936,ReMoved (2013),5.0,3
4,179731,Sound of Christmas (2016),5.0,3
...,...,...,...,...
84427,287109,Donbass. Borderland (2019),0.5,1
84428,130542,Honky,0.5,1
84429,268358,El club del paro (2021),0.5,1
84430,146876,Algorithms (2013),0.5,1


##### 3.4.2 Average ratings per movie ordered from best to worst rating

In [19]:
df_avg_ratings = run_sql("""
SELECT
    m.title,
    ROUND(AVG(r.rating), 2) AS media_rating,
    COUNT(*) AS total_ratings
FROM ratings_parquet AS r
JOIN movies_parquet AS m
    ON r.movieid = m.movieid
GROUP BY m.title
ORDER BY media_rating DESC, total_ratings DESC
""")

df_avg_ratings


  df = pd.read_sql(sql, conn)


Unnamed: 0,title,media_rating,total_ratings
0,"Love, Kennedy (2017)",5.0,4
1,ReMoved (2013),5.0,3
2,Lonesome Dove Church (2014),5.0,3
3,The Nagano Tapes (2018),5.0,3
4,David Attenborough's Tasmania (2018),5.0,3
...,...,...,...
84234,Strength and Honour (2007),0.5,1
84235,The Devil's Child (2021),0.5,1
84236,After Midnight (2018),0.5,1
84237,Ghost (2012),0.5,1


##### 3.4.3 Average ratings per movie ordered from most rated to least rated

In [16]:
# Average rating per movie ordered from most rated to least rated
run_sql("""
SELECT
    m.movieid,
    m.title,
    ROUND(AVG(r.rating), 2) AS avg_rating,
    COUNT(*)                AS total_ratings
FROM ratings_parquet r
JOIN movies_parquet  m
  ON r.movieid = m.movieid
GROUP BY m.movieid, m.title
ORDER BY total_ratings DESC, avg_rating DESC
""")

  df = pd.read_sql(sql, conn)


Unnamed: 0,movieid,title,avg_rating,total_ratings
0,318,"Shawshank Redemption, The (1994)",4.40,102929
1,356,Forrest Gump (1994),4.05,100296
2,296,Pulp Fiction (1994),4.20,98409
3,2571,"Matrix, The (1999)",4.16,93808
4,593,"Silence of the Lambs, The (1991)",4.15,90330
...,...,...,...,...
84427,146876,Algorithms (2013),0.50,1
84428,274991,InSearchOf (2009),0.50,1
84429,266492,Maximum Achievement: The Brian Tracy Story (2017),0.50,1
84430,275175,Knucklebones (2016),0.50,1


##### 3.4.4 Summary statistics of number of ratings per movie

In [17]:
# Distribution of the number of ratings per movie (percentiles and summary stats)
run_sql("""
WITH counts AS (
    SELECT
        movieid,
        COUNT(*) AS n_ratings
    FROM ratings_parquet
    GROUP BY movieid
)
SELECT
    approx_percentile(n_ratings, 0.25)          AS p25_ratings,
    approx_percentile(n_ratings, 0.50)          AS median_ratings,
    approx_percentile(n_ratings, 0.75)          AS p75_ratings,
    MIN(n_ratings)                              AS min_ratings,
    MAX(n_ratings)                              AS max_ratings,
    AVG(CAST(n_ratings AS DOUBLE))              AS mean_ratings
FROM counts
""")

  df = pd.read_sql(sql, conn)


Unnamed: 0,p25_ratings,median_ratings,p75_ratings,min_ratings,max_ratings,mean_ratings
0,2,5,25,1,102929,379.005638


### Conclusion

- Most movies have less than 25 ratings (q3), with a median of 5 ratings.

#### Close connection

In [20]:
conn.close()
print("Athena's connection closed.")

Athena's connection closed.


### Comparison between (Small)100K and 32M
- Higher number of reviews: 100K vs. 32M
- Higher number of users revieweing: 610 vs. 200K
- Both datasets are cleaned vs. (Full)33M - only users who rated at least 20 movies show up
- The database (Small)100K is representative compared to 32M when it comes to the top movies with highest number of total_ratings - the top5 remains the same, even if in a different order between them.
- Higher median of ratings: 3 vs. 5 ratings per movie
- Higer p75 of ratings (75% of the movies): 9 vs. 25 ratings per movie
- Plausible that the 32M dataset also has some noise to be further investigated: for example, user id 175325 has rated +33K movies, equivalent to 1 movie a day for 91 years.