## Exploring 'Ratings' table
### 1 Library and duckdb file import

In [1]:

# initial exploration of "ratings" table
import duckdb, pandas as pd
from pathlib import Path

#create or connect if it already exists
con = duckdb.connect("movielens100K.duckdb")

### 2 General dataset description

In [2]:
con.sql("DESCRIBE ratings").df()


Unnamed: 0,column_name,column_type,null,key,default,extra
0,userId,INTEGER,YES,,,
1,movieId,INTEGER,YES,,,
2,rating,DOUBLE,YES,,,
3,timestamp,TIMESTAMP WITH TIME ZONE,YES,,,


Comment

- `userId`: INTEGER  
- `movieId`: INTEGER  
- `rating`: DOUBLE  
- `timestamp`: TIMESTAMP WITH TIME ZONE

- The `null` column indicates whether the field can contain null (NULL) values.  
  - In this case, it can.

- The `key` column indicates whether the field is a primary key (PRIMARY KEY).  
  - It is not.

- The `default` column shows the default value (DEFAULT).  
  - None.

- The `extra` column displays additional information about the field, such as auto_increment or generated.  
  - None in this case.


In [3]:
#see data types of each column
con.sql("PRAGMA table_info('ratings')").df()

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,userId,INTEGER,False,,False
1,1,movieId,INTEGER,False,,False
2,2,rating,DOUBLE,False,,False
3,3,timestamp,TIMESTAMP WITH TIME ZONE,False,,False


### 3 Dataset individual basic exploration
#### 3.1 Dataset composition

In [4]:
#see first 10 rows
con.sql("SELECT * FROM ratings LIMIT 10;").df()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30 19:45:03+01:00
1,1,3,4.0,2000-07-30 19:20:47+01:00
2,1,6,4.0,2000-07-30 19:37:04+01:00
3,1,47,5.0,2000-07-30 20:03:35+01:00
4,1,50,5.0,2000-07-30 19:48:51+01:00
5,1,70,3.0,2000-07-30 19:40:00+01:00
6,1,101,5.0,2000-07-30 19:14:28+01:00
7,1,110,4.0,2000-07-30 19:36:16+01:00
8,1,151,5.0,2000-07-30 20:07:21+01:00
9,1,157,5.0,2000-07-30 20:08:20+01:00


#### 3.2 Missing values

In [5]:
#Count number of missing values
con.sql("""
SELECT
    COUNT(*) - COUNT(userId)   AS missing_userId,
    COUNT(*) - COUNT(movieId)  AS missing_movieId,
    COUNT(*) - COUNT(rating)   AS missing_rating,
    COUNT(*) - COUNT(timestamp) AS missing_timestamp
FROM ratings
""").df()


Unnamed: 0,missing_userId,missing_movieId,missing_rating,missing_timestamp
0,0,0,0,0


#### 3.3 Basic statistics

In [6]:
#Identification of maximum, minimum values and counts of ratings
con.sql("""
SELECT
    MIN(userId)                  AS min_userId,
    MAX(userId)                  AS max_userId,
    COUNT(DISTINCT userId)       AS total_users,
    MIN(movieId)                 AS min_movieId,
    MAX(movieId)                 AS max_movieId,
    COUNT(DISTINCT movieId)      AS total_movies,
    MIN(rating)                  AS min_rating,
    MAX(rating)                  AS max_rating,
    AVG(rating)                  AS med_rating,
    MIN(timestamp)               AS min_timestamp,
    MAX(timestamp)               AS max_timestamp,
    COUNT(*)                     AS total_ratings
        
FROM ratings
""").df()

Unnamed: 0,min_userId,max_userId,total_users,min_movieId,max_movieId,total_movies,min_rating,max_rating,med_rating,min_timestamp,max_timestamp,total_ratings
0,1,610,610,1,193609,9724,0.5,5.0,3.501557,1996-03-29 19:36:55+01:00,2018-09-24 15:27:30+01:00,100836


Comments

- There are 610 unique user IDs providing ratings.  
- A total of 9,724 movies have been evaluated.  
- Ratings range from 0.5 to 5.0, with an average value of 3.501557.  
- The earliest rating timestamp is from March 29, 1996, at 19:36:55+01:00.  
- The most recent rating timestamp is from September 24, 2024, at 15:27:30+01:00
- The dataset contains a total of 100 838 ratings.


#### 3.4 Ratings distribution
##### 3.4.1 Number of ratings and average rating per user

In [7]:
#Number of ratings and average rating per user
con.sql("""
SELECT
    userId,
    COUNT(*)              AS total_ratings,
    ROUND(AVG(rating), 2) AS media_rating
FROM ratings
GROUP BY userId
ORDER BY total_ratings DESC, media_rating DESC
""").df()



Unnamed: 0,userId,total_ratings,media_rating
0,414,2698,3.39
1,599,2478,2.64
2,474,2108,3.40
3,448,1864,2.85
4,274,1346,3.24
...,...,...,...
605,257,20,3.20
606,576,20,3.10
607,207,20,2.88
608,431,20,2.73


#### Comment
 - All 609 users had at least 20 ratings, up to maximum of 2698 ratings.

##### 3.4.2 Average ratings per movie ordered from best to worst rating

In [8]:
#average ratings per movie ordered from best to worst rating
con.sql("""
SELECT
    m.title,
    ROUND(AVG(r.rating), 2) AS media_rating,
    COUNT(*)                AS total_ratings
FROM ratings r
JOIN movies m USING (movieId)
GROUP BY m.title
ORDER BY media_rating DESC, total_ratings DESC
""").df()


Unnamed: 0,title,media_rating,total_ratings
0,Lesson Faust (1994),5.0,2
1,Enter the Void (2009),5.0,2
2,Jonah Who Will Be 25 in the Year 2000 (Jonas q...,5.0,2
3,Lamerica (1994),5.0,2
4,Heidi Fleiss: Hollywood Madam (1995),5.0,2
...,...,...,...
9714,"Cincinnati Kid, The (1965)",0.5,1
9715,Son of God (2014),0.5,1
9716,"Crow, The: Wicked Prayer (2005)",0.5,1
9717,My Bloody Valentine (1981),0.5,1


##### 3.4.3 Average ratings per movie ordered from most rated to least rated

In [9]:
#average ratings per movie ordered from most rated to least rated
con.sql("""
SELECT
    m.title,
    ROUND(AVG(r.rating), 2) AS media_rating,
    COUNT(*)                AS total_ratings
FROM ratings r
JOIN movies m USING (movieId)
GROUP BY m.title
ORDER BY total_ratings DESC, media_rating DESC
""").df()


Unnamed: 0,title,media_rating,total_ratings
0,Forrest Gump (1994),4.16,329
1,"Shawshank Redemption, The (1994)",4.43,317
2,Pulp Fiction (1994),4.20,307
3,"Silence of the Lambs, The (1991)",4.16,279
4,"Matrix, The (1999)",4.19,278
...,...,...,...
9714,War Room (2015),0.50,1
9715,Wizards of the Lost Kingdom II (1989),0.50,1
9716,Baby Boy (2001),0.50,1
9717,Alone in the Dark (2005),0.50,1


##### 3.4.4 Summary statistics of number of ratings per movie

In [10]:
con.sql("""
WITH counts AS (
  SELECT movieId, COUNT(*)::BIGINT AS n_ratings
  FROM ratings
  GROUP BY movieId
)
SELECT
  quantile_cont(n_ratings, 0.25) AS p25_ratings,
  quantile_cont(n_ratings, 0.50) AS median_ratings,
  quantile_cont(n_ratings, 0.75) AS p75_ratings,
  MIN(n_ratings) AS min_ratings,
  MAX(n_ratings) AS max_ratings,
  AVG(n_ratings)::DOUBLE AS mean_ratings
FROM counts
""").df()


Unnamed: 0,p25_ratings,median_ratings,p75_ratings,min_ratings,max_ratings,mean_ratings
0,1.0,3.0,9.0,1,329,10.369807


Conclusion

- Most movies have less than 9 ratings (p75), with a median of 3 ratings.

Close connection

In [11]:
con.close()
print("Connection closed")

Connection closed
