# DuckDB + MovieLens — Starter Notebook (VS Code)

This notebook is set up to help you:

- connect to a local `.duckdb` database file  
- run useful SQL queries  
- optionally import CSV files into the database  
- export results for further analysis  

> **Tip:** In VS Code, make sure you select the Python kernel from your `.venv` where `duckdb` and `pandas` are installed.



In [10]:

#if you are in need to install dependencies in your venv, uncomment the line below.
%pip install duckdb pandas


Note: you may need to restart the kernel to use updated packages.


In [11]:
import duckdb, pandas as pd
from pathlib import Path
 
DATA_DIR = Path("..") / "data" / "100k"
#movies_path
 
links_path = DATA_DIR / "links.csv"
movies_path = DATA_DIR / "movies.csv"
ratings_path = DATA_DIR / "ratings.csv"
tags_path = DATA_DIR / "tags.csv"

In [12]:
import duckdb
con = duckdb.connect("movielens100K.duckdb")

# see existent tables
existing = set(con.sql("SHOW TABLES").df()["name"].str.lower())

# create table only if not exists
# movies (importing directly without changes — it's text and ok)
if "movies" not in existing:
    con.sql(f"CREATE TABLE movies AS SELECT * FROM read_csv_auto('{movies_path}')")

# ratings (ensuring correct timestamp on import)
con.sql(f"""
CREATE OR REPLACE TABLE ratings AS
SELECT
    userId::INT     As userId,
    movieId::INT    As movieId,
    rating::DOUBLE  As rating,
    to_timestamp(CAST(timestamp AS BIGINT)) AS timestamp
FROM read_csv_auto('{ratings_path}')
""")

# tags (also ensuring timestamp)
con.sql(f"""
CREATE OR REPLACE TABLE tags AS
SELECT
    userId::INT     As userId,
    movieId::INT    As movieId,
    tag,
    to_timestamp(CAST(timestamp AS BIGINT)) AS timestamp
FROM read_csv_auto('{tags_path}')
""")

# links (if everything as text is ok)
if "links" not in existing:
    con.sql(f"CREATE TABLE links AS SELECT * FROM read_csv_auto('{links_path}')")

# show final state
con.sql("SHOW TABLES").df()

#why is this good?
#userId and movieId are already integers (not text)
#rating stays as real decimal number
#the timestamp in the CSV is in UNIX epoch format (seconds since 1970-01-01)
# so we convert it to a proper timestamp type for easier date/time operations later on

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,name
0,links
1,movies
2,ratings
3,tags


In [13]:
# Show row counts for all tables in the current DuckDB connection
for table in con.sql("SHOW TABLES").df()["name"]:
    count = con.sql(f"SELECT COUNT(*) AS cnt FROM {table}").df()["cnt"][0]
    print(f"Table '{table}': {count} rows")

Table 'links': 86537 rows
Table 'movies': 86537 rows
Table 'ratings': 33832162 rows
Table 'tags': 2328315 rows


## Close the connection (when done)

In [15]:
con.close()
print("Connection closed.")


Connection closed.
