# DuckDB + MovieLens — Starter Notebook (VS Code)

This notebook is set up to help you:

- connect to a local `.duckdb` database file  
- run useful SQL queries  
- optionally import CSV files into the database  
- export results for further analysis  

> **Tip:** In VS Code, make sure you select the Python kernel from your `.venv` where `duckdb` and `pandas` are installed.



### 1 Importing libraries, Data and data conversion
#### 1.1 pip instal

In [1]:

#if you are in need to install dependencies in your venv, uncomment the line below.
%pip install duckdb pandas


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


#### 1.2 Importing libraries and folder paths

In [2]:
import duckdb, pandas as pd
from pathlib import Path
import os
 
DATA_DIR = Path("..") / "data" / "100k"
#movies_path
 
links_csv = DATA_DIR / "links.csv"
movies_csv = DATA_DIR / "movies.csv"
ratings_csv = DATA_DIR / "ratings.csv"
tags_csv = DATA_DIR / "tags.csv"


# diretório para guardar parquet
PARQUET_DIR = DATA_DIR / "parquet"
PARQUET_DIR.mkdir(exist_ok=True)

#### 1.3 Creating function to create parque files

In [3]:
def safe_copy_to_parquet(csv_path, parquet_path, sql_select):
    if parquet_path.exists():
        print(f"Deleting existing file: {parquet_path}")
        os.remove(parquet_path)

    print(f"Creating parquet file: {parquet_path}")

    duckdb.sql(f"""
    COPY (
        {sql_select}
    ) TO '{parquet_path}'
    (FORMAT 'parquet');
    """)

    print(f"✔ Finished writing {parquet_path}\n")


#### 1.4 Creating Parquet files
##### 1.4.1 movies.parquet

In [4]:
# converter movies
safe_copy_to_parquet(
    movies_csv,
    PARQUET_DIR / "movies.parquet",
    f"SELECT * FROM read_csv_auto('{movies_csv}')"
)

Deleting existing file: ..\data\100k\parquet\movies.parquet
Creating parquet file: ..\data\100k\parquet\movies.parquet
✔ Finished writing ..\data\100k\parquet\movies.parquet



##### 1.4.2 ratings.parquet

In [5]:
# converter ratings
safe_copy_to_parquet(
    ratings_csv,
    PARQUET_DIR / "ratings.parquet",
    f"""
    SELECT
        userId::INT,
        movieId::INT,
        rating::DOUBLE,
        to_timestamp(CAST(timestamp AS BIGINT)) AS timestamp
    FROM read_csv_auto('{ratings_csv}')
    """
)

Deleting existing file: ..\data\100k\parquet\ratings.parquet
Creating parquet file: ..\data\100k\parquet\ratings.parquet
✔ Finished writing ..\data\100k\parquet\ratings.parquet



##### 1.4.3 tags.parquet

In [6]:
safe_copy_to_parquet(
    tags_csv,
    PARQUET_DIR / "tags.parquet",
    f"""
    SELECT
        userId::INT,
        movieId::INT,
        tag,
        to_timestamp(CAST(timestamp AS BIGINT)) AS timestamp
    FROM read_csv_auto('{tags_csv}')
    """
)

Deleting existing file: ..\data\100k\parquet\tags.parquet
Creating parquet file: ..\data\100k\parquet\tags.parquet
✔ Finished writing ..\data\100k\parquet\tags.parquet



##### 1.4.4 links.parquet

In [7]:
# converter links
safe_copy_to_parquet(
    links_csv,
    PARQUET_DIR / "links.parquet",
    f"SELECT * FROM read_csv_auto('{links_csv}')"
)

Deleting existing file: ..\data\100k\parquet\links.parquet
Creating parquet file: ..\data\100k\parquet\links.parquet
✔ Finished writing ..\data\100k\parquet\links.parquet



### 2.0 Creating tables in Duckdb

In [8]:
con = duckdb.connect("movielens100K.duckdb")

con.sql(f"CREATE OR REPLACE TABLE movies  AS SELECT * FROM read_parquet('{PARQUET_DIR / "movies.parquet"}')")
con.sql(f"CREATE OR REPLACE TABLE ratings AS SELECT * FROM read_parquet('{PARQUET_DIR / "ratings.parquet"}')")
con.sql(f"CREATE OR REPLACE TABLE tags    AS SELECT * FROM read_parquet('{PARQUET_DIR / "tags.parquet"}')")
con.sql(f"CREATE OR REPLACE TABLE links   AS SELECT * FROM read_parquet('{PARQUET_DIR / "links.parquet"}')")


In [9]:
# show final state
con.sql("SHOW TABLES").df()

Unnamed: 0,name
0,links
1,movies
2,ratings
3,tags


In [10]:
# Show row counts for all tables in the current DuckDB connection
for table in con.sql("SHOW TABLES").df()["name"]:
    count = con.sql(f"SELECT COUNT(*) AS cnt FROM {table}").df()["cnt"][0]
    print(f"Table '{table}': {count} rows")

Table 'links': 9742 rows
Table 'movies': 9742 rows
Table 'ratings': 100836 rows
Table 'tags': 3683 rows


### Close the connection (when done)

In [11]:
con.close()
print("Connection closed.")


Connection closed.
