# PyDI Profiling Example

This notebook mirrors `PyDI/examples/profiling_example.py`.

It loads a dataset with PyDI's provenance-aware I/O and generates a profiling
report using ydata-profiling. Artifacts are written to `output/profiling`.

If optional dependencies are missing, install:
- `ydata-profiling`
- optionally `sweetviz`


In [3]:
# Step 0: Imports and setup
import logging
from pathlib import Path

from PyDI.io import load_csv
from PyDI.profiling import DataProfiler


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

root = Path.cwd().parents[0]  # repo root fallback for notebooks


/Users/aaronsteiner/Documents/GitHub/PyDI


## Step 1: Locate input and output paths


In [4]:
csv_path = root / "input/schemamatching/data/movie_list.csv"
out_dir = root / "output/profiling"
csv_path, out_dir


(PosixPath('/Users/aaronsteiner/Documents/GitHub/PyDI/input/schemamatching/data/movie_list.csv'),
 PosixPath('/Users/aaronsteiner/Documents/GitHub/PyDI/output/profiling'))

## Step 2: Load the dataset with provenance


In [5]:
if not csv_path.exists():
    raise FileNotFoundError(f"CSV file not found: {csv_path}")

df = load_csv(csv_path, name="movies")
print(f"Loaded {len(df)} rows from {csv_path.name}")
print(f"Columns: {list(df.columns)[:8]} ...")
df.head(3)


INFO:PyDI.io.loaders:Loaded dataset 'movies' via read_csv: shape=(656, 23), source=/Users/aaronsteiner/Documents/GitHub/PyDI/input/schemamatching/data/movie_list.csv


Loaded 656 rows from movie_list.csv
Columns: ['movies_id', 'id', 'year', 'exclude', 'Film', 'Lead Studio', 'Rotten Tomatoes', 'Audience Score'] ...


Unnamed: 0,movies_id,id,year,exclude,Film,Lead Studio,Rotten Tomatoes,Audience Score,Story,Genre,...,Foreign Gross,Worldwide Gross,Budget,Profit,Proftitability,Opening Weekend,Oscar,Bafta,Source,Column
0,movies-0000,1,2010,,127 Hours,Independent,93.0,84,Escape,Adventure,...,42.4,60.73,18.0,42.73,337.39%,0.26,,,http://boxofficemojo.com/movies/?id=127hours.htm,
1,movies-0001,2,2010,,A Nightmare on Elm Street,Warner Bros.,13.0,40,Monster Force,Horror,...,52.59,115.66,35.0,80.66,330.46%,32.9,,,,
2,movies-0002,3,2010,,Alice in Wonderland,Disney,52.0,72,Journey And Return,Adventure,...,690.2,1024.39,200.0,824.39,512.20%,116.1,,,,


## Step 3: Generate profiling report
Requires `ydata-profiling` (and optionally `sweetviz`).


In [6]:
profiler = DataProfiler()
try:
    report_path = profiler.profile(df, str(out_dir))
    print("\nProfiling report written to:")
    print(report_path)
except ImportError as e:
    logger.error(
        "Profiling requires optional dependencies. Please install: 'ydata-profiling' (and optionally 'sweetviz'). Error: %s",
        e,
    )


  from .autonotebook import tqdm as notebook_tqdm
INFO:visions.backends:Pandas backend loaded 2.3.2
INFO:visions.backends:Numpy backend loaded 1.26.4
INFO:visions.backends:Pyspark backend NOT loaded
INFO:visions.backends:Python backend loaded


100%|██████████| 23/23 [00:00<00:00, 772.22it/s]<00:00, 58.04it/s, Describe variable: Column]              
Summarize dataset: 100%|██████████| 97/97 [00:02<00:00, 40.61it/s, Completed]                                                           
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.51s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 216.82it/s]


Profiling report written to:
/Users/aaronsteiner/Documents/GitHub/PyDI/output/profiling/movies_profile.html



