# PyDI Quickstart: Normalize repository data

This notebook mirrors `PyDI/examples/normalization_quickstart.py` and lets you run it step by step.

What this shows:
- Load CSV/XML from the repo with provenance
- Detect column types and units (including header-derived unit)
- Normalize data and save results for sharing
- Apply per-column transformations before normalization (discoverable API)

Run cells below in order. Adjust paths if running outside the repo root.


In [7]:
from __future__ import annotations

from pathlib import Path
import pandas as pd

from PyDI.io.loaders import load_csv, load_xml
from PyDI.normalization.datasets import DatasetNormalizer, create_normalization_config
from PyDI.normalization.columns import ColumnTypeInference
from PyDI.normalization.text import HeaderNormalizer
from PyDI.normalization.transforms import Transforms as T, list_transforms


def repo_root() -> Path:
    """Go two directories up from the current working directory."""
    return Path.cwd().resolve().parents[1]

repo_root()

PosixPath('/Users/aaronsteiner/Documents/GitHub/PyDI')

## Step 1: Locate input files
We point to the CSV and XML inputs bundled in this repository.


In [8]:
root = repo_root()
csv_path = root / "input" / "movies" / "schemamatching" / "data" / "movie_list.csv"
xml_path = root / "input" / "movies" / "fusion" / "data" / "academy_awards.xml"
csv_path, xml_path


(PosixPath('/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/schemamatching/data/movie_list.csv'),
 PosixPath('/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/fusion/data/academy_awards.xml'))

## Step 2: Load datasets with provenance
CSV via `load_csv`, XML via `load_xml`.


In [9]:
movies = load_csv(csv_path, name="movies")
awards = load_xml(xml_path, name="academy_awards")
print("Movies shape:", movies.shape)
print("Academy awards shape:", awards.shape)

# Show provenance info
print("Movies provenance info:", movies.attrs)
print("Academy awards provenance info:", awards.attrs)

Movies shape: (656, 23)
Academy awards shape: (5700, 7)
Movies provenance info: {'dataset_name': 'movies', 'provenance': {'dataset_name': 'movies', 'reader': 'read_csv', 'loaded_time_utc_iso': '2025-09-15T09:51:26.608685+00:00', 'source_path': '/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/schemamatching/data/movie_list.csv', 'file_size_bytes': 88801, 'modified_time_utc_iso': '2025-08-29T12:59:13.769362+00:00', 'sha256_prefix': '606069231eab113a361d1798e1fffdf939f424e2e3ea1a05f0e95669d0bdd9e8', 'sha256_prefix_bytes': 88801, 'id_column_name': 'movies_id'}}
Academy awards provenance info: {'dataset_name': 'academy_awards', 'provenance': {'dataset_name': 'academy_awards', 'reader': 'read_xml_exploded', 'loaded_time_utc_iso': '2025-09-15T09:51:26.891650+00:00', 'source_path': '/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/fusion/data/academy_awards.xml', 'file_size_bytes': 771854, 'modified_time_utc_iso': '2025-08-29T12:59:13.764462+00:00', 'sha256_prefix': '537dfaeb80d87

In [10]:
# display movie dataframe
movies.head(5)

Unnamed: 0,movies_id,id,year,exclude,Film,Lead Studio,Rotten Tomatoes,Audience Score,Story,Genre,...,Foreign Gross,Worldwide Gross,Budget,Profit,Proftitability,Opening Weekend,Oscar,Bafta,Source,Column
0,movies-0000,1,2010,,127 Hours,Independent,93.0,84,Escape,Adventure,...,42.4,60.73,18.0,42.73,337.39%,0.26,,,http://boxofficemojo.com/movies/?id=127hours.htm,
1,movies-0001,2,2010,,A Nightmare on Elm Street,Warner Bros.,13.0,40,Monster Force,Horror,...,52.59,115.66,35.0,80.66,330.46%,32.9,,,,
2,movies-0002,3,2010,,Alice in Wonderland,Disney,52.0,72,Journey And Return,Adventure,...,690.2,1024.39,200.0,824.39,512.20%,116.1,,,,
3,movies-0003,4,2010,,All About Steve,Independent,6.0,35,Comedy,Comedy,...,6.26,40.13,15.0,25.13,267.53%,11.2,,,http://www.the-numbers.com/movies/2009/ABSTV.php,
4,movies-0004,5,2010,y,All Good Things,Independent,33.0,64,The Riddle,Drama,...,0.062,0.64,20.0,-19.36,3.20%,0.037,,,http://www.wikipedia.org,


In [11]:
awards.head(5)

Unnamed: 0,academy_awards_id,id,title,actors_actor_name,date,director_name,oscar
0,academy_awards-0000,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards-0001,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Joel Coen,
2,academy_awards-0002,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Ethan Coen,
3,academy_awards-0003,academy_awards_2,True Grit,Hailee Steinfeld,2010-01-01,Joel Coen,
4,academy_awards-0004,academy_awards_2,True Grit,Hailee Steinfeld,2010-01-01,Ethan Coen,


## Step 3: Preview inferred column types
Use `ColumnTypeInference` on a sample to see detected types.


In [12]:
infer = ColumnTypeInference()
summary = infer.get_type_summary(infer.infer_column_types(movies.head(1000)))
summary

Unnamed: 0,column,detected_type,confidence,null_percentage,samples_analyzed,unit_category,specific_unit,format_pattern,sample_values
0,movies_id,string,1.0,0.0,656,,,string,"movies-0000, movies-0001, movies-0002"
1,id,currency,0.998,0.0,656,,,currency,"1, 2, 3"
2,year,numeric,1.0,0.0,656,time,year,numeric,"2010, 2010, 2010"
3,exclude,bool,1.0,80.49,128,,,boolean,"y, y, y"
4,Film,string,0.979,0.15,655,,,string,"127 Hours, A Nightmare on Elm Street, Alice in..."
5,Lead Studio,string,0.984,16.62,547,,,string,"Independent, Warner Bros., Disney"
6,Rotten Tomatoes,numeric,1.0,0.15,655,,,numeric,"93.0, 13.0, 52.0"
7,Audience Score,currency,1.0,0.0,656,,,currency,"84, 40, 72"
8,Story,string,1.0,0.3,654,,,string,"Escape, Monster Force, Journey And Return"
9,Genre,string,1.0,0.0,656,,,string,"Adventure, Horror, Adventure"


## Step 4: Explore available transforms
List built-in transforms to learn what you can apply before normalization.


In [13]:
for t in list_transforms():
    print(f"- {t['name']}: {t['summary']}")


- lower: Lowercase strings
- upper: Uppercase strings
- strip: Trim leading/trailing whitespace
- normalize_whitespace: Collapse internal whitespace and strip
- to_numeric: Convert to numeric (thousands stripped, errors=coerce)
- to_datetime: Convert to datetime (infer formats)
- fill_na_empty: Fill NA with empty string
- fill_na_zero: Fill NA with 0
- drop_non_ascii: Remove non-ASCII characters


## Step 4b: Normalize column headers
Use a header normalizer to standardize column names (lowercase, remove brackets). This can improve detection and downstream processing.


In [14]:
# Normalize headers on a copy
title = "Original -> Normalized header names (first 10)"
print(title)
hn = HeaderNormalizer(lowercase=True, remove_brackets=True)
movies_norm_headers = hn.normalize_dataframe_headers(movies)
print("Original headers:", list(movies.columns)[:10])
print("Normalized headers:", list(movies_norm_headers.columns)[:10])

Original -> Normalized header names (first 10)
Original headers: ['movies_id', 'id', 'year', 'exclude', 'Film', 'Lead Studio', 'Rotten Tomatoes', 'Audience Score', 'Story', 'Genre']
Normalized headers: ['movies_id', 'id', 'year', 'exclude', 'film', 'lead studio', 'rotten tomatoes', 'audience score', 'story', 'genre']


## Step 4c: Re-run type inference on normalized headers
Compare summaries before/after header normalization.


In [21]:
infer2 = ColumnTypeInference()
summary_norm = infer2.get_type_summary(infer2.infer_column_types(movies_norm_headers.head(1000)))
summary_norm.head(3)

Unnamed: 0,column,detected_type,confidence,null_percentage,samples_analyzed,unit_category,specific_unit,format_pattern,sample_values
0,movies_id,string,1.0,0.0,656,,,string,"movies-0000, movies-0001, movies-0002"
1,id,currency,0.998,0.0,656,,,currency,"1, 2, 3"
2,year,numeric,1.0,0.0,656,time,year,numeric,"2010, 2010, 2010"


## Step 5: Define column-level transforms
- Clean title text
- Map yes/no flags to booleans
- Convert money-like columns to numeric and scale to millions


In [16]:
transforms = {
    'Film': [T.strip(), T.normalize_whitespace()],
    'exclude': T.replace({'y': True, 'n': False, '': None}),
}

money_cols = [
    'Domestic Gross', 'Foreign Gross', 'Worldwide Gross',
    'Opening Weekend', 'Box Office Average per Cinema',
    'Budget', 'Profit',
]

to_millions = T.map(lambda v: v / 1_000_000 if pd.notna(v) else v)
transforms[tuple(money_cols)] = [T.to_numeric(), to_millions]

transforms


{'Film': [<function PyDI.normalization.transforms.Transforms.strip.<locals>.<lambda>(s)>,
  <function PyDI.normalization.transforms.Transforms.normalize_whitespace.<locals>.<lambda>(s)>],
 'exclude': <function PyDI.normalization.transforms.Transforms.replace.<locals>.<lambda>(s)>,
 ('Domestic Gross',
  'Foreign Gross',
  'Worldwide Gross',
  'Opening Weekend',
  'Box Office Average per Cinema',
  'Budget',
  'Profit'): [<function PyDI.normalization.transforms.Transforms.to_numeric.<locals>._fn(s: 'pd.Series') -> 'pd.Series'>,
  <function PyDI.normalization.transforms.Transforms.map.<locals>.<lambda>(s)>]}

## Step 6: Configure and run normalization
Create a configuration, run the normalizer, and capture results.


In [17]:
cfg = create_normalization_config(
    enable_unit_conversion=True,
    enable_quantity_scaling=True,
)
normalizer = DatasetNormalizer(cfg)
out_dir = root / "output" / "examples" / "quickstart"
out_dir.mkdir(parents=True, exist_ok=True)
normalized, result = normalizer.normalize_dataset(
    movies_norm_headers, output_path=out_dir, column_transforms=transforms)
normalized.shape, f"{result.overall_success_rate:.1%}"


Transform targets missing column(s): ['Film']
Transform targets missing column(s): ['Domestic Gross', 'Foreign Gross', 'Worldwide Gross', 'Opening Weekend', 'Box Office Average per Cinema', 'Budget', 'Profit']


((656, 23), '100.0%')

In [18]:
normalized

Unnamed: 0,movies_id,id,year,exclude,film,lead studio,rotten tomatoes,audience score,story,genre,...,foreign gross,worldwide gross,budget,profit,proftitability,opening weekend,oscar,bafta,source,column
0,movies-0000,1,201,,127 hours,independent,93.0,84,escape,adventure,...,42.4,60.73,18.0,42.73,337.39%,0.260,,,http://boxofficemojo.com/movies/?id=127hours.htm,
1,movies-0001,2,201,,a nightmare on elm street,warner bros.,13.0,40,monster force,horror,...,52.59,115.66,35.0,80.66,330.46%,32.900,,,,
2,movies-0002,3,201,,alice in wonderland,disney,52.0,72,journey and return,adventure,...,690.2,1024.39,200.0,824.39,512.20%,116.100,,,,
3,movies-0003,4,201,,all about steve,independent,6.0,35,comedy,comedy,...,6.26,40.13,15.0,25.13,267.53%,11.200,,,http://www.the-numbers.com/movies/2009/abstv.php,
4,movies-0004,5,201,True,all good things,independent,33.0,64,the riddle,drama,...,0.062,0.64,20.0,-19.36,3.20%,0.037,,,http://www.wikipedia.org,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
651,movies-0651,652,200,True,whip it,,84.0,73,maturation,drama,...,3,16.0,15.0,1.00,1.07,4.700,,,http://www.boxofficemojo.com/movies/?id=whipit...,
652,movies-0652,653,200,True,whiteout,,7.0,28,pursuit,action,...,1.9,12.2,35.0,-22.80,0.35,4.900,,,http://www.the-numbers.com/movies/records/allb...,
653,movies-0653,654,200,,x-men origins: wolverine,fox,37.0,72,revenge,action,...,193.2,373.1,150.0,223.10,2.49,85.100,,,http://www.the-numbers.com/movies/records/allb...,
654,movies-0654,655,200,True,year one,,14.0,31,quest,adventure,...,26.2,60.2,60.0,0.20,1,19.600,,,http://www.the-numbers.com/movies/records/allb...,


## Step 7: Save and preview output
Write the normalized CSV and preview select columns.


In [19]:
(out_dir / "movies_normalized.csv").write_text(normalized.to_csv(index=False))
print("Saved:", out_dir / "movies_normalized.csv")
cols = [c for c in ['film', 'year', 'exclude', 'budget', 'profit'] if c in normalized.columns]
normalized[cols].head(10)


Saved: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/quickstart/movies_normalized.csv


Unnamed: 0,film,year,exclude,budget,profit
0,127 hours,201,,18.0,42.73
1,a nightmare on elm street,201,,35.0,80.66
2,alice in wonderland,201,,200.0,824.39
3,all about steve,201,,15.0,25.13
4,all good things,201,True,20.0,-19.36
5,alpha and omega,201,,20.0,9.91
6,barry munday,201,True,,0.0
7,black swan,201,,13.0,316.39
8,brooklyn's finest,201,,17.0,19.31
9,buried,201,,2.0,16.38


### Optional: Profile the normalized dataset

You can quickly profile the normalized DataFrame to inspect schema, nulls, and distributions.
- `DataProfiler.summary(df)` returns lightweight stats without extra dependencies.
- `DataProfiler.profile(df, out_dir)` generates a rich HTML report (requires `ydata-profiling`).

If you don’t have optional dependencies installed, the example below will still show the summary and skip the HTML report with a friendly message.


In [20]:
# Profile normalized dataset (optional)
from PyDI.profiling import DataProfiler
from pathlib import Path

profiler = DataProfiler()
summary = profiler.summary(normalized)
print("Rows:", summary["rows"], " Columns:", summary["columns"]) 
print("Total nulls:", summary["nulls_total"]) 

# Try to create an HTML profiling report if dependency is available
out_dir = root / "output" / "profiling"
try:
    report_path = profiler.profile(normalized, str(out_dir))
    print("Saved profiling report to:", report_path)
except ImportError as e:
    print("Skipping HTML profiling (optional dependency missing):", e)


movies:
  Rows: 656
  Columns: 23
  Total nulls: 3,123
  Null percentage: 20.7%
  Null counts per column:
    exclude: 528 (80.5%)
    film: 1 (0.2%)
    lead studio: 109 (16.6%)
    rotten tomatoes: 1 (0.2%)
    story: 2 (0.3%)
    number of theatres in opening weekend: 45 (6.9%)
    box office average per cinema: 54 (8.2%)
    domestic gross: 6 (0.9%)
    foreign gross: 55 (8.4%)
    worldwide gross: 4 (0.6%)
    budget: 11 (1.7%)
    proftitability: 11 (1.7%)
    opening weekend: 6 (0.9%)
    oscar: 640 (97.6%)
    bafta: 644 (98.2%)
    source: 351 (53.5%)
    column: 655 (99.8%)

Rows: 656  Columns: 23
Total nulls: 3123


  from .autonotebook import tqdm as notebook_tqdm


100%|██████████| 23/23 [00:00<00:00, 453.53it/s]<00:00, 66.45it/s, Describe variable: column]                      
Summarize dataset: 100%|██████████| 154/154 [00:44<00:00,  3.46it/s, Completed]                                                                          
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.61s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 45.85it/s]

Saved profiling report to: /Users/aaronsteiner/Documents/GitHub/PyDI/output/profiling/movies_profile.html



