# `pandantic` v1- Solving the issue of black box `DataFrames` with `pydantic`

This is the executable code associated with [article-goes-here-when-published]().

## Setup for examples

For our examples we will be using publicly available data from [FiveThirtyEight's repository](https://github.com/fivethirtyeight/data/tree/master) that captures stats about which movies pass the "Bechdel test". Essentially, a movie that passes the Bechdel test is one where there is at least one conversation between two women that isn't about a man. Regardless, we will be working with CSV data containing movies  Bechdel test results and other information for 1794 movies.

In [1]:
import pydantic
import pandantic
import pandas as pd
from pandantic.plugins import pandas
from typing import Literal, Annotated 

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bechdel/movies.csv")
display(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1794 entries, 0 to 1793
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            1794 non-null   int64  
 1   imdb            1794 non-null   object 
 2   title           1794 non-null   object 
 3   test            1794 non-null   object 
 4   clean_test      1794 non-null   object 
 5   binary          1794 non-null   object 
 6   budget          1794 non-null   int64  
 7   domgross        1777 non-null   float64
 8   intgross        1783 non-null   float64
 9   code            1794 non-null   object 
 10  budget_2013$    1794 non-null   int64  
 11  domgross_2013$  1776 non-null   float64
 12  intgross_2013$  1783 non-null   float64
 13  period code     1615 non-null   float64
 14  decade code     1615 non-null   float64
dtypes: float64(6), int64(3), object(6)
memory usage: 210.4+ KB


None

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,1.0,1.0
3,2013,tt1272878,2 Guns,notalk,notalk,FAIL,61000000,75612460.0,132493015.0,2013FAIL,61000000,75612460.0,132493015.0,1.0,1.0
4,2013,tt0453562,42,men,men,FAIL,40000000,95020213.0,95020213.0,2013FAIL,40000000,95020213.0,95020213.0,1.0,1.0


## Basic validation

Imagine we are writing a program that only needs the the `binary` test result (PASS or FAIL) and the non-negative `budget` values from each row in the `DataFrame`. We can first encapsulate our expectations with a `pydantic` model as shown below. 

Next we simply use the `pandas` plugin via `df.pandantic.validate(...)`. By default, `.validate()` returns the input dataframe if every row *contains* the fields defined in our `pydantic` model, as well as valid values. Optionally,`strict=True` alters this behavior to only allow validation if the model schema and the `DataFrame` columns are 1-to-1.

In [3]:
class BechdelTestMovie(pydantic.BaseModel):
    """Information we need from each row/movie."""
    binary: Literal["PASS"] | Literal["FAIL"]
    budget: Annotated[int, pydantic.Field(gt=0)]

In [4]:
df.pandantic.validate(BechdelTestMovie)

True

## DataFrame filtering

While validating a `DataFrame` does allows one to make useful assumptions when writing code, it does not provide much functionality directly. Filtering however, can add a lot of value by completely replacing difficult to read `pandas` code, especially in cases where there are many specific rules/cases.

To illustrate this let's pretent we only care to watch movies that pass the Bechdel test and have a budget between $10-$12M. After representing this criteria in a `pydantic` model, we can use `.filter()` on the `DataFrame` to remove invalid columns.

In [11]:
class MovieToWatch(pydantic.BaseModel):
    """A Bechdel test passing movie w/ a medium budget."""
    title: str
    binary: Literal["PASS"]
    budget: Annotated[int, pydantic.Field(gt=10000000, lt=12000000)]

In [12]:
# we can see here that only 7/1794 movies fit this criteria
display(len(df.pandantic.filter(MovieToWatch)))

7

## Taking things further with iterators

Pandas offers a variety of iterator functions that allow rows to be processed one-by-one in a memory efficient manor. `pandantic` builds on these capabilities by only returning valid rows out of iterator, or even returning the `pydantic` model object itself if using `.iterschemas(...)`

In [13]:
for row_tuple in df.pandantic.itertuples(MovieToWatch):
    print(row_tuple)

(595, 2008, 'tt0962726', 'High School Musical 3: Senior Year', 'ok', 'ok', 'PASS', 11000000, 90559416.0, 274392880.0, '2008PASS', 11904819, 98008500.0, 296963426.0, 2.0, 2.0)
(644, 2008, 'tt0416212', 'The Secret Life of Bees', 'ok', 'ok', 'PASS', 11000000, 37770162.0, 39984023.0, '2008PASS', 11904819, 40876996.0, 43272961.0, 2.0, 2.0)
(712, 2007, 'tt0964587', 'St. Trinian&#39;s', 'ok', 'ok', 'PASS', 11400000, 15000.0, 22446568.0, '2007PASS', 12808396, 16853.0, 25219695.0, 2.0, 2.0)
(960, 2004, 'tt0362269', 'Kinsey', 'ok', 'ok', 'PASS', 11000000, 10214647.0, 13000959.0, '2004PASS', 13565122, 12596630.0, 16032690.0, 3.0, 2.0)
(1004, 2004, 'tt0383694', 'Vera Drake', 'ok', 'ok', 'PASS', 11000000, 3753806.0, 13353855.0, '2004PASS', 13565122, 4629167.0, 16467879.0, 3.0, 2.0)
(1292, 1999, 'tt0139134', 'Cruel Intentions', 'ok', 'ok', 'PASS', 11000000, 38230075.0, 75803716.0, '1999PASS', 15383082, 53463308.0, 106008618.0, 4.0, 3.0)
(1711, 1982, 'tt0084516', 'Poltergeist', 'ok', 'ok', 'PASS', 10

In [14]:
for index, row in df.pandantic.iterrows(MovieToWatch):
    assert isinstance(row, pd.Series)
    display(index)

595

644

712

960

1004

1292

1711

In [15]:
for index, obj in df.pandantic.iterschemas(MovieToWatch):
    display(obj)

MovieToWatch(title='High School Musical 3: Senior Year', binary='PASS', budget=11000000)

MovieToWatch(title='The Secret Life of Bees', binary='PASS', budget=11000000)

MovieToWatch(title='St. Trinian&#39;s', binary='PASS', budget=11400000)

MovieToWatch(title='Kinsey', binary='PASS', budget=11000000)

MovieToWatch(title='Vera Drake', binary='PASS', budget=11000000)

MovieToWatch(title='Cruel Intentions', binary='PASS', budget=11000000)

MovieToWatch(title='Poltergeist', binary='PASS', budget=10700000)