# Parsing Field Definitions from a Data Dictionary

For this project, the Pluto Parcel data set comes with a ~100 page data dictionary pdf. This pdf must
be parsed, creating sturctured data that can then be stored in the FAISS vector store

## Parsing Text

First step of this will be to parse the actual text from the PDF. Going to us PyPDF2 to start, if this proves to be not working, then a multi-modal approach can be tried

In [1]:
from PyPDF2 import PdfReader

reader = PdfReader("../pluto/pluto_datadictionary.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[4]
text = page.extract_text()

In [2]:
print(text)

PLUTO DATA DICTIONARY April 2025 (25v1.1) 
 
5 
 Description:  The number of the tax lot. 
 
This field contains a one to four-digit tax lot number. 
 
Each tax lot is unique within a tax block (see TAX BLOCK). 
 
Special handling for condominiums: 
 In a condominium complex, each condominium  unit is a separate tax lot and has its 
own lot number. In a residential condominiu m, the condominium units are generally 
the individual apartments; in a commercial c ondominium, the units might be floors in 
an office building, individual retail shops, or blocks of office space. These unit lot 
numbers have values be tween 1001 – 6999.  
 Each unit tax lot has an associated billin g lot number, with values between 7501 – 
7599. Lots in a condominium complex on th e same block will have the same billing 
lot number. To make condominium inform ation more compatible with parcel 
information, the Department of City Pl anning aggregates condominium unit tax lot 
information to the billing lot. For 

## Data Models

When reviewing the document, it looks like 3 different types of data "elements" can be extracted

1. `FieldDefinition`: A field name, desc, format, etc
1. `Abbreviation`: Abbreviations and its full value
1. `CodeLookip`: Code to value lookups

In [3]:
from pydantic import BaseModel, Field
from typing import Literal


class FieldDefinition(BaseModel):
    name: str = Field(description="The direct name of the field")
    name_pretty: str = Field(description="A well formatted name of the field")
    description: str = Field(description="The entire description for the field. Format as markdown if needed")
    source: str = Field(description="description of where the data comes from")
    format: Literal['str', 'int', 'float'] = Field(description="The python primitive type of the field")


class Abbreviation(BaseModel):
    abbreviation: str
    description: str


class Code(BaseModel):
    code: str
    description: str

class CodeLookup(BaseModel):
    name: str
    lookup: list[Code]


class DataDictionary(BaseModel):
    field_defintions: list[FieldDefinition]
    abbreviations: list[Abbreviation]
    codes: list[CodeLookup]

## Attempting a Single Page

Next is to attempt this response model on a single page, using OpenAI SDK

In [10]:
import openai
import pathlib

client = openai.OpenAI(api_key=pathlib.Path("../openai.key").read_text())

system_message = """
You are an AI assistant specialized in extracting structured information from a data dictionary PDF. You will be given the complete, page-by-page text of a data dictionary (tables and entries may be split across pages). Your job is to:

1. Consolidate any multi-page or fragmented entries.
2. Extract all field definitions into FieldDefinition objects, capturing:
   - name
   - name_pretty
   - description (preserve markdown where appropriate)
   - source
   - format (one of 'str', 'int', or 'float')
3. Extract all abbreviations into Abbreviation objects.
4. Extract all code lookups into CodeLookup objects, each containing a list of Code items (code + description).

Produce exactly one JSON object matching the `DataDictionary` Pydantic schema (with keys `field_definitions`, `abbreviations`, and `codes`) and nothing else. Do not include any explanatory text or metadata—only the JSON output. 

"""

res = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': text}
    ],
    text_format=DataDictionary
)

In [11]:
print(res.output_parsed.model_dump_json(indent=2))

{
  "field_defintions": [
    {
      "name": "lot",
      "name_pretty": "Lot",
      "description": "The number of the tax lot.  \n\nThis field contains a one to four-digit tax lot number.  \n\nEach tax lot is unique within a tax block (see TAX BLOCK).  \n\nSpecial handling for condominiums:  In a condominium complex, each condominium unit is a separate tax lot and has its own lot number. In a residential condominium, the condominium units are generally the individual apartments; in a commercial condominium, the units might be floors in an office building, individual retail shops, or blocks of office space. These unit lot numbers have values between 1001 – 6999.  \n\nEach unit tax lot has an associated billing lot number, with values between 7501 – 7599. Lots in a condominium complex on the same block will have the same billing lot number. To make condominium information more compatible with parcel information, the Department of City Planning aggregates condominium unit tax lot infor

## Entire document

Next is to try to send the entire doc in one call. This may fail due to it being too large, but gonna give it a shot

In [12]:
all_text = ""

for page_num, page in enumerate(reader.pages):
    all_text += f"## Page {page_num} \n"
    all_text += page.extract_text()
    all_text += "\n"

print(len(all_text))

108757


In [13]:
res = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': all_text}
    ],
    text_format=DataDictionary
)

In [14]:
parsed_dictionary = res.output_parsed

print(f"N Fields: {len(parsed_dictionary.field_defintions)}")
print(f"N Abbreviations: {len(parsed_dictionary.abbreviations)}")
print(f"N Codes: {len(parsed_dictionary.codes)}")

N Fields: 92
N Abbreviations: 73
N Codes: 11


## Data Check

Manually checking a couple of the parsed data

In [16]:
print(parsed_dictionary.field_defintions[15].model_dump_json(indent=2))

{
  "name": "SanitBoro",
  "name_pretty": "SANITATION DISTRICT BORO",
  "description": "The borough of the sanitation district that services the tax lot.",
  "source": "Department of City Planning – Geosupport System, Department of City Planning – Administrative District Base Map files",
  "format": "int"
}


In [17]:
print(parsed_dictionary.abbreviations[15].model_dump_json(indent=2))

{
  "abbreviation": "C",
  "description": "Special Grand Concourse Preservation District"
}


In [19]:
print(parsed_dictionary.codes[3].model_dump_json(indent=2))

{
  "name": "Land Use Categories",
  "lookup": [
    {
      "code": "01",
      "description": "One & Two Family Buildings"
    },
    {
      "code": "02",
      "description": "Multi-Family Walk-Up Buildings"
    },
    {
      "code": "03",
      "description": "Multi-Family Elevator Buildings"
    },
    {
      "code": "04",
      "description": "Mixed Residential & Commercial Buildings"
    },
    {
      "code": "05",
      "description": "Commercial & Office Buildings"
    },
    {
      "code": "06",
      "description": "Industrial & Manufacturing Buildings"
    },
    {
      "code": "07",
      "description": "Transportation & Utility"
    },
    {
      "code": "08",
      "description": "Public Facilities & Institutions"
    },
    {
      "code": "09",
      "description": "Open Space & Outdoor Recreation"
    },
    {
      "code": "10",
      "description": "Parking Facilities"
    },
    {
      "code": "11",
      "description": "Vacant Land"
    }
  ]
}


## Conclusions

The parsed field defintions look Ok for a first passing, this could be something improved upon later.

For the Abbreviations / Code Lookups, this data may be better to preserve the original format, so that text
will be passed directly into the system message of the geo assisstant for this project

In [27]:
import json

export_path = pathlib.Path("../pluto/parsed_data_dictionary.json")
export_path.parent.mkdir(parents=True, exist_ok=True)
json.dump(parsed_dictionary.model_dump(), open(export_path, "w"), indent=2)