# Building the Field Definition Vector Store

Notebook will look at using openai and FAISS to build a vector store of field defintions

## Parse Documents

Need to create a documents store, as well as ids and associated metadata. All of this will be parsed from a previously parsed data dictionary json file (see the [field_def_parsing](field_def_parsing.ipynb) notebook)

In [15]:
import json
import pathlib

dict_path = pathlib.Path("../pluto/parsed_data_dictionary.json")

data_dictionary = json.loads(dict_path.read_text())
data_dictionary['field_defintions'][34]

{'name': 'OwnerName',
 'name_pretty': 'OWNER NAME',
 'description': 'The name of the owner of the tax lot. For publicly owned tax lots, owner names have been normalized.',
 'source': 'Department of Finance - Property Tax System (PTS), Department of City Planning – PLUTO_input_research.csv , field ownername',
 'format': 'str'}

In [19]:
docs = []

for i, field_def in enumerate(data_dictionary['field_defintions']):
    docs.append(
        {
            "id": i,
            "text": f"{field_def['name_pretty']}: {field_def['description']}",
            "metadata": {
                "source": field_def['source'],
                "format": field_def['format'],
                "name": field_def['name']
            }
        }
    )

print(json.dumps(docs[3], indent=2))

{
  "id": 3,
  "text": "COMMUNITY DISTRICT: The community district (CD) or joint interest area (JIA) for the tax lot. The city is divided into 59 community districts and 12 joint interest areas, which are large parks or airports not part of any community district. This field consists of three digits, the first of which is the borough code (see BORO CODE). The second and third digits are the community district or joint interest area number.",
  "metadata": {
    "source": "Department of City Planning \u2013 Geosupport System, Department of City Planning \u2013 Administrative District Base Map files",
    "format": "int",
    "name": "CD"
  }
}


## Embed the Documents

Going to use OpenAI to Embed the documents to a vector space

In [31]:
import openai

client = openai.OpenAI(api_key=pathlib.Path("../openai.key").read_text())

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=[d['text'] for d in docs]
)
# extract the vector
for embedding_data, doc in zip(resp.data, docs):
    doc['embedding'] = embedding_data.embedding

## Create the Index

In [33]:
import faiss
import numpy as np

embeddings_np = np.array([doc['embedding'] for doc in docs], dtype="float32")
dim = embeddings_np.shape[1]

# we'll use a simple L2 index wrapped in an ID map
index = faiss.IndexFlatL2(dim)
index = faiss.IndexIDMap(index)

# Assign Ids
index.add_with_ids(embeddings_np, [doc['id'] for doc in docs])

# store metadata in a dict keyed by the same integer id
metadata_store = {doc['id']: doc["metadata"] for doc in docs}

## Test the searching

In [36]:
query = "Show me where the buildings are less than 2 stories?"
q_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=[query]
).data[0].embedding
q_np = np.array([q_emb], dtype="float32")

# retrieve top-2 nearest neighbors
k = 2
distances, neighbors = index.search(q_np, k)

for rank, (idx, dist) in enumerate(zip(neighbors[0], distances[0]), start=1):
    doc = docs[idx]
    meta = metadata_store[idx]
    print(f"{rank}. id={doc['id']} (score={dist:.4f})")
    print(f"   text: {doc['text']}")
    print(f"   metadata: {meta}\n")

1. id=47 (score=1.2399)
   text: NUMBER OF FLOORS: The number of full and partial floors starting from the ground floor, for the tallest building on the tax lot. A partial floor is a floor that does not span the entire building envelope.
   metadata: {'source': 'Department of Finance – Property Tax System (PTS)', 'format': 'float', 'name': 'NumFloors'}

2. id=46 (score=1.2841)
   text: NUMBER OF BUILDINGS: The number of buildings on the tax lot. Calculated by taking the Building Identification Number (BIN) for every building in DoITT’s Building Footprints dataset and summing the number of buildings per tax lot.
   metadata: {'source': 'Department of Information Technology and Telecommunications – Building Footprints, Department of City Planning – Geosupport System, Department of Finance – Property Tax System (PTS)', 'format': 'int', 'name': 'NumBldgs'}



In [37]:
query = "Show me parcels with R2 zoning in Brooklyn."
q_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=[query]
).data[0].embedding
q_np = np.array([q_emb], dtype="float32")

# retrieve top-2 nearest neighbors
k = 2
distances, neighbors = index.search(q_np, k)

for rank, (idx, dist) in enumerate(zip(neighbors[0], distances[0]), start=1):
    doc = docs[idx]
    meta = metadata_store[idx]
    print(f"{rank}. id={doc['id']} (score={dist:.4f})")
    print(f"   text: {doc['text']}")
    print(f"   metadata: {meta}\n")

1. id=19 (score=1.1241)
   text: ZONING DISTRICT 1: The zoning district classification of the tax lot. Under the Zoning Resolution, the map of New York City is generally apportioned into three basic zoning district categories: Residence (R), Commercial (C) and Manufacturing (M), which are further divided into a range of individual zoning districts, denoted by different number and letter combinations.
   metadata: {'source': 'Department of City Planning NYC GIS Zoning Features', 'format': 'str', 'name': 'ZoneDist1'}

2. id=71 (score=1.1729)
   text: BORO CODE: The borough in which the tax lot is located. Each code represents a specific borough.
   metadata: {'source': 'Department of Finance - Property Tax System (PTS)', 'format': 'int', 'name': 'BoroCode'}



## Conclusions

Seems to be working great! Next step is to codify this notebook and the [field_def_parsing](field_def_parsing.ipynb) notebook!