<a href="https://colab.research.google.com/github/saifromiitm/dsstuff/blob/main/Extractting_info_from_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Extraction

Let's talk about how we can extract systematic information from a dataset.

Let's say we have a bunch of famous addresses.

We want to systematically extract the state name, the zip code, if it exists.

In [None]:
addresses = [
    {"place": "White House", "address": "1600 Pennsylvania Avenue, Washington DC"},
    {"place": "NYSE", "address": "11 Wall Street New York, NY"},
    {"place": "Empire State Building", "address": "350 Fifth Avenue New York, NY 10118"},
    {"place": "Hollywood sign", "address": "4059 Mt Lee Dr. Hollywood, CA 90068"},
    {"place": "Statue of Liberty", "address": "Statue of Liberty, Liberty Island New York, NY 10004"},
    {"place": "Fatehpur Sikri", "address": "Fatehpur Sikri, UP 283110, Agra"}
]

You need to define the [`OPENAI_API_KEY`](https://platform.openai.com/account/api-keys) in the Secrets tab of Colab to access the OpenAI models.

In [None]:
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

Let's send a request to the GPT 3.5 Turbo model and ask for a JSON object.

Along with the address, we pass the instructions (system prompt)

```
Extract the state name, ZIP code and country as JSON.
Use {"state_name": ..., "zip_code": ..., "country": 3-letter country code}
```

In [None]:
import json
import requests

def get_address(address):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    data = {
        "model": "gpt-3.5-turbo",
        "response_format": { "type": "json_object" },
        "messages": [
            {
                "role": "system",
                "content": """
Extract the state name, ZIP code and country as JSON.
Use {"state_name": ..., "zip_code": ..., "country": 3-letter country code}
"""
            },
            {
                "role": "user",
                "content": address
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    result = response.json()
    return json.loads(result["choices"][0]["message"]["content"])

get_address(addresses[0]['address'])

{'state_name': 'Washington D.C.', 'zip_code': '20500', 'country': 'USA'}

Now let's run this on all the addresses.

In [None]:
from copy import deepcopy
from tqdm import tqdm

addr = deepcopy(addresses)

for item in tqdm(addr):
    item.update(get_address(item["address"]))

100%|██████████| 6/6 [00:04<00:00,  1.23it/s]


In [None]:
import pandas as pd
pd.DataFrame(addr)

Unnamed: 0,place,address,state_name,zip_code,country
0,White House,"1600 Pennsylvania Avenue, Washington DC",Washington DC,20006,USA
1,NYSE,"11 Wall Street New York, NY",New York,10005,USA
2,Empire State Building,"350 Fifth Avenue New York, NY 10118",New York,10118,USA
3,Hollywood sign,"4059 Mt Lee Dr. Hollywood, CA 90068",California,90068,USA
4,Statue of Liberty,"Statue of Liberty, Liberty Island New York, NY...",New York,10004,USA
5,Fatehpur Sikri,"Fatehpur Sikri, UP 283110, Agra",Uttar Pradesh,283110,IND


This was a simple example and it gets the JSON structure right. But if we want a stronger guarantee of the output structure when it is complex, it's best to use a [JSON schema](https://json-schema.org/).

Here is the JSON schema for a structure like this:

```
{
    "state": {"name": "Washington DC", "code": "DC"},
    "country": {"name": "India", "code": "IND"},
    "zip": {"code": "..."}
}
```

In [None]:
# Target output
# {
#     "state": {"name": "Washington DC", "code": "DC"},
#     "country": {"name": "India", "code": "IND"},
#     "zip": {"code": "..."}
# }

schema = {
  "type": "object",
  "properties": {
    "state": {
      "type": "object",
      "description": "Details about the state",
      "properties": {
        "name": {
          "type": "string",
          "description": "Official state name"
        },
        "code": {
          "type": "string",
          "description": "Official state code"
        }
      },
      "required": ["name", "code"]
    },
    "country": {
      "type": "object",
      "description": "Details about the country",
      "properties": {
        "name": {
          "type": "string",
          "description": "Official country name"
        },
        "code": {
          "type": "string",
          "description": "3-letter country code"
        }
      },
      "required": ["name", "code"]
    },
    "zip": {
      "type": "object",
      "description": "Details about the ZIP code",
      "properties": {
        "code": {
          "type": "string",
          "description": "ZIP code"
        }
      },
      "required": ["code"]
    }
  },
  "required": ["state", "country", "zip"]
}

We'll use this schema using an approach called [function calling](https://platform.openai.com/docs/guides/function-calling).

In [None]:
def get_address_schema(address):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    data = {
        "model": "gpt-3.5-turbo",
        "response_format": { "type": "json_object" },
        "tools": [
          {"type": "function", "function": {"name": "extract_address", "description": "Extract address details", "parameters": schema}}
        ],
        "tool_choice": {"type": "function", "function": {"name": "extract_address"}},
        "messages": [
            {
                "role": "system",
                "content": "Get address as JSON via extract_address. If unsure, leave fields blank."
            },
            {
                "role": "user",
                "content": address
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    result = response.json()
    return json.loads(result["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])

get_address_schema(addresses[0]['address'])

{'state': {'name': 'District of Columbia', 'code': 'DC'},
 'country': {'name': 'United States', 'code': 'USA'},
 'zip': {'code': ''}}

Here's a sample usage.

In [None]:
get_address_schema("1234 Elmwood Avenue, Pristina, Veridonia, Canada.")

{'state': {'name': 'Veridonia', 'code': ''},
 'country': {'name': 'Canada', 'code': 'CAN'},
 'zip': {'code': ''}}