# Leveraging large language models and vector databases for exploring life cycle inventory databases

## ChatGPT

Let's run a simple query.
For that, open [OpenAI Chat](https://chat.openai.com/), sign up/sign in, and create a new chat (it's free).

![ChatGPT: What is Ecoinvent?](chatgpt-1.png)

ChatGPT knows about Ecoinvent, great! The answer seems pretty satisfactory. Let's continue the conversation.

![ChatGPT: Cool, where is it located?](chatgpt-2.png)

### What is special about this query?

<details>
  
<summary><b>Answer</b></summary>
This seems stateful, as if ChatGPT keeps a session of my conversation, as I used "it".
The context must be shomehow saved.

</details>


### Now, how popular is Brightcon?

![ChatGPT: what is Brightcon?](chatgpt-3.png)

Sadly, it does not know. A first question then comes to mind:

### How do we teach ChatGPT / LLMs in general new knowledge? How would you do it with a "traditional ML model"?

<details>
  
<summary><b>Answer</b></summary>
<b>Fine-tuning</b>. Take the model and train on top of it.

But as of only very recently, ChatGPT offers fine-tuning capability. Regardless, this is quite hard and not the first solution that should come to mind.
</details>


### Let's feed ChatGPT with a little extra information

We do a quick Google search for recent LCA conferences, paste a few summaries in the prompt, and try again

![ChatGPT: now with context, what is Brightcon?](chatgpt-4.png)

## Checkpoint

- LLMs can generate high-quality human-language answers, based on human-language input
- LLMs can be passed context at runtime, that they can use in their responses

## Moving to the API

Let's replicate the previous flows with the API.

In [1]:
import os

import openai

# os.environ["OPENAI_API_KEY"] = xxx

In [2]:
response1 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"}
    ]
)
response1

<OpenAIObject chat.completion id=chatcmpl-80ElnrIbYyijVktAD0WAzphbzr0ZH at 0x7fbc9c56c590> JSON: {
  "id": "chatcmpl-80ElnrIbYyijVktAD0WAzphbzr0ZH",
  "object": "chat.completion",
  "created": 1695067467,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is a widely-used database that provides life cycle inventory (LCI) data for various products and processes. It contains comprehensive information on environmental impacts associated with the entire life cycle of a product, from raw material extraction to manufacturing, use, and disposal. Ecoinvent is commonly utilized in life cycle assessment (LCA) studies, where it allows researchers and practitioners to assess the environmental performance of products and processes and make informed decisions regarding sustainability."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 91,
 

In [3]:
response2 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"},
        {"role": "assistant", "content": response1["choices"][0]["message"]["content"]},
        {"role": "user", "content": "Where is it located?"}
    ]
)
response2

<OpenAIObject chat.completion id=chatcmpl-80ElreXteHiJXIdqDWwplHifA7gAG at 0x7fbc9c56cf50> JSON: {
  "id": "chatcmpl-80ElreXteHiJXIdqDWwplHifA7gAG",
  "object": "chat.completion",
  "created": 1695067471,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is an online database and can be accessed from anywhere with an internet connection. It is not physically located in a specific place as it is hosted on servers and available to users worldwide. The database is managed and maintained by the Ecoinvent Centre, which is based in Switzerland. However, users can access the database and its resources remotely without the need to be physically present at the Ecoinvent Centre."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 127,
    "completion_tokens": 83,
    "total_tokens": 210
  }
}

## What is the `system` role used for?

In [4]:
response_pirate = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a pirate."},
        {"role": "user", "content": "What is Ecoinvent?"}
    ]
)
response_pirate

<OpenAIObject chat.completion id=chatcmpl-80ElvAvVdthUhKTyUOHQaQS9n1pXT at 0x7fbc9c56d1f0> JSON: {
  "id": "chatcmpl-80ElvAvVdthUhKTyUOHQaQS9n1pXT",
  "object": "chat.completion",
  "created": 1695067475,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Arr, Ecoinvent be a database, matey. It be containin' detailed information about the environmental impacts of various products and processes. It be helpin' businesses and researchers analyze and reduce their environmental footprints, makin' their operations more sustainable. 'Tis a valuable resource for those lookin' to make the world a greener place, me heartie."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 77,
    "total_tokens": 99
  }
}

### Breaking down the response

```javascript
{
    "usage": {
    "prompt_tokens": 148,
    "completion_tokens": 95,
    "total_tokens": 243
}
```

What are these **tokens**?

OpenAI provides a [nice web app OpenAI Tokenizer](https://platform.openai.com/tokenizer) to understand how words are broken down into tokens.

They say

```
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
`````
```

![Tokenizer words](tokenizer-1.png)

### Tokens are mappings for sequences of letters to integers

![Tokenizer words](tokenizer-2.png)

### And programmatically with tiktoken

In [5]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [6]:
tokens = enc.encode("What is Ecoinvent?")
tokens

[3923, 374, 469, 7307, 688, 30]

In [7]:
[enc.decode([token]) for token in tokens]

['What', ' is', ' E', 'coin', 'vent', '?']

### Why do we care about tokens here?

<details>
  
<summary><b>Answer</b></summary>
Because there is a limit to the size of the context window, and it is measured in tokens.
Also, the pricing is per token.

</details>


## Checkpoint

- Text is chunked into tokens before being manipulated. Tokens are just mappings from strings to integers. They define an exhaustive vocabulary.
- There is a context window, we can't pass an unlimited amount of data to LLMS.

## Which applications can you think of? In the context of LCAs

Here is one I used for Brightcon:

- Text generation from a small description

<details>
  
<summary><b>Answer</b></summary>

Let's describe the process of doing an LCA:

1. Collect imperfect data from a customer - for instance a bunch of Excel files with random names for units (seen at the Hackathon): `kg CH4`, `PCS`. <b>Covered application 1: "kg CH4" -> "kg" and "PCS" -> "Item(s)"</b>
2. Create a supply chain in an LCA software, mapping this "company data" to LCI data (like Ecoinvent, but we'll use ELCD here).
3. Pick from the numerous processes in the LCI database the one that matches the closest your data (or refine this simplified approach). <b>Covered application 2: Search and match</b>
4. Compute impacts and do something with them

</details>


# Application 1. Mapping vocabularies

- Company data -> LCI format (units for instance)
- One ontology to another ontology (suggested through the Hackathon)

## Dataset

We assume you did the following:

- Go to [OpenLCA Nexus](https://nexus.openlca.org/)
- Create an account / sign in and download ELCD (a retired free EU LCI database)
- Download OpenLCA V2
- Click "Database > Restore database" and pick the ELCD ".zolca" file you've just downloaded
- Then click "File > Export > JSON-LD"
- Unzip the JSON-LD zip you now have, under `datasets/elcd/`

For licensing reasons, I cannot provide the data as is, you need to follow these steps.

Your folder structure should look like

```bash
tree datasets/ -L 2
datasets/
└── elcd
    ├── actors
    ├── bin
    ├── categories
    ├── context.json
    ├── currencies
    ├── dq_systems
    ├── flow_properties
    ├── flows
    ├── lcia_categories
    ├── lcia_methods
    ├── locations
    ├── meta.info
    ├── nw_sets
    ├── processes
    ├── sources
    └── unit_groups
```

In [8]:
# Here type the datasets root path
DSROOT="/home/selim/code/brightcon/brightcon-2023-llm/datasets"

In [9]:
import glob
import json
import os

from tqdm.notebook import tqdm

def load_units(root: str):
    units_set = set()
    for unit_group_fp in tqdm(glob.glob(os.path.join(root, "unit_groups", "*.json"))):
        with open(unit_group_fp) as unit_group_f:
            unit_group = json.load(unit_group_f)
        for unit in unit_group["units"]:
            units_set.add(unit["name"])
    return units_set

In [10]:
units_set = load_units(os.path.join(DSROOT, "elcd"))
print(f"N units: {len(units_set)}")
print(units_set)

  0%|          | 0/42 [00:00<?, ?it/s]

N units: 202
{'lb av', 'bl (US beer)', 'm2 yr eq organic arable land', 'KRW 2000', '$', 'in3', 'sh tn', 'mBq', 'm2*a', 'dr (Av)', 'ng', 'MJ', 'kJ', 'kt*km', 'l*d', 'pk', 'Ci', 'dal', 'mm3', 'Items*mi', 'EUR 2000', 'CHF 2000', 'u', 'sFr', 'mi2*a', 'm2', 'lb*mi', 'fur', 'btu', 'nmi2', 'dm2', 'a', 'kcal', 'kg R11-Equiv.', 'gill', 'd', 'kg SO2-Equiv.', 'USD 2002', 'µBq', 'qt (US dry)', 'SEK 2000', 'h', 'Nm3', 'nBq', '(mm*m2)/a', 'in2', 'bbl', 'MWh', 'fl oz (Imp)', 'g', 'Items*a', 'kg Ethene-Equiv.', 'bsh (US)', 't*a', 'TJ', 'lb*nmi', 'Yen', 'pt (US fl)', 'CHF 2005', 'mi', 'gal (Imp)', 'kg DCB-Equiv.', 'ch', 'J', 'kWh', 'USD 2000', 'cl', 'dr (Fl)', 't*km', 's', 'gr', 'bl (Imp)', 'mg', 'kg*d', 'GBP 2000', 'mi*a', 'CZK 2000', 'mm2', 'Dozen(s)', 'm3*d', 'ZAR 2000', 'l*a', 'min', 'ftm', '(cmol*m2*a)/kg', 'cg', '(cmol*m2)/kg', 'GJ', 'DKK 2000', 'Bq', 'CAD 2000', 't', 'cm2a', 'yd3', 't*mi', 'bl (US fl)', 'Wh', 'pt (US dry)', 'long tn', 'Rutherford', 'TCE', 'hl', 'm3*km', 'Item(s)', 'dm', 'mm2a', 

## Objective: convert "kg N" into a unit from ELCD above

Concrete goal: make a function

```python
from typing import Set

def map_unit(source_unit: str, units: Set[str]) -> str:
    return dest_unit
```

[Source from Hestia](https://www.hestia.earth/glossary?page=1&query=Excreta%20(kg%20N))

![Hestia kg N](hestia-1.png)

In [11]:
from typing import Set

def map_unit_1(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ]
    )
    return response["choices"][0]["message"]["content"]

In [12]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'The unit "kg N" corresponds to the kilogram of nitrogen.'

In [13]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'kg N corresponds to kilograms of nitrogen.'

In [14]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'The unit "kg N" corresponds to kilograms of nitrogen.'

### Notice a difference?

We don't always get the same output. There is randomness in the output. This is a bit scary.

We can set lower the randomness by settting `temperature=0`.

From now on this will be the default setting.

### Seems to have worked! But I'd like to use that in my code, how do I do that?

<details>
  
<summary><b>Answer</b></summary>

- Build a regex? Not robust.
- Ask the LLM to return a structured output? Yes please.

</details>


In [15]:
def map_unit_2(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    return response["choices"][0]["message"]["content"]

In [16]:
map_unit_2("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'{\n  "kg N": "kg of Nitrogen"\n}'

### Cool, that's an improvement, let's parse it

In [17]:
def map_unit_3(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output[source_unit]


In [18]:
map_unit_3("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'kg of Nitrogen'

In [19]:
def map_unit_4(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON with key 'unit' and value the unit from the list of units between the backticks"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output["unit"]


In [20]:
map_unit_4("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'kg N'

In [21]:
def map_unit_5(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON with key 'unit' and value the unit from the list of units between the backticks"
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output["unit"]


In [22]:
map_unit_5("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

'kg'

## This still feels really brittle. Can we type this?



In [23]:
import pydantic

PYDANTIC_FORMAT_INSTRUCTIONS = """The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {{"properties": {{"foo": {{"title": "Foo", "description": "a list of strings", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}
the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of the schema. The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted.

Here is the output schema:
```
{schema}
```"""

class UnitResponse(pydantic.BaseModel):
    unit: str = pydantic.Field("the mapped unit from the provided exhaustive list")

In [24]:
UnitResponse.schema()

{'title': 'UnitResponse',
 'type': 'object',
 'properties': {'unit': {'title': 'Unit',
   'default': 'the mapped unit from the provided exhaustive list',
   'type': 'string'}}}

In [25]:
def describe_pydantic_schema_as_str(model: pydantic.BaseModel):
    schema = model.schema()
    for key, value in schema.get("properties", {}).items():
        if "title" in value:
            del value["title"]
        if "type" in value and "description" in value:
            value = value["description"]
            schema["properties"][key] = value
    return json.dumps(schema)

In [26]:
describe_pydantic_schema_as_str(UnitResponse)

'{"title": "UnitResponse", "type": "object", "properties": {"unit": {"default": "the mapped unit from the provided exhaustive list", "type": "string"}}}'

In [27]:
def map_unit_6(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    return json.loads(output_json_str)

In [29]:
map_unit_6("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

{'unit': 'kg'}

In [30]:
def map_unit_7(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse.parse_obj(parsed_output)
    
    return parsed_output

In [31]:
map_unit_7("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

{'unit': 'kg'}

### What is that is a sub/super unit?

Let's make our schema more complex and see if this works out of the box.

In [32]:
class UnitResponse2(pydantic.BaseModel):
    source_unit: str = pydantic.Field("the source unit")
    allowed_unit: str = pydantic.Field("the allowed unit")
    conversion_factor: int = pydantic.Field("the conversion factor from the provided unit to the mapped unit one")


def map_unit_8(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [33]:
map_unit_8("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

{'source_unit': 'kg N', 'allowed_unit': 'kg', 'conversion_factor': 1}

In [34]:
def map_unit_9(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    Also, provide a detailed explanation.
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [35]:
map_unit_9("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

JSONDecodeError: Extra data: line 7 column 1 (char 79)

### Too brittle! How do we fix this?

<details>
    <summary><b>Answer</b></summary>
    - Add a regex to parse JSON output out of the text response
</details>

In [36]:
import re

def extract_json(text: str):
    match = re.search(r"\{.*\}", text.strip(), re.MULTILINE | re.IGNORECASE | re.DOTALL)
    if not match:
        return None

    json_str = match.group()
    return json_str

def map_unit_10(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    Also, provide a detailed explanation.
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_str}")
    
    output_json_str = extract_json(output_str)
    print(f"Output JSON: {output_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [None]:
map_unit_10("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

## What we built is a "chain"

See [Langchain chains](https://python.langchain.com/docs/modules/chains/)

The chain was:

- Prompt
- Input variables: "the unit" and the "allowed units"
- Structure that must be present in the output: the Pydantic format
- Output parsing: regex + Pydantic validation