# Leveraging large language models and vector databases for exploring life cycle inventory databases

## ChatGPT

Let's run a simple query.
For that, open [OpenAI Chat](https://chat.openai.com/), sign up/sign in, and create a new chat (it's free).

![ChatGPT: What is Ecoinvent?](chatgpt-1.png)

ChatGPT knows about Ecoinvent, great! The answer seems pretty satisfactory. Let's continue the conversation.

![ChatGPT: Cool, where is it located?](chatgpt-2.png)

### What is special about this query?

<details>
  
<summary><b>Answer</b></summary>
This seems stateful, as if ChatGPT keeps a session of my conversation, as I used "it".
The context must be shomehow saved.

</details>


### Now, how popular is Brightcon?

![ChatGPT: what is Brightcon?](chatgpt-3.png)

Sadly, it does not know. A first question then comes to mind:

### How do we teach ChatGPT / LLMs in general new knowledge? How would you do it with a "traditional ML model"?

<details>
  
<summary><b>Answer</b></summary>
<b>Fine-tuning</b>. Take the model and train on top of it.

But as of only very recently, ChatGPT offers fine-tuning capability. Regardless, this is quite hard and not the first solution that should come to mind.
</details>


### Let's feed ChatGPT with a little extra information

We do a quick Google search for recent LCA conferences, paste a few summaries in the prompt, and try again

![ChatGPT: now with context, what is Brightcon?](chatgpt-4.png)

## Checkpoint

- LLMs can generate high-quality human-language answers, based on human-language input
- LLMs can be passed context at runtime, that they can use in their responses

## Moving to the API

Let's replicate the previous flows with the API.

In [1]:
import os

import openai

# os.environ["OPENAI_API_KEY"] = xxx

In [3]:
response1 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"}
    ]
)
response1

<OpenAIObject chat.completion id=chatcmpl-80D7zOhuJZzS5qBEUI2tknSm13GA4 at 0x7f5b185d0ef0> JSON: {
  "id": "chatcmpl-80D7zOhuJZzS5qBEUI2tknSm13GA4",
  "object": "chat.completion",
  "created": 1695061155,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is a database that provides life cycle inventory (LCI) data for various materials and processes. It is one of the most widely used LCI databases in the world and includes comprehensive data on energy, resource use, emissions, and other environmental impacts associated with the production of goods and services. Researchers, companies, and policymakers use Ecoinvent to assess and compare the environmental performance of products, inform sustainability assessments, and support decision-making processes. It helps users understand the environmental implications of their activities and enables them to make more informed and sustainable choices

In [5]:
response2 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"},
        {"role": "assistant", "content": response1["choices"][0]["message"]["content"]},
        {"role": "user", "content": "Where is it located?"}
    ]
)
response2

<OpenAIObject chat.completion id=chatcmpl-80D8rPcECNL8iQdiUHY05RMTQr9In at 0x7f5b181c8ef0> JSON: {
  "id": "chatcmpl-80D8rPcECNL8iQdiUHY05RMTQr9In",
  "object": "chat.completion",
  "created": 1695061209,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is located in Switzerland. It is managed by the Swiss Centre for Life Cycle Inventories, a non-profit organization based in Zurich. The database is continuously updated and maintained by a team of experts who collect data from various sources, verify and validate it, and ensure its accuracy and relevance. Ecoinvent has gained international recognition and is widely used by researchers, companies, and organizations around the world for life cycle assessment (LCA) studies and environmental impact evaluations."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 148,
    "completion_tokens": 95,
    "total_to

### Breaking down the response

```javascript
{
    "usage": {
    "prompt_tokens": 148,
    "completion_tokens": 95,
    "total_tokens": 243
}
```

What are these **tokens**?

OpenAI provides a [nice web app OpenAI Tokenizer](https://platform.openai.com/tokenizer) to understand how words are broken down into tokens.

They say

```
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
`````
```

![Tokenizer words](tokenizer-1.png)

### Tokens are mappings for sequences of letters to integers

![Tokenizer words](tokenizer-2.png)

### And programmatically with tiktoken

In [16]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [17]:
tokens = enc.encode("What is Ecoinvent?")
tokens

[3923, 374, 469, 7307, 688, 30]

In [19]:
[enc.decode([token]) for token in tokens]

['What', ' is', ' E', 'coin', 'vent', '?']

### Why do we care about tokens here?

<details>
  
<summary><b>Answer</b></summary>
Because there is a limit to the size of the context window, and it is measured in tokens.
Also, the pricing is per token.

</details>


## Checkpoint

- Text is chunked into tokens before being manipulated. Tokens are just mappings from strings to integers. They define an exhaustive vocabulary.
- There is a context window, we can't pass an unlimited amount of data to LLMS.

## Which applications can you think of? In the context of LCAs

Here is one I used for Brightcon:

- Text generation from a small description

<details>
  
<summary><b>Answer</b></summary>

Let's describe the process of doing an LCA:

1. Collect imperfect data from a customer - for instance a bunch of Excel files with random names for units (seen at the Hackathon): `kg CH4`, `PCS`. <b>Covered application 1: "kg CH4" -> "kg" and "PCS" -> "Item(s)"</b>
2. Create a supply chain in an LCA software, mapping this "company data" to LCI data (like Ecoinvent, but we'll use ELCD here).
3. Pick from the numerous processes in the LCI database the one that matches the closest your data (or refine this simplified approach). <b>Covered application 2: Search and match</b>
4. Compute impacts and do something with them

</details>
