# Leveraging large language models and vector databases for exploring life cycle inventory databases

Life Cycle Assessment (LCA) has emerged as a vital tool for evaluating the environmental impacts of products and processes throughout their entire life cycle. To conduct comprehensive LCAs, reliable and extensive Life Cycle Inventory Databases (LCID) are essential. However, accessing, interpreting, and utilizing such databases can be challenging due to their sheer size and complexity. This conference presentation proposes an innovative approach that harnesses the power of Large Language Models (LLMs) and vector databases to effectively explore and navigate LCID, specifically focusing on the European Life Cycle Database (ELCD) in the OpenLCA format.

In recent years, LLMs have achieved significant advancements in natural language processing, enabling them to comprehend complex queries and deliver contextually relevant information. Leveraging the capabilities of LLMs, particularly exemplified by the ChatGPT model, allows for intuitive interactions with LCID, facilitating user access and comprehension of extensive LCA data.

Furthermore, to enhance the performance and efficiency of LCA-related tasks, this study introduces vector databases designed to store and manage LCA-specific data structures. The use of vector databases complements the traditional relational databases typically employed for LCID, offering optimized search operations, lower latency, and improved scalability.

The conference presentation will highlight a case study involving ChatGPT and vector databases, which showcases their collective potential to extract relevant information from the ELCD in OpenLCA format. Through a series of real-world scenarios, the feasibility and efficacy of this approach will be demonstrated, illustrating how LLMs can efficiently respond to LCA queries while vector databases efficiently store and retrieve vast amounts of LCA data.

Attendees will gain insights into the benefits of incorporating cutting-edge technology into the LCA domain, fostering better-informed decision-making processes across diverse industries. The proposed methodology not only addresses the challenges of accessing and comprehending complex LCID but also opens new avenues for advancing LCA research and environmental sustainability assessments.

In conclusion, this conference presentation will empower LCA practitioners, researchers, and stakeholders to harness the power of LLMs and vector databases, ultimately unlocking the full potential of Life Cycle Inventory Databases for more comprehensive and insightful life cycle assessments.

## Requirements to run this notebook

See this [brightcon-2023-llm Github repo](https://github.com/sapiologie/brightcon-2023-llm) for instructions.

## Plan (1 min)

**1. Introduction to LLMs, embeddings, and the applications covered here**

- I/O diagram of how a LLM works
  - Context window
  - Tokens and embeddings
  - It's just text in text out
- Internal representation of text - vector embeddings
  - Example of Word2vec
  - Access ChatGPT embeddings
- Applications
  - Vector (~ semantic) search (covered)
  - Question answering (covered)
  - Summarization

**2. Search**

- The dataset: ELCD
- Traditional keyword based search
- Evolution: text distance
- Moving to vector search with ChromaDB
- Advanced search mechanisms (show Langchain flows and "hybrid search")

**3. Question answering**

- Example of trucks: finding the payload
- Prompt engineering with JSON output parsing
- Going further: an actual schema with Pydantic

**4. Langchain**

- Rewriting 2. and 3. with Langchain
- Exploring this framework
- Criticisms

**5. Resources**

- List of resources
- Takeaway: human/machine boundary, don't diss traditional NLP, see this as a tool in your traditional engineering system instead of a magic wand

## 1. Introduction to LLMs, embeddings, and the applications covered here

### 1.1 LLM internals (1 min)

```
Input text (maximum context window)
-> Tokenization
-> Embeddings
-> LLM
-> Embeddings
-> To text
```

ChatGPT flow
- Write **text input**
- Transformed into a series of **tokens**: integer representation of a word through Byte Pair Encoding
  - GPT3 vocabulary, about 50k tokens
  - Source: [OpenAI Tokenizer web app](https://platform.openai.com/tokenizer) and [Tiktoken Python library](https://github.com/openai/tiktoken)

- The ChatGPT model: "GPT-3.5 is a transformer trained as a completion-style model, which means that if we give it a few words as input, it's capable of generating a few more words that are likely to follow them in the training data."
  - Send text input in, transformed to tokens (within the context window of 4097 tokens), passed through the model, **returns a series of tokens, decoded into text**.

In [3]:
transport_lorry_process_uuids = [
    "b444f4d0-3393-11dd-bd11-0800200c9a66",
    "b444f4d1-3393-11dd-bd11-0800200c9a66",
    "b444f4d2-3393-11dd-bd11-0800200c9a66",
    "b444f4d3-3393-11dd-bd11-0800200c9a66",
    "b444f4d4-3393-11dd-bd11-0800200c9a66",
    "b4451be0-3393-11dd-bd11-0800200c9a66"
]

In [4]:
# Introductory concept 2 - ChatGPT
# Tokens in -> Tokens out
import openai

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
)
response

AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = <API-KEY>', or you can set the environment variable OPENAI_API_KEY=<API-KEY>). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = <PATH>'. You can generate API keys in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.

In [1]:
# Introductory concept 1 - Tokens
# Useful for a context window and output size

import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [2]:
tokens = enc.encode("hello world, my name is Selim")
tokens

[15339, 1917, 11, 856, 836, 374, 24082, 318]

In [3]:
for token in tokens:
    dec = enc.decode([token])
    print(f"{token}: {dec}")

15339: hello
1917:  world
11: ,
856:  my
836:  name
374:  is
24082:  Sel
318: im


<OpenAIObject chat.completion id=chatcmpl-7zArIUmyYkuiH13tDO7I7wM5S5855 at 0x7f6a002304d0> JSON: {
  "id": "chatcmpl-7zArIUmyYkuiH13tDO7I7wM5S5855",
  "object": "chat.completion",
  "created": 1694814104,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The World Series in 2020 was played at Globe Life Field in Arlington, Texas."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 53,
    "completion_tokens": 18,
    "total_tokens": 71
  }
}

In [9]:
response['choices'][0]['message']['content']

'The World Series in 2020 was played at Globe Life Field in Arlington, Texas.'

In [36]:
len(enc.encode(response['choices'][0]['message']['content']))

KeyError: 'choices'

In [38]:
# Applications - 1 - Question answering

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the distance between Paris and Esch-sur-Alzette?"},
    ]
)
response

<OpenAIObject chat.completion id=chatcmpl-7zBIr8cHE45Lu75c3rPlaIX0NoWKH at 0x7f69e7f8bef0> JSON: {
  "id": "chatcmpl-7zBIr8cHE45Lu75c3rPlaIX0NoWKH",
  "object": "chat.completion",
  "created": 1694815813,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The distance between Paris and Esch-sur-Alzette is approximately 365 kilometers (227 miles) by road."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 23,
    "total_tokens": 54
  }
}

In [40]:
italian_mix_desc = """
The Italian electricity consumption mix is provided by multiple energy carriers.
The Italian specific mix is shown in the pie chart "Power Grid Mix - IT".
The electricity is either produced in energy carrier specific power plants and / or energy carrier specific heat and power plants (CHP).
The Italian specific fuel supply (share of resources used, by import and / or domestic supply) including the Italian specific energy carrier properties (e.g. element and energy contents) are accounted for.
Furthermore Italian specific technology standards of power plants regarding efficiency, firing technology, flue-gas desulphurisation, NOx removal and dedusting are considered.
The Italian electricity consumption mix is modelled as shown in the flow diagram "Modelling of Power Consumption Mix".
It includes imported/exported electricity, distribution losses (in %) and the own use by energy producers.
The data set considers the whole supply chain of the fuels from exploration over extraction and preparation to transport of fuels to the power plants.
The background system is addressed as follows:  Transports: All relevant and known transport processes used are included.
Overseas transports including rail and truck transport to and from major ports for imported bulk resources are included.
Furthermore all relevant and known pipeline and / or tanker transport of gases and oil imports are included.
Energy carriers: Coal, crude oil, natural gas and uranium are modelled according to the specific import situation.
Refinery products: Diesel, gasoline, technical gases, fuel oils, basic oils and residues such as bitumen are modelled via a country-specific, refinery parameterized model.
The refinery model represents the current national standard in refinery techniques (e.g. emission level, internal energy consumption,...) as well as the individual country-specific product output spectrum, which can be quite different from country to country.
Hence the refinery products used show the individual country-specific use of resources. The supply of crude oil is modelled, again, according to the country-specific crude oil situation with the respective properties of the resources.
"""

In [45]:
# Applications - 2 - Summarization

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"""
        ```
        {italian_mix_desc}
        ```
    
        Summarize the 2 key elements of the content between the backticks.
        """},
    ]
)
response

<OpenAIObject chat.completion id=chatcmpl-7zBKEEY3Co8Cm9APq6Yzt8LoPIHCt at 0x7f69e7fd3cb0> JSON: {
  "id": "chatcmpl-7zBKEEY3Co8Cm9APq6Yzt8LoPIHCt",
  "object": "chat.completion",
  "created": 1694815898,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The two key elements of the content between the backticks are: \n1. The description of the Italian electricity consumption mix, including the energy carriers, power plants, and heat and power plants used for electricity production in Italy. It also mentions the fuel supply and import/export situation, as well as the technology standards of power plants in Italy.\n2. The modeling of the Italian electricity consumption mix, which takes into account factors such as imported/exported electricity, distribution losses, and the self-consumption by energy producers. It also considers the entire supply chain of fuels, including exploration, extraction, pre

In [46]:
# From 143 to 146 elements.

### 1.2 Embeddings (1 min)

- Then lists of tokens are mapped to **embeddings**: a vector representation of tokens. The latest OpenAI model is "text-embedding-ada-002"
  - Dimension: 1536
  - This is a full machine learning model
  - It has a context size of "8192" tokens
  - Output of 1.5k tokens
  - 1 token = 4 characters in English OR 100 tokens = 75 words
  - [Ada 2 annoucement](https://openai.com/blog/new-and-improved-embedding-model)
  - [How they work](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)


Show an example and the code for **"What is life cycle assessment?"**ar**.

In [16]:
response = openai.Embedding.create(
    input="What is life cycle assessment?",
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']
print(f"{embeddings[:10]}...")
print(len(embeddings))

[0.015214473940432072, 0.0019709658809006214, 0.0004404623177833855, -0.02267022430896759, -0.01617608033120632, -0.0011270894901826978, -0.011782984249293804, 0.010432781651616096, -0.019785402342677116, -0.019047731533646584]...
1536


In [30]:
a1 = "Cotton production"
a2 = "Nuclear energy"
query = "Find an activity to power my car"

embed_a1 = openai.Embedding.create(input=a1, model="text-embedding-ada-002")['data'][0]['embedding']
embed_a2 = openai.Embedding.create(input=a2, model="text-embedding-ada-002")['data'][0]['embedding']
embed_query = openai.Embedding.create(input=query, model="text-embedding-ada-002")['data'][0]['embedding']

In [26]:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [29]:
cosine_similarity(embed_a1, embed_a2)

0.7777373690392992

In [33]:
sim_a1_q = cosine_similarity(embed_a1, embed_query)
sim_a2_q = cosine_similarity(embed_a2, embed_query)
print(f"dist 1={sim_a1_q} / dist 2={sim_a2_q}")

dist 1=0.7156581197365439 / dist 2=0.785766316483196


In [35]:
query_2 = "Making of material for textile"
embed_query_2 = openai.Embedding.create(input=query_2, model="text-embedding-ada-002")['data'][0]['embedding']

sim_a1_q2 = cosine_similarity(embed_a1, embed_query_2)
sim_a2_q2 = cosine_similarity(embed_a2, embed_query_2)
print(f"dist 1={sim_a1_q2} / dist 2={sim_a2_q2}")

dist 1=0.8757795762923071 / dist 2=0.7661570944163845


## 2. Search (1 min)

ELCD : TODO description
- Number of activities: N
- Number of elementary flows: K

### 2.1 Keyword-based search (1 min)

TODO put the activities in a Pandas DataFrame and do keyword search

### 2.2. Inexact search (1 min)

TODO Pandas DataFrame with a text distance function and an inexact query

### 2.3 Vector search (8 min)

TODO Introduce and instanciate ChromaDB

TODO Show how to get one embedding from OpenAI

TODO Show I've saved them in a JSON format

TODO Talk about the pricing (cost me pennies)

TODO Ingest all JSON files into ChromaDB

TODO Show examples of search in English and French

TODO Advanced search mechanisms - Langchain flows

TODO Advanced search - Hybrid search Vector + keyword

## 3. Question answering (5 min)

TODO Get 3 example datasets from ELCD about trucks and show their description here.

### 3.1 Simple text output (3 min)

TODO Ask questions to get the payload in a text format and show the output using OpenAI API.

### 3.2 Prompt engineering (5 min)

TODO Define it

TODO Simple and manual JSON output parsing (describe JSON in the input).

TODO Parse the JSON with a regex (get the Regex with ChatGPT)

### 3.3 Pydantic schemas (5 min)

TODO Define a Pydantic schema

TODO Describe this schema in the input

TODO Parse and validate the schema in the output

## 4. Langchain (2 min)

TODO Explain what it is.

TODO Show the tutorials from Deeplearning.ai

### 4.1. Search with Langchain and ChromaDB (2 min)

TODO Walk through the example

### 4.2. JSON output parsing with Langchain (2 min)

TODO Show Pydantic output parser chain

TODO Show another example to parse "units" field from an imperfect CSV

### 4.3. Quick peak at the framework (2 min)

TODO Show main applications

### 4.4. Criticisms (1 min)

Open-source but feels opaque as it becomes bloated.

## 5. Resources (2 min)

https://public.3.basecamp.com/p/RUgMdhPpg72dPP5Y5MNDMXHm

### 6. Conclusion (1 min)

- Very useful as a human to machine correcter
- Try to understand the concepts, there are not that many, then plug and play bits to your application
- Good framework to think about it: build your application with your usual engineering best practices, identify bottlenecks that could be solved by AI. Do not start by trying to squeeze AI wherever. For instance, regexes are very powerful at information extraction, so don't ditch it for LLMs right away.
- Hard to stay on track, Hacker News, AlphaSignal, the Langchain newsletter are good sources of up to date information