![A car dashboard with lots of new technical features.](images/dashboard.jpg)

You're working for a well-known car manufacturer who is looking at implementing LLMs into vehicles to provide guidance to drivers. You've been asked to experiment with integrating car manuals with an LLM to create a context-aware chatbot. They hope that this context-aware LLM can be hooked up to a text-to-speech software to read the model's response aloud.

As a proof of concept, you'll integrate several pages from a car manual that contains car warning messages and their meanings and recommended actions. This particular manual, stored as an HTML file, `mg-zs-warning-messages.html`, is from an MG ZS, a compact SUV. Armed with your newfound knowledge of LLMs and LangChain, you'll implement Retrieval Augmented Generation (RAG) to create the context-aware chatbot.

## Before you start

In order to complete the project you will need to create a developer account with OpenAI and store your API key as a secure environment variable. Instructions for these steps are outlined below.

### Create a developer account with OpenAI

1. Go to the [API signup page](https://platform.openai.com/signup). 

2. Create your account (you'll need to provide your email address and your phone number).

3. Go to the [API keys page](https://platform.openai.com/account/api-keys). 

4. Create a new secret key.

<img src="images/openai-new-secret-key.png" width="200">

5. **Take a copy of it**. (If you lose it, delete the key and create a new one.)

### Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details. 

**This project should cost much less than 1 US cents with `gpt-4o-mini` (but if you rerun tasks, you will be charged every time).**

1. Go to the [Payment Methods page](https://platform.openai.com/account/billing/payment-methods).

2. Click Add payment method.

<img src="images/openai-add-payment-method.png" width="200">

3. Fill in your card details.

### Add an environmental variable with your OpenAI key

1. In the workbook, click on "Environment," in the top toolbar and select "Environment variables".

2. Click "Add" to add environment variables.

3. In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key.

<img src="images/datalab-env-var-details.png" width="500">

4. Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.

<img src="images/connect-integ.png" width="500">

### Update to Python 3.10

Due to how frequently the libraries required for this project are updated, you'll need to update your environment to Python 3.10:

1. In the workbook, click on "Environment," in the top toolbar and select "Session details".

2. In the workbook language dropdown, select "Python 3.10".

3. Click "Confirm" and hit "Done" once the session is ready.

## Environment Setup and Package Installation

Ensures that the environment is updated to Python 3.10 and installs the required packages for the project. The `install_if_needed` function checks if the specified version of each package is installed and installs it if necessary.

In [None]:
# Update your environment to Python 3.10 as described above before running this cell
import subprocess
import pkg_resources

def install_if_needed(package, version=None):
    '''Function to ensure that the libraries used are consistent to avoid errors.'''
    try:
        pkg = pkg_resources.get_distribution(package)
        if pkg.version != version:
            raise pkg_resources.VersionConflict(pkg, version)
    except (pkg_resources.DistributionNotFound, pkg_resources.VersionConflict):
        subprocess.check_call(["pip", "install", f"{package}=={version}"])

install_if_needed("langchain-core", "0.3.18")
install_if_needed("langchain-openai", "0.2.8")
install_if_needed("langchain-community", "0.3.7")
install_if_needed("unstructured", "0.14.4")
install_if_needed("langchain-chroma")
install_if_needed("langchain-text-splitters", "0.3.2")
install_if_needed("unstructured")

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-core==0.3.18
  Using cached langchain_core-0.3.18-py3-none-any.whl.metadata (6.3 kB)
Using cached langchain_core-0.3.18-py3-none-any.whl (409 kB)
Installing collected packages: langchain-core
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.3.35
    Uninstalling langchain-core-0.3.35:
      Successfully uninstalled langchain-core-0.3.35


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.3.18 requires langchain-core<1.0.0,>=0.3.34, but you have langchain-core 0.3.18 which is incompatible.
langchain 0.3.18 requires langchain-text-splitters<1.0.0,>=0.3.6, but you have langchain-text-splitters 0.3.2 which is incompatible.
crewai 0.30.11 requires langchain<0.2.0,>=0.1.10, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain-openai<0.2.0,>=0.1.7, but you have langchain-openai 0.2.8 which is incompatible.
langchain-cohere 0.1.5 requires langchain-core<0.3,>=0.1.42, but you have langchain-core 0.3.18 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[3

Successfully installed langchain-core-0.3.18
Defaulting to user installation because normal site-packages is not writeable



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-core<0.4.0,>=0.3.17 (from langchain-community==0.3.7)
  Using cached langchain_core-0.3.35-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.6 (from langchain<0.4.0,>=0.3.7->langchain-community==0.3.7)
  Using cached langchain_text_splitters-0.3.6-py3-none-any.whl.metadata (1.9 kB)
Using cached langchain_core-0.3.35-py3-none-any.whl (413 kB)
Using cached langchain_text_splitters-0.3.6-py3-none-any.whl (31 kB)
Installing collected packages: langchain-core, langchain-text-splitters
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.3.18
    Uninstalling langchain-core-0.3.18:
      Successfully uninstalled langchain-core-0.3.18
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 0.3.2
    Uninstalling langchain-text-splitters-0.3.2:
      Successfully uninstalled lan

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
crewai 0.30.11 requires langchain<0.2.0,>=0.1.10, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain-openai<0.2.0,>=0.1.7, but you have langchain-openai 0.2.8 which is incompatible.
langchain-cohere 0.1.5 requires langchain-core<0.3,>=0.1.42, but you have langchain-core 0.3.35 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Defaulting to user installation because normal site-packages is not writeable



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-text-splitters==0.3.2
  Using cached langchain_text_splitters-0.3.2-py3-none-any.whl.metadata (2.3 kB)
Using cached langchain_text_splitters-0.3.2-py3-none-any.whl (25 kB)
Installing collected packages: langchain-text-splitters
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 0.3.6
    Uninstalling langchain-text-splitters-0.3.6:
      Successfully uninstalled langchain-text-splitters-0.3.6
Successfully installed langchain-text-splitters-0.3.2


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.3.18 requires langchain-text-splitters<1.0.0,>=0.3.6, but you have langchain-text-splitters 0.3.2 which is incompatible.
crewai 0.30.11 requires langchain<0.2.0,>=0.1.10, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.3.18 which is incompatible.
embedchain 0.1.110 requires langchain-openai<0.2.0,>=0.1.7, but you have langchain-openai 0.2.8 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


## import required libraries

In [46]:
# Set your API key to a variable
import os
openai_api_key = os.environ["OPENAI_API_KEY"]

# Import the required packages
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser 
from langchain_core.runnables import RunnablePassthrough
from langchain_chroma import Chroma

## Create custom html loader



In [47]:
from typing import List
from langchain.document_loaders.base import BaseLoader
from langchain.docstore.document import Document
from bs4 import BeautifulSoup

class CustomHTMLTableLoader(BaseLoader):
    """
    Custom HTML loader that loads an HTML file containing multiple tables.
    Each table is expected to have at least two columns with headers containing
    'warning' and 'procedure'. For every data row (skipping header rows), it extracts
    the warning message and procedure, combining them into one text string.
    """
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self) -> List[Document]:
        with open(self.file_path, "r", encoding="utf-8") as f:
            html_content = f.read()
        soup = BeautifulSoup(html_content, "html.parser")
        tables = soup.find_all("table")
        rows_text = []

        for table in tables:
            rows = table.find_all("tr")
            if len(rows) < 2:
                continue  # Skip tables without data rows
            # Assume the first row is header
            header_cells = rows[0].find_all(["th", "td"])
            header = [cell.get_text(strip=True).lower() for cell in header_cells]
            if len(header) < 2:
                continue
            # Check if headers contain the expected keywords
            if "warning" not in header[0] or "procedure" not in header[1]:
                continue
            # Process remaining rows (data rows)
            for row in rows[1:]:
                cells = row.find_all("td")
                if len(cells) < 2:
                    continue
                warning_message = cells[0].get_text(strip=True)
                procedure = cells[1].get_text(strip=True)
                # Combine both fields into one string
                row_text = f"Warning Message: {warning_message}\nProcedure: {procedure}"
                rows_text.append(row_text)
        # Join all rows with a delimiter (here we use double newlines)
        combined_text = "\n\n".join(rows_text)
        return [Document(page_content=combined_text)]

In [48]:
# Instantiate the custom loader with your HTML file path
loader = CustomHTMLTableLoader("data/mg-zs-warning-messages.html")
documents = loader.load()

In [49]:
print(documents[0].page_content)

Procedure: Indicates that the cruise controlsystem has detected a fault. Please consult an MG Authorised Repairer as soon as possible.

Procedure: Indicates that the active speed limit system has detected a fault. Contact an MG Authorised Repairer as soon as possible.

Procedure: High engine coolant temperature could result in severe damage. As soon as conditions permit, safely stop the vehicle and switch off the engine and contact an MG Authorised Repairer immediately.

Procedure: Indicates that the engine coolant temperature sensor has failed. As soon as conditions permit, safely stop the vehicle and switch off the engine and contact an MG Authorised Repairer immediately.

Procedure: Indicates that the oil pressure is too low, which may result in severe engine damage. As soon as safety permits, stop the car, switch off the engine and check the engine oil level. Contact anMG Authorised Repairer as soonas possible.

Procedure: Indicates that a failure has occurred that will effect engi

In [50]:
from langchain.text_splitter import TextSplitter

class HTMLTableRowSplitter(TextSplitter):
    """
    Custom text splitter that splits the loaded HTML document (which contains rows joined by
    double newlines) into individual chunks, each representing one table row.
    """
    def split_text(self, text: str) -> List[str]:
        # Split the text on double newlines and strip extra spaces
        return [Document(page_content=chunk.strip()) for chunk in text.split("\n\n") if chunk.strip()]


In [51]:
splitter = HTMLTableRowSplitter()
chunks = splitter.split_text(documents[0].page_content)

print(f"number of rows: {len(chunks)}")

number of rows: 32


## Instantiate the model 

We initializes the OpenAI language model with a predefined temperature setting, which controls the balance between creativity and determinism in responses.

In [52]:
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)

## Vector database setup

Configure the embedding model and connects it to a Chroma vector store. The embedding model transforms text into numerical representations, allowing for efficient similarity-based retrieval. Then onverts the vector store into a retriever, allowing it to efficiently search for relevant documents based on their embeddings

In [53]:
embedding_model = OpenAIEmbeddings(api_key=openai_api_key, model="text-embedding-3-small")
vector_store = Chroma.from_documents(documents=chunks, embedding=embedding_model)
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 1})

In [54]:
prompt = ChatPromptTemplate.from_template(
"""
You will be provided with a piece of context about car diagnosis. Use it to answer the question at the end clearly. IF you could not find the answer from the given context, simply say 'Sorry, I don't know the answer!
Context: {context} 
Question: {question} 
""") 

In [55]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser() 
)

In [56]:
test_question = "The Gasoline Particular Filter Full warning has appeared. What does this mean and what should I do about it?"
answer = chain.invoke(test_question) 
print(answer)

