![A car dashboard with lots of new technical features.](images/dashboard.jpg)

You're working for a well-known car manufacturer who is looking at implementing LLMs into vehicles to provide guidance to drivers. You've been asked to experiment with integrating car manuals with an LLM to create a context-aware chatbot. They hope that this context-aware LLM can be hooked up to a text-to-speech software to read the model's response aloud.

As a proof of concept, you'll integrate several pages from a car manual that contains car warning messages and their meanings and recommended actions. This particular manual, stored as an HTML file, `mg-zs-warning-messages.html`, is from an MG ZS, a compact SUV. Armed with your newfound knowledge of LLMs and LangChain, you'll implement Retrieval Augmented Generation (RAG) to create the context-aware chatbot.

## Before you start

In order to complete the project you will need to create a developer account with OpenAI and store your API key as a secure environment variable. Instructions for these steps are outlined below.

### Create a developer account with OpenAI

1. Go to the [API signup page](https://platform.openai.com/signup). 

2. Create your account (you'll need to provide your email address and your phone number).

3. Go to the [API keys page](https://platform.openai.com/account/api-keys). 

4. Create a new secret key.

<img src="images/openai-new-secret-key.png" width="200">

5. **Take a copy of it**. (If you lose it, delete the key and create a new one.)

### Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details. 

**This project should cost much less than 1 US cents with `gpt-4o-mini` (but if you rerun tasks, you will be charged every time).**

1. Go to the [Payment Methods page](https://platform.openai.com/account/billing/payment-methods).

2. Click Add payment method.

<img src="images/openai-add-payment-method.png" width="200">

3. Fill in your card details.

### Add an environmental variable with your OpenAI key

1. In the workbook, click on "Environment," in the top toolbar and select "Environment variables".

2. Click "Add" to add environment variables.

3. In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key.

<img src="images/datalab-env-var-details.png" width="500">

4. Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.

<img src="images/connect-integ.png" width="500">

### Update to Python 3.10

Due to how frequently the libraries required for this project are updated, you'll need to update your environment to Python 3.10:

1. In the workbook, click on "Environment," in the top toolbar and select "Session details".

2. In the workbook language dropdown, select "Python 3.10".

3. Click "Confirm" and hit "Done" once the session is ready.

## Environment Setup and Package Installation

Ensures that the environment is updated to Python 3.10 and installs the required packages for the project.

## import required libraries

In [3]:
# Set your API key to a variable
import os
openai_api_key = os.environ["OPENAI_API_KEY"]

# Import the required packages
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser 
from langchain_core.runnables import RunnablePassthrough
from langchain_chroma import Chroma

ModuleNotFoundError: No module named 'langchain_chroma'

## Document Loading and splitting
The document will be used by the model is  a manual for car diagnosis and troubleshootings. It is a HTML file that contains several tables with the same columns: 
- `Warning Message`: the warning displayed in the screen
- `Procedure`: the corresponding troubleshooting approach to solve it.

For best data retrieval from this document, the idea is to split the file into chunks so that each chunk should cover a single warning message and its corresponding procedure (troubleshooting). For that, we are gonna create a customized HTML loader and splitter.

### 1. Document loading
The loader should load and combine the tables into a single document containing the rows delimited by a regular delimiter (say `"\n\n"`).

In [None]:
from typing import List
from langchain.document_loaders.base import BaseLoader
from langchain.docstore.document import Document
from bs4 import BeautifulSoup

class CustomHTMLTableLoader(BaseLoader):
    """
    Custom HTML loader that loads an HTML file containing multiple tables.
    Each table is expected to have at least two columns with headers containing
    'warning' and 'procedure'. For every data row (skipping header rows), it extracts
    the warning message and procedure, combining them into one text string.
    """
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self) -> List[Document]:
        with open(self.file_path, "r", encoding="utf-8") as f:
            html_content = f.read()
        soup = BeautifulSoup(html_content, "html.parser")
        tables = soup.find_all("table")
        rows_text = []

        for table in tables:
            rows = table.find_all("tr")
            if len(rows) < 2:
                continue  # Skip tables without data rows
            # Assume the first row is header
            header_cells = rows[0].find_all(["th", "td"])
            header = [cell.get_text(strip=True).lower() for cell in header_cells]
            if len(header) < 2:
                continue
            # Check if headers contain the expected keywords
            if "warning" not in header[0] or "procedure" not in header[1]:
                continue
            # Process remaining rows (data rows)
            for row in rows[1:]:
                cells = row.find_all("td")
                if len(cells) < 2:
                    continue
                warning_message = cells[0].get_text(strip=True)
                procedure = cells[1].get_text(strip=True)
                # Combine both fields into one string
                row_text = f"Warning Message: {warning_message}\nProcedure: {procedure}"
                rows_text.append(row_text)
        # Join all rows with a delimiter (here we use double newlines)
        combined_text = "\n\n".join(rows_text)
        return [Document(page_content=combined_text)]

In [None]:
# Instantiate the custom loader with your HTML file path
loader = CustomHTMLTableLoader("data/mg-zs-warning-messages.html")
documents = loader.load()

In [None]:
print(documents[0].page_content)

Procedure: Indicates that the cruise controlsystem has detected a fault. Please consult an MG Authorised Repairer as soon as possible.

Procedure: Indicates that the active speed limit system has detected a fault. Contact an MG Authorised Repairer as soon as possible.

Procedure: High engine coolant temperature could result in severe damage. As soon as conditions permit, safely stop the vehicle and switch off the engine and contact an MG Authorised Repairer immediately.

Procedure: Indicates that the engine coolant temperature sensor has failed. As soon as conditions permit, safely stop the vehicle and switch off the engine and contact an MG Authorised Repairer immediately.

Procedure: Indicates that the oil pressure is too low, which may result in severe engine damage. As soon as safety permits, stop the car, switch off the engine and check the engine oil level. Contact anMG Authorised Repairer as soonas possible.

Procedure: Indicates that a failure has occurred that will effect engi

### 2. Document Splitting
As we said the splitter also is special (but simple). It will split the Document object into a list of Document objects by a delimiter `"\n\n"`. each chunk now should have the warning message and its procedure only.

In [None]:
from langchain.text_splitter import TextSplitter

class HTMLTableRowSplitter(TextSplitter):
    """
    Custom text splitter that splits the loaded HTML document (which contains rows joined by
    double newlines) into individual chunks, each representing one table row.
    """
    def split_text(self, text: str) -> List[str]:
        # Split the text on double newlines and strip extra spaces
        return [Document(page_content=chunk.strip()) for chunk in text.split("\n\n") if chunk.strip()]


In [5]:
splitter = HTMLTableRowSplitter()
chunks = splitter.split_text(documents[0].page_content)

print(f"number of chunks: {len(chunks)}")

NameError: name 'HTMLTableRowSplitter' is not defined

## The LLM 
We will use OpenAI `gpt-40-mini` model for the reasoning part. we set `temperature=0` so that the model should be strict and straitforward in answering without any crativity.

In [None]:
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)

## Vector Database and Retriever

In this step we configure the **embedding** model. The embedding model transforms text into numerical representations (embeddings or vertors), allowing for efficient similarity-based retrieval.<br>
Then connect the embedding model to a **Chroma** vector store that works mainly with embeddings.<br>
Finallt, we get the Retriever from the vectorstore intself, allowing it to efficiently search for relevant documents based on their embeddings.

- We will use `text-embedding-3-small` from OpenAI as the embedding model.<br>
- Setting `k=1` in the retriever means that it will get the most similar documents (only 1).

In [None]:
embedding_model = OpenAIEmbeddings(api_key=openai_api_key, model="text-embedding-3-small")
vector_store = Chroma.from_documents(documents=chunks, embedding=embedding_model)
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 1})

## Prompt template

The prompt template is a dynamic template object that can take some variables and give the final prompt the model will take. Here we used 2 variables:

- `question`: the questions posed by the user.
- `context`: the context is the most relevant chunk with the question (got from the retriever).

In [None]:
prompt = ChatPromptTemplate.from_template("""
You will be provided with a piece of context about car diagnosis. Use it to answer the question at the end clearly. IF you could not find the answer from the given context, simply say 'Sorry, I don't know the answer!
Context: {context} 
Question: {question} 
""") 

## The Q&A chain

This is the final pipeline (or chain). It is a set of operations (or tasks) that should be run sequentially starting from the user input to the final answer. I will highlight the main steps below:

1. The user question should go into two directions. First, to the retriever to get the relevant chunk, and the second into the prompt template. 
2. The retrieved chunk will go to the prompt as a context. 
3. The model take the prompt and answer the quesiton based on the context.
4. The final answer will be parsed for clear output.

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser() 
)

## Testing the RAG chain 
We will test the chain with the requested question. the variable `answer` will store the final answer.

In [None]:
test_question = "The Gasoline Particular Filter Full warning has appeared. What does this mean and what should I do about it?"
answer = chain.invoke(test_question) 
print(answer)



## Resources
- Langchain documentation: https://python.langchain.com/api_reference
- RAG track from datacamp: https://app.datacamp.com/learn/skill-tracks/developing-applications-with-langchain
- OpenAI API documentation: https://platform.openai.com/docs/api-reference