## **RAG based summarization agent**

LLM based summarization agent that summarizes an automotive issue
using a given dataset

Steps:


1.   Load and Preprocess the data
2.   Create Embeddings and VectorStore for Document Retrieval
3.   Define LLM and Summarization Chain



0. Install requirements


In [None]:
!pip install transformers -U
!pip install chromadb langchain_groq

Collecting chromadb
  Downloading chromadb-0.5.23-py3-none-any.whl.metadata (6.8 kB)
Collecting langchain_groq
  Downloading langchain_groq-0.2.2-py3-none-any.whl.metadata (3.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-p

In [None]:
!pip install tokenizers==0.21

Collecting tokenizers==0.21
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successfully uninstalled tokenizers-0.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.0 which is incompatible.[0m[31m
[0mSuccessfully installed tokenizers-0.21.0


## 1.   Load and Preprocess the data


In [None]:
import pandas as pd
import chromadb
from  langchain_groq import ChatGroq
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from sentence_transformers import SentenceTransformer


In [None]:
#load document for RAG
file_path = '/content/drive/MyDrive/colab/predii/FLAT_RCL.txt'

df = pd.read_csv(
    file_path,
    sep='\t',
    dtype=str ,
    header=None,
    on_bad_lines='skip'
)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,1,02V288000,FORD,FOCUS,2000,02S41,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,19990719.0,20010531.0,...,,,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015339000215021000000202,,,
1,2,02V288000,FORD,FOCUS,2001,02S41,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,19990719.0,20010531.0,...,,,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015339000215022000000202,,,
2,3,02V236000,JAYCO,FT EAGLE 10 SG,2003,,EQUIPMENT:OTHER:LABELS,"JAYCO, INC.",20020730.0,20020813.0,...,,,"ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL C...","IF THE TIRES WERE INFLATED TO 80 PSI, THEY COU...",OWNERS WILL BE MAILED CORRECT LABELS FOR INSTA...,"ALSO, CUSTOMERS CAN CONTACT THE NATIONAL HIGHW...",000015210000106403000000349,,,
3,4,02V237000,HOLIDAY RAMBLER,ENDEAVOR,2000,,STRUCTURE,MONACO COACH CORP.,,,...,,,"ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUS...",CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE...,DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK S...,CUSTOMERS CAN ALSO CONTACT THE NATIONAL HIGHWA...,000015211000083965000000272,,,
4,5,02V237000,HOLIDAY RAMBLER,ENDEAVOR,1999,,STRUCTURE,MONACO COACH CORP.,,,...,,,"ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUS...",CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE...,DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK S...,CUSTOMERS CAN ALSO CONTACT THE NATIONAL HIGHWA...,000015211000080938000000272,,,


In [None]:
# Check size of dataframe
df.shape

(291375, 27)

In [None]:
# add columns names from metadata
column_names = [
    "RECORD_ID", "CAMPNO", "MAKETXT", "MODELTXT", "YEARTXT", "MFGCAMPNO",
    "COMPNAME", "MFGNAME", "BGMAN", "ENDMAN", "RCLTYPECD", "POTAFF", "ODATE",
    "INFLUENCED_BY", "MFGTXT", "RCDATE", "DATEA", "RPNO", "FMVSS", "DESC_DEFECT",
    "CONEQUENCE_DEFECT", "CORRECTIVE_ACTION", "NOTES", "RCL_CMPT_ID",
    "MFR_COMP_NAME", "MFR_COMP_DESC", "MFR_COMP_PTNO"
]

df.columns = column_names



In [None]:
# Standardize by converting to uppercase and stripping whitespaces
df['MAKETXT'] = df['MAKETXT'].str.upper().str.strip()

# filtering out only FORT and TOYOTA
filtered_df = df[df['MAKETXT'].isin(['FORD', 'TOYOTA'])].copy()

# verify the unique values
print("Unique MAKETXT values : ", filtered_df['MAKETXT'].unique())



Unique MAKETXT values :  ['FORD' 'TOYOTA']


In [None]:
filtered_df.head()

Unnamed: 0,RECORD_ID,CAMPNO,MAKETXT,MODELTXT,YEARTXT,MFGCAMPNO,COMPNAME,MFGNAME,BGMAN,ENDMAN,...,RPNO,FMVSS,DESC_DEFECT,CONEQUENCE_DEFECT,CORRECTIVE_ACTION,NOTES,RCL_CMPT_ID,MFR_COMP_NAME,MFR_COMP_DESC,MFR_COMP_PTNO
0,1,02V288000,FORD,FOCUS,2000,02S41,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,19990719,20010531,...,,,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015339000215021000000202,,,
1,2,02V288000,FORD,FOCUS,2001,02S41,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,19990719,20010531,...,,,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015339000215022000000202,,,
148,149,02V249000,FORD,CROWN VICTORIA,2002,02S39,"FUEL SYSTEM, OTHER:STORAGE:TANK ASSEMBLY",FORD MOTOR COMPANY,20010510,20020322,...,,,"ON CERTAIN NATURAL GAS MODEL VEHICLES, A T-FIT...",A GAS LEAK OF SUFFICIENT QUANTITY CONCENTRATED...,"DEALERS WILL INSTALL A REDESIGNED ""T"" FITTING ...",ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015251000215009000000162,,,
300,301,02V239000,FORD,NAVIGATOR,2000,02L10,EQUIPMENT:OTHER:LABELS,FORD MOTOR COMPANY,19990331,20000806,...,571.0,120.0,CERTAIN 4X2 SPORT UTILITY VEHICLES FAIL TO COM...,CUSTOMERS MAY INFLATE THEIR REAR TIRES BASED O...,A SUPPLEMENTARY LABEL WILL BE SENT TO CUSTOMER...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015220000089746000000349,,,
301,302,02V239000,FORD,EXPEDITION,2000,02L10,EQUIPMENT:OTHER:LABELS,FORD MOTOR COMPANY,19990331,20000806,...,571.0,120.0,CERTAIN 4X2 SPORT UTILITY VEHICLES FAIL TO COM...,CUSTOMERS MAY INFLATE THEIR REAR TIRES BASED O...,A SUPPLEMENTARY LABEL WILL BE SENT TO CUSTOMER...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...,000015220000203219000000349,,,


In [None]:
#new df with only required 6 columns
df_combined = filtered_df[["MAKETXT", "MODELTXT", "YEARTXT",
                           "DESC_DEFECT", "CONEQUENCE_DEFECT", "CORRECTIVE_ACTION"]].copy()

#combined all issues into single column
df_combined['combined_summary'] = (
    df_combined['DESC_DEFECT'].fillna('') + ' ' +
    df_combined['CONEQUENCE_DEFECT'].fillna('') + ' '  +
    df_combined['CORRECTIVE_ACTION'].fillna('')
)

#dropping reduntant cols
df_combined = df_combined.drop(columns = ["DESC_DEFECT", "CONEQUENCE_DEFECT", "CORRECTIVE_ACTION"])


In [None]:
df_combined.head()

Unnamed: 0,MAKETXT,MODELTXT,YEARTXT,combined_summary
0,FORD,FOCUS,2000,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...
1,FORD,FOCUS,2001,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...
148,FORD,CROWN VICTORIA,2002,"ON CERTAIN NATURAL GAS MODEL VEHICLES, A T-FIT..."
300,FORD,NAVIGATOR,2000,CERTAIN 4X2 SPORT UTILITY VEHICLES FAIL TO COM...
301,FORD,EXPEDITION,2000,CERTAIN 4X2 SPORT UTILITY VEHICLES FAIL TO COM...


In [None]:
df_combined.shape

(17137, 4)

In [None]:
#df_combined = df_combined[:150]

## 2.   Create Embeddings and VectorStore for Document Retrieval


In [None]:
#Using opensource ChromaDB Vectorstore
client = chromadb.PersistentClient('vectorstore')
collection = client.get_or_create_collection('auto_issues')

#embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

#collection of documents from combined summary and metadata
if not collection.count():
    for idx, row in df_combined.iterrows():
        embedding = embedding_model.encode(row['combined_summary'])
        collection.add(
            documents=[row['combined_summary']],
            metadatas={
                'make': row['MAKETXT'],
                'model': row['MODELTXT'],
                'year': row['YEARTXT']
            },
            ids=[row['RECORD_ID']],
            embeddings=[embedding]
        )


## 3.   Define LLM and Summarization Chain

```
# This is formatted as code
```



In [None]:

# From GroqAPI import llama-3.1-8b model
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0,
    groq_api_key=""
)

retrieval_prompt = PromptTemplate(
    input_variables=["issue", "make", "model", "year", "retrieved_documents"],

    template=
    """

    Analyze and summarize information about {issue} for {make} {model} {year}.

    - You should focus on issues specifically matching the vehicle model and year if available.
    - Only stick to the information retrieved from the documents.
    - Summarize into 50 words.

    Retrieved Documents:
    {retrieved_documents}

    Give a concise summary focusing on directly relevant issues
    Do not use any information outside of the retrieved documents.

    """
)

# LLM Chain
retrieval_chain = retrieval_prompt | llm


In [None]:
# Summarization function
def summarize_issue(input_data):

    make = input_data['make'].upper()
    model = input_data['model'].upper()
    year = input_data['year']
    issue = input_data['issue']

    query_text = f"{make} {model} {year} {issue}"

    #compute query embedding
    query_embedding = embedding_model.encode(query_text)

    #Retrieve relevant documents
    results = collection.query(
        query_embeddings=[query_embedding],
        where={
            "$and": [
                {"make": make},
                {"year": year}]},
        n_results=10 )

    retrieved_docs = []
    seen_docs = set()

    #combine documents and metadata
    for doc, metadata in zip(results.get('documents', [[]])[0],results.get('metadatas', [[]])[0]):
        if doc not in seen_docs:
            retrieved_docs.append( doc)
            seen_docs.add(doc)

    #top 5 unique documents
    final_docs = retrieved_docs[:5]

    inputs = {
        "issue": issue ,
        "make": make,
        "model": model,
        "year": year,
        "retrieved_documents": "\n \n".join(final_docs)}

    # summarization chain
    summary = retrieval_chain.invoke(inputs)
    return {
        'input' : input_data ,
        'retrieved_documents' : final_docs,
        'summary' : summary.content
    }


In [None]:
output = summarize_issue(input_data)
print(output)

{'input': {'make': 'ford', 'model': 'escape', 'year': '2001', 'issue': 'stuck throttle risk'}, 'retrieved_documents': ['VEHICLE DESCRIPTION:  PASSENGER VEHICLES EQUIPPED WITH ADJUSTABLE PEDALS.    IF THE GREASE FROM THE ADJUSTABLE PEDAL ASSEMBLY ENTERS THE STOP LAMP SWITCH, IT CAN CONTAMINATE THE CONTACTS LEADING TO CARBON BUILD UP, AND POTENTIALLY, A SHORT CIRCUIT. A SHORT CIRCUIT COULD LEAD TO EITHER THE BRAKE LAMPS STAYING ON, OR TO A LOSS OF BRAKE LAMP FUNCTION, INCREASING THE RISK OF A CRASH. DEALERS WILL REPLACE THE BRAKE LAMP SWITCH AND WIPE DOWN THE ADJUSTABLE PEDAL ASSEMBLY TO REMOVE EXCESS GREASE.  OWNER NOTIFICATION BEGAN MARCH 22, 2001.  OWNERS WHO TAKE THEIR VEHICLES TO AN AUTHORIZED DEALER ON AN AGREED UPON SERVICE DATE AND DO NOT RECEIVE THE FREE REMEDY WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-800-392-3673.', 'CERTAIN PASSENGER VEHICLES EQUIPPED WITH ADJUSTABLE PADELS ARE BEING RECALLED IN ORDER TO ADJUST THE BRAKE AND ACCELERATOR PEDALS TO A MINIMUM OF 50 MM OF