# Final Notebook

# 1. Setting Up the Dependencies

- Navigate to Project Directory.
- Create a virtual environment using `virtualenv`. Activate the virtual environment afterwards.
```bash
virtualenv .tyre-scan-venv
source .tyre-scan-venv/bin/activate
```
- Install the required dependencies listed in the `requirements.txt` file using `pip`:
```bash
pip install -r requirements.txt
```
- Navigate to the notebook and use the virtual environment.

In [3]:
# helper functions to extract relevant urls to scrape data for given EAN Code


# imports
import os
from bs4 import BeautifulSoup

# import the custom googlesearch api key
from constants import googlesearch_api_key, googlesearch_cx, googlesearch_api_url
# import the path to cache directory
from constants import url_cache_dir
# import the htmt tags to parse (or ignore)
from constants import tags_to_use, tags_to_ignore
# import azure openai creds
from constants import azure_openai_creds
# other constants
from constants import max_chunk_size, chunk_intersection_size, llm_model

# langchain imports
from langchain import hub
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain, create_extraction_chain
from langchain.schema import HumanMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import AsyncChromiumLoader

from langchain_community.document_transformers import BeautifulSoupTransformer
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain.tools import Tool
from langchain_community.utilities import GoogleSearchAPIWrapper
from langchain_openai import AzureOpenAI, AzureOpenAIEmbeddings, AzureChatOpenAI

# utils
from utils import remove_tags_to_ignore, extract_tags_to_use, translate_text

# set azure openai variables
os.environ["OPENAI_API_TYPE"] = azure_openai_creds["openai_api_type"]
os.environ["OPENAI_API_VERSION"] = azure_openai_creds["openai_api_version"]
os.environ["AZURE_OPENAI_ENDPOINT"] = azure_openai_creds["azure_openai_endpoint"]
os.environ["OPENAI_API_KEY"] = azure_openai_creds["openai_api_key"]

# set google api variables
os.environ["GOOGLE_CSE_ID"] = googlesearch_cx
os.environ["GOOGLE_API_KEY"] = googlesearch_api_key

import nest_asyncio
nest_asyncio.apply()

# URL Extraction

- `extract_top_urls`:

The function returns a JSON object containing a list of relevant queries along with additional information. Subsequently, the `extract_url_list` function is commonly employed to extract the URLs from this JSON object, providing easy access to the web resources associated with the provided EAN code. Also the actual number of results retrieved is capped at the minimum of top_k and 10 to ensure a manageable and relevant set of queries.

### Workflow:

1. The function initializes a `GoogleSearchAPIWrapper` to interact with the Google Search API efficiently.

2. It defines an inner function `top_results(query)` to retrieve the top results for a given query. This function is then utilized by the `GoogleSearchAPIWrapper` as a custom tool named "Google Search Snippets."

3. The main function forms a query string in the format "EAN <ean_code>" and uses the custom tool to fetch results from the Google Search API.

4. The results are returned as a JSON object containing information about the top results, including titles, links, and snippets.

In [None]:
def extract_top_urls(ean_code, top_k = 5):
    search = GoogleSearchAPIWrapper()

    def top_results(query):
        return search.results(query, min(top_k, 10))
    
    tool = Tool(
        name="Google Search Snippets",
        description="Search Google for recent results.",
        func=top_results,
    )

    query = f"EAN {ean_code}"
    results = tool.run(query)
    return results

- `extract_url_list`:

This helper function takes the results JSON object from `extract_top_urls` and extracts a list of URLs. It iterates over the result items and collects the `'link'` attribute into a list, providing a convenient way to access the URLs for further processing.

In [None]:
def extract_url_list(results):
    url_list = []
    for url_dict in results:
        url_list.append(url_dict['link'])
    
    return url_list

- `scrape`:

The `scrape` function is responsible for extracting HTML content from a list of URLs using the `AsyncChromiumLoader`. It opens a headless Chromium instance to load the URLs and then extracts the HTML content along with metadata. The HTML content undergoes a series of preprocessing steps, including removing unnecessary tags using `remove_tags_to_ignore`, extracting relevant tags with `extract_tags_to_use`, and optional translation using Azure Translator.

In [None]:
import nest_asyncio
nest_asyncio.apply()

def scrape(urls, tags_to_use=tags_to_use, tags_to_ignore=tags_to_ignore, translation=True):
    loader = AsyncChromiumLoader(urls)
    docs = loader.load()

    for doc in docs:
        html_content = doc.page_content

        # preprocess the html_content
        soup = BeautifulSoup(html_content, 'html.parser')

        # remove tags to ignore
        soup = remove_tags_to_ignore(soup, tags_to_ignore)

        # extract tags to use
        text = extract_tags_to_use(soup, tags_to_use)

        if translation:
            text_limit = 50000
            # translate text using azure translator
            text = translate_text(text[:text_limit])
        
        # update the doc
        doc.page_content = text
    return docs

- `split_docs`

The `split_docs` function divides the overall document into smaller chunks while retaining metadata. It uses `tiktoken_encoder` to split the content into chunks of a specified token size limit, with optional intersection between consecutive chunks.

In [2]:
def split_docs(docs, max_chunk_size=max_chunk_size, chunk_intersection_size=chunk_intersection_size):
    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=max_chunk_size, chunk_overlap=chunk_intersection_size
    )

    splits = splitter.split_documents(docs)

    return splits

# Check Validity of EAN

- `is_ean_valid`:

The `is_ean_valid` function plays a crucial role in determining the validity of an EAN (European Article Number) code by assessing the presence of relevant Google search results associated with it. This function encapsulates the logic to identify whether an EAN code is valid or invalid based on the outcome of the `extract_top_urls` function.

This function serves as a valuable tool in the EAN validation process, allowing users to quickly identify and flag invalid EAN codes based on the availability of relevant Google search results.

### Workflow:

1. The function attempts to retrieve relevant URLs associated with the provided EAN code using the `extract_top_urls` function.

2. If the `extract_top_urls` function encounters an exception during execution (such as network issues or API limitations), the function catches the exception and assigns a predefined result indicating no good Google search result was found.

3. The function then compares the obtained results with the predefined invalid result. If the results match, indicating no valid URLs were found, the function concludes that the EAN code is invalid.

4. The function returns a boolean value: `True` if the EAN code is valid (at least one relevant URL found) and `False` if it is invalid (no relevant URLs found).


In [22]:
def is_ean_valid(ean_code):
    invalid_ean_result = [{'Result': 'No good Google Search Result was found'}]
    try:
        results = extract_top_urls(ean_code)
    except Exception as e:
        print(e.message, e.args)
        results = invalid_ean_result
    
    if results == invalid_ean_result:
        return False
    else:
        return True

In [30]:
ean_code = 8903094003235
# false eg : 4250463418034
# true eg: 8903094003235, 3528700139716
is_ean_valid(8903094003235)

True

In [33]:
# extract docs
scraped_docs = scrape(extract_url_list(extract_top_urls(ean_code, top_k=5)))

In [54]:
urls = extract_url_list(extract_top_urls(ean_code, top_k=5))
urls

['https://www.heuver.com/product/b12400032bk0813500/12-4-32-bkt-tr-135-124a6-121a8-8pr-tt',
 'https://shop.bohnenkamp.de/reifen-12-4-32.html',
 'https://www.heuver.es/product/b12400032bk0813500/12-4-32-bkt-tr-135-124a6-121a8-8pr-tt',
 'https://www.heba-reifen.at/produkt/bkt-12-4-32-tt-124a6-121a8-8pr-tr-135-as/',
 'https://www.rengas-online.com/rshop/Renkaat/BKT/TR-135/12-4--32-8PR-TT/R-187932']

# Checking EAN Context

- `check_ean_context`:

The `check_ean_context` function is a pivotal step in the process of determining whether an EAN (European Article Number) code is associated with tyres. This involves scraping and splitting HTML content from extracted URLs, storing them in a vector store, and querying Language Models (LLMs) to ascertain the association.

This function serves as a valuable tool for contextual analysis, allowing users to determine the nature of the association of an EAN code with tyres through intelligent querying of Language Models.

### Workflow:

1. The function splits the HTML content into smaller chunks using the `split_docs` function while retaining metadata.

2. It initializes an Azure Chat-based OpenAI Language Model (LLM) for querying and obtaining responses.

3. The Azure OpenAI Embeddings model is configured to create embeddings for the text chunks using the Azure Text Embedding deployment.

4. A vector store is created using the Chroma library, incorporating the embeddings and document splits.

5. The function constructs a retrieval chain to query the LLM for the association of the EAN code with tyres. The result is obtained by processing the retrieved context.

6. The vector store is cleared after processing to maintain efficiency.

In [4]:
from prompts import custom_qa_prompt

def check_ean_context(ean_code, docs, max_chunk_size=2500, chunk_intersection_size=200, k=3):
    splits = split_docs(docs, max_chunk_size=max_chunk_size, chunk_intersection_size=chunk_intersection_size)

    llm = AzureChatOpenAI(
        azure_deployment="gpt-4",
        temperature=0.0,
        max_tokens=16
    )

    # define the embedding model
    embeddings = AzureOpenAIEmbeddings(
        azure_deployment = "text-embedding-ada-002",
        openai_api_version = azure_openai_creds["openai_api_version"]
    )

    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

    # retrieve and generate using the relevant snippets
    retriever = vectorstore.as_retriever(search_kwargs={"k": k})

    # helper function to join splits
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # define the rag chain
    ean_context_rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | custom_qa_prompt
        | llm
        | StrOutputParser()
    )

    query = f"Is this EAN code {ean_code} associated with a tyre? Return Yes or No and nothing else."

    is_ean_context_valid = ean_context_rag_chain.invoke(query)

    vectorstore.delete_collection()

    if is_ean_context_valid == "Yes":
        return True
    else:
        return False

In [43]:
check_ean_context(ean_code, scraped_docs)

True

# Algorithm 1: RAG Pipeline

- `rag_llm_pipeline`

The `rag_llm_pipeline` function serves as a critical component for extracting specific information such as diameter, width, brand, model, and vehicle type associated with a validated EAN (European Article Number) code. This extraction process involves querying a Language Model (LLM) separately for each attribute, with the model retrieving relevant information based on matched context collected from a vector store.

### Workflow:

1. The function splits the HTML content into smaller chunks using the `split_docs` function while retaining metadata.

2. It initializes an Azure Chat-based OpenAI Language Model (LLM) for querying and obtaining responses.

3. The Azure OpenAI Embeddings model is configured to create embeddings for the text chunks using the Azure Text Embedding deployment.

4. A vector store is created using the Chroma library, incorporating the embeddings and document splits.

5. The function constructs a retrieval chain to query the LLM for each specified attribute separately. The LLM responses are collected in an output dictionary.

6. The vector store is cleared after processing to maintain efficiency.

In [5]:
from prompts import custom_qa_prompt, output_schema

def rag_llm_pipeline(ean_code, docs, llm_model, max_chunk_size, chunk_intersection_size, k, output_schema):
    splits = split_docs(docs, max_chunk_size=max_chunk_size, chunk_intersection_size=chunk_intersection_size)

    llm = AzureChatOpenAI(
        azure_deployment=llm_model,
        temperature=0.0,
        max_tokens=16
    )

    # define the embedding model
    embeddings = AzureOpenAIEmbeddings(
        azure_deployment = "text-embedding-ada-002",
        openai_api_version = azure_openai_creds["openai_api_version"]
    )

    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

    # retrieve and generate using the relevant snippets
    retriever = vectorstore.as_retriever(search_kwargs={"k": k})

    # helper function to join splits
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # define the rag chain
    rag_qa_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | custom_qa_prompt
        | llm
        | StrOutputParser()
    )
    
    # define the output_dict
    llm_output = {}

    for key in output_schema:
        try:
            query = output_schema[key]
            output = rag_qa_chain.invoke(query)
            llm_output[key] = output
        except Exception as e:
            print(e.message, e.args)
            llm_output[key] = "NA"

    query = f"Based on the following context extract information and generate output."

    vectorstore.delete_collection()

    return llm_output

In [None]:
llm_model = "gpt-35-turbo"
max_chunk_size = 512
chunk_intersection_size = 128
k = 6

rag_llm_pipeline(
    ean_code,
    scraped_docs,
    llm_model=llm_model,
    max_chunk_size=max_chunk_size,
    chunk_intersection_size=chunk_intersection_size,
    k=k,
    output_schema=output_schema
)

# Algorithm 2: Prompt Updating

- `prompt_llm_pipeline`:

The `prompt_llm_pipeline` function is designed to iteratively extract information related to a tyre, such as diameter, width, brand, model, and vehicle type, from the scraped HTML content associated with a validated EAN (European Article Number) code. This iterative approach involves splitting the text into chunks, prompting a Language Model (LLM) for information, and updating the schema based on the model's responses.

### Workflow:

1. The function initializes with system, user, and assistant prompts to set the context and instruct the user about the expected output format.

2. The top URLs associated with the EAN code are fetched, and the corresponding HTML content is scraped and split into chunks.

3. The scraped content is divided into chunks of size `context_size`, with an intersection between consecutive chunks.

4. The function iterates over each chunk, prompting the LLM to extract information and update the schema.

5. For the first chunk, the assistant provides an initial prompt, and for subsequent chunks, the assistant repeats the prompt while incorporating information from the previous iteration.

6. The extracted information is stored and updated iteratively for each chunk.

In [None]:
# helper function to extract ean data of scraped data from EAN value:
def prompt_llm_pipeline(ean_code, context_size=13000, context_intersection=200, top_urls=None, top_k=10, translation=True):
    '''
    `context_size` is approximately 3.5 characters for 1 tokens.
    We split the scraped text into chunks of sizes `context_size`, where each chunk
    has an intersection for the last `context_intersection` characters with the previous chunk
    and the first `context_intersection` characters with the next chunk.

    We repeatedly extract and update the output of the llm model for each chunk of text, 
    before feeding the `message` as input to the llm again.
    '''

    system_prompt = "Assistant is an intelligent chatbot designed to help users extract information on tyres based on the raw_text scraped from the web. Instructions: \n - The extraction should include information like - diameter - width - height - price - brand - model - vehicle type of tyre.\n - Strictly follow the output format for filling in the data. \n - The summary should have bullet points for each of the above mentioned label. \n Do not include information not present in the scraped data. Skip the label not available."

    assistant_prompt_init_ = "Here is the output format based on the above instruction. Follow this format strictly for output: \n\nOutput Format: \n\nAccording to the scraped data, the following information is available:\n\n- Diameter: [insert diameter here]\n- Width: [insert width here]\n- Height: [insert height here]\n- Price: [insert price or price range here]\n- Brand: [insert brand here]\n- Model: [insert model here]\n- Vehicle Type of Tyre: [insert vehicle type here]\n\n"

    assistant_prompt_repeat_ = "Here is the output obtained from the previous chunk of scraped data. Update this output following these instructions:\n -Here are the outputs from previous iteration, so make sure to retain all the non-redundant information.\n - Change value of any key having \"Not available\", only after getting evidence from current chunk.\n - Follow this output format strictly for output.\n - Update value to multiple values or range depending on updated information from the chunk.\n\n So, here is the output from previous iteration:\n\n"
    
    if not(top_urls):
        # extract top_k urls
        top_urls = get_top_urls(ean_code, top_k)

    # scrape urls and split the text
    scraped_content = final_scraper(top_urls, translation=translation)

    # initialize an empty list to store the chunks
    chunks = []

    # split the scraped content into chunks of size `context_size`

    # loop over the scraped content
    for i in range(0, len(scraped_content), 13800):
        # append the chunk to the list
        chunks.append(scraped_content[i:i+13000])
    
    
    # handle each chunk separately
    
    # for the initial prompt use system_prompt + chunk_prompt_init + assistant_prompt_init
    # for the remaining loop prompts use system_prompt + chunk_prompt_repeat + assistant_prompt_repeat
        
    # initialize a string to hold the llm output from previous iteration
    llm_output = ""
        
    for i, chunk in enumerate(chunks):
        # if first chunk
        if i == 0:
            chunk_prompt_init = "\n\n Web Scraped Chunk No. " + str(i+1) + ":\n\n"
            chunk_prompt_init += "\n\n" + chunk + "\n\n"

            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": chunk_prompt_init},
                {"role": "assistant", "content": assistant_prompt_init_}
            ]
        
        else:

            chunk_prompt_repeat = "\n\n Web Scraped Chunk No. " + str(i+1) + ":\n\n"
            chunk_prompt_repeat += "\n\n" + chunk + "\n\n"

            assistant_prompt_repeat_ += "\n\n" + llm_output + "\n\n" 

            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": chunk_prompt_repeat},
                {"role": "assistant", "content": assistant_prompt_repeat_},
                {"role": "assistant", "content": assistant_prompt_init_}
            ]


        response = openai.ChatCompletion.create(
            engine=engine,
            temperature=0.0, #make sure the outputs are deterministic
            messages=messages,
            max_tokens= 200
        )

        # store the extracted data
        llm_output = response['choices'][0]['message']['content']

    return llm_output