## Generate synthetic questions and answers to be used to evaluate the CIAM RAG.

We use a llama3.1 model to generate 5 questions, answers, excerpt and the product_asin a customer might likely ask in fashion store.
We use the mistral:7b model to instruct the model to return a JSON list and a JsonParser to format the response into JSON objects.

A noticable limitation is that some product_asin are not correctly extracted. The LLM picked the wrong product_asin for another product which is being referenced in the description of the product as the product_asin as of the main product.

In [23]:
%pip install --quiet python-dotenv

from dotenv import load_dotenv
load_dotenv()




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


True

In [24]:
from pathlib import Path
import json

product_path = "../training-sets/WOMEN_FASHION_NO_REVIEWS"
qa_json_file_name = f"{Path(product_path).name.lower()}_qa.json"
product_list = json.loads(Path(product_path + "/raw_products.json").read_text())
products = product_list[0:100]

In [25]:
%pip install --quiet langchain langchain_community langsmith


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [26]:
from langchain_community.llms.ollama import Ollama
from langchain_core.prompts import ChatPromptTemplate

llm = Ollama(model="llama3", temperature=0, num_ctx=4000)

In [27]:

from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import StrOutputParser
from langchain_core.output_parsers import JsonOutputParser

class ProductQA(BaseModel):
    product_asin: str 
    question: str
    answer: str 
    excerpt: str 
    
    
# Set up a parser + inject instructions into the prompt template.
parser = JsonOutputParser(pydantic_object=ProductQA)



In [28]:
import json
import os

def append_to_json(data_dict, file_path):
    # Check if the file exists
    if os.path.exists(file_path):
        # File exists, read its content
        with open(file_path, 'r') as file:
            try:
                file_content = json.load(file)
                if not isinstance(file_content, list):
                    file_content = [file_content]
            except json.JSONDecodeError:
                # File is empty or not valid JSON, start with an empty list
                file_content = []
    else:
        # File doesn't exist, start with an empty list
        file_content = []
    
    # Append the new data
    file_content.append(data_dict)
    
    # Write the updated content back to the file
    with open(file_path, 'w') as file:
        json.dump(file_content, file, indent=2)

In [29]:


system_message = """You are a Customer Assistance. You are tasked to help customers find products that will match their taste. 
                      
Come up with likely questions that can trigger you to recommend this product. Provide the excerpt of the product information
that contains and answer to such question.

Generate 5 of these.

"""

prompt = PromptTemplate(
    template=system_message + "\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

for index, product in enumerate(products):
    print(f"{index + 1} / {len(products)}\n")
    
    product_name = product.get("name", "")
    product_asin = product.get("product_asin", "")
    description = product.get("description", "").strip()
    overall_ratings = product.get("overall_ratings", "")
    total_customers_that_rated = product.get("total_customers_that_rated", "")
    price = product.get("price", "")
    currency = product.get("currency", "")
    reviews = product.get("reviews", "")
    
    combined_product_info = (
        f"{product_name}\n"
        f"{description}\n"
        f"Overall Ratings: {overall_ratings}\n"
        f"Total Customers that rated: {total_customers_that_rated}\n"
        f"Price: {currency}{price}\n"
    )
    
    chain = prompt | llm | StrOutputParser()
    response = chain.invoke({ "query": combined_product_info })
    # print(response)
    
    times_tried = 0
    while(times_tried < 5):
        try:
            strict_output_model = Ollama(model="mistral", temperature=0)
            strict_output_prompt = ChatPromptTemplate.from_template("""Your final output should be a JSON. Each JSON object must follow this pattern:
{{
    "question": "",
    "answer": "",
    "excerpt": "",
    "product_asin": "",
}}
Extract each product object and reconstruct a JSON list that contains each object,
following the pattern specified above.
Return back the JSON in this content: 
{input}
Your final response must be the JSON you constructed nothing more!
""")
            strict_output_chain = strict_output_prompt | strict_output_model | parser
            json_output = strict_output_chain.invoke({
                "input": response
            })
            append_to_json(json_output, qa_json_file_name)
            break
        except Exception as e:
            print(e)
            times_tried += 1 
    

1 / 100

2 / 100

3 / 100

4 / 100

5 / 100

6 / 100

7 / 100

8 / 100

9 / 100

10 / 100

11 / 100

12 / 100

13 / 100

14 / 100

15 / 100

16 / 100

17 / 100

18 / 100

19 / 100

20 / 100

21 / 100

22 / 100

23 / 100

24 / 100

25 / 100

26 / 100

27 / 100

28 / 100

29 / 100

30 / 100

31 / 100

32 / 100

33 / 100

34 / 100

35 / 100

36 / 100

37 / 100

38 / 100

Invalid json output: ```json
[
  {
    "product_asin": "B07RL5Z8C8",
    "question": "What type of jeans would you recommend for someone who wants a slim fit?",
    "answer": "These Women's Mid-Rise Slim Bootcut Jean have a close fit through the hip and thigh, with slight ease through the leg. They're perfect for those who want a slim silhouette.",
    "excerpt": "SLIM FIT: Close fit through hip and thigh, with slight ease through the leg."
  },
  {
    "product_asin": "B07RL5Z8C8",
    "question": "Are there any jeans that are both comfortable and stylish?",
    "answer": "Yes! These Women's Mid-Rise Slim Bootcut Jean ha