# Retrieval Augmented Generation with metadata context

In the previous app, we used textual context to assist LLM to provide sensible answers. In the cases where there is a combination fo structured and unstructured data in the dataset, it might make sense to include metadata. We will try to show the benefit of something like that in this notebook.
Here's the blog post this notebook is inspired by https://blog.langchain.dev/a-chunk-by-any-other-name/.

In [1]:
## Load necessary libraries
import pandas as pd
from pyprojroot.here import here
from textwrap3 import wrap

import os
import openai
import sys

sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv

_= load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

## Load the dataset

In [21]:
rawdat = pd.read_csv(here('data/RAW_recipes.csv'))
## just pick first 10,000 to keep the code runtime meanageable
rawdat = rawdat.head(10000)# Assume 'rawdat' is your DataFrame
#rawdat.to_csv(here('data/RAW_recipes.csv'), index=False)

In [3]:
rawdat.tags[0:2]
rawdat['tags'] = rawdat['tags'].apply(lambda x: x.strip('[]'))

We will be using the recipes dataset from this Kaggle article. It contains a nice set of structured and un-structured text. The relevant data here is structured text of number of ingredients, time taken to make the dish and unstructured part of the data is the steps and description of the recipe. 

In [4]:
import regex as re

## remove the NAs
rawdat.fillna(' ',inplace=True)
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]+', '',text)

# Apply the function to each cell in the DataFrame
#rawdat['name'] = rawdat['name'].apply(lambda x: remove_non_ascii(x))

for colname in (rawdat.columns):
    #print(colname)
    if rawdat[colname].dtype == "object":
        rawdat[colname] = rawdat[colname].apply(lambda x: remove_non_ascii(x))


rawdat.head()
## remove the non ascii words

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"'60-minutes-or-less', 'time-to-make', 'course'...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"'30-minutes-or-less', 'time-to-make', 'course'...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"'time-to-make', 'course', 'preparation', 'main...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"'60-minutes-or-less', 'time-to-make', 'course'...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"'weeknight', 'time-to-make', 'course', 'main-i...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


## Process the dataset
#### Identify the main unstructured text and corresponding metadata to include

In [5]:
# Select necessary columns from rawdat DataFrame
newDF = rawdat[['name', 'steps', 'description', 'ingredients', 'tags', 'minutes', 'n_steps', 'n_ingredients']].copy()

# Concatenate 'description', 'steps', and 'ingredients' columns into a new 'content' column
newDF['content'] = newDF['description'].str.cat([newDF['steps'], newDF['ingredients']], sep=' ')
newDF.rename(columns={'n_steps': 'number_of_steps'}, inplace=True)
newDF.rename(columns={'n_ingredients': 'number_of_ingredients'}, inplace=True)

# Copy 'n_steps' and 'n_ingredients' columns from rawdat to newDF
#newDF['number_of_steps'] = rawdat['n_steps']
#newDF['number_of_ingredients'] = rawdat['n_ingredients']


# Display all columns of the first row
newDF.head(1)

Unnamed: 0,name,steps,description,ingredients,tags,minutes,number_of_steps,number_of_ingredients,content
0,arriba baked winter squash mexican style,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...","'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7,autumn is my favorite time of year to cook! th...


## Chunking and Embedding

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter

persist_dir = 'docs/chroma/'

embedding = OpenAIEmbeddings()

## Lets try to create the chunks

def chunk_section(section, chunk_size, chunk_overlap):
    #print(section)
    text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n","\n"," ",""],
        chunk_size = chunk_size,
        chunk_overlap= chunk_overlap,
        length_function = len
    )
    chunks = text_splitter.create_documents(
        texts=[section["content"]],
        metadatas=[{"name":section["name"],
                    "tags":section["tags"],
                    "ingredients":section["ingredients"],
                    "minutes":section["minutes"],
                    "number_of_steps":section["number_of_steps"],
                    "number_of_ingredients":section["number_of_ingredients"]}]
    )
    return[{"text":chunk.page_content,
            "name":chunk.metadata["name"],
            "tags":chunk.metadata["tags"],
            "minutes":chunk.metadata["minutes"],
            "number_of_steps":chunk.metadata["number_of_steps"],
            "number_of_ingredients":chunk.metadata["number_of_ingredients"]} for chunk in chunks]



chunked_data = newDF.apply(lambda row: chunk_section(row, 200,50),axis=1)

# Flatten the list of lists
chunked_data = [item for sublist in chunked_data for item in sublist]

# Convert the list of dictionaries to a DataFrame
outDF = pd.DataFrame(chunked_data)

  warn_deprecated(


In [7]:
outDF.head()

Unnamed: 0,text,name,tags,minutes,number_of_steps,number_of_ingredients
0,autumn is my favorite time of year to cook! th...,arriba baked winter squash mexican style,"'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7
1,two of my posted mexican-inspired seasoning mi...,arriba baked winter squash mexican style,"'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7
2,", cut into half or fourths', 'remove seeds', '...",arriba baked winter squash mexican style,"'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7
3,"seasoning mix ii', 'for sweet squash , drizzle...",arriba baked winter squash mexican style,"'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7
4,"mix', 'bake at 350 degrees , again depending o...",arriba baked winter squash mexican style,"'60-minutes-or-less', 'time-to-make', 'course'...",55,11,7


In [8]:
metadata = outDF[['name', 'tags','minutes','number_of_steps','number_of_ingredients']].to_dict('records')

vectorDB_new = Chroma.from_texts(
    texts = outDF['text'].tolist(),
    metadatas = metadata,
    embedding = embedding,
    persist_directory = persist_dir
)

## Retrieval process

In [9]:
## SelfQuery Retrieval 
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo, get_query_constructor_prompt, load_query_constructor_runnable
## this is where we need to let the retriever know that there are metadatas available as part of the document
metadata_field_info = [
    AttributeInfo(
        name = "name",
        description = "This is the recipe name",
        type = "string"
    ),
    AttributeInfo(
        name = "tags",
        description = "This is the tags associated with the recipe",
        type = "string"
   ),
   AttributeInfo(
       name="minutes",
       description="The total time for the recipe creation in minutes",
       type="integer"
   ),
   AttributeInfo(
       name ="number_of_steps",
       description="The number of steps involved in creation of recipe",
       type="integer"
   ),
   AttributeInfo(
       name ="number_of_ingredients",
       description="The number of ingredients involved in creation of recipe",
       type="integer"
   )
]

## Generate chains with and without metadata context

In [10]:
document_content_description = "Recipes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorDB_new,
    document_content_description,
    metadata_field_info,
    verbose=True,
    enable_limit=True
)

  warn_deprecated(


In [11]:
question = "Can you show me some vegetarian recipes using mexican flavors that take less than 60 minutes with 10 or less ingredients?"

In [12]:
#retriever.get_relevant_documents(question)

retriever.get_relevant_documents(question)

[Document(page_content="'mexican-style tomatoes', 'vegetarian ground beef', 'onion', 'chili powder', 'cayenne', 'vegetable oil', 'cornbread batter']", metadata={'minutes': 30, 'name': 'chili and cornbread casserole', 'number_of_ingredients': 8, 'number_of_steps': 10, 'tags': "'30-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'breads', 'main-dish', 'beans', 'beef', 'american', 'southwestern-united-states', 'tex-mex', 'oven', 'easy', 'grains', 'dietary', 'spicy', 'one-dish-meal', 'comfort-food', 'inexpensive', 'meat', 'pasta-rice-and-grains', 'taste-mood', 'equipment'"}),
 Document(page_content="'green bell pepper', 'vegetarian refried beans', 'flour tortillas', 'tomatoes', 'cheddar cheese', 'sour cream']", metadata={'minutes': 25, 'name': 'cheesy mexican rice   bean burritos', 'number_of_ingredients': 8, 'number_of_steps': 8, 'tags': "'30-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'pr

In [13]:
## Using Retrieval QA chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

retriever= RetrievalQA.from_chain_type(
    llm= ChatOpenAI(),
    chain_type= "stuff",
    retriever = vectorDB_new.as_retriever(),
    return_source_documents = True
)
question = "Can you show me some vegetarian recipes using mexican flavors that take less than 60 minutes?"

response = retriever({"query":question})
print(response['result'])

  warn_deprecated(


Here are a few vegetarian recipes using Mexican flavors that take less than 60 minutes to make:

1. **Vegetarian Chili with Tortilla Chips and Sour Cream**: This recipe takes approximately 10-15 minutes. Ingredients include tortilla chips, black beans, vegetarian chili, taco seasoning, shredded cheddar cheese, salsa, and sour cream.

2. **Mexican Bean Soup with Tortilla Chips and Cheese**: This recipe takes approximately 30 minutes to 1 hour. Ingredients include onion, garlic, vegetable oil, stewed tomatoes, chicken broth, beef broth, water, Rotel tomatoes & chilies, and cumin.

3. **Avocado Hummus with Tortilla Chips**: This recipe takes about 30 minutes. Ingredients include garlic, garbanzo beans, lemon juice, onion, avocado, green chili, salt, pepper, plum tomato, and green onion.

I hope you find these recipes helpful for your vegetarian Mexican-inspired meal!


In [14]:
## using get_query_constructor
chain = load_query_constructor_runnable(
    llm=ChatOpenAI(
        model="gpt-3.5-turbo",
        temperature=0
    ),
    attribute_info=metadata_field_info,
    document_contents=document_content_description,
    fix_invalid=True
)

In [15]:
query = "Can you show me some vegetarian recipes using mexican flavors that take less than 60 minutes with less than 10 ingredients?"
chain.invoke(({"query": query}))

StructuredQuery(query='vegetarian mexican flavors', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.LT: 'lt'>, attribute='minutes', value=60), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='number_of_ingredients', value=10)]), limit=None)

In [16]:
sq_retriever = SelfQueryRetriever(
    query_constructor=chain,
    vectorstore=vectorDB_new,
    verbose=True,
)
sq_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=sq_retriever,
    return_source_documents=True
)
def print_result(response_obj):
    print("SOURCES: \n")
    cnt = 1
    for source_doc in response_obj["source_documents"]:
        print(f"Chunk #{cnt}")
        cnt += 1
        print("Source Metadata: ", source_doc.metadata)
        print("Source Text:")
        print(source_doc.page_content)
        print("\n")
    print("RESULT: \n")
    print(response_obj["result"] + "\n\n")

In [17]:
response = sq_qa({"query": query})
print_result(response)

SOURCES: 

Chunk #1
Source Metadata:  {'minutes': 25, 'name': 'cheesy mexican rice   bean burritos', 'number_of_ingredients': 8, 'number_of_steps': 8, 'tags': "'30-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'eggs-dairy', 'mexican', 'easy', 'vegetarian', 'cheese', 'dietary', 'comfort-food', 'taste-mood'"}
Source Text:
'green bell pepper', 'vegetarian refried beans', 'flour tortillas', 'tomatoes', 'cheddar cheese', 'sour cream']


Chunk #2
Source Metadata:  {'minutes': 45, 'name': 'rosarita vegetarian chile  chile  relleno bake', 'number_of_ingredients': 8, 'number_of_steps': 15, 'tags': "'60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'north-american', 'main-dish', 'side-dishes', 'beans', 'eggs-dairy', 'mexican', 'easy', 'cheese', 'eggs', 'inexpensive'"}
Source Text:
flour', 'diced green chilies', 'vegetarian refried beans', 'onion', 'salsa verde', 'monterey jack pepp

In [18]:
wrap(response['result'])

[' One possible recipe that fits this criteria is a vegetarian bean and',
 'cheese quesadilla. To make this, you will need flour tortillas,',
 'vegetarian refried beans, diced green chilies, onion, salsa verde,',
 'monterey jack pepper cheese, and Mexican blend cheese. Begin by',
 'heating a large skillet over medium heat and lightly spraying it with',
 'cooking spray. Place a tortilla in the skillet and spread a layer of',
 'refried beans on one half. Top with diced green chilies, onion, and a',
 'sprinkle of cumin and garlic powder. Fold the tortilla in half and',
 'cook for 2-3 minutes on each side, until lightly browned. Repeat with',
 'remaining tortillas and filling ingredients. Serve with a side of',
 'salsa for dipping. This recipe should take less than 30 minutes to',
 'prepare and uses only 8 ingredients.']