There are 4 key Phases to this Project:

1. Scrape cooking recipe information from the Betty Crocker Data (/data/receipe.txt) and create a list of objects where the objects are the receipes
2. Search Function
3. LLM + RAG Support
4. Actions

Phase 1: Scrape and Clean Data
In this phase the objective is to convert the recipe book text into a list of dictionaries, each representing information about a single recipe. The dictionaries must have the following keys:

`title`: Name of the recipe

`ingredients`: String of ingredients separated by newline characters

`instructions`: String of text explaining how to prepare the recipe

`serving_size`: Number of portions or people this can serve

`notes`: Any additional information in the recipe that doesn't fit the above

Other details:
-  There are largely 3 parts to this receipe: 1. Basic Baking, Independent Receipe (i.e. receipe that doesn't depend/is contingent on making other receipes), Dependent Receipe 

-  Recipe titles should always be uppercase

-  Some recipes do not have an ingredient list, e.g. Hush Puppies. The ingredients are actually in the first sentence of the instructions. *Hint*: you can identify these recipes based on the verb used at the beginning of the sentence.

- Some recipes actually do not have ingredients at all. Instead, they begin with instructions to first make another basic recipe, e.g. Chicken Griddlecakes. In such cases, simply leave ingredients as an empty string.

- Serving sizes are often represented as ranges, e.g. Waffles Supper Royal makes "6 to 8 servings", and should be stored in your dictionary as a list, [6,8] in this case. If the recipe gives a single number, then simply repeat it, e.g. Celery Crescents says it "makes 16" so the `serving_size` should be [16,16]. If no serving size is given in the recipe, e.g. Batter Fried Shrimp, then set it to [0,0].

#Example

The Cheese Bread recipe has the following text in the recipe book:

>CHEESE BREAD
>
> _Wonderful warm, sliced ½″ thick._
>
>1 egg, beaten
>
>1½ cups milk
>
>3¾ cups Bisquick
>
>¾ cup grated sharp cheese
>
>Heat oven to 350° (mod.). Blend all together. Beat 30 seconds, until well blended. Pour into well greased, waxed paper-lined 9x5x2½″ loaf pan. Bake 1 hr. When serving cold, slice thin.
>
>

The resulting dictionary would look like this:

In [14]:
{'title': 'CHEESE BREAD',
'ingredients': '1 egg, beaten\n1½ cups milk\n3¾ cups Bisquick\n¾ cup grated sharp cheese',
'instructions': 'Heat oven to 350° (mod.). Blend all together. Beat 30 seconds, until well blended. Pour into well greased, waxed paper-lined 9x5x2½″ loaf pan. Bake 1 hr. When serving cold, slice thin.',
'notes': 'Wonderful warm, sliced ½″ thick.',
'serving_size': [0, 0]}

{'title': 'CHEESE BREAD',
 'ingredients': '1 egg, beaten\n1½ cups milk\n3¾ cups Bisquick\n¾ cup grated sharp cheese',
 'instructions': 'Heat oven to 350° (mod.). Blend all together. Beat 30 seconds, until well blended. Pour into well greased, waxed paper-lined 9x5x2½″ loaf pan. Bake 1 hr. When serving cold, slice thin.',
 'notes': 'Wonderful warm, sliced ½″ thick.',
 'serving_size': [0, 0]}

In [32]:
from tqdm import tqdm
from langchain_ollama import ChatOllama
import json

In [37]:
class RecipeExtractionLLM():
    def __init__(self):
        self.llm = ChatOllama(
            model="mistral:latest",
            temperature=0.0
        )
        print("Loaded Mistral 7B via Ollama using LangChain")

    def is_recipe_start(self, text: str) -> bool:
        prompt = (
            "### Instruction:\n"
            "Determine if the following line is the START of a recipe title in a cookbook.\n"
            "Respond only with YES or NO.\n\n"
            f"### Input:\n{text}\n\n### Response:"
        )
        response = self.llm.invoke(prompt)
        return response.content.strip().upper().startswith("YES")
    
    def create_recipe_object(self, text_block: str) -> dict:
        prompt = (
            "### Instruction:\n"
            "Extract the recipe information from the following text block. "
            "Return a JSON object with the following keys: 'title', 'ingredients', 'instructions', "
            "'notes', and 'serving_size'.\n"
            "- The title should be the name of the recipe and MUST be in uppercase.\n"
            "- Ingredients MUST be in a string with each ingredient on its own line.\n"
            "- Instructions MUST be a string with full preparation steps.\n"
            "- Notes should include any extra details or comments (e.g., italicized comments or serving tips).\n"
            "- Serving size should be a list of 2 integers like [6, 8] where each of the values represent [min_serving_size, max_serving_size], or [0, 0] if unspecified.\n"
            "- DO NOT make up or hallucinate ingredients, notes, or steps that are not present in the input.\n\n"
            f"### Input:\n{text_block.strip()}\n\n### Response:"
        )
        # print("Running create_receipe_object...")
        response = self.llm.invoke(prompt)

        try:
            return json.loads(response.content.strip())
        except json.JSONDecodeError:
            print("⚠️ Failed to parse LLM response, raw output below:")
            # print(response.content)
            return {}        

receipe_extractor = RecipeExtractionLLM()

Loaded Mistral 7B via Ollama using LangChain


In [None]:
# There should be a language model that 
all_receipes = [] # where the recipe objects will be stored

with open("../data/receipe.txt", "r", encoding="utf-8") as f:
    lines = [line for i, line in enumerate(f, start=1) if 48 <= i <= 1817]
    curr_receipe = ""

    for line in tqdm(lines, desc="Processing lines"):
        if line.strip() == "": continue

        is_receipe = receipe_extractor.is_recipe_start(line)
        
        if is_receipe:
            print(f"Line - {line.strip()} is a Receipe!")
            if curr_receipe.strip() != "":
                receipe = receipe_extractor.create_recipe_object(curr_receipe)

                if receipe: all_receipes.append(receipe)
            curr_receipe = line.strip()
        else:
            curr_receipe += line

    if curr_receipe.strip() != "":
        receipe = receipe_extractor.create_recipe_object(curr_receipe)

        if receipe: all_receipes.append(receipe)

Processing lines:   0%|          | 1/1770 [00:02<1:08:30,  2.32s/it]

Line - [Illustration: PANCAKES] is a Receipe!


Processing lines:   0%|          | 6/1770 [00:04<14:17,  2.06it/s]  

Line - [Illustration: MUFFINS] is a Receipe!


Processing lines:   0%|          | 7/1770 [00:09<39:14,  1.34s/it]


KeyboardInterrupt: 

In [36]:
basic_recipes

[{'title': 'PANCAKES',
  'ingredients': '2 cups Bisquick\n1 egg\n1 ⅔ cups milk',
  'instructions': 'Beat the ingredients (2 cups Bisquick, 1 egg, 1 ⅔ cups milk) with a rotary beater until well blended.\nPour batter onto a heated griddle.\nTurn pancakes when bubbles appear.',
  'notes': '_Makes about 18 4" pancakes. For thinner pancakes use 2 cups milk._',
  'serving_size': [18, 4]},
 {'title': 'MUFFINS',
  'ingredients': ['2 tablespoons of sugar', '1 egg', '3/4 cup'],
  'instructions': 'Heat oven to 400° (mod. hot). Blend together the ingredients.',
  'notes': '',
  'serving_size': [0, 0]},
 {'title': 'BISQUICK MUFFINS',
  'ingredients': ['milk, 2 cups', 'Bisquick'],
  'instructions': 'Beat vigorously for 30 seconds.\nFill 12 well greased muffin cups ⅔ full.\nBake for 15 minutes.',
  'notes': '_For richer batter_, add 2 tbsp. more sugar, 2 tbsp. melted shortening or salad oil.',
  'serving_size': [0, 0]},
 {'title': 'WAFFLES',
  'ingredients': '2 cups Bisquick\n1 ⅔ cups milk\n1 egg\n2 