# Recipe Parser

This Notebook serves as an example of how we can use LLMs to extract and format recipes into a given format using OpenAI's GPT-4o-mini model.

For a given URL, the goal is to have the model extract the recipe into a known JSON format, and then have it re-write it in the specified Markdown format.

In [None]:
from langchain_openai import ChatOpenAI
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [None]:
model = ChatOpenAI(
    model="gpt-4o-mini",
)

As some websites might use JavaScript to post-process content for a URL, we want to use a headless browser and wait for a few seconds to give it some time to render.

We will be using Selenium to load the page and then use BeautifulSoup to parse the HTML, returning just the text content.

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)


def load_url(url):
    driver.get(url)
    driver.implicitly_wait(2)
    return BeautifulSoup(driver.page_source).get_text()

Next we will define the format of the recipe that we want to extract using pydantic. This will be a JSON object with the following structure:

```json
{
    "name": "Name of the recipe",
    "recipe_url": "URL of the recipe",
    "ingredients": [
        {
            "name": "Name of ingredient",
            "quantity": "Quantity of ingredient",
            "unit": "Unit of ingredient",
            "directions": "Directions on how to use the ingredient"
        }
    ],
    "instructions": "Cooking instructions"
}
```

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field
from typing import List, Optional
import json


class Ingredient(BaseModel):
    name: str = Field(..., title="Name of ingredient")
    quantity: str = Field(..., title="Quantity of ingredient")
    unit: str = Field(..., title="Unit of ingredient")
    directions: Optional[str] = Field(..., title="Directions on how to use the ingredient")


class Recipe(BaseModel):
    name: str = Field(..., title="Name of the recipe")
    recipe_url: str = Field(..., title="URL of the recipe")
    ingredients: List[Ingredient] = Field(..., title="List of ingredients")
    instructions: str = Field(..., title="Cooking instructions")


parser = JsonOutputParser(pydantic_object=Recipe)

recipe_extractor_prompt = PromptTemplate(
    template="Extract the recipe.\n{format_instructions}\n{recipe}\n{recipe_url}\n",
    input_variables=["recipe", "recipe_url"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

Next, let's create some n shot instructions on how we want the recipe to be transcribed from the JSON content. For this example, we will be providing a general format with two recipe examples.

In [None]:
recipe_format = """

The format to write the recipe is the following:

# [Name of Recipe](URL of Recipe)

## Ingredients

* Ingredient 1, cutting instructions (if any)

## Method

Recipe instructions

Below are some example recipes, each one separated by a horizontal line.

----

# [Chicken Teriyaki](https://www.bbcgoodfood.com/recipes/chicken-teriyaki)

## Ingredients

* 600g Boneless Chicken Thighs
* 6 tbsp Clear Honey
* 6 tbsp Soy Sauce (Blue Dragon)
* 6 tbsp Toasted Sesame Seed Oil
* 1 tbsp Tabasco Sauce
* Butter for sauce
* 200g Basmati Rice to serve

## Method

Cut the chicken thighs into strips.

Mix the honey, soy sauce, sesame seed oil, and tabasco in a bowl, and put the sliced chicken in. Marinade for at least 10 minutes.

Get a wok hot (if you drop a little bit of water onto it, it should turn into a sphere).

Using a sieve, drain the sauce from the marinated chicken into another pan.

Bring the marinade sauce to the boil for a few minutes, while you stir fry the chicken for 5 minutes.

Whisk some butter into the boiler marinade sauce.

----

# [Prawn and Chorizo Rice Recipe](https://www.bbcgoodfood.com/recipes/prawn-chorizo-rice)

## Ingredients

* ½ tbsp olive oil
* 90g Cooks’ Ingredients Diced Chorizo
* 1/2 large onion, thinly sliced
* 1 sticks celery, halved lengthways then thinly sliced
* 1 red peppers, deseeded and thinly sliced
* 1 cloves garlic, crushed
* 1½ tsp Cajun spice
* 120g Arborio rice
* 250ml passata
* 250ml vegetable stock
*  180g packs raw extra large king prawns
* Parsley, chopped

## Method

Heat the oil in a casserole dish or sauté pan over a medium heat. Add the chorizo and fry for 2 minutes, until lightly browned. Remove to a plate with a slotted spoon and leave the oil in the pan.

Add the onion, celery and peppers and cook for 5-7 minutes until soft. Add the garlic and Cajun seasoning. Cook for 30 seconds before adding the rice, passata and stock. Simmer for 20 minutes, stirring occasionally so the base doesn’t stick.

Stir the chorizo back into the pan with the prawns and cook for another 4-5 minutes until piping hot and the rice is tender. Divide between 4 plates and scatter with the parsley.

"""

recipe_output_prompt = PromptTemplate(
    template="Write the recipe in the given format. \n{format_instructions}\n{recipe}\n{recipe_url}\n",
    input_variables=["recipe", "recipe_url"],
    partial_variables={"format_instructions": recipe_format},
)

Finally, let's set up our individual chains and an `extract_and_transcribe_recipe` function. 

The first chain will be used to extract the recipe into the JSON format we have specified above, while the second will to transcribe the recipe into the Markdown format. 

In [None]:
recipe_extractor = recipe_extractor_prompt | model | parser
recipe_writer = recipe_output_prompt | model | StrOutputParser()

def extract_and_transcribe_recipe(url):
    print(f"Extracting recipe from {url}")
    print("Loading source")
    url_source = load_url(url)

    print("Converting source into JSON format")
    recipe_json = recipe_extractor.invoke({
        "recipe": url_source,
        "recipe_url": url,
    })
    
    print("Writing recipe in Markdown format")
    return recipe_writer.invoke({
        "recipe": json.dumps(recipe_json),
        "recipe_url": url,
    })


To give a sense of how generic this is, we will go through a List of different URLs from different sites, extracting each one into the prefered Markdown format.

In [None]:
recipe_urls = [
    # Gordon Ramsay
    "https://www.gordonramsay.com/gr/recipes/pan-seared-scallops-with-butternut-squash-puree-and-pomegranate-quince-slaw/",
    # Jamie Oliver
    "https://www.jamieoliver.com/recipes/vegetable-recipes/veggie-chilli/",
    # BBC Good Food
    "https://www.bbcgoodfood.com/recipes/spiced-pumpkin-soup-2",
    # The Pioneer Woman
    "https://www.thepioneerwoman.com/food-cooking/recipes/a85701/how-to-make-chocolate-pudding/"
    
]

In [None]:
recipes = [extract_and_transcribe_recipe(url) for url in recipe_urls]

In [None]:
print("\n\n".join(recipes))