Pydantic validation of LLM response to make sure its in json format with right data types.

In [1]:
!pip install --upgrade fireworks-ai



In [2]:

from pydantic import BaseModel
import fireworks.client



In [3]:
fireworks.client.api_key = "W4etuSqWlXPUIa8gsuz8g9kCvz6KGNsSF4q8FePNjpBLOgfF"

Web scrapping https://en.wikipedia.org/wiki/List_of_2023_box_office_number-one_films_in_the_United_States .
I have only extracted data from the table



In [4]:
import requests
from bs4 import BeautifulSoup

def scrape_box_office(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.find('table', {'class': 'wikitable'})
        box_office_data = []
        for row in table.find_all('tr')[1:]:
            columns = row.find_all(['th', 'td'])
            data = [column.get_text(strip=True) for column in columns]
            box_office_data.append(data)

        return box_office_data
    else:
        print(f"Error: Unable to fetch data. Status Code: {response.status_code}")

url = 'https://en.wikipedia.org/wiki/List_of_2023_box_office_number-one_films_in_the_United_States'

result = scrape_box_office(url)
string = ""
for row in result:
    for i in range(len(row)):
      if i==0:
        pass
      else:
        string += row[i] + " "
    string += '\n'


print(string)

January 8, 2023 Avatar: The Way of Water $45,838,986 Black Panther: Wakanda ForeverandAvatar: The Way of Waterbecame the first two films to consecutively top the box office for four consecutive weekends each sinceThe Hunger Games: Mockingjay – Part 2andStar Wars: The Force Awakensin 2015 and 2016. [2] 
January 15, 2023 $32,824,684 Black Panther: Wakanda ForeverandAvatar: The Way of Waterbecame the first two films to consecutively top the box office for five consecutive weekends each sinceStakeoutandFatal Attractionin 1987. [3] 
January 22, 2023 $20,133,106 Avatar: The Way of Waterbecame the first film sinceAvatarto top the box office for six consecutive weekends. It also became the first film sinceSpider-Man: No Way Hometo top the box office in its sixth week of release, as well as for six total weekends. During the weekend, the film surpassed $2 billion at the global box office. [4] 
January 29, 2023 $15,968,532 Avatar: The Way of Waterbecame the first film sinceAvatarto top the box o

Specifying a pydantic schema

In [5]:
from pydantic import BaseModel, ValidationError
import json
class WebScrape1(BaseModel):
    date: str
    film: str
    gross: int

In [6]:
response_schema_dict = WebScrape1.model_json_schema()
response_schema_json = json.dumps(response_schema_dict)
response_schema_json

'{"properties": {"date": {"title": "Date", "type": "string"}, "film": {"title": "Film", "type": "string"}, "gross": {"title": "Gross", "type": "integer"}}, "required": ["date", "film", "gross"], "title": "WebScrape1", "type": "object"}'

Creating a prompt

In [7]:
question = "1 January 8, 2023 Avatar: The Way of Water $45,838,986 Black Panther: Wakanda ForeverandAvatar: The Way of Waterbecame the first two films to consecutively top the box office for four consecutive weekends each sinceThe Hunger Games: Mockingjay – Part 2andStar Wars: The Force Awakensin 2015 and 2016. [2] "
prompt = f"""I will provide you with a unstructured data and you have to convert it into the following format and Do not include the provided properties json in the output:
```json
{response_schema_dict}
```
Make sure you only have the required json as output and no other text.

Data: {string}

only generate extracted data json.
The provided format should be strictly followed.

"""

In [8]:
prompt

"I will provide you with a unstructured data and you have to convert it into the following format and Do not include the provided properties json in the output:\n```json\n{'properties': {'date': {'title': 'Date', 'type': 'string'}, 'film': {'title': 'Film', 'type': 'string'}, 'gross': {'title': 'Gross', 'type': 'integer'}}, 'required': ['date', 'film', 'gross'], 'title': 'WebScrape1', 'type': 'object'}\n```\nMake sure you only have the required json as output and no other text.\n\nData: January 8, 2023 Avatar: The Way of Water $45,838,986 Black Panther: Wakanda ForeverandAvatar: The Way of Waterbecame the first two films to consecutively top the box office for four consecutive weekends each sinceThe Hunger Games: Mockingjay – Part 2andStar Wars: The Force Awakensin 2015 and 2016. [2] \nJanuary 15, 2023 $32,824,684 Black Panther: Wakanda ForeverandAvatar: The Way of Waterbecame the first two films to consecutively top the box office for five consecutive weekends each sinceStakeoutandFat

Creating functions for text generation and error handling

In [9]:
def askLLM(prompt):
    completion = fireworks.client.ChatCompletion.create(
    model="accounts/fireworks/models/mixtral-8x7b-instruct",
    messages=[
      {
        "role": "user",
        "content": prompt,
      }
    ],
    n=1,
    max_tokens=4000,
    temperature=0.1,
    top_p=0.9,
   )
    return completion.choices[0].message.content


def handle_validation_failure(error, prompt, num_tries):
  if num_tries > 3:
    return "Couldnt validate LLM response"
  else:
    out = askLLM(error + prompt+ "Make sure the output is json")
    print(out)
    return out


Providing initial prompt and viewing the results

In [10]:
completion = askLLM(prompt)

In [11]:
completion

'[\n{\n"properties": {\n"date": {\n"title": "Date",\n"type": "string"\n},\n"film": {\n"title": "Film",\n"type": "string"\n},\n"gross": {\n"title": "Gross",\n"type": "integer"\n}\n},\n"required": [\n"date",\n"film",\n"gross"\n],\n"type": "object",\n"title": "WebScrape1",\n"data": [\n{\n"date": "January 8, 2023",\n"film": "Avatar: The Way of Water",\n"gross": 45838986\n},\n{\n"date": "January 15, 2023",\n"film": "Black Panther: Wakanda Forever",\n"gross": 32824684\n},\n{\n"date": "January 15, 2023",\n"film": "Avatar: The Way of Water",\n"gross": 32824684\n},\n{\n"date": "January 22, 2023",\n"film": "Avatar: The Way of Water",\n"gross": 20133106\n},\n{\n"date": "January 29, 2023",\n"film": "Avatar: The Way of Water",\n"gross": 15968532\n},\n{\n"date": "February 5, 2023",\n"film": "Knock at the Cabin",\n"gross": 14127170\n},\n{\n"date": "February 12, 2023",\n"film": "Magic Mike\'s Last Dance",\n"gross": 8305317\n},\n{\n"date": "February 19, 2023",\n"film": "Ant-Man and the Wasp: Quantumani

In [12]:
max_tries = 0

In [13]:
while(max_tries < 3):
  try:
    json_objects = json.loads(completion)
    break
  except:
    completion = handle_validation_failure("Output Not json", prompt, max_tries)
    max_tries+=1


[
{"date": "January 8, 2023", "film": "Avatar: The Way of Water", "gross": 45838986},
{"date": "January 15, 2023", "film": "Black Panther: Wakanda Forever", "gross": 32824684},
{"date": "January 22, 2023", "film": "Avatar: The Way of Water", "gross": 20133106},
{"date": "January 29, 2023", "film": "Avatar: The Way of Water", "gross": 15968532},
{"date": "February 5, 2023", "film": "Knock at the Cabin", "gross": 14127170},
{"date": "February 12, 2023", "film": "Magic Mike's Last Dance", "gross": 8305317},
{"date": "February 19, 2023", "film": "Ant-Man and the Wasp: Quantumania", "gross": 106109650},
{"date": "February 26, 2023", "film": null, "gross": 31964803},
{"date": "March 5, 2023", "film": "Creed III", "gross": 58370007},
{"date": "March 12, 2023", "film": "Scream VI", "gross": 44447270},
{"date": "March 19, 2023", "film": "Shazam! Fury of the Gods", "gross": 30111158},
{"date": "March 26, 2023", "film": "John Wick: Chapter 4", "gross": 73817950},
{"date": "April 2, 2023", "film":

Pydantic Validations for each individual json extraction. This approach is used in order to avoid regeneration everytime there is an error in a single json. Errors are extracted and displayed to the user

In [16]:
failed = []
validated = []
for i in range(len(json_objects)):
  try:
      validated_data = WebScrape1.parse_obj(json_objects[i])
      print("Validation successful!")
      validated.append(json_objects[i])
      print("Validated data:", validated_data.dict())
  except ValidationError as e:
      print("Validation failed!")
      failed.append(json_objects[i])
      print(e.json())

print(failed)

Validation successful!
Validated data: {'date': 'January 8, 2023', 'film': 'Avatar: The Way of Water', 'gross': 45838986}
Validation successful!
Validated data: {'date': 'January 15, 2023', 'film': 'Black Panther: Wakanda Forever', 'gross': 32824684}
Validation successful!
Validated data: {'date': 'January 22, 2023', 'film': 'Avatar: The Way of Water', 'gross': 20133106}
Validation successful!
Validated data: {'date': 'January 29, 2023', 'film': 'Avatar: The Way of Water', 'gross': 15968532}
Validation successful!
Validated data: {'date': 'February 5, 2023', 'film': 'Knock at the Cabin', 'gross': 14127170}
Validation successful!
Validated data: {'date': 'February 12, 2023', 'film': "Magic Mike's Last Dance", 'gross': 8305317}
Validation successful!
Validated data: {'date': 'February 19, 2023', 'film': 'Ant-Man and the Wasp: Quantumania', 'gross': 106109650}
Validation failed!
[{"type":"string_type","loc":["film"],"msg":"Input should be a valid string","input":null,"url":"https://errors

<ipython-input-16-63b98083a3f4>:5: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  validated_data = WebScrape1.parse_obj(json_objects[i])
<ipython-input-16-63b98083a3f4>:8: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  print("Validated data:", validated_data.dict())


Displaying the failed prompts

In [17]:
for i in failed:
  print(i)

{'date': 'February 26, 2023', 'film': None, 'gross': 31964803}
{'date': 'May 14, 2023', 'film': None, 'gross': 62008548}
{'date': 'August 6, 2023', 'film': None, 'gross': 53008647}
{'date': 'September 17, 2023', 'film': None, 'gross': 14534579}
{'date': 'September 24, 2023', 'film': None, 'gross': 8550110}
{'date': 'November 5, 2023', 'film': None, 'gross': 37205784}


In [18]:
for i in validated:
  print(i)

{'date': 'January 8, 2023', 'film': 'Avatar: The Way of Water', 'gross': 45838986}
{'date': 'January 15, 2023', 'film': 'Black Panther: Wakanda Forever', 'gross': 32824684}
{'date': 'January 22, 2023', 'film': 'Avatar: The Way of Water', 'gross': 20133106}
{'date': 'January 29, 2023', 'film': 'Avatar: The Way of Water', 'gross': 15968532}
{'date': 'February 5, 2023', 'film': 'Knock at the Cabin', 'gross': 14127170}
{'date': 'February 12, 2023', 'film': "Magic Mike's Last Dance", 'gross': 8305317}
{'date': 'February 19, 2023', 'film': 'Ant-Man and the Wasp: Quantumania', 'gross': 106109650}
{'date': 'March 5, 2023', 'film': 'Creed III', 'gross': 58370007}
{'date': 'March 12, 2023', 'film': 'Scream VI', 'gross': 44447270}
{'date': 'March 19, 2023', 'film': 'Shazam! Fury of the Gods', 'gross': 30111158}
{'date': 'March 26, 2023', 'film': 'John Wick: Chapter 4', 'gross': 73817950}
{'date': 'April 2, 2023', 'film': 'Dungeons & Dragons: Honor Among Thieves', 'gross': 37205784}
{'date': 'Apri