## Notebook1 - Synthetic Data Generation

Initial Imports

In [42]:
from langchain_openai.chat_models import ChatOpenAI
from pathlib import Path
from dotenv import load_dotenv
import os
import json
import pandas as pd

# Load environment vars:
load_dotenv()
# Alternative approach:
# base_url_voc = os.getenv("OPENAI_BASE")
# api_key_voc = os.getenv("OPENAI_API")

True

Initialisation of OpenAI LLM

In [21]:
completion_model_name = "gpt-4.1"
temperature = 0.0

completion_llm = ChatOpenAI(
    temperature= temperature,
    model= completion_model_name,
    max_tokens= 32768,
    max_retries=1
)

User Prompt and LLM Invocation

In [22]:
user_prompt = """ Your task is to generate Real Estate Listings containing facts about the real estate.
Specifically I want you to generate 50 listings that correspond to the following properties:
> Neighborhood (string value)
> Price (integer number - in euro)
> Bedrooms (integer number)
> House size (integer number - in square meters)
> Description (string value - text description of the house)
> Neighborhood description (string value - text description of the neighborhood))

--Start of Example--
Neighborhood: Green Oaks
Price: 100,000 euro
Bedrooms: 3
House Size: 180 square meters

Description: Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.

Neighborhood Description: Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a breeze.
--End of Example--

Output should be a JSON file with the following structure:
{
    House1:
    {
        Neighborhood: Value1,
        Price: Value2,
        Bedrooms: Value3,
        House Size: Value4,
        Description: Value5,
        Neighborhood Description: Value6 },
    House2:
    {
        Neighborhood: Value1,
        Price: Value2,
        Bedrooms: Value3,
        House Size: Value4,
        Description: Value5,
        Neighborhood Description: Value6 },
...}

Ensure that the JSON file is valid and follows the structure above.
"""

Invoke LLM and print output

In [23]:
llm_output = completion_llm.invoke(user_prompt)

In [None]:
print(llm_output.content.split('```json')[1].split('```')[0])

Save and Load JSON Data

In [38]:
# Save the JSON data
json_data = json.loads(llm_output.content.split('```json')[1].split('```')[0])
Path("../data").mkdir(parents=True, exist_ok=True)
with open('../data/real_estate_listings.json', 'w') as f:
    json.dump(json_data, f, indent=4)

# Load and verify
with open('../data/real_estate_listings.json', 'r') as f:
    loaded_data = json.load(f)
print("Successfully saved and loaded! Number of houses:", len(loaded_data))


Successfully saved and loaded! Number of houses: 50


Load data into a dataframe (tabular format)

In [43]:
pd.DataFrame(dict(loaded_data))

Unnamed: 0,House1,House2,House3,House4,House5,House6,House7,House8,House9,House10,...,House41,House42,House43,House44,House45,House46,House47,House48,House49,House50
Neighborhood,Riverside Heights,Old Town,Sunnydale,City Center,Maple Grove,Harbor View,Willow Park,Lakeside Estates,Elm Street,Sunset Hills,...,Magnolia Place,Cloverfield,Blueberry Hill,Redwood Estates,Sunflower Meadows,Pebble Creek,Orchard Park,Crystal Lake,Lavender Fields,Willow Creek
Price,320000,185000,270000,450000,210000,390000,160000,500000,130000,350000,...,230000,210000,250000,470000,260000,220000,280000,350000,230000,210000
Bedrooms,4,2,3,2,3,4,2,5,2,4,...,3,3,3,5,3,3,3,4,3,3
House Size,210,95,140,110,125,200,90,260,80,190,...,120,115,120,240,130,120,140,190,120,115
Description,Spacious 4-bedroom family home with a modern k...,Charming 2-bedroom townhouse with exposed bric...,Modern 3-bedroom home featuring an open-concep...,Luxury 2-bedroom apartment in the heart of the...,Cozy 3-bedroom bungalow with a spacious living...,Elegant 4-bedroom home with stunning harbor vi...,Bright and airy 2-bedroom apartment with a bal...,Stunning 5-bedroom villa with private lake acc...,Affordable 2-bedroom starter home with a renov...,Beautiful 4-bedroom home with a spacious open-...,...,"Charming 3-bedroom home with magnolia trees, u...",Cozy 3-bedroom home with a clover-filled backy...,"Charming 3-bedroom home with blueberry bushes,...",Elegant 5-bedroom estate surrounded by redwood...,Beautiful 3-bedroom home with a sunflower gard...,Well-kept 3-bedroom home with a pebble creek r...,Modern 3-bedroom home with an open-concept lay...,"Beautiful 4-bedroom home with lake views, a mo...","Charming 3-bedroom home with lavender gardens,...",Cozy 3-bedroom home with a willow tree in the ...
Neighborhood Description,Riverside Heights is known for its scenic rive...,"Old Town boasts historic architecture, cobbles...",Sunnydale is a peaceful suburb with excellent ...,"City Center is bustling with energy, offering ...","Maple Grove is a quiet, family-oriented neighb...",Harbor View offers a relaxed coastal lifestyle...,"Willow Park is a green oasis in the city, feat...",Lakeside Estates is an exclusive gated communi...,Elm Street is a friendly neighborhood with loc...,"Sunset Hills is known for its rolling hills, p...",...,Magnolia Place is a family-friendly neighborho...,Cloverfield is a quiet neighborhood with natur...,Blueberry Hill is a peaceful neighborhood with...,Redwood Estates is an upscale area known for i...,Sunflower Meadows is known for its beautiful g...,Pebble Creek is a family-friendly neighborhood...,Orchard Park is a peaceful suburb with excelle...,"Crystal Lake offers scenic lake views, walking...",Lavender Fields is a family-friendly neighborh...,Willow Creek is a quiet neighborhood with natu...


Reformat Data for into a column named 'Text'

In [44]:
# Create formatted strings for each house
formatted_listings = []
for house_id, house_data in loaded_data.items():
    formatted_text = f"""Neighborhood: {house_data['Neighborhood']}
Price: {house_data['Price']} euro
Bedrooms: {house_data['Bedrooms']}
House Size: {house_data['House Size']} square meters

Description: {house_data['Description']}

Neighborhood Description: {house_data['Neighborhood Description']}"""
    formatted_listings.append(formatted_text)

# Create dataframe with text column
df = pd.DataFrame({'text': formatted_listings})
df


Unnamed: 0,text
0,Neighborhood: Riverside Heights\nPrice: 320000...
1,Neighborhood: Old Town\nPrice: 185000 euro\nBe...
2,Neighborhood: Sunnydale\nPrice: 270000 euro\nB...
3,Neighborhood: City Center\nPrice: 450000 euro\...
4,Neighborhood: Maple Grove\nPrice: 210000 euro\...
5,Neighborhood: Harbor View\nPrice: 390000 euro\...
6,Neighborhood: Willow Park\nPrice: 160000 euro\...
7,Neighborhood: Lakeside Estates\nPrice: 500000 ...
8,Neighborhood: Elm Street\nPrice: 130000 euro\n...
9,Neighborhood: Sunset Hills\nPrice: 350000 euro...


Save formatted data to CSV

In [46]:
df.to_csv('../data/real_estate_listings_formatted.csv')