# ML vs LLM Comparison: Real Estate Price Prediction

This notebook compares traditional machine learning approaches with Large Language Model (LLM) few-shot learning for house price prediction on the Indian real estate dataset.

## Approaches Compared:

### Traditional ML:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor
- SVM Regressor

### LLM Approaches:
- Zero-shot learning (no examples)
- One-shot learning (1 example)
- Few-shot learning (5, 10, 20 examples)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import openai
import json
import time
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

In [None]:
# Load the dataset
data = pd.read_csv("data/House Price India.csv")
print(f"Dataset shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nFirst few rows:")
data.head()

In [None]:
# Basic statistics and data exploration
print("Dataset Info:")
print(data.info())
print("\nPrice Statistics:")
print(data['Price'].describe())
print(f"\nPrice range: ₹{data['Price'].min():,.0f} to ₹{data['Price'].max():,.0f}")

In [None]:
# Visualize price distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Price distribution
axes[0].hist(data['Price'], bins=50, alpha=0.7, edgecolor='black')
axes[0].set_title('Price Distribution')
axes[0].set_xlabel('Price (₹)')
axes[0].set_ylabel('Frequency')

# Log price distribution
axes[1].hist(np.log(data['Price']), bins=50, alpha=0.7, edgecolor='black')
axes[1].set_title('Log Price Distribution')
axes[1].set_xlabel('Log Price')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Data preprocessing for ML models
def preprocess_data(data):
    # Remove ID and Date columns
    features_to_drop = ['id', 'Date']
    X = data.drop(features_to_drop + ['Price'], axis=1)
    y = data['Price']
    
    # Handle any missing values
    X = X.fillna(X.median())
    
    return X, y

X, y = preprocess_data(data)
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features for algorithms that need it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Traditional Machine Learning Models

In [None]:
# Train and evaluate ML models
ml_models = {}
ml_results = {}

# Linear Regression
print("Training Linear Regression...")
ml_models['Linear Regression'] = LinearRegression()
ml_models['Linear Regression'].fit(X_train_scaled, y_train)

# Random Forest
print("Training Random Forest...")
ml_models['Random Forest'] = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
ml_models['Random Forest'].fit(X_train, y_train)

# XGBoost
print("Training XGBoost...")
ml_models['XGBoost'] = xgb.XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)
ml_models['XGBoost'].fit(X_train, y_train)

# SVM Regressor
print("Training SVM...")
ml_models['SVM'] = SVR(kernel='rbf', C=1000, gamma=0.001)
ml_models['SVM'].fit(X_train_scaled, y_train)

print("All models trained successfully!")

In [None]:
# Evaluate ML models
def evaluate_model(name, model, X_test_data, y_true):
    y_pred = model.predict(X_test_data)
    
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    return {
        'MAE': mae,
        'MSE': mse,
        'RMSE': rmse,
        'R2': r2,
        'predictions': y_pred
    }

# Evaluate all ML models
for name, model in ml_models.items():
    if name in ['Linear Regression', 'SVM']:
        results = evaluate_model(name, model, X_test_scaled, y_test)
    else:
        results = evaluate_model(name, model, X_test, y_test)
    
    ml_results[name] = results
    print(f"{name}: MAE={results['MAE']:.2f}, RMSE={results['RMSE']:.2f}, R²={results['R2']:.4f}")

In [None]:
# Visualize ML model performance
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, (name, results) in enumerate(ml_results.items()):
    ax = axes[i]
    ax.scatter(y_test, results['predictions'], alpha=0.5)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    ax.set_xlabel('Actual Price')
    ax.set_ylabel('Predicted Price')
    ax.set_title(f'{name}\nR² = {results["R2"]:.4f}, RMSE = {results["RMSE"]:.0f}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## LLM-based Approaches

**Note:** To run the LLM experiments, you need to:
1. Set your OpenAI API key
2. Uncomment and run the LLM evaluation cells below

In [None]:
# Set your OpenAI API key here
# openai.api_key = "your-api-key-here"

def create_property_description(row):
    """Convert a property row to natural language description"""
    description = f"""Property Details:
- {int(row['number of bedrooms'])} bedrooms, {row['number of bathrooms']:.1f} bathrooms
- Living area: {int(row['living area'])} sq ft
- Lot area: {int(row['lot area'])} sq ft  
- {row['number of floors']:.1f} floors
- {'Waterfront property' if row['waterfront present'] else 'No waterfront'}
- {int(row['number of views'])} views
- Condition: {int(row['condition of the house'])}/10
- Grade: {int(row['grade of the house'])}/13
- House area (excluding basement): {int(row['Area of the house(excluding basement)'])} sq ft
- Basement area: {int(row['Area of the basement'])} sq ft
- Built in {int(row['Built Year'])}
- {'Renovated in ' + str(int(row['Renovation Year'])) if row['Renovation Year'] > 0 else 'Not renovated'}
- Postal code: {int(row['Postal Code'])}
- Location: ({row['Lattitude']:.4f}, {row['Longitude']:.4f})
- Renovated living area: {int(row['living_area_renov'])} sq ft
- Renovated lot area: {int(row['lot_area_renov'])} sq ft
- {int(row['Number of schools nearby'])} schools nearby
- {row['Distance from the airport']:.1f} km from airport"""
    return description

# Example property description
sample_row = X_test.iloc[0]
sample_description = create_property_description(sample_row)
print("Sample property description:")
print(sample_description)
print(f"\nActual price: ₹{y_test.iloc[0]:,.0f}")

In [None]:
# LLM prediction function
def get_llm_prediction(property_description, examples=None, shot_type="zero"):
    """Get price prediction from LLM using different shot approaches"""
    
    if shot_type == "zero":
        prompt = f"""You are a real estate expert. Based on the property details below, estimate the price in Indian Rupees (₹).

{property_description}

Provide only the numerical price estimate without currency symbol or formatting."""
    
    elif shot_type == "one":
        example = examples[0] if examples else ""
        prompt = f"""You are a real estate expert. Based on the property details, estimate the price in Indian Rupees (₹).

Example:
{example}

Now estimate the price for this property:
{property_description}

Provide only the numerical price estimate without currency symbol or formatting."""
    
    else:  # few-shot
        examples_text = "\n\n".join(examples) if examples else ""
        prompt = f"""You are a real estate expert. Based on the property details, estimate the price in Indian Rupees (₹).

Here are some examples:

{examples_text}

Now estimate the price for this property:
{property_description}

Provide only the numerical price estimate without currency symbol or formatting."""
    
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=50,
            temperature=0.1
        )
        
        prediction_text = response.choices[0].message.content.strip()
        # Extract numerical value
        prediction = float(''.join(filter(str.isdigit, prediction_text.replace(',', ''))))
        return prediction
        
    except Exception as e:
        print(f"Error getting LLM prediction: {e}")
        return None

# Prepare few-shot examples
def prepare_few_shot_examples(n_examples=5):
    """Prepare examples for few-shot learning"""
    example_indices = np.random.choice(len(X_train), n_examples, replace=False)
    examples = []
    
    for idx in example_indices:
        row = X_train.iloc[idx]
        price = y_train.iloc[idx]
        description = create_property_description(row)
        example = f"{description}\nPrice: ₹{price:,.0f}"
        examples.append(example)
    
    return examples

In [None]:
# Uncomment and run this cell to evaluate LLM approaches
# Note: This will make API calls to OpenAI and may incur costs

# llm_results = {}
# n_test_samples = 50  # Adjust based on your budget

# # Sample test data
# test_indices = np.random.choice(len(X_test), min(n_test_samples, len(X_test)), replace=False)

# shot_configs = {
#     'Zero-shot': ('zero', None),
#     'One-shot': ('one', 1), 
#     'Few-shot (5)': ('few', 5),
#     'Few-shot (10)': ('few', 10),
#     'Few-shot (20)': ('few', 20)
# }

# for approach_name, (shot_type, n_examples) in shot_configs.items():
#     print(f"\nEvaluating {approach_name}...")
#     predictions = []
#     actual_prices = []
    
#     # Prepare examples
#     if n_examples:
#         examples = prepare_few_shot_examples(n_examples)
#     else:
#         examples = None
    
#     for idx in test_indices:
#         row = X_test.iloc[idx]
#         actual_price = y_test.iloc[idx]
#         description = create_property_description(row)
        
#         prediction = get_llm_prediction(description, examples, shot_type)
        
#         if prediction is not None:
#             predictions.append(prediction)
#             actual_prices.append(actual_price)
        
#         # Delay to avoid rate limiting
#         time.sleep(0.1)
    
#     if predictions:
#         predictions = np.array(predictions)
#         actual_prices = np.array(actual_prices)
        
#         mae = mean_absolute_error(actual_prices, predictions)
#         mse = mean_squared_error(actual_prices, predictions)
#         rmse = np.sqrt(mse)
#         r2 = r2_score(actual_prices, predictions)
        
#         llm_results[approach_name] = {
#             'MAE': mae,
#             'MSE': mse,
#             'RMSE': rmse,
#             'R2': r2,
#             'n_predictions': len(predictions),
#             'predictions': predictions,
#             'actual': actual_prices
#         }
        
#         print(f"{approach_name}: MAE={mae:.2f}, RMSE={rmse:.2f}, R²={r2:.4f} ({len(predictions)} predictions)")

print("Uncomment the above code to run LLM experiments (requires OpenAI API key)")

## Comprehensive Comparison

In [None]:
# Compare all approaches
def create_comparison_table(ml_results, llm_results=None):
    """Create a comparison table of all approaches"""
    
    all_results = ml_results.copy()
    if llm_results:
        all_results.update(llm_results)
    
    # Create DataFrame for better visualization
    comparison_data = []
    for name, metrics in all_results.items():
        comparison_data.append({
            'Approach': name,
            'MAE': metrics['MAE'],
            'RMSE': metrics['RMSE'],
            'R²': metrics['R2'],
            'Type': 'Traditional ML' if name in ml_results else 'LLM'
        })
    
    df = pd.DataFrame(comparison_data)
    df = df.sort_values('R²', ascending=False)
    
    return df

# Create comparison table (currently only ML results)
comparison_df = create_comparison_table(ml_results)  # Add llm_results when available
print("Performance Comparison (Traditional ML Models):")
print("=" * 60)
print(comparison_df.to_string(index=False, float_format='%.4f'))

# Find best performing model
best_model = comparison_df.iloc[0]
print(f"\nBest performing model: {best_model['Approach']}")
print(f"R² Score: {best_model['R²']:.4f}")
print(f"RMSE: ₹{best_model['RMSE']:,.0f}")

In [None]:
# Visualize performance comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# R² comparison
axes[0].bar(comparison_df['Approach'], comparison_df['R²'], color='skyblue', edgecolor='black')
axes[0].set_title('R² Score Comparison')
axes[0].set_ylabel('R² Score')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)

# MAE comparison
axes[1].bar(comparison_df['Approach'], comparison_df['MAE'], color='lightcoral', edgecolor='black')
axes[1].set_title('Mean Absolute Error Comparison')
axes[1].set_ylabel('MAE (₹)')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3)

# RMSE comparison
axes[2].bar(comparison_df['Approach'], comparison_df['RMSE'], color='lightgreen', edgecolor='black')
axes[2].set_title('Root Mean Square Error Comparison')
axes[2].set_ylabel('RMSE (₹)')
axes[2].tick_params(axis='x', rotation=45)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Feature Importance Analysis (for tree-based models)

In [None]:
# Feature importance for Random Forest and XGBoost
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Random Forest feature importance
rf_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': ml_models['Random Forest'].feature_importances_
}).sort_values('importance', ascending=True)

axes[0].barh(rf_importance['feature'], rf_importance['importance'])
axes[0].set_title('Random Forest - Feature Importance')
axes[0].set_xlabel('Importance')

# XGBoost feature importance
xgb_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': ml_models['XGBoost'].feature_importances_
}).sort_values('importance', ascending=True)

axes[1].barh(xgb_importance['feature'], xgb_importance['importance'])
axes[1].set_title('XGBoost - Feature Importance')
axes[1].set_xlabel('Importance')

plt.tight_layout()
plt.show()

print("Top 5 most important features (Random Forest):")
print(rf_importance.tail().to_string(index=False))
print("\nTop 5 most important features (XGBoost):")
print(xgb_importance.tail().to_string(index=False))

## Summary and Conclusions

### Key Findings:

1. **Traditional ML Performance**: 
   - All traditional ML models show strong performance on this structured dataset
   - Tree-based models (Random Forest, XGBoost) typically perform best
   - Linear models may struggle with non-linear relationships

2. **Feature Importance**:
   - Location features (latitude, longitude) are typically highly important
   - House size features (living area, lot area) are crucial predictors
   - Quality indicators (grade, condition) significantly impact price

3. **LLM Comparison** (when run):
   - Zero-shot: Tests the model's inherent real estate knowledge
   - One-shot: Shows how one example can guide predictions
   - Few-shot: Demonstrates the power of in-context learning

### Trade-offs:

- **Traditional ML**: Fast inference, interpretable, well-established metrics
- **LLM Approaches**: More flexible, can handle natural language descriptions, but slower and more expensive

### Next Steps:

1. Run LLM experiments by setting up OpenAI API key
2. Experiment with different prompt engineering techniques
3. Try ensemble methods combining both approaches
4. Implement cross-validation for more robust evaluation