| Model                  | Best for                     | Pros                              | Cons                         |
|------------------------|------------------------------|-----------------------------------|------------------------------|
| **Linear Regression**  | Simple, linear relationships | Fast, interpretable               | Limited to linear data, sensitive to outliers |
| **Random Forest**      | Complex, tabular data        | Handles outliers, captures non-linear patterns | Slower, less interpretable |
| **XGBoost**            | High-stakes, complex data    | High accuracy, handles non-linear data, regularization | Complex tuning, resource-intensive |


In practice, you can start with Linear Regression if you expect a simple relationship. If itâ€™s not enough, try Random Forest. If you need the best possible performance and are okay with more tuning, use XGBoost.

In [57]:
# Modeling Notebook - Dimensionality Reduction and Training

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.decomposition import PCA

In [58]:
# Load the feature-engineered data
file_path = '../../data/feature_engineered_immo_data.csv'  # Ensure this is correct
data = pd.read_csv(file_path)

In [59]:
# Step 1: Define Features (X) and Target (y)
X = data.drop(columns=['totalRent'])  # 'totalRent' is the target variable
Y = data['totalRent']


In [61]:
# Step 2: Apply PCA for Dimensionality Reduction (if needed)
# Apply PCA to retain 95% of the variance, which will choose an optimal number of components
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

# Print the number of components selected by PCA
print(f"Number of components selected to explain 95% variance: {X_reduced.shape[1]}")


Number of components selected to explain 95% variance: 1


In [62]:
# Step 3: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, Y, test_size=0.2, random_state=42)


In [63]:
# Step 4: Initialize and Train Models
# Define models to train
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

In [64]:
# Dictionary to store results
results = {}

# Train, predict, and evaluate each model
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculate evaluation metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Save results
    results[model_name] = {
        'MAE': mae,
        'MSE': mse,
        'R2 Score': r2
    }


In [65]:
# Display results
print("Model Evaluation Results:")
for model, metrics in results.items():
    print(f"{model}: {metrics}")

Model Evaluation Results:
Linear Regression: {'MAE': np.float64(193.0353279000988), 'MSE': np.float64(63904.140016557125), 'R2 Score': 0.004641938952950175}
Random Forest: {'MAE': np.float64(222.36540682548014), 'MSE': np.float64(85210.09325856206), 'R2 Score': -0.32721531320984454}
XGBoost: {'MAE': np.float64(192.59617854767689), 'MSE': np.float64(63250.71635227815), 'R2 Score': 0.014819534823115488}
