# 13. XGBoost Modeling and Final Comparison

This notebook focuses on applying the powerful XGBoost algorithm to our three city datasets (Berlin, Istanbul, and Munich). We will compare its performance against our previous Random Forest models to determine if it can improve on the results, especially for Istanbul and Munich. This analysis represents the final iteration of our project's modeling phase.

In [1]:
# Import necessary libraries
import pandas as pd
import os
import sys
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer 
from sklearn.metrics import r2_score, mean_squared_error
import xgboost as xgb

# Add the parent directory (utils folder) to the system path
sys.path.append(os.path.join(os.getcwd(), '..'))

# Import our custom data loading and model utility functions
from utils.data_loader import load_and_clean_data
from utils.model_utils import prepare_features

In [2]:
def train_and_evaluate_xgboost_model(city_name):
    """
    Loads data for a given city, prepares it, trains an XGBoost model,
    and evaluates its performance.
    """
    print(f"\n--- Starting XGBoost Modeling for {city_name} ---")

    # Load and prepare data
    df_city = load_and_clean_data(city_name)
    df_city.drop(columns=['host_since', 'calendar_last_scraped', 'first_review', 'last_review'], errors='ignore', inplace=True)
    X, y = prepare_features(df_city, target_column='price')
    imputer = SimpleImputer(strategy='mean')
    X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

    # --- HATA DÜZELTME: XGBoost için özellik isimlerini temizleme ---
    # [] ve < gibi karakterleri temizler
    X.columns = X.columns.str.replace('[\[\]<]', '', regex=True)
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize the XGBoost Regressor model
    xg_model = xgb.XGBRegressor(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.1,
        n_jobs=-1,
        random_state=42
    )

    print(f"Training XGBoost model for {city_name}...")
    
    # Train the model
    xg_model.fit(X_train, y_train)

    # Make predictions
    y_pred = xg_model.predict(X_test)

    # Evaluate performance
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    print(f"\n--- XGBoost Model Performance for {city_name} ---")
    print(f"R-squared (R²) score: {r2:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

    return r2, rmse

In [3]:
# Run the model for Berlin
berlin_r2, berlin_rmse = train_and_evaluate_xgboost_model('berlin')

# Run the model for Istanbul
istanbul_r2, istanbul_rmse = train_and_evaluate_xgboost_model('istanbul')

# Run the model for Munich
munich_r2, munich_rmse = train_and_evaluate_xgboost_model('munich')


--- Starting XGBoost Modeling for berlin ---
Loading cleaned data for Berlin from processed directory...
Categorical features have been one-hot encoded.
Shape of features (X) after encoding: (9135, 8936)
Training XGBoost model for berlin...

--- XGBoost Model Performance for berlin ---
R-squared (R²) score: 0.7264
Root Mean Squared Error (RMSE): 49.01

--- Starting XGBoost Modeling for istanbul ---
Loading cleaned data for Istanbul from processed directory...
Categorical features have been one-hot encoded.
Shape of features (X) after encoding: (3340, 3524)
Training XGBoost model for istanbul...

--- XGBoost Model Performance for istanbul ---
R-squared (R²) score: 0.3368
Root Mean Squared Error (RMSE): 174.30

--- Starting XGBoost Modeling for munich ---
Loading cleaned data for Munich from processed directory...
Categorical features have been one-hot encoded.
Shape of features (X) after encoding: (4687, 4871)
Training XGBoost model for munich...

--- XGBoost Model Performance for muni

# Airbnb Price Prediction Case Study

## Project Overview

This project is a comprehensive machine learning case study focused on predicting Airbnb prices in three major European cities: Berlin, Istanbul, and Munich. The primary goal was to explore how different machine learning models perform on varying urban market data and to develop a robust, high-performance solution.

The project followed an iterative workflow:

1.  **Data Preprocessing and Exploratory Data Analysis (EDA):** I began by cleaning and preparing the raw data, followed by a detailed EDA to understand the key factors influencing price in each city.
2.  **Baseline Modeling:** A simple Linear Regression model (V1) was established as a baseline to set a benchmark for performance.
3.  **Advanced Modeling:** I transitioned to more powerful tree-based ensemble models, starting with a Random Forest Regressor (V2).
4.  **Model Optimization:** I performed a hyperparameter tuning process to find the optimal settings for the Random Forest model (V3).
5.  **Advanced Algorithm Implementation:** Recognizing that the Random Forest model might have reached its performance limit, especially in Istanbul and Munich, I implemented and evaluated the more advanced XGBoost algorithm.

## Final Model Performance Summary: The Ultimate Comparison

This table represents the culmination of all modeling efforts, providing a clear comparison of each model's performance.

| Model | City | R² Score | RMSE |
|---|---|---|---|
| **Linear Regression (V1)** | Berlin | 0.008 | 93.31 |
| | Istanbul | 0.022 | 211.71 |
| | Munich | -0.002 | 142.19 |
| **Random Forest (V2/V3)** | Berlin | 0.7133 | 50.17 |
| | Istanbul | 0.3140 | 177.27 |
| | Munich | 0.4684 | 103.57 |
| **XGBoost (Final Model)** | **Berlin** | **0.7264** | **49.01** |
| | **Istanbul** | **0.3368** | **174.30** |
| | **Munich** | **0.5598** | **94.24** |

## Key Insights and Conclusions

Through this project, I gained several key insights and demonstrated core data science competencies:

-   **Algorithmic Choice Matters:** The jump in performance from Linear Regression (V1) to the ensemble models (Random Forest and XGBoost) was dramatic, confirming that a simple model is often insufficient for complex, non-linear data.
-   **Validation and Optimization:** My hyperparameter tuning of the Random Forest model (V3) showed that the initial default model was already performing near its peak.
-   **Knowing When to Iterate:** When the Random Forest model's performance plateaued, I chose to implement XGBoost, a superior algorithm for this type of data. This decision led to significant performance gains in Istanbul and Munich, validating my hypothesis that a more powerful model was needed to capture the complexities of those markets.
-   **Data-Driven Problem Solving:** I encountered and resolved technical challenges, such as a `ValueError` caused by special characters in feature names, demonstrating my ability to diagnose and fix real-world coding issues.
-   **Market-Specific Dynamics:** The final results show that while my models perform exceptionally well in Berlin, the lower scores in Istanbul and Munich suggest that each city has unique market factors not captured by the current dataset, which is a valuable business insight for future work.