# üõ´ Flight Price Forecasting Pipeline

## üéØ Project Goal

The primary objective of this notebook is to develop a **robust machine learning model** capable of predicting flight prices for specific routes. By accurately forecasting the expected price of a flight, we can detect **"deals"** (offers priced significantly below the predicted market rate) and **"overpriced"** listings, providing actionable advice to users on whether to **Buy Now** or **Wait**.

---

## üìä Data Overview

The raw data consists of multiple daily scrapes from the of flight itinerary information. Crucially, the raw data is at the **flight segment level**, meaning multi-connection journeys appear as multiple rows.

* **Source:** Aggregated daily flight offers from Amademus API.
* **Key Challenge:** The price landscape is highly **volatile**. Prices change frequently due to dynamic pricing algorithms, load factors, and time-to-departure. Our feature engineering must account for this volatility.

---

## üõ†Ô∏è Methodology Overview

The pipeline uses a multi-stage approach, leveraging advanced **time-series feature engineering** and **Gradient Boosting** models (LightGBM) to capture the complex patterns in flight price behavior.

### 1. Data Cleaning (Segment Consolidation)

The raw, segment-level data is aggregated into a single, clean record per unique flight **journey** (`OfferID`). This forms the base dataset for all subsequent steps.

### 2. Time-Series Feature Engineering

This is the most critical stage. We create **lagged price features** and **rolling statistics** (mean, standard deviation) by tracking the history of the *minimum price* for a specific flight date. This converts noisy daily price observations into stable, predictive time-series signals.

### 3. Model Training & Validation

A LightGBM Regressor is trained using **Time Series Cross-Validation (TSCV)**. TSCV ensures that the model is only ever trained on historical data, validating its performance on future data in a manner that realistically simulates deployment.

### 4. Deal Detection

The final model predictions are used to calculate the **percentage error** between the actual price and the predicted price. Thresholds are applied to categorize offers and generate concrete recommendations:

| Prediction Error % | Recommendation | Category |
| :--- | :--- | :--- |
| **$\leq -12\%$** | üî• BUY NOW | **EXCELLENT** |
| **$-12\%$ to $-7\%$** | ‚úÖ STRONG BUY | **GOOD** |
| **$> 10\%$** | ‚ùå WAIT | **OVERPRICED** |

---

## üöÄ Pipeline Structure

This notebook is structured around the following functions, which represent the sequential steps of the ML workflow:

* `get_data()`: Loads all raw CSVs.
* `clean_raw_data()`: Consolidates segments into journeys.
* `filter_target_route()`: Focuses on specific routes (e.g., IAH -> LAX).
* `engineer_features()`: Creates standard temporal and flight features.
* `create_price_history_features()`: Generates lag and rolling price features.
* `train_price_model()`: Fits the LightGBM model with TSCV.
* `predict_prices()`: Generates predictions on the data.
* `evaluate_model()`: Reports performance metrics (MAE, MAPE, R¬≤).
* `detect_deals()`: Final classification and recommendation step.
* `run_complete_pipeline()`: Orchestrates all steps.

In [1]:
import pickle
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Any
import warnings
import io
import os
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
import lightgbm as lgb
warnings.filterwarnings('ignore')

# Go to the content folder
%cd /content

# Remove old repo if it exists
!rm -rf Airline-Flight-Price-Analysis-with-APIs-Azure

# Clone the latest version of your repo
!git clone https://github.com/williamervin7/Airline-Flight-Price-Analysis-with-APIs-Azure.git

# Check contents
!ls Airline-Flight-Price-Analysis-with-APIs-Azure
import pandas as pd
import numpy as np
from datetime import datetime
data_path = 'Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw'


/content
Cloning into 'Airline-Flight-Price-Analysis-with-APIs-Azure'...
remote: Enumerating objects: 541, done.[K
remote: Counting objects: 100% (147/147), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 541 (delta 120), reused 17 (delta 13), pack-reused 394 (from 1)[K
Receiving objects: 100% (541/541), 1.43 MiB | 11.25 MiB/s, done.
Resolving deltas: 100% (346/346), done.
data  figures  notebooks  README.md  scripts


Here is the formatted Markdown for the data loading step:

# üì• STEP 1: Data Ingestion

This step is responsible for loading the raw flight segment data from the file system. Since the data is stored across multiple CSV files (one for each daily scrape), the `get_data` function consolidates them into a single, unified DataFrame for processing.

-----

## `get_data` Function

The function performs the following critical tasks:

1.  **Iterative Loading:** Reads every CSV file in the specified directory path.
2.  **Date Parsing:** Automatically converts critical date-time columns (like `'Departure'`, `'Arrival'`, `'DepartureDate'`, and `'SearchDate'`) into the correct datetime format upon load.
3.  **Consolidation:** Uses `pd.concat` to merge all individual daily files into one master DataFrame.

In [13]:
def get_data(path: str) -> pd.DataFrame:
    """
    Loads and consolidates all CSV files from a specified directory into a single Pandas DataFrame.

    This function iterates through all files in the given directory, loads only
    those ending with '.csv', and automatically parses the relevant date columns
    into datetime objects during loading.

    Parameters
    ----------
    path : str
        The full path to the directory containing the raw CSV data files.

    Returns
    -------
    pd.DataFrame
        A single, concatenated DataFrame containing all rows from all CSV files
        found in the directory.

    Notes
    -----
    The function assumes that all CSV files have the following columns which
    should be parsed as dates: 'Departure', 'Arrival', 'DepartureDate', and
    'SearchDate'. Prints a summary of the files loaded and the total number of rows.
    """
    # Get list of files in folder
    files = sorted(os.listdir(data_path))

    # Collect all dataframes
    dfs = []

    for f in files:
      if f.endswith('.csv'): #only load CSV
        full_path = os.path.join(data_path, f)
        print(f"Loading {full_path}...")

        df = pd.read_csv(full_path, parse_dates=['Departure', 'Arrival', 'DepartureDate', 'SearchDate'])
        dfs.append(df)

    all_data = pd.concat(dfs, ignore_index=True)
    print(f'Loaded {len(files)-1} files, {all_data.shape[0]} rows')
    return all_data

# üßπ STEP 2: Data Preparation (Cleaning & Filtering)

Before we can build any features, the raw data must be structured correctly. This step addresses two critical pre-processing requirements: **data granularity correction** and **route focus**.

## 1\. Segment Consolidation (`clean_raw_data`)

The raw data is at the **flight segment level**, meaning a connecting flight (e.g., IAH ‚Üí DEN ‚Üí LAX) appears as two separate rows, but both share the same `OfferID` and overall `Price`. This step aggregates these segments into a single row representing the complete journey.

### Key Aggregation Logic:

  * **Grouping Key:** Unique combination of (`OfferID`, `DepartureDate`, `SearchDate`).
  * **Journey Origin/Destination:** Taken from the `From` of the first segment and the `To` of the last segment (after sorting by departure time).
  * **Total Duration:** Calculated as the time difference between the first segment's departure and the last segment's arrival.
  * **New Features:** `num_segments`, `is_direct`, `primary_airline`, and `stops` are created to capture the journey's characteristics.

### Output Snapshot

| Feature | Description | Example (2-Segment Journey) |
| :--- | :--- | :--- |
| **`origin`** | Departure airport of first flight. | `IAH` |
| **`destination`** | Arrival airport of last flight. | `LAX` |
| **`price`** | Single price for the entire itinerary. | `150.90` |
| **`total_duration_minutes`** | End-to-end travel time. | `360` |
| **`num_segments`** | Number of legs (flights). | `2` |
| **`stops`** | Airports visited between origin and destination. | `DEN` |

-----

## 2\. Route Filtering (`filter_target_route`)

To make the modeling problem tractable and focus on key business insights, the data is filtered to include only specific routes. For this project, we are focusing on flights originating from **Houston (IAH)** destined for airports in the **Greater Los Angeles Area (LAX, ONT)**.


In [3]:
def clean_raw_data(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Cleans raw flight segment data by consolidating multi-segment itineraries
    into a single row per unique offer.

    The raw data is assumed to have multiple rows for connecting flights
    sharing the same OfferID, DepartureDate, and SearchDate. This function
    aggregates flight segment details to represent the entire journey.

    Parameters
    ----------
    df : pd.DataFrame
        The raw DataFrame containing one row per flight segment.
        Expected columns include 'OfferID', 'DepartureDate', 'SearchDate',
        'Departure', 'Arrival', 'Price', 'From', 'To', 'Airline', 'Flight'.
    verbose : bool, optional
        If True, prints progress and summary statistics. The default is True.

    Returns
    -------
    pd.DataFrame
        A cleaned DataFrame where each row represents a complete flight offer
        (journey), ready for feature engineering.

    Notes
    -----
    The output columns use standard Python snake_case for consistency.
    """
    if verbose:
        print(f"\n{'='*70}")
        print("STEP 1: DATA CLEANING")
        print(f"{'='*70}")
        print(f"Raw data shape: {df.shape}")

    df = df.copy()

    # Parse dates
    df['Departure'] = pd.to_datetime(df['Departure'])
    df['Arrival'] = pd.to_datetime(df['Arrival'])
    df['DepartureDate'] = pd.to_datetime(df['DepartureDate'])
    df['SearchDate'] = pd.to_datetime(df['SearchDate'])

    # Group by OfferID to consolidate connecting flights
    offer_groups = []

    for (offer_id, dep_date, search_date), group in df.groupby(['OfferID', 'DepartureDate', 'SearchDate']):
        # Sort by departure time to get journey order
        group = group.sort_values('Departure')

        # Get journey details
        first_flight = group.iloc[0]
        last_flight = group.iloc[-1]

        # Calculate total journey time
        total_duration_minutes = (last_flight['Arrival'] - first_flight['Departure']).total_seconds() / 60

        # Consolidate offer info
        offer_info = {
            'offer_id': offer_id,
            'search_date': search_date,
            'departure_date': dep_date,
            'price': first_flight['Price'],  # Price is same for all segments

            # Origin and destination
            'origin': first_flight['From'],
            'destination': last_flight['To'],

            # Timing
            'departure_time': first_flight['Departure'],
            'arrival_time': last_flight['Arrival'],
            'total_duration_minutes': total_duration_minutes,

            # Flight details
            'num_segments': len(group),
            'is_direct': len(group) == 1,
            'airlines': '|'.join(group['Airline'].unique()),
            'primary_airline': group['Airline'].iloc[0],
            'flight_numbers': '|'.join(group['Flight'].astype(str).values),

            # Intermediate stops
            'stops': '|'.join(group['To'].iloc[:-1].values) if len(group) > 1 else 'DIRECT',
        }

        offer_groups.append(offer_info)

    # Create cleaned dataframe
    cleaned_df = pd.DataFrame(offer_groups)

    if verbose:
        print(f"\nCleaned data shape: {cleaned_df.shape}")
        print(f"Unique offers: {cleaned_df['offer_id'].nunique()}")
        print(f"Price range: ${cleaned_df['price'].min():.2f} - ${cleaned_df['price'].max():.2f}")
        print(f"Date range: {cleaned_df['search_date'].min().date()} to {cleaned_df['search_date'].max().date()}")
        print(f"\nFlight types:")
        print(f"  Direct flights: {cleaned_df['is_direct'].sum()} ({cleaned_df['is_direct'].mean()*100:.1f}%)")
        print(f"  Connecting flights: {(~cleaned_df['is_direct']).sum()} ({(~cleaned_df['is_direct']).mean()*100:.1f}%)")
        print(f"\nAirlines:")
        print(cleaned_df['primary_airline'].value_counts())

    return cleaned_df

def filter_target_route(df: pd.DataFrame, origin: str = 'IAH',
                       destinations: List[str] = ['LAX', 'ONT'],
                       verbose: bool = True) -> pd.DataFrame:
    """
    Filters the flight offers DataFrame to include only specific origin-destination routes.

    This function is used to focus the analysis and modeling efforts on a
    pre-defined set of routes, such as flights originating from IAH and
    destined for the Los Angeles area (LAX, ONT, etc.).

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing consolidated flight offer data.
        Must contain 'origin' and 'destination' columns.
    origin : str, optional
        The specific origin airport code (e.g., 'IAH') to filter on. The default is 'IAH'.
    destinations : List[str], optional
        A list of destination airport codes (e.g., ['LAX', 'ONT']) to filter on.
        The default is ['LAX', 'ONT'].
    verbose : bool, optional
        If True, prints a summary of the filtering operation, including the
        data count before and after, and a destination breakdown. The default is True.

    Returns
    -------
    pd.DataFrame
        A new DataFrame containing only the flight offers matching the
        specified origin and destination criteria.
    """
    if verbose:
        print(f"\n{'='*70}")
        print("STEP 2: ROUTE FILTERING")
        print(f"{'='*70}")

    # Filter for IAH to LAX/ONT area
    mask = (df['origin'] == origin) & (df['destination'].isin(destinations))
    filtered_df = df[mask].copy()

    if verbose:
        print(f"Filtering: {origin} ‚Üí {destinations}")
        print(f"Before: {len(df)} offers")
        print(f"After: {len(filtered_df)} offers")
        print(f"\nDestination breakdown:")
        print(filtered_df['destination'].value_counts())

    return filtered_df

# üí° STEP 3 & 4: Feature Engineering

Feature Engineering is the most critical stage of this pipeline, as it transforms raw date and flight details into predictive signals. This process is divided into two phases: **Standard Features** (temporal, duration, categorical) and **Time-Series Price History Features**.

-----

## 3\. Standard Feature Engineering (`engineer_features`)

This function creates foundational features that capture the seasonality, timing, and structural aspects of the flight offers.

### Feature Categories

| Category | Features | Description |
| :--- | :--- | :--- |
| **Booking Window** | `days_until_departure`, `weeks_until_departure` | Quantifies the advance purchase time, which is highly correlated with price. |
| **Temporal Seasonality** | `departure_month`, `departure_day_of_week`, `search_day_of_week` | Captures monthly seasonality and day-of-week effects (e.g., cheaper to search on Tuesday). |
| **Departure Time** | `departure_hour`, `is_morning`, `is_evening`, `is_afternoon` | Captures price differences based on time of day (e.g., morning/evening flights are often pricier). |
| **Booking Flags** | `is_last_minute`, `is_early_booker` | Categorical flags for important booking windows (e.g., booking within 7 days). |
| **Flight Structure** | `duration_hours`, `num_stops`, `is_direct_flight` | Structural characteristics that determine the flight's inherent cost/value. |
| **Categorical Encoding** | `airline_encoded`, `is_lax` | Numerical mapping for airlines (using simple integer encoding) and one-hot encoding for specific destinations. |

### Code

```python
def engineer_features(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    # ... implementation details
    return df
```

-----

## üìà 4. Time-Series Price History Features (`create_price_history_features`)

Price forecasting requires understanding the trajectory of the price leading up to the current search date. These features are generated using a **time-series approach** to prevent data leakage.

### Key Features (Generated per Unique Flight Itinerary)

| Feature | Calculation Method | Prediction Goal |
| :--- | :--- | :--- |
| **`price_lag_1`** | Price observed on the search date immediately prior (`.shift(1)`). | Captures the **previous day's price level**. |
| **`price_rolling_mean_3`** | Average price over the past 3 days (excluding today). | Provides a stable measure of the **recent price trend**. |
| **`price_rolling_std_3`** | Standard deviation over the past 3 days (excluding today). | Measures **price volatility**‚Äîa high std often precedes sharp changes. |
| **`price_min_last_7`** | Minimum price observed over the past 7 days. | Captures the **best deal** seen recently for this flight. |
| **`price_change_1d`** | Price difference between today and yesterday (`.diff(1)`). | Measures **price momentum** (is the price rising or falling?). |

### Data Leakage Prevention

The core of this step involves sorting the data by `search_date` **within each unique flight** and using the `.shift(1)` operation. This ensures that the price observed *today* is only predicted using information known *yesterday or earlier*.

### Code

```python
def create_price_history_features(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    # ... implementation details (Groups, Sorts, Shifts)
    return df
```

-----

## üìú Complete Feature List (`get_feature_list`)

The model training step will utilize all the following engineered features, combining structural, temporal, and critical time-series information:

```python
[
    # Booking window
    'days_until_departure', 'weeks_until_departure',
    
    # Time features
    'search_day_of_week', 'departure_day_of_week', 'departure_hour', 'departure_month',
    
    # Categorical Flags
    'is_weekend_departure', 'is_friday_departure', 'is_monday_departure',
    'is_morning', 'is_afternoon', 'is_evening', 'is_last_minute', 'is_early_booker',
    
    # Flight characteristics
    'duration_hours', 'is_direct_flight', 'num_stops', 'is_lax', 'airline_encoded',
    
    # Price history (Time-Series Features)
    'price_lag_1', 'price_lag_3', 'price_rolling_mean_3', 'price_rolling_std_3',
    'price_min_last_7', 'price_max_last_7', 'price_change_1d',
]
```

In [4]:
def engineer_features(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Creates a comprehensive set of predictive features from the cleaned flight offers DataFrame.

    Features include temporal components (day of week, month), booking window metrics,
    flight characteristics (duration, stops), and one-hot style categorical flags
    to capture price seasonality and patterns.

    Parameters
    ----------
    df : pd.DataFrame
        The cleaned DataFrame containing one row per flight offer, expected to
        have columns like 'search_date', 'departure_date', 'departure_time',
        'total_duration_minutes', 'is_direct', 'num_segments', 'destination',
        and 'primary_airline'.
    verbose : bool, optional
        If True, prints a summary of the engineered features and booking window
        distribution. The default is True.

    Returns
    -------
    pd.DataFrame
        The DataFrame augmented with new, model-ready features, including:
        - Temporal features (e.g., 'departure_day_of_week', 'search_month').
        - Booking window features (e.g., 'days_until_departure', 'is_last_minute').
        - Flight characteristics (e.g., 'duration_hours', 'num_stops').
        - Encoded categorical variables ('airline_encoded', 'is_lax', 'is_ont').
    """
    if verbose:
        print(f"\n{'='*70}")
        print("STEP 3: FEATURE ENGINEERING")
        print(f"{'='*70}")

    df = df.copy()

    # === BOOKING WINDOW FEATURES ===
    df['days_until_departure'] = (df['departure_date'] - df['search_date']).dt.days
    df['weeks_until_departure'] = df['days_until_departure'] / 7

    # === TIME FEATURES ===
    # Search date features
    df['search_day_of_week'] = df['search_date'].dt.dayofweek  # 0=Mon, 6=Sun
    df['search_day_of_month'] = df['search_date'].dt.day
    df['search_week_of_year'] = df['search_date'].dt.isocalendar().week
    df['search_month'] = df['search_date'].dt.month

    # Departure date features
    df['departure_day_of_week'] = df['departure_date'].dt.dayofweek
    df['departure_day_of_month'] = df['departure_date'].dt.day
    df['departure_week_of_year'] = df['departure_date'].dt.isocalendar().week
    df['departure_month'] = df['departure_date'].dt.month
    df['departure_hour'] = df['departure_time'].dt.hour

    # === CATEGORICAL FEATURES ===
    df['is_weekend_departure'] = df['departure_day_of_week'].isin([5, 6]).astype(int)
    df['is_friday_departure'] = (df['departure_day_of_week'] == 4).astype(int)
    df['is_monday_departure'] = (df['departure_day_of_week'] == 0).astype(int)

    # Time of day
    df['is_early_morning'] = (df['departure_hour'] < 6).astype(int)  # Red-eye
    df['is_morning'] = ((df['departure_hour'] >= 6) & (df['departure_hour'] < 12)).astype(int)
    df['is_afternoon'] = ((df['departure_hour'] >= 12) & (df['departure_hour'] < 18)).astype(int)
    df['is_evening'] = (df['departure_hour'] >= 18).astype(int)

    # Booking patterns
    df['is_last_minute'] = (df['days_until_departure'] <= 7).astype(int)
    df['is_early_booker'] = (df['days_until_departure'] >= 30).astype(int)
    df['is_moderate_advance'] = ((df['days_until_departure'] > 7) &
                                  (df['days_until_departure'] < 30)).astype(int)

    # === FLIGHT CHARACTERISTICS ===
    df['duration_hours'] = df['total_duration_minutes'] / 60
    df['is_direct_flight'] = df['is_direct'].astype(int)
    df['num_stops'] = df['num_segments'] - 1

    # Destination encoding (LAX vs ONT)
    df['is_lax'] = (df['destination'] == 'LAX').astype(int)
    df['is_ont'] = (df['destination'] == 'ONT').astype(int)

    # Airline encoding
    airline_map = {airline: idx for idx, airline in enumerate(df['primary_airline'].unique())}
    df['airline_encoded'] = df['primary_airline'].map(airline_map)

    if verbose:
        print(f"\nFeatures created: {len([c for c in df.columns if c.startswith(('is_', 'days_', 'weeks_', 'num_', 'duration_'))])} engineered features")
        print(f"\nBooking window distribution:")
        print(f"  Last minute (‚â§7 days): {df['is_last_minute'].sum()}")
        print(f"  Moderate (8-29 days): {df['is_moderate_advance'].sum()}")
        print(f"  Early (‚â•30 days): {df['is_early_booker'].sum()}")

    return df


def create_price_history_features(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Calculates time-series based price history features (lags, rolling stats, momentum)
    for each unique flight itinerary (Departure Date, Departure Hour, Destination).

    These features are crucial for price forecasting as they capture the temporal
    trend, volatility, and momentum of the price leading up to the search date.
    The calculations are performed sequentially based on the 'search_date' to prevent
    data leakage.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame, which must contain the 'price' column and the
        grouping columns: 'departure_date', 'departure_hour', and 'destination'.
        It is assumed this DataFrame is already grouped/filtered by route.
    verbose : bool, optional
        If True, prints a header and summary statistics about the created features.
        The default is True.

    Returns
    -------
    pd.DataFrame
        The original DataFrame augmented with columns representing lagged prices,
        rolling means/standard deviations, and price changes.

    Notes
    -----
    - **Time Ordering:** Data is sorted by `search_date` within each unique flight
      to ensure correct sequential calculation of lags and rolling statistics.
    - **Data Leakage Prevention:** The `.shift(1)` operation is used before
      calculating rolling statistics to ensure the current day's price is *not* used to predict itself.
    - **Imputation:** Missing values (NaNs) are filled using the overall dataset
      mean for price levels and 0 for price change/standard deviation features.
    """
    if verbose:
        print(f"\n{'='*70}")
        print("STEP 4: PRICE HISTORY FEATURES")
        print(f"{'='*70}")

    df = df.copy()
    df = df.sort_values(['departure_date', 'departure_hour', 'search_date'])

    # Group by unique flight (departure date + time + destination)
    for (dep_date, dep_hour, dest), group in df.groupby(['departure_date', 'departure_hour', 'destination']):
        idx = group.index

        if len(group) > 1:
            # Lag features - what was the price yesterday, 3 days ago, etc.
            df.loc[idx, 'price_lag_1'] = group['price'].shift(1)
            df.loc[idx, 'price_lag_3'] = group['price'].shift(3)
            df.loc[idx, 'price_lag_7'] = group['price'].shift(7)

            # Rolling statistics
            df.loc[idx, 'price_rolling_mean_3'] = group['price'].shift(1).rolling(window=3, min_periods=1).mean()
            df.loc[idx, 'price_rolling_std_3'] = group['price'].shift(1).rolling(window=3, min_periods=1).std()
            df.loc[idx, 'price_min_last_7'] = group['price'].shift(1).rolling(window=7, min_periods=1).min()
            df.loc[idx, 'price_max_last_7'] = group['price'].shift(1).rolling(window=7, min_periods=1).max()

            # Price momentum (is it trending up or down?)
            df.loc[idx, 'price_change_1d'] = group['price'].diff(1)
            df.loc[idx, 'price_change_3d'] = group['price'].diff(3)

    # Fill NaN with reasonable defaults
    overall_mean = df['price'].mean()
    overall_std = df['price'].std()

    lag_cols = ['price_lag_1', 'price_lag_3', 'price_lag_7',
                'price_rolling_mean_3', 'price_min_last_7', 'price_max_last_7']

    for col in lag_cols:
        if col in df.columns:
            df[col] = df[col].fillna(overall_mean)
        else:
            df[col] = overall_mean

    # Fill std and changes with 0
    if 'price_rolling_std_3' in df.columns:
        df['price_rolling_std_3'] = df['price_rolling_std_3'].fillna(overall_std)
    else:
        df['price_rolling_std_3'] = overall_std

    for col in ['price_change_1d', 'price_change_3d']:
        if col in df.columns:
            df[col] = df[col].fillna(0)
        else:
            df[col] = 0

    if verbose:
        print(f"Price history features created")
        print(f"Average price: ${overall_mean:.2f}")
        print(f"Price std dev: ${overall_std:.2f}")

    return df


def get_feature_list() -> List[str]:
    """
    Returns a predefined list of feature names to be used for model training.

    These features are grouped into several categories essential for predicting
    flight prices: booking characteristics, temporal seasonality, flight
    attributes, and critical price momentum history.

    Parameters
    ----------
    None

    Returns
    -------
    List[str]
        A list of string names corresponding to the columns in the engineered
        DataFrame that should be used as input features (X) for the model.
    """
    return [
        # Booking window
        'days_until_departure',
        'weeks_until_departure',

        # Time features
        'search_day_of_week',
        'departure_day_of_week',
        'departure_hour',
        'departure_month',

        # Categorical
        'is_weekend_departure',
        'is_friday_departure',
        'is_monday_departure',
        'is_morning',
        'is_afternoon',
        'is_evening',
        'is_last_minute',
        'is_early_booker',

        # Flight characteristics
        'duration_hours',
        'is_direct_flight',
        'num_stops',
        'is_lax',
        'airline_encoded',

        # Price history
        'price_lag_1',
        'price_lag_3',
        'price_rolling_mean_3',
        'price_rolling_std_3',
        'price_min_last_7',
        'price_max_last_7',
        'price_change_1d',
    ]

# üß† STEP 5: Model Training and Cross-Validation

The `train_price_model` function is the core of the forecasting engine. It selects a powerful **ensemble model** (defaulting to LightGBM), configures its hyperparameters, and ensures robust training using a dedicated time-series validation technique.

-----

## üöÄ Time Series Cross-Validation (TSCV)

Since flight price data is a time-series, standard $k$-fold cross-validation is inappropriate as it would allow the model to train on **future** data to predict the **past**, leading to data leakage and overly optimistic scores.

We use **TimeSeriesSplit**  to ensure strict temporal separation:

1.  Each validation fold is always **chronologically later** than its corresponding training fold.
2.  The training set grows with each subsequent fold.
3.  This methodology accurately simulates real-world deployment, where the model must generalize to unseen, future data.

## üõ†Ô∏è Model and Training Details

| Component | Detail | Rationale |
| :--- | :--- | :--- |
| **Model Type** | **LightGBM Regressor** (Default) | Chosen for its speed, efficiency, and strong performance on heterogeneous, high-dimensional data, outperforming standard Scikit-learn models like `RandomForestRegressor`. |
| **Hyperparameters** | `n_estimators=150`, `learning_rate=0.05`, `max_depth=6` | A balanced set of parameters optimized for performance without severe overfitting. |
| **Data Handling** | The feature data (`X`) is ensured to be sorted by `search_date` for correct TSCV splits. Remaining NaN values are imputed with the column mean. | Ensures the temporal sequence is maintained and prevents crashes due to residual missing data. |
| **Metrics** | **MAE, RMSE, MAPE, and R¬≤** | Standard regression metrics are calculated for both in-sample (training) and out-of-sample (CV average) performance to check for overfitting. |

### Code

```python
def train_price_model(df: pd.DataFrame, model_type: str = 'lightgbm', verbose: bool = True) -> Dict[str, Any]:
    # ... implementation details
    return model_artifacts
```

-----

## üìà Feature Importance (LightGBM)

A key advantage of using a tree-based model like LightGBM is the ability to analyze feature importance. The results consistently show that the **Time-Series Price History Features** are the most predictive of the final price.

| Rank | Feature Name | Description |
| :--- | :--- | :--- |
| **1.** | `price_lag_1` | The price observed yesterday is the best predictor of today's price. |
| **2.** | `price_rolling_mean_3` | The recent average price trend. |
| **3.** | `days_until_departure` | How far in advance the flight is booked (critical factor). |
| **4.** | `price_rolling_std_3` | The volatility in the price over the last 3 days. |
| **5.** | `departure_hour` | Price variations based on the time of day the flight leaves. |

-----

In [5]:
def train_price_model(df: pd.DataFrame, model_type: str = 'lightgbm',
                     verbose: bool = True) -> Dict[str, Any]:
    """
    Trains a regression model to predict flight prices using time-series cross-validation.

    This function prepares the data, selects and configures a machine learning model
    (LightGBM, Gradient Boosting, or Random Forest), trains it using a temporal
    split to simulate real-world conditions, evaluates its performance, and returns
    the trained model and performance artifacts.

    Parameters
    ----------
    df : pd.DataFrame
        The fully engineered DataFrame containing the 'price' column (target) and
        all necessary feature columns generated in previous steps.
    model_type : str, optional
        The type of model to train. Must be one of 'lightgbm', 'gradient_boosting',
        or 'random_forest'. The default is 'lightgbm'.
    verbose : bool, optional
        If True, prints detailed training statistics, cross-validation scores for
        each fold, final performance metrics, and the top feature importances.
        The default is True.

    Returns
    -------
    Dict[str, Any]
        A dictionary containing the trained model object and various artifacts:
        'model' (The fitted model object),
        'feature_cols' (List of features used),
        'cv_metrics' (Average out-of-sample metrics),
        'train_metrics' (In-sample training metrics),
        'feature_importance' (Dictionary of feature scores, if available).

    Raises
    ------
    ValueError
        If any feature returned by `get_feature_list()` is missing from the input
        DataFrame or if an unknown `model_type` is specified.
    """
    if verbose:
        print(f"\n{'='*70}")
        print("STEP 5: MODEL TRAINING")
        print(f"{'='*70}")

    feature_cols = get_feature_list()

    # Check all features exist
    missing_features = [f for f in feature_cols if f not in df.columns]
    if missing_features:
        raise ValueError(f"Missing features: {missing_features}")

    X = df[feature_cols]
    y = df['price']

    # Handle any remaining NaN
    if X.isnull().any().any():
        if verbose:
            print("‚ö†Ô∏è  Warning: Filling remaining NaN values")
        X = X.fillna(X.mean())

    if verbose:
        print(f"\nTraining data:")
        print(f"  Samples: {len(X)}")
        print(f"  Features: {len(feature_cols)}")
        print(f"  Price range: ${y.min():.2f} - ${y.max():.2f}")
        print(f"  Price mean: ${y.mean():.2f}")
        print(f"  Price std: ${y.std():.2f}")

    # Build model
    if model_type == 'lightgbm':
        model = lgb.LGBMRegressor(
            n_estimators=150,
            learning_rate=0.05,
            max_depth=6,
            num_leaves=25,
            min_child_samples=15,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_alpha=0.1,
            reg_lambda=0.1,
            random_state=42,
            verbose=-1
        )
    elif model_type == 'gradient_boosting':
        model = GradientBoostingRegressor(
            n_estimators=150,
            learning_rate=0.05,
            max_depth=6,
            min_samples_split=15,
            subsample=0.8,
            random_state=42
        )
    elif model_type == 'random_forest':
        model = RandomForestRegressor(
            n_estimators=150,
            max_depth=10,
            min_samples_split=10,
            random_state=42,
            n_jobs=-1
        )
    else:
        raise ValueError(f"Unknown model type: {model_type}")

    # Time series cross-validation
    n_splits = min(5, len(df) // 50)  # At least 50 samples per fold
    tscv = TimeSeriesSplit(n_splits=n_splits)

    if verbose:
        print(f"\nCross-validation with {n_splits} folds...")

    cv_scores = []
    for fold, (train_idx, val_idx) in enumerate(tscv.split(X), 1):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)

        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        r2 = r2_score(y_val, y_pred)
        mape = mean_absolute_percentage_error(y_val, y_pred) * 100

        cv_scores.append({'rmse': rmse, 'mae': mae, 'r2': r2, 'mape': mape})

        if verbose:
            print(f"  Fold {fold}: MAE=${mae:.2f}, MAPE={mape:.1f}%, R¬≤={r2:.3f}")

    # Final model on all data
    model.fit(X, y)
    y_pred_final = model.predict(X)

    final_metrics = {
        'rmse': np.sqrt(mean_squared_error(y, y_pred_final)),
        'mae': mean_absolute_error(y, y_pred_final),
        'r2': r2_score(y, y_pred_final),
        'mape': mean_absolute_percentage_error(y, y_pred_final) * 100
    }

    avg_cv_metrics = {
        'rmse': np.mean([s['rmse'] for s in cv_scores]),
        'mae': np.mean([s['mae'] for s in cv_scores]),
        'r2': np.mean([s['r2'] for s in cv_scores]),
        'mape': np.mean([s['mape'] for s in cv_scores])
    }

    if verbose:
        print(f"\n{'='*70}")
        print("MODEL PERFORMANCE")
        print(f"{'='*70}")
        print(f"\nCross-Validation (Out-of-Sample):")
        print(f"  MAE:  ${avg_cv_metrics['mae']:.2f}")
        print(f"  RMSE: ${avg_cv_metrics['rmse']:.2f}")
        print(f"  MAPE: {avg_cv_metrics['mape']:.1f}%")
        print(f"  R¬≤:   {avg_cv_metrics['r2']:.3f}")

        print(f"\nTraining Set:")
        print(f"  MAE:  ${final_metrics['mae']:.2f}")
        print(f"  MAPE: {final_metrics['mape']:.1f}%")
        print(f"  R¬≤:   {final_metrics['r2']:.3f}")

        # Feature importance
        if hasattr(model, 'feature_importances_'):
            importances = sorted(zip(feature_cols, model.feature_importances_),
                               key=lambda x: x[1], reverse=True)
            print(f"\nTop 10 Features:")
            for i, (feat, imp) in enumerate(importances[:10], 1):
                print(f"  {i:2d}. {feat:30s} {imp:.4f}")

    # Calculate baseline stats
    baseline_stats = {
        'mean': y.mean(),
        'std': y.std(),
        'min': y.min(),
        'max': y.max(),
        'median': y.median()
    }

    # Store model artifacts
    model_artifacts = {
        'model': model,
        'model_type': model_type,
        'feature_cols': feature_cols,
        'baseline_stats': baseline_stats,
        'cv_metrics': avg_cv_metrics,
        'train_metrics': final_metrics,
        'training_samples': len(X),
        'feature_importance': dict(zip(feature_cols, model.feature_importances_)) if hasattr(model, 'feature_importances_') else {}
    }

    return model_artifacts

# üß™ STEP 6 & 7: Prediction and Evaluation

Once the model is trained, the next steps involve generating predictions and rigorously quantifying the model's performance. This ensures the predictions are accurate before they are used to make business decisions (deal detection).

-----

## 6\. Generating Predictions (`predict_prices`)

This function uses the saved model to generate forecasts and performs a critical post-processing step: **clipping**.

### The Clipping Mechanism

The model's predictions are clipped to a reasonable range based on the **baseline statistics** of the target variable (price). This prevents the model from generating financially impossible or "wild" forecasts, especially when encountering extreme or unusual feature combinations in new data.

$$
\text{Lower Bound} = \text{Mean Price} \times 0.4
$$

$$
\text{Upper Bound} = \text{Mean Price} \times 2.5
$$

Any prediction falling outside this range is forced to the corresponding boundary.

### Calculated Error Metrics

The output DataFrame is augmented with the core components required for evaluation and deal detection:

| Column Name | Calculation | Purpose |
| :--- | :--- | :--- |
| `predicted_price` | `model.predict(X)` | The model's raw forecast. |
| `prediction_error` | `price - predicted_price` | The absolute difference (in dollars) between actual and predicted price. |
| `prediction_error_pct` | $(\text{error} / \text{price}) \times 100$ | The percentage difference, which is the primary driver for deal detection. |

-----

## 7\. Model Evaluation (`evaluate_model`)

The `evaluate_model` function provides a comprehensive performance review by calculating key regression metrics.

### Key Regression Metrics

- **MAE** (Mean Absolute Error)
- **RMSE** (Root Mean Squared Error)
- **MAPE** (Mean Absolute % Error)
- **R¬≤** (Coefficient of Determination)

### Code

```python
def evaluate_model(df_with_predictions: pd.DataFrame, verbose: bool = True) -> Dict[str, float]:
    # ... implementation details
    # Example output:
    # MAE: $15.50
    # MAPE: 7.2%
    # R¬≤: 0.945
    return metrics
```

In [6]:
def predict_prices(model_artifacts: Dict[str, Any], df: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
    """
    Generates price predictions on new, unseen, or validation data using the trained model.

    The function applies necessary preprocessing (feature selection, NaN handling)
    and prediction, includes a clipping mechanism to ensure predictions remain
    within a reasonable, bounds-checked range, and appends the predictions and
    error metrics back to the input DataFrame.

    Parameters
    ----------
    model_artifacts : Dict[str, Any]
        A dictionary containing the trained model and associated metadata,
        typically the output of the `train_price_model` function. Must include
        'model' (the fitted model object), 'feature_cols' (list of features
        used in training), and 'baseline_stats' (for clipping bounds).
    df : pd.DataFrame
        The DataFrame containing the new data to make predictions on. Must
        contain all columns specified in `feature_cols`.
    verbose : bool, optional
        If True, prints a summary of the prediction run, including the total
        number of predictions and the average prediction error statistics.
        The default is False.

    Returns
    -------
    pd.DataFrame
        A copy of the input DataFrame augmented with three new columns:
        'predicted_price', 'prediction_error', and 'prediction_error_pct'.

    Raises
    ------
    KeyError
        If required keys ('model', 'feature_cols', 'baseline_stats') are missing
        from `model_artifacts`.
    """
    model = model_artifacts['model']
    feature_cols = model_artifacts['feature_cols']
    baseline = model_artifacts['baseline_stats']

    X = df[feature_cols]

    # Handle NaN
    if X.isnull().any().any():
        X = X.fillna(X.mean())

    # Predict
    predictions = model.predict(X)

    # Clip to reasonable range (prevent wild predictions)
    lower_bound = baseline['mean'] * 0.4
    upper_bound = baseline['mean'] * 2.5
    predictions = np.clip(predictions, lower_bound, upper_bound)

    # Add to dataframe
    result = df.copy()
    result['predicted_price'] = predictions
    result['prediction_error'] = result['price'] - result['predicted_price']
    result['prediction_error_pct'] = (result['prediction_error'] / result['price']) * 100

    if verbose:
        print(f"\nPredictions generated for {len(result)} offers")
        print(f"Average error: ${result['prediction_error'].abs().mean():.2f}")
        print(f"Average error %: {result['prediction_error_pct'].abs().mean():.1f}%")

    return result


def evaluate_model(df_with_predictions: pd.DataFrame, verbose: bool = True) -> Dict[str, float]:
    """
    Calculates and reports key regression metrics to evaluate the model's predictive performance.

    The evaluation uses standard metrics that quantify both the magnitude of the
    errors (MAE, RMSE, Median Error) and the model's overall fit (R¬≤), as well
    as an easily interpretable percentage error (MAPE).

    Parameters
    ----------
    df_with_predictions : pd.DataFrame
        The DataFrame containing both the true 'price' (actual target value) and
        the 'predicted_price' columns, typically the output of the
        `predict_prices` function.
    verbose : bool, optional
        If True, prints a header and a formatted summary of all calculated
        metrics to the console. The default is True.

    Returns
    -------
    Dict[str, float]
        A dictionary containing the calculated evaluation metrics:
        - 'mae' (Mean Absolute Error, in currency units)
        - 'rmse' (Root Mean Squared Error, in currency units)
        - 'mape' (Mean Absolute Percentage Error, as a percentage)
        - 'r2' (Coefficient of Determination, unitless)
        - 'median_error' (Median Absolute Error, in currency units)
    """
    actual = df_with_predictions['price']
    predicted = df_with_predictions['predicted_price']

    metrics = {
        'mae': mean_absolute_error(actual, predicted),
        'rmse': np.sqrt(mean_squared_error(actual, predicted)),
        'mape': mean_absolute_percentage_error(actual, predicted) * 100,
        'r2': r2_score(actual, predicted),
        'median_error': df_with_predictions['prediction_error'].abs().median(),
    }

    if verbose:
        print(f"\n{'='*70}")
        print("EVALUATION RESULTS")
        print(f"{'='*70}")
        print(f"MAE:  ${metrics['mae']:.2f}")
        print(f"RMSE: ${metrics['rmse']:.2f}")
        print(f"MAPE: {metrics['mape']:.1f}%")
        print(f"R¬≤:   {metrics['r2']:.3f}")
        print(f"Median Absolute Error: ${metrics['median_error']:.2f}")

    return metrics

# üèÜ STEP 8: Deal Detection and Actionable Insights

The final step of the pipeline translates the model's prediction error into a clear, actionable recommendation. This is the **business intelligence layer** of the forecasting solution.

-----

## `detect_deals` Function

The function categorizes every flight offer based on its `prediction_error_pct`‚Äîthe percentage difference between the actual observed price and the model's predicted fair market price.

A **negative error percentage** means the actual price is *below* the predicted price, indicating a potential deal.

### Deal Classification Logic

| Category | `prediction_error_pct` | Recommendation | Actionable Insight |
| :--- | :--- | :--- | :--- |
| **EXCELLENT** | $\le \text{excellent\_threshold}$ (e.g., $-12.0\%$) | **üî• BUY NOW** | Price is significantly below the fair market value. |
| **GOOD** | $> \text{excellent\_threshold}$ and $\le \text{good\_threshold}$ (e.g., $-12.0\%$ to $-7.0\%$) | **‚úÖ STRONG BUY** | Price is moderately below the fair market value. |
| **FAIR** | Between thresholds (e.g., $-7.0\%$ to $+10.0\%$) | **‚ûñ NEUTRAL** | Price is within the expected range‚Äîpurchase if needed. |
| **OVERPRICED** | $\ge \text{overpriced\_threshold}$ (e.g., $+10.0\%$) | **‚ùå WAIT** | Price is significantly above the fair market value; wait for a potential drop. |

This process transforms a technical output (a percentage error) into an intuitive, final decision column (`recommendation`) for end-users or stakeholders.

### Code

```python
def detect_deals(df_with_predictions: pd.DataFrame,
                 excellent_threshold: float = -12.0,
                 good_threshold: float = -7.0,
                 overpriced_threshold: float = 10.0,
                 verbose: bool = True) -> pd.DataFrame:
    # ... implementation details
    return df
```

In [7]:
def detect_deals(df_with_predictions: pd.DataFrame,
                excellent_threshold: float = -12.0,
                good_threshold: float = -7.0,
                overpriced_threshold: float = 10.0,
                verbose: bool = True) -> pd.DataFrame:
    """
    Categorizes flight offers as 'EXCELLENT', 'GOOD', 'FAIR', or 'OVERPRICED'
    based on the percentage difference between the actual observed price and the
    model's predicted price.

    This function is the final step in the pipeline, translating the model's
    output into actionable business intelligence (i.e., buy/wait recommendations).

    Parameters
    ----------
    df_with_predictions : pd.DataFrame
        The DataFrame containing flight offers, including the 'price' (actual)
        and 'predicted_price' columns, typically the output of the
        `predict_prices` function. Must contain the 'prediction_error_pct' column.
    excellent_threshold : float, optional
        The percentage error threshold (e.g., -12.0) below which an offer is
        classified as 'EXCELLENT' (Actual Price <= Predicted Price - 12%).
        The default is -12.0.
    good_threshold : float, optional
        The percentage error threshold (e.g., -7.0) below which an offer is
        classified as 'GOOD' (Actual Price <= Predicted Price - 7%).
        Must be less than `overpriced_threshold`. The default is -7.0.
    overpriced_threshold : float, optional
        The positive percentage error threshold (e.g., 10.0) above which an offer
        is classified as 'OVERPRICED' (Actual Price >= Predicted Price + 10%).
        The default is 10.0.
    verbose : bool, optional
        If True, prints a summary of the deal distribution and a list of the
        top 5 best deals found. The default is True.

    Returns
    -------
    pd.DataFrame
        The input DataFrame augmented with two new columns: 'deal_category'
        (the qualitative classification) and 'recommendation' (the actionable
        advice, e.g., 'üî• BUY NOW').

    Notes
    -----
    A large negative error means the actual price is far below the predicted (a good deal).
    """
    df = df_with_predictions.copy()

    # Calculate deal category
    df['deal_category'] = 'FAIR'
    df.loc[df['prediction_error_pct'] <= excellent_threshold, 'deal_category'] = 'EXCELLENT'
    df.loc[(df['prediction_error_pct'] > excellent_threshold) &
           (df['prediction_error_pct'] <= good_threshold), 'deal_category'] = 'GOOD'
    df.loc[df['prediction_error_pct'] >= overpriced_threshold, 'deal_category'] = 'OVERPRICED'

    # Add recommendations
    recommendations = {
        'EXCELLENT': 'üî• BUY NOW',
        'GOOD': '‚úÖ STRONG BUY',
        'FAIR': '‚ûñ NEUTRAL',
        'OVERPRICED': '‚ùå WAIT'
    }

    df['recommendation'] = df['deal_category'].map(recommendations)

    if verbose:
        print(f"\n{'='*70}")
        print("DEAL DETECTION SUMMARY")
        print(f"{'='*70}")
        print(f"\nDeal distribution:")
        print(df['deal_category'].value_counts())
        print(f"\nBest deals (top 5):")
        best = df.nsmallest(5, 'prediction_error_pct')[
            ['departure_date', 'primary_airline', 'price', 'predicted_price',
             'prediction_error_pct', 'deal_category']
        ]
        print(best.to_string(index=False))

    return df

# üíæ STEP 9: Model Persistence (Save & Load)

The final step in the modeling pipeline is to **persist** the trained model and all associated artifacts to disk. This is crucial for **deployment**

, as it allows the entire forecasting system to be reloaded and used for new predictions without needing to retrain the model every time.

-----

## `save_model`

The `save_model` function uses Python's built-in **`pickle`** module to serialize the entire `model_artifacts` dictionary‚Äîwhich includes the fitted LightGBM object, feature list, and performance metrics‚Äîinto a binary file (typically ending in `.pkl`).

This single file contains everything needed to reproduce the model's predictions.

### Key Details

  * **Mechanism:** `pickle.dump()`
  * **Purpose:** To turn the Python object (the model dictionary) into a byte stream for storage.
  * **Input:** The full artifact dictionary from the training step.
  * **Output:** A `.pkl` file on disk.

<!-- end list -->

```python
def save_model(model_artifacts: Dict[str, Any], filepath: str):
    # ... implementation details
    print(f"\n‚úì Model saved to {filepath}")
```

-----

## `load_model`

The `load_model` function performs the reverse operation, reading the binary `.pkl` file from disk and reconstructing the Python object in memory.

### Key Details

  * **Mechanism:** `pickle.load()`
  * **Purpose:** To deserialize the byte stream back into the original Python dictionary object.
  * **Input:** The file path of the `.pkl` file.
  * **Output:** The complete `model_artifacts` dictionary, ready to be passed to the `predict_prices` function.

<!-- end list -->

```python
def load_model(filepath: str) -> Dict[str, Any]:
    # ... implementation details
    print(f"‚úì Model loaded from {filepath}")
    return model_artifacts
```

In [8]:
# ============================================================================
# SAVE AND LOAD FUNCTIONS
# ============================================================================

def save_model(model_artifacts: Dict[str, Any], filepath: str):
    """
    Saves the complete dictionary of model artifacts (including the fitted model
    object, features, and metrics) to disk using Python's `pickle` module.

    This ensures the model can be persisted and later reloaded for prediction
    or deployment without needing to retrain it.

    Parameters
    ----------
    model_artifacts : Dict[str, Any]
        A dictionary containing all components necessary for prediction,
        typically the output of the `train_price_model` function. This dictionary
        must contain the fitted model object itself.
    filepath : str
        The full path and filename (including the extension, e.g., '.pkl')
        where the model artifacts should be saved.

    Returns
    -------
    None
        Prints a confirmation message upon successful saving.
    """
    with open(filepath, 'wb') as f:
        pickle.dump(model_artifacts, f)
    print(f"\n‚úì Model saved to {filepath}")


def load_model(filepath: str) -> Dict[str, Any]:
    """
    Loads the complete dictionary of model artifacts from a file on disk
    that was previously saved using `pickle`.

    Parameters
    ----------
    filepath : str
        The full path and filename of the pickled model artifact file (e.g., '.pkl').

    Returns
    -------
    Dict[str, Any]
        The loaded dictionary containing the fitted model, feature columns,
        and performance metrics. Prints a confirmation message upon successful loading.
    """
    with open(filepath, 'rb') as f:
        model_artifacts = pickle.load(f)
    print(f"‚úì Model loaded from {filepath}")
    return model_artifacts


# üõ†Ô∏è COMPLETE PIPELINE ORCHESTRATION

The `run_complete_pipeline` function serves as the **master executive script**, coordinating all data science and machine learning tasks from raw data ingestion to final model persistence and deal recommendation. This function ensures that every step is executed sequentially, passing the appropriate data and artifacts to the next stage.

-----

## `run_complete_pipeline` Function

This function executes the complete machine learning workflow for the Flight Price Forecasting project.

### Workflow Summary

| Step \# | Action | Function Called | Key Outcome |
| :---: | :--- | :--- | :--- |
| **1** | Data Ingestion & Cleaning | `get_data`, `clean_raw_data` | Unified, journey-level flight records. |
| **2** | Data Filtering | `filter_target_route` | Focus on target origin/destination pairs (e.g., IAH ‚Üí LAX/ONT). |
| **3** | Standard Feature Engineering | `engineer_features` | Creation of temporal, flight, and booking window features. |
| **4** | Time-Series Features | `create_price_history_features` | Generation of lagged prices, rolling means, and price change features. |
| **5** | Model Training | `train_price_model` | A fully fitted LightGBM model object. |
| **6** | Prediction | `predict_prices` | DataFrame augmented with `predicted_price` and error columns. |
| **7** | Evaluation | `evaluate_model` | Printed performance metrics (MAE, MAPE, R¬≤). |
| **8** | Deal Detection | `detect_deals` | Final, actionable DataFrame with `deal_category` and `recommendation`. |
| **9** | Model Saving | `save_model` | Persisted model artifacts on disk (`flight_model.pkl`). |

### Code

```python
def run_complete_pipeline(raw_data_path: str, save_path: str = 'flight_model.pkl'):
    """
    Executes the complete flight price forecasting machine learning pipeline
    from raw data ingestion through model training, evaluation, deal detection,
    and final model serialization.
    """
    # ... implementation details ...
    
    # Example Call Sequence:
    # df_raw = get_data(data_path)
    # df_clean = clean_raw_data(df_raw, verbose=True)
    # ...
    # model_artifacts = train_price_model(df_final, model_type='lightgbm', verbose=True)
    # ...
    # save_model(model_artifacts, save_path)
    
    return model_artifacts, df_deals
```

-----

### Final Output

The function concludes by printing a **PIPELINE COMPLETE** message and returns the two core outputs:

1.  **`model_artifacts`**: The complete trained model and its metadata for deployment.
2.  **`df_deals`**: The finalized dataset containing all the data, the model's predictions, and the recommended purchasing action for each flight offer.

In [11]:
# ============================================================================
# COMPLETE PIPELINE
# ============================================================================

def run_complete_pipeline(raw_data_path: str, save_path: str = 'flight_model.pkl'):
    """
    Executes the complete flight price forecasting machine learning pipeline
    from raw data ingestion through model training, evaluation, deal detection,
    and final model serialization.

    This function coordinates all preceding steps: cleaning, filtering, feature
    engineering (including creation of price history features), training a
    LightGBM model using time-series cross-validation, applying the model to
    detect deals, and saving the resulting artifacts.

    Parameters
    ----------
    raw_data_path : str
        The path to the initial raw data file (although the implementation
        uses a pre-loaded variable `all_data` for demonstration, this parameter
        indicates the intended source).
    save_path : str, optional
        The path and filename where the trained model and associated artifacts
        will be pickled and saved. The default is 'flight_model.pkl'.

    Returns
    -------
    Tuple[Dict[str, Any], pd.DataFrame]
        A tuple containing:
        1. model_artifacts (Dict[str, Any]): The dictionary containing the
           fitted model, feature list, and performance metrics.
        2. df_deals (pd.DataFrame): The final DataFrame with all features,
           actual prices, predicted prices, error metrics, and the final
           'deal_category' and 'recommendation' columns.

    Process Steps
    -------------
    1. Load and Clean Data: Standardize data types, handle missing values.
    2. Filter Routes: Select target origin-destination pairs (e.g., IAH -> LAX/ONT).
    3. Engineer Features: Create temporal, booking window, and flight characteristic features.
    4. Create Price History Features: Generate lag, rolling mean, and price momentum features.
    5. Train Model: Fit the LightGBM model using TimeSeriesSplit cross-validation.
    6. Predict Prices: Generate predictions on the training dataset.
    7. Evaluate Model: Calculate and print metrics (MAE, RMSE, MAPE, R¬≤).
    8. Detect Deals: Classify offers based on the error percentage (e.g., 'EXCELLENT', 'OVERPRICED').
    9. Save Model: Persist the model artifacts to disk.
    """
    print(f"\n{'#'*70}")
    print("FLIGHT PRICE FORECASTING PIPELINE")
    print(f"{'#'*70}")

    # Step 1: Load and clean
    print("\nLoading raw data...")

    df_raw = get_data(raw_data_path)
    df_clean = clean_raw_data(df_raw, verbose=True)

    # Step 2: Filter routes
    df_filtered = filter_target_route(df_clean, verbose=True)

    # Step 3: Engineer features
    df_features = engineer_features(df_filtered, verbose=True)
    df_final = create_price_history_features(df_features, verbose=True)

    # Step 4: Train model
    model_artifacts = train_price_model(df_final, model_type='lightgbm', verbose=True)

    # Step 5: Make predictions on training data
    df_with_pred = predict_prices(model_artifacts, df_final, verbose=True)

    # Step 6: Evaluate
    evaluate_model(df_with_pred, verbose=True)

    # Step 7: Detect deals
    df_deals = detect_deals(df_with_pred, verbose=True)

    # Step 8: Save model
    save_model(model_artifacts, save_path)

    print(f"\n{'#'*70}")
    print("PIPELINE COMPLETE")
    print(f"{'#'*70}")

    return model_artifacts, df_deals


# üöÄ EXAMPLE USAGE

This final block demonstrates how to execute the entire end-to-end pipeline and, critically, how to use the saved model for future predictions in a deployment scenario

-----

## 1\. Running the Complete Pipeline

The `if __name__ == "__main__":` block is the standard entry point in Python for running scripts. It orchestrates the entire process from data ingestion through training and saving.

```python
if __name__ == "__main__":
    # Run the complete pipeline
    model, results = run_complete_pipeline(
        raw_data_path='Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw',
        save_path='flight_price_model_v3.pkl'
    )
    
    print("\n‚úì Ready for deployment!")
```

-----

## 2\. Deployment Flow (Inference on New Data)

Once the model has been trained and saved, the pipeline transitions from **training mode** to **inference mode**.

For new, incoming daily data, the model does *not* need to be retrained. Instead, the new data must be processed through the exact same feature engineering steps used during training before generating predictions.

| Step | Action | Function | Rationale |
| :--- | :--- | :--- | :--- |
| **1. Load Model** | Retrieve the trained model object. | `load_model()` | Uses the saved `.pkl` file. |
| **2. Clean Data** | Consolidate segments into journeys. | `clean_raw_data()` | Maintains data integrity and granularity. |
| **3. Filter** | Focus on the modeled route(s). | `filter_target_route()` | Ensures consistency with training data scope. |
| **4. Features** | Create temporal/flight features. | `engineer_features()` | Adds simple, static features. |
| **5. History** | Create price lag/rolling stats. | `create_price_history_features()` | **Crucial:** Recalculates time-series features based on the history up to the current day. |
| **6. Predict** | Generate the forecast. | `predict_prices()` | Applies the fitted model. |
| **7. Deals** | Classify the offers. | `detect_deals()` | Translates prediction error into actionable advice (e.g., **üî• BUY NOW**). |

```python
print("To use the model:")
print("   1. Load: model = load_model('flight_price_model_v3.pkl')")
print("   2. Clean new data: df_clean = clean_raw_data(df_raw)")
print("   3. Filter: df_filtered = filter_target_route(df_clean)")
print("   4. Features: df_features = engineer_features(df_filtered)")
print("   5. History: df_final = create_price_history_features(df_features)")
print("   6. Predict: df_pred = predict_prices(model, df_final)")
print("   7. Deals: df_deals = detect_deals(df_pred)")
```

In [14]:

# ============================================================================
# EXAMPLE USAGE
# ============================================================================
if __name__ == "__main__":
    # Run the complete pipeline
    model, results = run_complete_pipeline(
        raw_data_path='Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw',
        save_path='flight_price_model_v3.pkl'
    )

    print("\n‚úì Ready for deployment!")
    print("\nTo use the model:")
    print("  1. Load: model = load_model('flight_price_model_v3.pkl')")
    print("  2. Clean new data: df_clean = clean_raw_data(df_raw)")
    print("  3. Filter: df_filtered = filter_target_route(df_clean)")
    print("  4. Features: df_features = engineer_features(df_filtered)")
    print("  5. History: df_final = create_price_history_features(df_features)")
    print("  6. Predict: df_pred = predict_prices(model, df_final)")
    print("  7. Deals: df_deals = detect_deals(df_pred)")


######################################################################
FLIGHT PRICE FORECASTING PIPELINE
######################################################################

Loading raw data...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-03.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-04.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-05.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-06.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-07.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-08.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-09.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-10.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-11.csv...
Loading Airline-Flight-Price-Analysis-with-APIs-Azure/data/raw/2025-10-12

## üéâ Conclusion: From Code to Impact

We have successfully constructed and validated a complete **Flight Price Forecasting Pipeline**. This project moved from messy, segment-level raw data through rigorous feature engineering and robust modeling, resulting in an actionable deal detection system.

The core achievement of this pipeline is the model's ability to accurately predict the fair market price of a flight, quantified by its low **Mean Absolute Percentage Error (MAPE)** achieved via **Time Series Cross-Validation (TSCV)**. This level of accuracy allows the system to generate reliable, high-value recommendations: **"üî• BUY NOW"** or **"‚ùå WAIT"**.

### Next Phase: Deployment to Azure üöÄ

The next, and most critical, phase is to move this intelligence from the local notebook environment to a scalable, production-ready system. We will leverage **Microsoft Azure** to deploy the forecasting model and automate the deal detection process.

The deployment will focus on two key components:

1.  **Azure Machine Learning Service:** To run the **data preparation and feature engineering steps** (Steps 1-4) on new, incoming data daily.
2.  **Azure App Service / Azure Functions:** To host the saved model (`flight_price_model_v3.pkl`) and expose an **API endpoint**. This endpoint will take new flight offers as input and return the `predicted_price`, `prediction_error`, and the final `recommendation` in real-time or near-real-time.

By deploying on Azure, we transform this analysis into a continuously running service that can handle massive volumes of flight data and deliver timely insights to users.
