# Flight Price Prediction - End-to-End Machine Learning Project

## Project Overview
This notebook presents a complete machine learning solution for predicting flight ticket prices. Flight prices are highly dynamic and influenced by multiple factors including departure time, airline, route, number of stops, and booking date.

## Business Problem
Airlines and travelers need accurate price predictions to:
- **Airlines**: Optimize pricing strategies and revenue management
- **Travelers**: Make informed booking decisions and find the best deals
- **Travel Agencies**: Provide price recommendations to customers

## Dataset
The dataset contains historical flight booking data with features such as airline, journey date, source, destination, route, departure/arrival times, duration, stops, and price.

## Approach
We will follow a systematic data science workflow:
1. Data Loading & Understanding
2. Data Cleaning & Preprocessing
3. Feature Engineering
4. Exploratory Data Analysis
5. Model Training & Comparison
6. Hyperparameter Tuning
7. Model Selection & Deployment

## Success Metrics
- **RÂ² Score**: Measures the proportion of variance explained by the model
- **RMSE**: Root Mean Squared Error - measures average prediction error
- **Cross-validation**: Ensures model generalizes well to unseen data

---
**Author**: Data Science Team  
**Date**: November 2025  
**Python Version**: 3.8+

---
## 01 - Importing Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

# Date and time operations
from datetime import datetime

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Machine Learning - Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb

# Machine Learning - Metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Model persistence
import joblib
import pickle

# System operations
import zipfile
import os

print("All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"XGBoost version: {xgb.__version__}")
print(f"LightGBM version: {lgb.__version__}")

**What this does:**  
Imports all necessary libraries for data manipulation, visualization, machine learning, and model persistence.

**Why it's needed:**
- **pandas/numpy**: Data manipulation and numerical operations
- **matplotlib/seaborn**: Professional visualizations
- **scikit-learn**: ML preprocessing and evaluation
- **XGBoost/LightGBM**: State-of-the-art gradient boosting algorithms
- **joblib**: Model serialization for deployment

**Best Practice:**  
Organizing imports by category improves readability and maintainability.

---
## 02 - Load Dataset

In [None]:
# Extract dataset from zip file
zip_file_path = 'flight-fare.zip'

# Check if zip file exists
if os.path.exists(zip_file_path):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall('data/')
    print("âœ“ Dataset extracted successfully!")
    print("\nExtracted files:")
    for file in os.listdir('data/'):
        print(f"  - {file}")
else:
    if not os.path.exists('data/'):
        os.makedirs('data/')
    print(f"âš  {zip_file_path} not found, will create sample data")

# Load dataset
data_files = [f for f in os.listdir('data/') if f.endswith(('.csv', '.xlsx', '.xls'))] if os.path.exists('data/') else []

if data_files:
    data_file = f"data/{data_files[0]}"
    print(f"\nLoading: {data_file}")
    if data_file.endswith('.csv'):
        df = pd.read_csv(data_file)
    else:
        df = pd.read_excel(data_file)
    print(f"âœ“ Dataset loaded! Shape: {df.shape}")
else:
    # Create realistic sample data
    print("\nCreating sample dataset for demonstration...")
    np.random.seed(42)
    n = 10819
    df = pd.DataFrame({
        'Airline': np.random.choice(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet', 
                                     'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia'], n),
        'Date_of_Journey': pd.date_range('2019-03-01', periods=n, freq='3H').strftime('%d/%m/%Y'),
        'Source': np.random.choice(['Delhi', 'Kolkata', 'Mumbai', 'Chennai', 'Bangalore'], n),
        'Destination': np.random.choice(['Cochin', 'Bangalore', 'Delhi', 'Hyderabad', 'Mumbai'], n),
        'Route': np.random.choice(['DEL-BOM', 'BOM-DEL', 'DEL-BLR', 'BLR-DEL', 'MAA-BOM'], n),
        'Dep_Time': [f"{np.random.randint(0,24):02d}:{np.random.randint(0,60):02d}" for _ in range(n)],
        'Arrival_Time': [f"{np.random.randint(0,24):02d}:{np.random.randint(0,60):02d}" for _ in range(n)],
        'Duration': [f"{np.random.randint(1,25)}h {np.random.randint(0,60):02d}m" for _ in range(n)],
        'Total_Stops': np.random.choice(['non-stop', '1 stop', '2 stops', '3 stops', '4 stops'], 
                                        n, p=[0.25, 0.35, 0.25, 0.10, 0.05]),
        'Additional_Info': np.random.choice(['No info', 'In-flight meal not included', 
                                             'No check-in baggage included', '1 Short layover'], n),
        'Price': (np.random.gamma(2.5, 2500, n) + np.random.normal(3000, 1000, n)).astype(int)
    })
    df['Price'] = df['Price'].clip(lower=1500, upper=80000)
    print(f"âœ“ Sample dataset created! Shape: {df.shape}")

print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nFirst 5 rows:")
display(df.head())

**What this does:**  
Extracts and loads the flight fare dataset, with automatic fallback to sample data creation.

**Why it's needed:**
- **Automatic extraction**: Handles zipped datasets seamlessly
- **Format flexibility**: Supports CSV and Excel files
- **Fallback mechanism**: Creates realistic sample data if file unavailable
- **Initial inspection**: Shows dataset shape and preview

**Insights:**  
The dataset typically contains 10,000+ flight records with ~11 features. Early shape validation ensures data loaded correctly.

---
## 03 - Understanding Data

Deep dive into dataset structure, data types, and distributions.

In [None]:
# Dataset dimensions
print("="*70)
print("DATASET DIMENSIONS")
print("="*70)
print(f"Total Rows: {df.shape[0]:,}")
print(f"Total Columns: {df.shape[1]}")
print(f"Total Data Points: {df.shape[0] * df.shape[1]:,}")

**What this does:**  
Displays dataset dimensions (rows and columns).

**Why it's needed:**  
Understanding scale helps determine computational requirements and whether sampling is needed.

**Insight:**  
~10,000 rows with 11 columns provides sufficient data for robust model training without computational constraints.

In [None]:
# Column information
print("="*70)
print("COLUMN INFORMATION")
print("="*70)
print(df.dtypes)
print("\n" + "="*70)
print("DATASET INFO")
print("="*70)
df.info()

**What this does:**  
Shows column names, data types, and non-null counts.

**Why it's needed:**
- Identifies categorical vs numerical features
- Reveals missing values
- Guides encoding strategies

**Expected findings:**  
Object types need encoding; numeric types ready for modeling.

In [None]:
# Statistical summary
print("="*70)
print("STATISTICAL SUMMARY")
print("="*70)
display(df.describe())

# Target variable analysis
price_col = [col for col in df.columns if 'price' in col.lower()][0]
print(f"\n{'='*70}")
print(f"TARGET VARIABLE: {price_col}")
print(f"{'='*70}")
print(f"Minimum: â‚¹{df[price_col].min():,.2f}")
print(f"Maximum: â‚¹{df[price_col].max():,.2f}")
print(f"Mean: â‚¹{df[price_col].mean():,.2f}")
print(f"Median: â‚¹{df[price_col].median():,.2f}")
print(f"Std Dev: â‚¹{df[price_col].std():,.2f}")
print(f"\nPrice Range: â‚¹{df[price_col].max() - df[price_col].min():,.2f}")

**What this does:**  
Provides statistical summary of numerical features including central tendency and spread.

**Why it's needed:**
- Identifies outliers and distribution shape
- Shows price range (critical for business context)
- Reveals skewness through mean vs median

**Key Insights:**  
Wide price ranges indicate diverse flight types. High std dev shows significant pricing variability.

In [None]:
# Unique values for categorical columns
print("="*70)
print("UNIQUE VALUE COUNTS - CATEGORICAL FEATURES")
print("="*70)

categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

for col in categorical_cols:
    unique_count = df[col].nunique()
    print(f"\n{col}:")
    print(f"  Unique values: {unique_count}")
    if unique_count <= 15:
        print(f"  Values: {sorted(df[col].unique()[:15].tolist())}")
    else:
        print(f"  Top 10: {df[col].value_counts().head(10).index.tolist()}")

print(f"\n{'='*70}")
print("SUMMARY")
print(f"{'='*70}")
print(f"Categorical columns: {len(categorical_cols)}")
print(f"Numerical columns: {len(df.select_dtypes(include=[np.number]).columns)}")

**What this does:**  
Counts unique values in each categorical column.

**Why it's needed:**
- **Cardinality assessment**: Determines encoding strategy
- **Data validation**: Identifies unexpected values
- **Feature planning**: Low cardinality â†’ OneHot, High cardinality â†’ Label/Target encoding

**Decision criteria:**
- <10 unique: OneHot encoding
- 10-50 unique: Label encoding
- >50 unique: Target encoding or feature engineering

---
## 04 - Data Cleaning

Handle missing values, duplicates, and data quality issues.

In [None]:
# Missing value analysis
print("="*70)
print("MISSING VALUE ANALYSIS")
print("="*70)

missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing_Count': missing.values,
    'Percentage': missing_pct.values
}).sort_values('Missing_Count', ascending=False)

missing_df = missing_df[missing_df['Missing_Count'] > 0]

if len(missing_df) > 0:
    display(missing_df)
    print(f"\nTotal columns with missing values: {len(missing_df)}")
else:
    print("âœ“ No missing values found!")

print(f"\nDataset shape: {df.shape}")

**What this does:**  
Identifies and quantifies missing values across all columns.

**Why it's needed:**
- ML algorithms can't handle missing data
- Determines imputation vs deletion strategy
- Assesses data quality

**Strategy:**
- <5% missing: Drop rows
- 5-30%: Impute (mean/median/mode)
- >30%: Consider dropping feature

In [None]:
# Handle missing values
rows_before = len(df)
df = df.dropna()
rows_after = len(df)

print("="*70)
print("MISSING VALUE TREATMENT")
print("="*70)
print(f"Rows before: {rows_before:,}")
print(f"Rows after: {rows_after:,}")
print(f"Rows dropped: {rows_before - rows_after:,}")
print(f"Data retained: {(rows_after/rows_before*100):.2f}%")
print("\nâœ“ Missing values handled")

**What this does:**  
Removes rows with missing values.

**Why it's needed:**  
Ensures complete data for all features, preventing errors during model training.

**Alternative:**  
For production with missing data, consider mean/median imputation for numerical features.

In [None]:
# Duplicate analysis
print("="*70)
print("DUPLICATE ANALYSIS")
print("="*70)

duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates:,}")

if duplicates > 0:
    print(f"Percentage: {(duplicates/len(df)*100):.2f}%")
    df = df.drop_duplicates()
    print(f"âœ“ Duplicates removed! New shape: {df.shape}")
else:
    print("âœ“ No duplicates found")

**What this does:**  
Identifies and removes duplicate rows.

**Why it's needed:**
- Prevents bias toward repeated patterns
- Avoids data leakage in train/test split
- Ensures unique information per record

In [None]:
# Outlier detection
print("="*70)
print("OUTLIER DETECTION - PRICE")
print("="*70)

Q1 = df[price_col].quantile(0.25)
Q3 = df[price_col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df[price_col] < lower) | (df[price_col] > upper)]

print(f"Q1: â‚¹{Q1:,.2f}")
print(f"Q3: â‚¹{Q3:,.2f}")
print(f"IQR: â‚¹{IQR:,.2f}")
print(f"\nBounds: â‚¹{lower:,.2f} to â‚¹{upper:,.2f}")
print(f"Outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.2f}%)")
print("\nâœ“ Retaining outliers (legitimate premium/budget flights)")
print(f"Final shape: {df.shape}")

**What this does:**  
Uses IQR method to detect price outliers.

**Why it's needed:**  
Identifies potentially erroneous data points.

**Decision:**  
We retain outliers because:
- High prices = business/first class
- Low prices = promotions/budget airlines
- Tree models handle outliers well
- Removing would lose valuable patterns

---
## 05 - Feature Engineering

Transform raw features into meaningful ML inputs.

### 5.1 Date and Time Features

In [None]:
# Extract date features
print("="*70)
print("DATE FEATURE ENGINEERING")
print("="*70)

date_cols = [col for col in df.columns if 'date' in col.lower() or 'journey' in col.lower()]
if date_cols:
    date_col = date_cols[0]
    print(f"Date column: {date_col}")
    print(f"Sample: {df[date_col].head(3).tolist()}")

    df[date_col] = pd.to_datetime(df[date_col], dayfirst=True, errors='coerce')

    df['Journey_Day'] = df[date_col].dt.day
    df['Journey_Month'] = df[date_col].dt.month
    df['Journey_Year'] = df[date_col].dt.year
    df['Journey_Weekday'] = df[date_col].dt.dayofweek
    df['Journey_Weekend'] = (df['Journey_Weekday'] >= 5).astype(int)

    print(f"\nâœ“ Created: Journey_Day, Journey_Month, Journey_Year, Journey_Weekday, Journey_Weekend")
    df = df.drop(columns=[date_col])
    print(f"âœ“ Dropped original: {date_col}")

print(f"\nShape: {df.shape}")

**What this does:**  
Extracts temporal features from journey date.

**Why it's needed:**
- **Seasonality**: Prices vary by month (holidays, vacations)
- **Day patterns**: Weekday vs weekend pricing
- **Model compatibility**: ML needs numeric inputs

**Feature importance:**
- Month captures seasonal demand
- Weekday shows business vs leisure patterns
- Weekend indicates premium pricing

In [None]:
# Time feature extraction
print("="*70)
print("TIME FEATURE ENGINEERING")
print("="*70)

def extract_hour_minute(time_str):
    try:
        if pd.isna(time_str):
            return None, None
        time_str = str(time_str).strip()
        if ':' in time_str:
            parts = time_str.split(':')
            hour = int(parts[0])
            minute = int(parts[1][:2])
            return hour, minute
        return None, None
    except:
        return None, None

def time_to_minutes(hour, minute):
    if hour is not None and minute is not None:
        return hour * 60 + minute
    return None

def categorize_time(hour):
    if hour is None:
        return 'Unknown'
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 24:
        return 'Evening'
    else:
        return 'Night'

# Departure time
dep_cols = [col for col in df.columns if 'dep' in col.lower() and 'time' in col.lower()]
if dep_cols:
    dep_col = dep_cols[0]
    print(f"Processing: {dep_col}")
    df[['Dep_Hour', 'Dep_Minute']] = df[dep_col].apply(lambda x: pd.Series(extract_hour_minute(x)))
    df['Dep_Time_Minutes'] = df.apply(lambda x: time_to_minutes(x['Dep_Hour'], x['Dep_Minute']), axis=1)
    df['Dep_Time_Period'] = df['Dep_Hour'].apply(categorize_time)
    df = df.drop(columns=[dep_col])
    print(f"âœ“ Created: Dep_Hour, Dep_Minute, Dep_Time_Minutes, Dep_Time_Period")

# Arrival time
arr_cols = [col for col in df.columns if 'arr' in col.lower() and 'time' in col.lower()]
if arr_cols:
    arr_col = arr_cols[0]
    print(f"Processing: {arr_col}")
    df[['Arr_Hour', 'Arr_Minute']] = df[arr_col].apply(lambda x: pd.Series(extract_hour_minute(x)))
    df['Arr_Time_Minutes'] = df.apply(lambda x: time_to_minutes(x['Arr_Hour'], x['Arr_Minute']), axis=1)
    df['Arr_Time_Period'] = df['Arr_Hour'].apply(categorize_time)
    df = df.drop(columns=[arr_col])
    print(f"âœ“ Created: Arr_Hour, Arr_Minute, Arr_Time_Minutes, Arr_Time_Period")

print(f"\nShape: {df.shape}")

**What this does:**  
Extracts hour/minute and creates categorical time periods from departure/arrival times.

**Why it's needed:**
- **Time-of-day pricing**: Early morning/late night flights cheaper
- **Business hours**: Premium pricing for convenient times
- **Multiple representations**: Both continuous and categorical capture different patterns

**Features created:**
- Hour/Minute: Granular information
- Minutes since midnight: Continuous for distance calc
- Time period: Categorical for broad patterns

### 5.2 Duration Processing

In [None]:
# Duration feature engineering
print("="*70)
print("DURATION PROCESSING")
print("="*70)

def duration_to_minutes(duration_str):
    try:
        if pd.isna(duration_str):
            return None
        duration_str = str(duration_str).strip()
        total_minutes = 0
        if 'h' in duration_str:
            hours = int(duration_str.split('h')[0].strip())
            total_minutes += hours * 60
        if 'm' in duration_str:
            minutes_part = duration_str.split('h')[-1] if 'h' in duration_str else duration_str
            minutes = int(minutes_part.replace('m', '').strip())
            total_minutes += minutes
        return total_minutes
    except:
        return None

dur_cols = [col for col in df.columns if 'duration' in col.lower()]
if dur_cols:
    dur_col = dur_cols[0]
    print(f"Processing: {dur_col}")
    df['Duration_Minutes'] = df[dur_col].apply(duration_to_minutes)
    df['Duration_Hours'] = df['Duration_Minutes'] / 60
    df['Duration_Category'] = pd.cut(df['Duration_Hours'], bins=[0, 2, 5, 10, 50], 
                                     labels=['Short', 'Medium', 'Long', 'Very_Long'])
    df = df.drop(columns=[dur_col])
    print(f"âœ“ Created: Duration_Minutes, Duration_Hours, Duration_Category")
    print(f"  Range: {df['Duration_Minutes'].min():.0f}-{df['Duration_Minutes'].max():.0f} min")

print(f"\nShape: {df.shape}")

**What this does:**  
Converts duration strings to numerical minutes/hours and categorical bins.

**Why it's needed:**
- Duration is a strong price predictor
- ML needs numeric inputs
- Categorical bins capture non-linear relationships

**Insight:**  
Duration typically ranks as top feature in importance analysis.

### 5.3 Stops Processing

In [None]:
# Total stops engineering
print("="*70)
print("STOPS PROCESSING")
print("="*70)

def stops_to_number(stops_str):
    if pd.isna(stops_str):
        return 0
    stops_str = str(stops_str).lower()
    if 'non' in stops_str:
        return 0
    for i in range(5):
        if str(i) in stops_str:
            return i
    return 0

stops_cols = [col for col in df.columns if 'stop' in col.lower()]
if stops_cols:
    stops_col = stops_cols[0]
    print(f"Processing: {stops_col}")
    df['Total_Stops_Num'] = df[stops_col].apply(stops_to_number)
    df['Is_Direct_Flight'] = (df['Total_Stops_Num'] == 0).astype(int)
    print(f"\nâœ“ Created: Total_Stops_Num, Is_Direct_Flight")
    print(f"Stop distribution: {df['Total_Stops_Num'].value_counts().sort_index().to_dict()}")
    df = df.drop(columns=[stops_col])

print(f"\nShape: {df.shape}")

**What this does:**  
Converts stop descriptions to numerical values and creates direct flight indicator.

**Why it's needed:**
- Direct flights command premium prices
- Numerical encoding enables math operations
- Binary feature captures significant price jump

**Business insight:**  
Direct flights often cost 20-50% more than connecting flights.

### 5.4 Categorical Encoding

In [None]:
# Encoding strategy
print("="*70)
print("CATEGORICAL ENCODING STRATEGY")
print("="*70)

categorical_features = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical features: {len(categorical_features)}")

for col in categorical_features:
    unique = df[col].nunique()
    strategy = "OneHot" if unique <= 10 else "Label"
    print(f"  {col}: {unique} unique â†’ {strategy}")

# Apply encoding
original_shape = df.shape

low_card = [col for col in categorical_features if df[col].nunique() <= 10 and col != 'Route']
high_card = [col for col in categorical_features if df[col].nunique() > 10 or col == 'Route']

if low_card:
    print(f"\nOneHot encoding: {low_card}")
    df = pd.get_dummies(df, columns=low_card, drop_first=True, dtype=int)

label_encoders = {}
if high_card:
    print(f"Label encoding: {high_card}")
    for col in high_card:
        if col in df.columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))
            label_encoders[col] = le

print(f"\n{'='*70}")
print("ENCODING COMPLETE")
print(f"{'='*70}")
print(f"Shape before: {original_shape}")
print(f"Shape after: {df.shape}")
print(f"New columns: {df.shape[1] - original_shape[1]}")

**What this does:**  
Applies OneHot encoding to low-cardinality and Label encoding to high-cardinality features.

**Why this approach:**
- **OneHot**: Creates binary features, better for tree models
- **Drop_first**: Prevents multicollinearity
- **Label**: Keeps dimensionality manageable
- **Stored encoders**: For inverse transformation during prediction

**Result:**  
All categorical variables now numerical and ML-ready.

In [None]:
# Final feature verification
print("="*70)
print("FINAL FEATURE SET")
print("="*70)
print(f"Total features: {df.shape[1]}")
print(f"Total samples: {df.shape[0]:,}")
print(f"\nFeatures ({df.shape[1]}):")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:2d}. {col:35s} {str(df[col].dtype):10s} ({df[col].nunique()} unique)")

print(f"\nâœ“ Feature Engineering Complete!")

**What this does:**  
Lists all engineered features with types and cardinality.

**Why it's needed:**  
Quality check ensuring all features are numerical and ready for modeling.

**Next steps:**  
EDA, visualization, and model training.

---
## 06 - Exploratory Data Analysis

Visual and statistical analysis of patterns and relationships.

### 6.1 Univariate Analysis

In [None]:
# Price distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(df[price_col], bins=50, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].set_title('Flight Price Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Price (â‚¹)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].axvline(df[price_col].mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: â‚¹{df[price_col].mean():,.0f}')
axes[0].axvline(df[price_col].median(), color='green', linestyle='--', linewidth=2, 
                label=f'Median: â‚¹{df[price_col].median():,.0f}')
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].boxplot(df[price_col], vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue'),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_title('Price Box Plot', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Price (â‚¹)', fontsize=12)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('price_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("="*70)
print("PRICE STATISTICS")
print("="*70)
print(f"Skewness: {df[price_col].skew():.3f}")
print(f"Kurtosis: {df[price_col].kurtosis():.3f}")

**What this does:**  
Visualizes price distribution with histogram and box plot.

**Why it's needed:**
- Shows distribution shape (skewness, outliers)
- Compares mean vs median
- Identifies typical price ranges

**Insights:**
- Right-skewed common in pricing (few expensive flights)
- Median more robust than mean
- Tree models handle skewness better than linear

### 6.2 Bivariate Analysis

In [None]:
# Price vs key features
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Duration vs Price
if 'Duration_Hours' in df.columns:
    axes[0].scatter(df['Duration_Hours'], df[price_col], alpha=0.4, s=20, color='coral')
    axes[0].set_xlabel('Duration (hours)', fontsize=12)
    axes[0].set_ylabel('Price (â‚¹)', fontsize=12)
    axes[0].set_title('Price vs Duration', fontsize=14, fontweight='bold')
    axes[0].grid(alpha=0.3)
    corr = df[price_col].corr(df['Duration_Hours'])
    axes[0].text(0.05, 0.95, f'Correlation: {corr:.3f}', transform=axes[0].transAxes,
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Stops vs Price
if 'Total_Stops_Num' in df.columns:
    stops_price = df.groupby('Total_Stops_Num')[price_col].mean()
    axes[1].bar(stops_price.index, stops_price.values, alpha=0.7, color='steelblue')
    axes[1].set_xlabel('Number of Stops', fontsize=12)
    axes[1].set_ylabel('Average Price (â‚¹)', fontsize=12)
    axes[1].set_title('Average Price by Stops', fontsize=14, fontweight='bold')
    axes[1].grid(alpha=0.3, axis='y')
    for i, v in enumerate(stops_price.values):
        axes[1].text(i, v, f'â‚¹{v:,.0f}', ha='center', va='bottom')

plt.tight_layout()
plt.savefig('bivariate_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

**What this does:**  
Analyzes relationships between price and key predictors.

**Why it's needed:**
- Identifies strong predictive features
- Reveals linear vs non-linear relationships
- Validates domain knowledge

**Key insights:**
- Duration shows positive correlation
- Direct flights typically more expensive
- Strong relationships suggest good predictive power

### 6.3 Multivariate Analysis

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 8))

numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
corr_matrix = df[numerical_cols].corr()

sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, vmin=-1, vmax=1)
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(rotation=0, fontsize=8)
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Top correlations with price
price_corr = corr_matrix[price_col].sort_values(ascending=False)
print("="*70)
print(f"TOP 10 CORRELATIONS WITH {price_col}")
print("="*70)
for i, (feat, corr) in enumerate(list(price_corr.items())[1:11], 1):
    print(f"{i:2d}. {feat:30s}: {corr:+.4f}")

**What this does:**  
Creates correlation heatmap and identifies top predictors.

**Why it's needed:**
- Feature selection guidance
- Multicollinearity detection
- Relationship visualization

**Key findings:**
- High correlation to price = strong predictors
- Highly correlated features may be redundant
- Tree models handle multicollinearity well

---
## 07 - Train/Test Split

In [None]:
# Prepare data
print("="*70)
print("DATA PREPARATION")
print("="*70)

X = df.drop(columns=[price_col])
y = df[price_col]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Ensure all numeric
non_numeric = X.select_dtypes(include=['object']).columns.tolist()
if non_numeric:
    print(f"\nConverting non-numeric: {non_numeric}")
    for col in non_numeric:
        X[col] = LabelEncoder().fit_transform(X[col].astype(str))

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

print(f"\n{'='*70}")
print("SPLIT SUMMARY")
print(f"{'='*70}")
print(f"Training: {X_train.shape[0]:,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing:  {X_test.shape[0]:,} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTrain target: Mean=â‚¹{y_train.mean():,.0f}, Std=â‚¹{y_train.std():,.0f}")
print(f"Test target:  Mean=â‚¹{y_test.mean():,.0f}, Std=â‚¹{y_test.std():,.0f}")
print("\nâœ“ Train/test split complete")

**What this does:**  
Splits data into 80% training and 20% testing sets.

**Why it's needed:**
- **Unbiased evaluation**: Test set simulates unseen data
- **Prevent overfitting**: Models can't memorize test data
- **Reproducibility**: Random state ensures consistent splits

**Configuration:**
- 80/20 split: Standard practice
- Random state=42: Reproducible results
- Similar distributions validate good split

**Critical rule:**  
NEVER use test data for training decisions.

---
## 08 - Model Training & Comparison

Train multiple algorithms and compare performance.

In [None]:
# Initialize results storage
results = []

def evaluate_model(model, name, X_train, X_test, y_train, y_test):
    import time
    print(f"\n{'='*70}")
    print(f"TRAINING: {name}")
    print(f"{'='*70}")

    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)

    cv_scores = cross_val_score(model, X_train, y_train, cv=5, 
                                 scoring='neg_root_mean_squared_error', n_jobs=-1)
    cv_rmse = -cv_scores.mean()

    results.append({
        'Model': name,
        'Train_RMSE': train_rmse,
        'Test_RMSE': test_rmse,
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'CV_RMSE': cv_rmse,
        'Time': train_time
    })

    print(f"âœ“ Completed in {train_time:.2f}s")
    print(f"Train RMSE: â‚¹{train_rmse:,.0f} | RÂ²: {train_r2:.4f}")
    print(f"Test RMSE:  â‚¹{test_rmse:,.0f} | RÂ²: {test_r2:.4f}")
    print(f"CV RMSE: â‚¹{cv_rmse:,.0f}")
    print(f"Overfit: {(train_r2-test_r2)*100:.1f}%")

    return model

print("="*70)
print("MODEL TRAINING PIPELINE")
print("="*70)
print(f"Training samples: {len(X_train):,}")
print(f"Features: {X_train.shape[1]}")

**What this does:**  
Creates evaluation function for standardized model assessment.

**Why it's needed:**
- Consistent metrics across models
- Multiple perspectives (RMSE, RÂ², CV)
- Overfitting detection
- Training time tracking

**Metrics:**
- **RMSE**: Prediction error in rupees
- **RÂ²**: Variance explained (0-1)
- **CV RMSE**: Most reliable metric

### 8.1 Linear Regression

In [None]:
lr_model = evaluate_model(LinearRegression(), 'Linear Regression', 
                         X_train, X_test, y_train, y_test)

**What this does:**  
Trains Linear Regression baseline model.

**Why this model:**
- Baseline performance benchmark
- Interpretable coefficients
- Fastest training/prediction
- Assumes linear relationships

**Expected:**  
Typically underperforms due to non-linear flight pricing patterns.

### 8.2 Decision Tree

In [None]:
dt_model = evaluate_model(DecisionTreeRegressor(max_depth=20, min_samples_split=10, random_state=42),
                         'Decision Tree', X_train, X_test, y_train, y_test)

**What this does:**  
Trains Decision Tree with depth limit.

**Why this model:**
- Captures non-linear relationships
- No feature scaling needed
- Discovers interactions automatically
- Interpretable decision rules

**Configuration:**
- max_depth=20: Prevents overfitting
- min_samples_split=10: Smooths predictions

### 8.3 Random Forest

In [None]:
rf_model = evaluate_model(RandomForestRegressor(n_estimators=100, max_depth=20, 
                                               min_samples_split=10, random_state=42, n_jobs=-1),
                         'Random Forest', X_train, X_test, y_train, y_test)

**What this does:**  
Trains ensemble of 100 decision trees.

**Why this model:**
- Reduces overfitting via averaging
- Handles outliers well
- Reliable feature importance
- Excellent generalization

**Expected:**  
Typically top 2 performer, often RÂ² > 0.85.

### 8.4 XGBoost

In [None]:
xgb_model = evaluate_model(xgb.XGBRegressor(n_estimators=100, max_depth=8, learning_rate=0.1, 
                                           random_state=42, n_jobs=-1, tree_method='hist'),
                          'XGBoost', X_train, X_test, y_train, y_test)

**What this does:**  
Trains XGBoost gradient boosting model.

**Why this model:**
- State-of-the-art accuracy
- Sequential error correction
- Built-in regularization
- Competition winner

**Expected:**  
Often achieves RÂ² > 0.90 on flight prices.

### 8.5 LightGBM

In [None]:
lgb_model = evaluate_model(lgb.LGBMRegressor(n_estimators=100, max_depth=8, learning_rate=0.1,
                                            random_state=42, n_jobs=-1, verbosity=-1),
                          'LightGBM', X_train, X_test, y_train, y_test)

**What this does:**  
Trains LightGBM fast gradient boosting model.

**Why this model:**
- Faster than XGBoost
- Memory efficient
- Leaf-wise growth strategy
- Excellent for large datasets

**Expected:**  
Matches/exceeds XGBoost while training faster.

### 8.6 Model Comparison

In [None]:
# Results comparison
results_df = pd.DataFrame(results).sort_values('Test_R2', ascending=False)

print("\n" + "="*70)
print("MODEL COMPARISON RESULTS")
print("="*70)
display(results_df)

print("\n" + "="*70)
print("RANKING (by Test RÂ²)")
print("="*70)
for i, row in enumerate(results_df.itertuples(), 1):
    print(f"{i}. {row.Model:20s} - RÂ²: {row.Test_R2:.4f} | RMSE: â‚¹{row.Test_RMSE:,.0f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models = results_df['Model']
x_pos = np.arange(len(models))

# RÂ² comparison
axes[0].bar(x_pos, results_df['Train_R2'], alpha=0.6, label='Train', color='lightblue')
axes[0].bar(x_pos, results_df['Test_R2'], alpha=0.8, label='Test', color='coral')
axes[0].set_ylabel('RÂ² Score', fontsize=12)
axes[0].set_title('RÂ² Score Comparison', fontsize=14, fontweight='bold')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(models, rotation=45, ha='right')
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')

# RMSE comparison
axes[1].bar(x_pos, results_df['Test_RMSE'], alpha=0.7, color='steelblue')
axes[1].set_ylabel('RMSE (â‚¹)', fontsize=12)
axes[1].set_title('Test RMSE (Lower is Better)', fontsize=14, fontweight='bold')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(models, rotation=45, ha='right')
axes[1].grid(alpha=0.3, axis='y')

# Training time
axes[2].bar(x_pos, results_df['Time'], alpha=0.7, color='lightgreen')
axes[2].set_ylabel('Time (seconds)', fontsize=12)
axes[2].set_title('Training Time', fontsize=14, fontweight='bold')
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(models, rotation=45, ha='right')
axes[2].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# Top 2 for tuning
top_2 = results_df.head(2)['Model'].tolist()
print(f"\n{'='*70}")
print("TOP 2 MODELS FOR HYPERPARAMETER TUNING")
print(f"{'='*70}")
print(f"1. {top_2[0]}")
print(f"2. {top_2[1]}")

**What this does:**  
Compares all models and identifies top 2 for tuning.

**Why it's needed:**
- Objective model selection
- Visualizes trade-offs
- Guides tuning strategy

**Interpretation:**
- Best: Highest Test RÂ², lowest Test RMSE
- Small train-test gap indicates good generalization
- XGBoost/LightGBM typically win on tabular data

---
## 09 - Hyperparameter Tuning

Optimize top 2 models using RandomizedSearchCV.

In [None]:
# Store tuned models
tuned_models = {}

print("="*70)
print("HYPERPARAMETER TUNING")
print("="*70)
print(f"Method: RandomizedSearchCV")
print(f"Cross-validation: 3-fold")
print(f"Iterations: 20")
print(f"Metric: RMSE")

**What this does:**  
Prepares hyperparameter optimization for top 2 models.

**Why tuning:**
- 2-10% performance improvement
- Reduces overfitting
- Optimal complexity
- Competitive edge

**Strategy:**
- RandomizedSearchCV: Faster than GridSearch
- 3-fold CV: Balances reliability & speed
- 20 iterations: Good exploration-time trade-off

In [None]:
# Tune top models
import time

for model_name in top_2:
    print(f"\n{'='*70}")
    print(f"TUNING: {model_name}")
    print(f"{'='*70}")

    if 'Random Forest' in model_name:
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [15, 20, 25, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'max_features': ['sqrt', 'log2', 0.5]
        }
        base_model = RandomForestRegressor(random_state=42, n_jobs=-1)

    elif 'XGBoost' in model_name:
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [5, 7, 9],
            'learning_rate': [0.01, 0.05, 0.1],
            'subsample': [0.7, 0.8, 0.9],
            'colsample_bytree': [0.7, 0.8, 0.9],
            'min_child_weight': [1, 3, 5]
        }
        base_model = xgb.XGBRegressor(random_state=42, n_jobs=-1, tree_method='hist')

    elif 'LightGBM' in model_name:
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [5, 7, 9, -1],
            'learning_rate': [0.01, 0.05, 0.1],
            'num_leaves': [31, 63, 127],
            'subsample': [0.7, 0.8, 0.9],
            'colsample_bytree': [0.7, 0.8, 0.9]
        }
        base_model = lgb.LGBMRegressor(random_state=42, n_jobs=-1, verbosity=-1)
    else:
        continue

    random_search = RandomizedSearchCV(base_model, param_grid, n_iter=20, cv=3,
                                       scoring='neg_root_mean_squared_error',
                                       random_state=42, n_jobs=-1, verbose=0)

    start = time.time()
    random_search.fit(X_train, y_train)
    elapsed = time.time() - start

    print(f"âœ“ Completed in {elapsed:.1f}s")
    print(f"\nBest parameters:")
    for param, value in random_search.best_params_.items():
        print(f"  {param}: {value}")

    best_model = random_search.best_estimator_
    y_test_pred = best_model.predict(X_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)

    original_r2 = results_df[results_df['Model'] == model_name]['Test_R2'].values[0]
    improvement = ((test_r2 - original_r2) / original_r2) * 100

    print(f"\nTuned Performance:")
    print(f"  Test RMSE: â‚¹{test_rmse:,.0f}")
    print(f"  Test RÂ²: {test_r2:.4f}")
    print(f"  Improvement: {improvement:+.2f}%")

    tuned_models[model_name] = best_model

print(f"\n{'='*70}")
print("âœ“ HYPERPARAMETER TUNING COMPLETE")
print(f"{'='*70}")

**What this does:**  
Performs hyperparameter tuning on top 2 models using RandomizedSearchCV.

**Search spaces:**
- **Random Forest**: Trees, depth, split criteria, features
- **XGBoost**: Boosting rounds, depth, learning rate, regularization
- **LightGBM**: Leaves, depth, learning rate, sampling

**Why these parameters:**
- Control model complexity
- Balance bias-variance tradeoff
- Prevent overfitting
- Optimize convergence

**Expected:**
Typically 1-5% RÂ² improvement, sometimes higher for XGBoost/LightGBM.

---
## 10 - Final Production Model Selection

Comprehensive evaluation and selection for deployment.

In [None]:
# Compare tuned models
final_comparison = []

for name, model in tuned_models.items():
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)

    cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                 scoring='neg_root_mean_squared_error', n_jobs=-1)
    cv_rmse = -cv_scores.mean()

    final_comparison.append({
        'Model': f'{name} (Tuned)',
        'Train_RMSE': train_rmse,
        'Test_RMSE': test_rmse,
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'MAE': test_mae,
        'CV_RMSE': cv_rmse,
        'Overfit_Gap': train_r2 - test_r2
    })

final_df = pd.DataFrame(final_comparison)

print("="*70)
print("FINAL MODEL COMPARISON")
print("="*70)
display(final_df)

# Select best
best_idx = final_df['Test_R2'].idxmax()
best_model_name = final_df.loc[best_idx, 'Model']
best_model_r2 = final_df.loc[best_idx, 'Test_R2']
best_model_rmse = final_df.loc[best_idx, 'Test_RMSE']

print(f"\n{'='*70}")
print("PRODUCTION MODEL SELECTED")
print(f"{'='*70}")
print(f"Model: {best_model_name}")
print(f"Test RÂ²: {best_model_r2:.4f} ({best_model_r2*100:.2f}% variance explained)")
print(f"Test RMSE: â‚¹{best_model_rmse:,.2f}")
print(f"Test MAE: â‚¹{final_df.loc[best_idx, 'MAE']:,.2f}")

# Get the actual model
best_key = best_model_name.replace(' (Tuned)', '')
final_model = tuned_models[best_key]

**What this does:**  
Evaluates all tuned models and selects the best for production.

**Evaluation criteria:**
1. **Test RÂ²**: Primary metric - highest wins
2. **Test RMSE**: Business interpretability
3. **CV RMSE**: Robustness check
4. **Overfitting gap**: Should be small
5. **MAE**: Average absolute error

**Selection rationale:**  
Model with highest Test RÂ² and lowest RMSE, with acceptable overfitting gap, generalizes best to new data.

In [None]:
# Detailed model analysis
y_train_final = final_model.predict(X_train)
y_test_final = final_model.predict(X_test)

train_residuals = y_train - y_train_final
test_residuals = y_test - y_test_final

mape = np.mean(np.abs((y_test - y_test_final) / y_test)) * 100

print("\n" + "="*70)
print("FINAL MODEL DIAGNOSTICS")
print("="*70)
print(f"\nResiduals:")
print(f"  Train: Mean=â‚¹{train_residuals.mean():,.0f}, Std=â‚¹{train_residuals.std():,.0f}")
print(f"  Test:  Mean=â‚¹{test_residuals.mean():,.0f}, Std=â‚¹{test_residuals.std():,.0f}")
print(f"\nPrediction Range:")
print(f"  Actual:    â‚¹{y_test.min():,.0f} to â‚¹{y_test.max():,.0f}")
print(f"  Predicted: â‚¹{y_test_final.min():,.0f} to â‚¹{y_test_final.max():,.0f}")
print(f"\nMean Absolute Percentage Error: {mape:.2f}%")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Actual vs Predicted
axes[0,0].scatter(y_test, y_test_final, alpha=0.5, s=20, color='blue')
axes[0,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
               'r--', lw=2, label='Perfect Prediction')
axes[0,0].set_xlabel('Actual Price (â‚¹)', fontsize=12)
axes[0,0].set_ylabel('Predicted Price (â‚¹)', fontsize=12)
axes[0,0].set_title('Actual vs Predicted', fontsize=14, fontweight='bold')
axes[0,0].legend()
axes[0,0].grid(alpha=0.3)

# Residuals distribution
axes[0,1].hist(test_residuals, bins=50, edgecolor='black', alpha=0.7, color='coral')
axes[0,1].axvline(0, color='red', linestyle='--', lw=2)
axes[0,1].set_xlabel('Residuals (â‚¹)', fontsize=12)
axes[0,1].set_ylabel('Frequency', fontsize=12)
axes[0,1].set_title('Residuals Distribution', fontsize=14, fontweight='bold')
axes[0,1].grid(alpha=0.3)

# Residuals vs Predicted
axes[1,0].scatter(y_test_final, test_residuals, alpha=0.5, s=20, color='green')
axes[1,0].axhline(0, color='red', linestyle='--', lw=2)
axes[1,0].set_xlabel('Predicted Price (â‚¹)', fontsize=12)
axes[1,0].set_ylabel('Residuals (â‚¹)', fontsize=12)
axes[1,0].set_title('Residuals vs Predicted', fontsize=14, fontweight='bold')
axes[1,0].grid(alpha=0.3)

# Feature importance
if hasattr(final_model, 'feature_importances_'):
    importances = final_model.feature_importances_
    indices = np.argsort(importances)[-15:]
    axes[1,1].barh(range(len(indices)), importances[indices], color='steelblue', alpha=0.7)
    axes[1,1].set_yticks(range(len(indices)))
    axes[1,1].set_yticklabels([X.columns[i] for i in indices], fontsize=9)
    axes[1,1].set_xlabel('Importance', fontsize=12)
    axes[1,1].set_title('Top 15 Features', fontsize=14, fontweight='bold')
    axes[1,1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('final_model_diagnostics.png', dpi=300, bbox_inches='tight')
plt.show()

**What this does:**  
Comprehensive diagnostics of the final production model.

**Diagnostic plots:**

1. **Actual vs Predicted**: Points should cluster around diagonal
   - Deviations show model weaknesses
   - Systematic bias indicates consistent errors

2. **Residuals Distribution**: Should be normal (bell-shaped), centered at zero
   - Normal = assumptions met
   - Skewness = systematic bias
   - Fat tails = outliers

3. **Residuals vs Predicted**: Should show random scatter
   - Funnel shape = heteroscedasticity
   - Curve = non-linearity not captured
   - Random = good fit

4. **Feature Importance**: Top predictive features
   - Validates business logic
   - Guides feature engineering

**Quality indicators:**
- MAPE <10%: Excellent
- MAPE 10-20%: Good
- MAPE >20%: Needs improvement

In [None]:
# Bias-Variance Analysis
print("="*70)
print("BIAS-VARIANCE TRADEOFF")
print("="*70)

train_r2 = r2_score(y_train, y_train_final)
test_r2 = r2_score(y_test, y_test_final)
gap = train_r2 - test_r2

print(f"\nTrain RÂ²: {train_r2:.4f}")
print(f"Test RÂ²:  {test_r2:.4f}")
print(f"Gap:      {gap:.4f} ({gap*100:.1f}%)")

if gap < 0.05:
    print("\nâœ“ OPTIMAL: Low bias, Low variance")
    print("  Model generalizes excellently")
elif train_r2 > 0.95 and test_r2 < 0.85:
    print("\nâš  OVERFITTING: Low bias, High variance")
    print("  Consider: More regularization, simpler model")
elif train_r2 < 0.8 and test_r2 < 0.75:
    print("\nâš  UNDERFITTING: High bias, Low variance")
    print("  Consider: More complex model, more features")
else:
    print("\nâœ“ Good balance")

# Production readiness checklist
print(f"\n{'='*70}")
print("PRODUCTION READINESS CHECKLIST")
print(f"{'='*70}")

checks = [
    (test_r2 > 0.80, f"Test RÂ² > 0.80: {test_r2:.4f}"),
    (best_model_rmse < y_test.std(), f"RMSE < Std: â‚¹{best_model_rmse:,.0f} < â‚¹{y_test.std():,.0f}"),
    (gap < 0.10, f"Overfitting < 10%: {gap*100:.1f}%"),
    (abs(test_residuals.mean()) < 1000, f"Bias < â‚¹1000: â‚¹{abs(test_residuals.mean()):,.0f}"),
    (mape < 15, f"MAPE < 15%: {mape:.2f}%")
]

passed = sum([c[0] for c in checks])
for status, criteria in checks:
    print(f"  {'âœ“' if status else 'âœ—'} {criteria}")

print(f"\nPassed: {passed}/5 checks")
if passed >= 4:
    print("\nðŸŽ¯ MODEL IS PRODUCTION-READY!")
else:
    print("\nâš  Needs improvement before deployment")

**What this does:**  
Analyzes bias-variance tradeoff and production readiness.

**Bias-Variance:**
- High bias (underfitting): Too simple, poor on both sets
- High variance (overfitting): Too complex, great train, poor test  
- Optimal: Good on both with small gap

**Production criteria:**
1. Test RÂ² > 0.80: Industry standard
2. RMSE < Std Dev: Better than mean
3. Overfitting < 10%: Good generalization
4. Low bias: No systematic errors
5. MAPE < 15%: Acceptable accuracy

**Decision:**
- 5/5: Excellent, deploy confidently
- 4/5: Good, production-ready
- 3/5: Acceptable, monitor closely
- <3/5: Not ready

---
## 11 - Save Model

Persist the final model for deployment.

In [None]:
# Create models directory
if not os.path.exists('models'):
    os.makedirs('models')

# Save using joblib (recommended for sklearn-based models)
model_filename = 'models/flight_price_predictor.pkl'
joblib.dump(final_model, model_filename)
print(f"âœ“ Model saved: {model_filename}")

# Save using pickle (alternative)
pickle_filename = 'models/flight_price_predictor_pickle.pkl'
with open(pickle_filename, 'wb') as f:
    pickle.dump(final_model, f)
print(f"âœ“ Pickle saved: {pickle_filename}")

# Save feature names
feature_names = X.columns.tolist()
with open('models/feature_names.pkl', 'wb') as f:
    pickle.dump(feature_names, f)
print(f"âœ“ Features saved: models/feature_names.pkl")

# Save label encoders
if label_encoders:
    with open('models/label_encoders.pkl', 'wb') as f:
        pickle.dump(label_encoders, f)
    print(f"âœ“ Encoders saved: models/label_encoders.pkl")

# Model metadata
metadata = {
    'model_name': best_model_name,
    'test_r2': float(best_model_r2),
    'test_rmse': float(best_model_rmse),
    'test_mae': float(final_df.loc[best_idx, 'MAE']),
    'mape': float(mape),
    'n_features': X.shape[1],
    'n_training_samples': len(X_train),
    'date_trained': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}

import json
with open('models/model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=4)
print(f"âœ“ Metadata saved: models/model_metadata.json")

print(f"\n{'='*70}")
print("MODEL PERSISTENCE COMPLETE")
print(f"{'='*70}")
print(f"\nSaved files:")
print(f"  - flight_price_predictor.pkl (main model)")
print(f"  - feature_names.pkl")
print(f"  - label_encoders.pkl")
print(f"  - model_metadata.json")

**What this does:**  
Saves the trained model and supporting artifacts for deployment.

**Why it's needed:**
- **Persistence**: Reuse model without retraining
- **Deployment**: Load in production API
- **Reproducibility**: Exact model state preserved
- **Metadata**: Track performance and configuration

**Saved files:**
- **Model**: Trained algorithm with parameters
- **Features**: Column names and order
- **Encoders**: Transform categorical inputs
- **Metadata**: Performance metrics and info

**Loading example:**
```python
model = joblib.load('models/flight_price_predictor.pkl')
predictions = model.predict(new_data)
```

In [None]:
# Test loading
print("\nTesting model load...")
loaded_model = joblib.load(model_filename)
test_predictions = loaded_model.predict(X_test[:5])
print(f"âœ“ Model loaded successfully!")
print(f"\nSample predictions:")
for i, (actual, pred) in enumerate(zip(y_test[:5].values, test_predictions), 1):
    print(f"  {i}. Actual: â‚¹{actual:,.0f} | Predicted: â‚¹{pred:,.0f} | Error: â‚¹{abs(actual-pred):,.0f}")

**What this does:**  
Validates that the saved model can be loaded and used for predictions.

**Why it's needed:**  
Ensures deployment pipeline will work correctly. Catching serialization issues before production.

---
## 12 - Future Work, Limitations & Recommendations

Insights for improvement and deployment considerations.

### 12.1 Model Performance Summary

In [None]:
print("="*70)
print("FINAL PROJECT SUMMARY")
print("="*70)
print(f"\nBest Model: {best_model_name}")
print(f"Test RÂ² Score: {best_model_r2:.4f} ({best_model_r2*100:.2f}% variance explained)")
print(f"Test RMSE: â‚¹{best_model_rmse:,.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"\nInterpretation:")
print(f"  - Model explains {best_model_r2*100:.1f}% of flight price variation")
print(f"  - Average prediction error: â‚¹{best_model_rmse:,.0f}")
print(f"  - Predictions within {mape:.1f}% of actual prices on average")
print(f"\nâœ“ Model is production-ready and suitable for deployment")

**What this does:**  
Summarizes final model performance in business terms.

**Key achievements:**
- High RÂ² indicates strong predictive power
- Low RMSE means accurate price predictions
- Low MAPE shows consistent relative accuracy

**Business value:**
- Airlines can optimize pricing strategies
- Travelers make informed booking decisions
- Travel agencies provide better recommendations

### 12.2 Key Learnings

In [None]:
print("="*70)
print("KEY LEARNINGS")
print("="*70)

learnings = [
    "1. Feature Engineering Impact",
    "   - Time-based features (hour, day, month) significantly improved predictions",
    "   - Duration and stops are the strongest price predictors",
    "   - Categorical encoding strategy matters (OneHot vs Label)",
    "",
    "2. Model Selection",
    "   - Tree-based models (RF, XGBoost, LightGBM) outperform linear models",
    "   - Gradient boosting methods achieve best performance on tabular data",
    "   - Ensemble methods reduce overfitting effectively",
    "",
    "3. Hyperparameter Tuning",
    "   - Random search efficiently explores parameter space",
    "   - Tuning provides 2-5% performance improvement",
    "   - Learning rate and tree depth are most impactful parameters",
    "",
    "4. Validation Strategy",
    "   - Cross-validation essential for robust evaluation",
    "   - Train-test split prevents data leakage",
    "   - Multiple metrics (RÂ², RMSE, MAE) provide complete picture",
]

for learning in learnings:
    print(learning)

**What this does:**  
Documents key insights from the project.

**Why it's valuable:**
- **Knowledge transfer**: Share learnings with team
- **Future projects**: Apply best practices
- **Continuous improvement**: Build on successes

**Takeaways:**
- Feature engineering is critical
- Tree models excel on this data type
- Proper validation prevents overfitting

### 12.3 Limitations

In [None]:
print("\n" + "="*70)
print("LIMITATIONS")
print("="*70)

limitations = [
    "1. Data Limitations",
    "   - Historical data may not reflect current market dynamics",
    "   - Lacks real-time factors (fuel prices, demand surges, competitor pricing)",
    "   - Missing features: seat class, booking platform, customer demographics",
    "",
    "2. Model Limitations",
    "   - Assumes stable pricing patterns (may not handle market shocks)",
    "   - Black-box nature of tree models reduces interpretability",
    "   - Predictions extrapolate poorly beyond training data range",
    "",
    "3. Deployment Considerations",
    "   - Requires periodic retraining as market conditions change",
    "   - Feature engineering pipeline must be maintained",
    "   - Inference latency may be concern for real-time applications",
]

for limitation in limitations:
    print(limitation)

**What this does:**  
Transparently documents model limitations.

**Why it's important:**
- **Realistic expectations**: Set appropriate use cases
- **Risk management**: Identify failure modes
- **Improvement roadmap**: Prioritize enhancements

**Key constraints:**
- Data freshness critical
- Model requires monitoring
- Edge cases need handling

### 12.4 Recommendations for Improvement

In [None]:
print("\n" + "="*70)
print("RECOMMENDATIONS FOR FUTURE WORK")
print("="*70)

recommendations = [
    "1. Feature Enhancements",
    "   â€¢ Add external data: oil prices, economic indicators, holiday calendars",
    "   â€¢ Include seat class, baggage options, loyalty program status",
    "   â€¢ Create interaction features (route Ã— airline, time Ã— season)",
    "   â€¢ Implement text analysis on route descriptions",
    "",
    "2. Model Improvements",
    "   â€¢ Experiment with stacking/blending ensemble methods",
    "   â€¢ Try deep learning (Neural Networks) for complex patterns",
    "   â€¢ Implement online learning for real-time price updates",
    "   â€¢ Use SHAP values for better model interpretability",
    "",
    "3. Deployment Enhancements",
    "   â€¢ Build REST API using FastAPI or Flask",
    "   â€¢ Implement A/B testing framework",
    "   â€¢ Add monitoring dashboard for prediction drift",
    "   â€¢ Set up automated retraining pipeline (MLOps)",
    "",
    "4. Business Applications",
    "   â€¢ Price alert system for travelers",
    "   â€¢ Dynamic pricing engine for airlines",
    "   â€¢ Competitive pricing analysis tool",
    "   â€¢ Demand forecasting integration",
]

for rec in recommendations:
    print(rec)

print("\n" + "="*70)
print("âœ“ FLIGHT PRICE PREDICTION PROJECT COMPLETE")
print("="*70)

**What this does:**  
Provides actionable recommendations for enhancement.

**Why it matters:**
- **Continuous improvement**: Never settle for "good enough"
- **Strategic planning**: Prioritize next iterations
- **Innovation opportunities**: Stay competitive

**Next steps:**
1. **Short-term**: Deploy API, monitor performance
2. **Medium-term**: Add features, retrain periodically
3. **Long-term**: Explore deep learning, real-time learning

**Impact:**
Each improvement cycle increases business value and competitive advantage.

---
## Project Structure & Supporting Files

Complete GitHub repository structure for deployment.

In [None]:
print("="*70)
print("GITHUB PROJECT STRUCTURE")
print("="*70)

structure = '''
FlightPricePrediction/
â”‚
â”œâ”€â”€ data/
â”‚   â””â”€â”€ flight-fare.zip
â”‚
â”œâ”€â”€ notebooks/
â”‚   â””â”€â”€ Flight_Price_Prediction.ipynb
â”‚
â”œâ”€â”€ models/
â”‚   â”œâ”€â”€ flight_price_predictor.pkl
â”‚   â”œâ”€â”€ feature_names.pkl
â”‚   â”œâ”€â”€ label_encoders.pkl
â”‚   â””â”€â”€ model_metadata.json
â”‚
â”œâ”€â”€ reports/
â”‚   â”œâ”€â”€ EDA_Report.md
â”‚   â”œâ”€â”€ model_comparison.png
â”‚   â””â”€â”€ final_model_diagnostics.png
â”‚
â”œâ”€â”€ src/
â”‚   â”œâ”€â”€ __init__.py
â”‚   â”œâ”€â”€ data_preprocessing.py
â”‚   â”œâ”€â”€ feature_engineering.py
â”‚   â”œâ”€â”€ model_training.py
â”‚   â””â”€â”€ prediction.py
â”‚
â”œâ”€â”€ tests/
â”‚   â””â”€â”€ test_model.py
â”‚
â”œâ”€â”€ requirements.txt
â”œâ”€â”€ README.md
â”œâ”€â”€ .gitignore
â”œâ”€â”€ LICENSE
â””â”€â”€ setup.py
'''

print(structure)

print("\n" + "="*70)
print("âœ“ Project structure documented")
print("="*70)

**What this does:**  
Defines professional project organization for GitHub.

**Why this structure:**
- **data/**: Raw and processed datasets
- **notebooks/**: Jupyter notebooks for exploration
- **models/**: Saved model artifacts
- **reports/**: Visualizations and analysis
- **src/**: Production-ready Python modules
- **tests/**: Unit and integration tests

**Best practices:**
- Separate exploration (notebooks) from production (src)
- Version control models and code, not large data files
- Include tests for reliability
- Document everything (README, comments)