## Airfare Price Prediction - README

### 1.Project Overview
#### Goal: Predict airfare prices using historical flight data and evaluate multiple regression models (OLS, Ridge, Lasso).

#### Task Summary:
- Data cleaning and feature engineering
- Data Storage in Azure Postgres database
- Exploratory Data Analysis (EDA)
- Model selection and validation
- Model diagnostics and visualizations

### 2. Project File Structure

- airfare_packages/ – Custom modules
    - __init__.py – Package initializer
    - airfare_classes.py – Class definitions (e.g., model wrappers)
    - airfare_etl.py – Data cleaning & preprocessing logic
    - airfare_modeling.py – Model training, tuning, evaluation
    - airfare_visualization.py – Plots, diagnostics, and visual displays

- data
    Scraped_dataset.csv - raw source data for project
- airfare_etl.ipynb – ETL workflow notebook
- airfare_eda.ipynb – Exploratory Data Analysis
- airfare_models.ipynb – Model building, training, and testing
- README.ipynb – Project overview and documentation
- requirements.txt - Packages needed to run project


### 3. Data Description

### Source
- Kaggle Dataset: https://www.kaggle.com/datasets/yashdharme36/airfare-ml-predicting-flight-fares?select=Scraped_dataset.csv

### Dataset Info
#### Raw Data Set
- Rows: 452088
- Columns: 8

| Column Name         | Data Type         | Description                                                                 |
|---------------------|-------------------|-----------------------------------------------------------------------------|
| `Date of Booking`   | `date`        | The date on which the flight was booked                                    |
| `Date of Journey`   | `date`        | The scheduled departure date of the flight                                 |
| `Airline-Class`           | `string`          | Concatenated string with airline name, flight code, and class (e.g., SpiceJet \nSG-8169\nECONOMY)                              |
| `Departure Time`       | `string`          | Time and city of departure (e.g., 20:00\nDelhi)                                     |
| `Arrival Time`             | `string`          | Time and city of arrival (e.g., 22:05\nMumbai)                                     |
| `Duration`  | `int`             | Total flight duration in hours and minutes (e.g., 02h 05m)                                         |
| `Total Stops`    | `string`| Number of stops (e.g., non-stop, 1 stop, 2 stops) |
| `Price`       | `string`          | ticket price (e.g., 5,335)                                          |



#### Cleaned Data Set
- Rows: 452088
- Columns: 12

| Column Name         | Data Type         | Description                                                                 |
|---------------------|-------------------|-----------------------------------------------------------------------------|
| `date_of_booking`   | `datetime`        | The date on which the flight was booked                                    |
| `date_of_journey`   | `datetime`        | The scheduled departure date of the flight                                 |
| `airline`           | `string`          | Name of the airline (e.g., IndiGo, Air India)                              |
| `flight_code`       | `string`          | Code of the flight (e.g., AI202, 6E415)                                     |
| `class`             | `string`          | Travel class (e.g., Economy, Business)                                     |
| `connections`       | `int`             | Number of connections (0 = direct, 1+ = layovers)                          |
| `duration_minutes`  | `int`             | Duration of the flight in minutes                                          |
| `departure_time`    | `string` / `time` | Departure time (can be parsed into time or used to extract hour features) |
| `source_city`       | `string`          | City from which the flight departs                                         |
| `arrival_time`      | `string` / `time` | Arrival time (can be parsed similarly to departure_time)                   |
| `destination_city`  | `string`          | City at which the flight arrives                                           |
| `price`             | `float`           | **Target variable** — ticket price

### 4. Data Processing
#### Raw Data Cleaning
- `Airline-Class` →`airline`, `flight_code`, `class`
    - Parsed 'Airline-Class' into 'airline','flight_code' and 'class' columns respectively  
- `Date of Booking`,  `Date of Journey` →`date_of_booking`, `date_of_journey`
    - Converted columns to 'd/m/y' format
- `Total Stops` →`connections`
    - Converted text to integers
        Ex: '1-Stop' → 1
- `Duration` -> `duration_minutes`
    - Converted hours to minutes
        - Ex: '2h 05m' → 125
- `Departure Time` → `departure_time`, `source_city`
    - Parsed 'Departure Time' into 'departure_time' and 'source_city' respectively
        -'departure_time' convert to 'H:M' format
- `Arrival Time` → `arrival_time`, `destination_city`
    - Parsed 'Arrival Time' into 'arrival_time' and 'destination_city' respectively
        -'arrival_time' convert to 'H:M' format
- `Price`→ `price`
    - converted to float

#### ML Data Prep
- Dropped Columns
    - `flight_code`: removed as it provides little predictive value or is redundant.
- Categorical Variables → One-Hot Encoded
    - `airline`, `class`, `source_city`, `destination_city`, and any other non-date/non-numeric columns.
    - One-hot encoding is applied with drop_first=True to avoid multicollinearity.
- Datetime Variables → Transformed
    - `date_of_booking`, `date_of_journey`:
        - `year` (numerical)
        - `day_of_week` (Monday–Sunday as category)
        - `is_weekend` (1 if Saturday/Sunday, else 0)
        - `day_of_month_category` (categorical bucketization, via `day_of_month_category` function)
    - Time Variables → Transformed
        - `departure_time`, `arrival_time`:
        - `time_of_day_category` (e.g. morning, afternoon, etc., via time_of_day_category function)
        - minute_category (bucketed, via minute_category function)
- Feature Additions
    - `day_of_month` category (e.g. beginning, mid, end of month)
    - `day_of_week` name as categorical
    - weekend flag (binary)
    - `time-of-day` and minute buckets for time-based features






### 5. Modeling Techniques
- Function: produce_forward_selection_lms
    - Performs forward stepwise feature selection up to max_features
    - Scored using adjusted R² across cv_folds folds
-  Ridge Regression
    - Fits a Ridge regression with alpha_list using RidgeCV
-  Lasso Regression
    - Fits a Lasso regression with alpha_list using LassoCV


### 6. Evaluation Metrics

#### Core Metrics
| Metric    | Description                                                                                  | Goal              |
|-----------|----------------------------------------------------------------------------------------------|-------------------|
| `adjusted R-squared`  | **Adjusted R-squared** — measures model fit while penalizing for number of predictors        | Higher is better  |
| `mse`     | **Mean Squared Error** — average of squared prediction errors                                 | Lower is better   |
| `rmse`    | **Root Mean Squared Error** — square root of MSE; interpretable in same units as target       | Lower is better   |
| `mae`     | **Mean Absolute Error** — average of absolute prediction errors                                | Lower is better   |
| `aic`     | **Akaike Information Criterion** — trade-off between model complexity and goodness-of-fit     | Lower is better   |
| `bic`     | **Bayesian Information Criterion** — similar to AIC but with stronger penalty for complexity  | Lower is better   |

#### Final Model Selections
- The final model was selected based on the **highest average ranking across all evaluation metrics**
- This approach ensures the chosen model offers strong overall performance, balancing **accuracy**, **interpretability**, and **generalizability**.


### 7. Best Performing Model
- ridgeCV Model
    - Metrics
        - `mse`: 66208685.28,
        - `adj_r2`: 0.84,
        - `rmse`: 8136.87,
        - `mae`: 5306.90,
        - `aic`: 814238.23,
        - `bic`: 814674.19

    - Important Features (by coefficent magnitude)
        - `class_ECONOMY`	
	    - `class_PREMIUMECONOMY`	
	    - `class_FIRST`	
	    - `connections`	
    	- `destination_city_Kolkata`	


### 8. Conclusion
- Model Diagnostics
    - The best-performing model achieved a respectable Adjusted R² of approximately 0.84, indicating a strong overall fit.
    - However, diagnostic plots reveal systematic issues in the residuals, particularly at the extremes of the predicted values.
    -  Specifically, the variance of residuals is not constant across all fitted values—violating the assumption of homoscedasticity—suggesting the model struggles to make consistent predictions across the full price range.
- Potential Improvements
    - Transformation of the target variable (e.g. log transformation)
    - Tree based based models (e.g. random forest) or GLMs