## Dataset
Dataset has 11 features.

1. **Airline**: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2. **Flight**: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. **Source City**: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4. **Departure Time**: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5. **Stops**: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6. **Arrival Time**: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. **Destination City**: City where the flight will land. It is a categorical feature having 6 unique cities.
8. **Class**: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9. **Duration**: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10. **Days Left**: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11. **Price**: Target variable stores information of the ticket price.

In [1]:
# import libraries needed for exploratory data analysis (eda) and feature engineering (fe)
import os
import time
import datetime
import pandas as pd
pd.set_option('display.max_columns',None)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# import libraries needed for model training
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor, ExtraTreesRegressor, BaggingRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from catboost import CatBoostRegressor
from xgboost import XGBRegressor


In [None]:
pd.set_option('display.max_columns',None) #display all possible columns
for dirname, _, filenames in os.walk('../data'): 
    for filename in filenames:
        print(os.path.join(dirname, filename)) #list all files in the data directory

In [None]:
df=pd.read_csv('../data/clean_dataset.csv') #load data into dataframe
df.head(5) #display head (top 5 rows)

In [None]:
df.tail(5) #display tail (last 5 rows)

In [None]:
print(f"Shape: ",df.shape) #get total shape of dataset, total rows and columns
print("Number of Columns:", df.shape[1])
print("Number of Rows:", df.shape[0])

In [None]:
df.info() #quick info about data

In [None]:
df.describe().transpose() #statistics for numerical datatypes

In [8]:
df.drop('Unnamed: 0',axis=1, inplace = True) #drop unwanted column permanently

In [None]:
df.isna().sum() #number of missing values per column

In [None]:
df.dropna() #drop rows with any NA values

In [None]:
print("Number of Duplicates: ", df.duplicated().sum())

In [None]:
df.drop_duplicates() #drop rows with duplicate vales

In [None]:
df.nunique() #number of unique values in each column

In [None]:
df.columns #show all cloumns

In [None]:
numerical_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

print('Numerical Features : {} : {}'.format(len(numerical_features), numerical_features))
print('Categorical Features : {} : {}'.format(len(categorical_features), categorical_features))


In [None]:
#get unique values in categorical columns
for column in categorical_features:
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}': {unique_values}")

In [None]:
df.info() #quick info about data

In [None]:
df.describe().transpose() #statistics for numerical datatypes

In [19]:
x = df.drop(columns=['price'],axis=1) #dataframe contains all cloumns which shold be used to predicted
y=df['price'] #series contains to be predicted


In [None]:
print(x.head())
print(type(x)) #datatype of x

In [None]:
print(y.head())
print(type(y)) #datatype of y

## Data Encoding & Feature Scaling

**Encoding** is transform **categorical data** into numerical representations which can be understood by machine learning algotihhms.
Common types of Encoding :
1. One-Hot Encoding (OHE)
   Good for categories with no inherent prder or relationship. Each category is represented as a binary vector. This is most widely used technique.
2. Label Encoding
   Suitable for dataset with two distinct categories (eg size of t-shirt), each categories are assigned integer values.
3. Ordinal Encoding
   Similar to label encoding however the explicit mapping can be provided for integer assignments. (eg education degree)

Scaling is used to improve the consistency of numerical features. StandardScaler is the most common type of scaling applied to numerical features.

Standardization is a data preparation method that involves adjusting the input (features) by first centering them (subtracting the mean from each data point) and then dividing them by the standard deviation, resulting in the data having a mean of 0 and a standard deviation of 1.

**StandardScaler** is used to standardize the input data in a way that ensures that the data points have a balanced scale, which is crucial for machine learning algorithms, especially those that are sensitive to differences in feature scales.


**ColumnTransformer** allows different features of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
Example we will apply onehot encoding to categorical features and standard scaler to numerical features

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numerical_features = x.select_dtypes(exclude="object").columns
categorical_features = x.select_dtypes(include="object").columns

numerical_transformer = StandardScaler()
ohe_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", ohe_transformer, categorical_features),
         ("StandardScaler", numerical_transformer, numerical_features),
    ]
)
X = preprocessor.fit_transform(x)   #pre-processing source data x data and saving in X 
print(f"Shape of original data (x): {x.shape}")
print(f"Shape of transformed data (X): {X.shape}")


## Split Training & Test Data

- Data needs to be split into training and test.Refer https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- x is original dataset of independent features
- X is encoded dataset of independent features
- y is dependent data which needs to be predicted
- Training dataset is applied with fit_transform()
- Test dataset is applied with transform()
- The fit() method is calculating the mean and variance of each of the features present in our data. 
- The transform() method is transforming all the features using the respective mean and variance.
- The fit_transform() method is used on the training data so that we can scale the training data and also learn the scaling parameters of that data.
- To avoid any bias the test data is nto applied with fit and only transform.  


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=None) #using 20% to test and 80% for training.
print(f"Shape of training data : {X_train.shape}")
print(f"Shape of test data : {X_test.shape}")

## Regression Model Performance Metrics

### MAE (Mean Absolute Error)
The MAE value itself indicates the average absolute error between predicted and actual values. The smaller the MAE, the better the model’s predictions align with the actual data.

### MSE (Mean Squared Error)
Mean squared error (MSE) measures the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values. When a model has no error, the MSE equals zero. As model error increases, its value increases. The mean squared error is also known as the mean squared deviation (MSD).

### RMSE (Root Mean Square Error)
The root mean square error (RMSE) measures the average difference between a statistical model’s predicted values and the actual values. Mathematically, it is the standard deviation of the residuals. Residuals represent the distance between the regression line and the data points.Use the root mean square error to assess the amount of error in a regression or other statistical model. A value of 0 means that the predicted values perfectly match the actual values, but you’ll never see that in practice. Low RMSE values indicate that the model fits the data well and has more precise predictions. 


### R-Squared (R²)
R-Squared (R²) is a statistical measure used to determine the proportion of variance in a dependent variable that can be predicted or explained by an independent variable.
In other words, R-Squared shows how well a regression model (independent variable) predicts the outcome of observed data (dependent variable).
R-Squared is also commonly known as the coefficient of determination. It is a goodness of fit model for linear regression analysis.Higher R-squared values suggest a better fit, but it doesn’t necessarily mean the model is a good predictor in an absolute sense.

### Adjusted R-Squared (R²)
Adjusted R-squared addresses a limitation of Adjusted R Squared, especially in multiple regression (models with more than one independent variable). Adjusted R-squared vs adjusted r squared penalizes the addition of unnecessary variables.

In [None]:
#Initialise dataframe for Regression Performace Metrics
performance_metrics={
    'Model Name':[], 
    'MAE':[] ,
#    'MSE':[] ,
    'RMSE':[] ,
    'R2 Score':[],
    'Adjusted R2 Score':[] ,
    'Training Duration':[],
    'Predection Duration':[],
    'Evaluation Duration':[]
    }
df_ModelPerformance=pd.DataFrame(performance_metrics)
print(type(df_ModelPerformance))
df_ModelPerformance.head()

In [25]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

#Define a function to evaluate model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = root_mean_squared_error(true, predicted)
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, r2_square

In [None]:
#Define Models

models = {
    "Linear": LinearRegression(),
    "Lasso": Lasso(alpha=0.1),
    "Ridge": Ridge(),
    "Bagging": BaggingRegressor(),
    "ExtraTrees": ExtraTreesRegressor(),
    #"SVR": SVR(),
    #"K-Neighbors": KNeighborsRegressor(n_neighbors=5),
    "Random Forest": RandomForestRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "Gradient Boost": GradientBoostingRegressor(),
    "CatBoosting": CatBoostRegressor(verbose=False),
    "AdaBoost": AdaBoostRegressor()
}

for key, value in models.items():
    model_name = key
    model = value
    test_performance_metrics = {}

    print('-'*80)
    
    t1=time.time()
    print(f'{datetime.datetime.fromtimestamp(t1).strftime("%Y-%m-%d %H:%M:%S")} - {model_name} - performing training')
    model.fit(X_train, y_train) # Training the Model with training dataset

    # Predicting Values of test dataset
    
    t2=time.time()
    #print(f'{datetime.datetime.fromtimestamp(t2).strftime("%Y-%m-%d %H:%M:%S")} - {model_name} - predecting training dataset')
    #y_train_pred = model.predict(X_train)
    
    t3=time.time()
    print(f'{datetime.datetime.fromtimestamp(t3).strftime("%Y-%m-%d %H:%M:%S")} - {model_name} - predecting test dataset')
    y_test_pred = model.predict(X_test)
    
    # Evaluating Model Performance
    
    t4=time.time()
    #print(f'{datetime.datetime.fromtimestamp(t4).strftime("%Y-%m-%d %H:%M:%S")} - {model_name} - evaluating performance of training dataset')
    #model_train_mae ,model_train_mse, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    
    
    t5=time.time()
    print(f'{datetime.datetime.fromtimestamp(t5).strftime("%Y-%m-%d %H:%M:%S")} - {model_name} - evaluating performance of test dataset')
    model_test_mae ,model_test_mse, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)    
    
    t6=time.time()
    #model_train_adjusted_r2 = (1 - (1-model_train_r2)*(len(y)-1)/(len(y)-x.shape[1]-1))
    #model_train_mae = round(model_train_mae,2)
    ##model_train_mse = round(model_train_mse,2)
    #model_train_rmse = round(model_train_rmse,2)
    #model_train_r2 = round(model_train_r2,2)
    #model_train_adjusted_r2 = round(model_train_adjusted_r2,2)
    #model_train_duration = round(float(t2-t1),2)
    #model_train_pred_duration = round(float(t3-t2),2)
    #model_train_eval_duration = round(float(t5-t4),2)

    model_test_adjusted_r2 = (1 - (1-model_test_r2)*(len(y)-1)/(len(y)-x.shape[1]-1))
    model_test_mae = round(model_test_mae,2)
    #model_test_mse = round(model_test_mse,2)
    model_test_rmse = round(model_test_rmse,2)
    model_test_r2 = round(model_test_r2,2)
    model_test_adjusted_r2 = round(model_test_adjusted_r2,2)
    model_test_duration = round(float(0),2)
    model_test_pred_duration = round(float(t4-t3),2)
    model_test_eval_duration = round(float(t6-t5),2)
    
    
    #train_performance_metrics=pd.DataFrame({'Model Name':f'{model_name} (Train)', 
    #                                    'MAE':[model_train_mae] ,
    #                                    #'MSE':[model_train_mse] ,
    #                                    'RMSE':[model_train_rmse] ,
    #                                    'R2 Score':[model_train_r2],
    #                                    'Adjusted R2 Score':[model_train_adjusted_r2],
    #                                    'Training Duration':[model_train_duration],
    #                                    'Predection Duration':[model_train_pred_duration],
    #                                    'Evaluation Duration':[model_train_eval_duration]
    #                                    })

    test_performance_metrics=pd.DataFrame({'Model Name':f'{model_name} (Test)', 
                                        'MAE':[model_test_mae] ,
                                        #'MSE':[model_test_mse] ,
                                        'RMSE':[model_test_rmse] ,
                                        'R2 Score':[model_test_r2],
                                        'Adjusted R2 Score':[model_test_adjusted_r2],
                                         'Training Duration':[model_test_duration],
                                        'Predection Duration':[model_test_pred_duration],
                                        'Evaluation Duration':[model_test_eval_duration]
                                        })

    #df_ModelPerformance = pd.concat([train_performance_metrics,df_ModelPerformance], ignore_index=True)
    df_ModelPerformance = pd.concat([test_performance_metrics,df_ModelPerformance], ignore_index=True)
print('-'*80)

In [None]:
pd.set_option('display.max_columns',None)

filepath = f'../outputs/{time.strftime("%Y%m%d_%H%M%S")}_ModelPerformance.csv'
df_ModelPerformance.to_csv(filepath)  
df_ModelPerformance

#df_ModelPerformance.drop(df_ModelPerformance.tail(1).index,inplace=True)