In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [4]:
dir = 'merge.csv'
df = pd.read_csv(dir)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18451 entries, 0 to 18450
Data columns (total 24 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Year                                 18451 non-null  int64  
 1   Quarter                              18451 non-null  int64  
 2   Total_Mkt_Fare                       18451 non-null  float64
 3   depart_city                          18451 non-null  int64  
 4   arrival_city                         18451 non-null  int64  
 5   airlineID                            18451 non-null  int64  
 6   Passengers_by_Carrier                18451 non-null  int64  
 7   CarriersMktShare                     18451 non-null  float64
 8   CarrierAvgFare                       18451 non-null  float64
 9   Carrier_MinFareIncrement             18451 non-null  int64  
 10  CarrierMinPassangerShare             18451 non-null  float64
 11  CarrierMaxFareIncrement     

### Feature Details:
In our endeavor to construct a predictive model for customer ticket pricing within the domestic US airline market, we meticulously curated a set of variables, each chosen for its potential impact on ticket pricing dynamics and its relevance to the airline industry. These variables were selected to capture a comprehensive range of factors that influence pricing decisions, encompassing various aspects of market demand, operational costs, competitive positioning, and external environmental conditions.

Year and quarter variables were integrated into the model to account for temporal fluctuations in travel demand and market conditions. Understanding the seasonal variations in ticket pricing is crucial, as fares tend to fluctuate in response to changes in travel patterns, such as heightened demand during peak seasons and holidays.

Departure and arrival cities were included as variables to account for geographical variations in market demand and operational costs. The choice of departure and arrival locations can significantly impact ticket pricing, reflecting differences in airport fees, route popularity, and regional economic conditions.

Attributes related to airline characteristics, such as airline identity, market share, and average fare, were incorporated into the model to capture the competitive landscape and branding strategies of carriers. Airlines with larger market shares or higher perceived quality may command premium fares, while budget carriers may offer more competitively priced options, thereby influencing overall pricing dynamics.

The state of departure and arrival was considered in the model to reflect the regional economic conditions, regulatory environments, and local market dynamics that can influence ticket pricing. Variances in economic prosperity, taxation policies, and regulatory frameworks across states can lead to disparities in ticket prices for flights originating from or arriving at different locations.

Operational costs, including indicators such as fuel costs and maintenance fees, were included to capture the cost structures of airlines. These costs directly impact pricing decisions, as airlines may adjust fares to account for fluctuations in operational expenses, with cost-intensive operations typically translating to higher ticket prices for passengers.

Weather conditions, encompassing elements such as temperature, precipitation, and wind speed, were integrated into the model to account for their impact on flight operations and passenger demand. Weather disruptions can lead to flight delays or cancellations, affecting supply and demand dynamics and potentially resulting in fare adjustments by airlines.

Finally, flight distance was considered as a variable to reflect the operational costs associated with longer flights. Longer flight distances typically incur higher operational expenses, leading to proportionally higher ticket prices to offset these costs.

By incorporating these variables into our predictive model, we aimed to develop a comprehensive understanding of the multifaceted factors influencing customer ticket pricing in the domestic US airline market. This holistic approach enables us to provide valuable insights to industry stakeholders, supporting strategic decision-making and pricing optimization efforts.

In [5]:
X = df.loc[:, df.columns != 'Total_Mkt_Fare']
y = df.Total_Mkt_Fare

# make 30% test set and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=21)

# devide test into half to form validation set: 25% test, 25% valid, 50% train
X_hold, X_valid, y_hold, y_valid = train_test_split(X_test, y_test, test_size=0.5,
                                                    random_state=21)

In [6]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_hold.shape, y_hold.shape

((9225, 23), (9225,), (4613, 23), (4613,), (4613, 23), (4613,))

In [7]:
# make dataframe to store results
results = pd.DataFrame(
    columns = ['Model', 'ori_RMSE_valid', 'ori_MAE_valid', 'ori_R2_valid',
               'tune_RMSE_valid', 'tune_MAE_valid', 'tune_R2_valid',
               'test_RMSE', 'test_MAE', 'test_R2'])

# **Gradient Boosting Regressor**

In [8]:
from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor()
gbr = gb_model.fit(X_train, y_train)

### GB Feature Importance

In [22]:
# Get feature importances
feature_importances = gb_model.feature_importances_

[CV] END ...max_depth=4, max_features=log2, n_estimators=116; total time=   0.9s
[CV] END ...max_depth=4, max_features=log2, n_estimators=166; total time=   1.3s
[CV] END ...max_depth=4, max_features=log2, n_estimators=166; total time=   1.3s
[CV] END ...max_depth=4, max_features=None, n_estimators=133; total time=   5.5s
[CV] END ...max_depth=4, max_features=None, n_estimators=133; total time=   5.5s
[CV] END ...max_depth=5, max_features=sqrt, n_estimators=116; total time=   1.1s
[CV] END ...max_depth=5, max_features=sqrt, n_estimators=116; total time=   1.1s
[CV] END ...max_depth=5, max_features=log2, n_estimators=100; total time=   1.0s
[CV] END ...max_depth=5, max_features=log2, n_estimators=100; total time=   1.0s
[CV] END ....max_depth=3, max_features=log2, n_estimators=50; total time=   0.3s
[CV] END ....max_depth=3, max_features=log2, n_estimators=50; total time=   0.3s
[CV] END ....max_depth=4, max_features=log2, n_estimators=50; total time=   0.5s
[CV] END ...max_depth=6, max

In [19]:
# DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

In [20]:
# Sort in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Results
feature_importance_df

Unnamed: 0,Feature,Importance
7,CarrierAvgFare,0.967867
0,Year,0.010637
8,Carrier_MinFareIncrement,0.004259
6,CarriersMktShare,0.003328
5,Passengers_by_Carrier,0.003283
4,airlineID,0.002323
14,Avg_TDOMT_COST_carrier,0.002302
22,distance,0.001712
3,arrival_city,0.000753
13,arrival_state,0.000735


# **Random Forest Regressor**

In [9]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rfr = rf_model.fit(X_train, y_train)

### RF Feature Importance

In [16]:
# Get feature importances
feature_importance = rf_model.feature_importances_

# DataFrame for feature importances
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importance
})

# Sort in descending order
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)

# Results
feature_importances_df

Unnamed: 0,Feature,Importance
7,CarrierAvgFare,0.946377
0,Year,0.016122
5,Passengers_by_Carrier,0.004486
8,Carrier_MinFareIncrement,0.004253
6,CarriersMktShare,0.003744
22,distance,0.00305
14,Avg_TDOMT_COST_carrier,0.002343
4,airlineID,0.001789
15,avg_TDOMT_GALLONS_carrier,0.001658
2,depart_city,0.001646



#### Interpreting Feature Importance for Model Tuning

In the analysis of feature importance values derived from a model, a nuanced understanding of their significance in contributing to the model's predictions is essential. In this specific case, the Gradient Boosting model has identified "CarrierAvgFare" as the most influential feature, as evidenced by its substantial importance score of 0.967867.

To address why these feature importance values were not leveraged to fine-tune other models for the customer ticket pricing prediction model, several critical considerations emerge.

Firstly, the interpretation of feature importance varies across different models. While "CarrierAvgFare" holds significant weight for the Gradient Boosting model, its relevance may not be as pronounced for alternative models such as Random Forest, Support Vector Machines, or Neural Networks. Each model has its unique way of interpreting feature importance based on its underlying algorithm and assumptions.

Furthermore, the efficacy of features can be influenced by the complexity and capacity of a model to capture intricate relationships between features and the target variable. Certain models, by design, excel at leveraging complex feature interactions more effectively than others. Therefore, the importance of features might differ based on the modeling approach employed.

Additionally, the preprocessing steps and feature engineering techniques applied to the data play a crucial role in determining feature importance. Different models may require distinct feature transformations or selection methods based on their specific requirements and assumptions.

Moreover, the hyperparameters of a model can significantly impact feature importance. Adjusting these parameters might alter how the model utilizes features during the training process, thereby affecting the importance assigned to each feature.

Lastly, the choice of evaluation metric or scoring function used during hyperparameter tuning can also influence feature importance. Different metrics may prioritize different aspects of prediction performance, leading to variations in the importance of features across models.

In conclusion, while feature importance provides valuable insights into a specific model's inner workings, it's imperative to consider the broader context of the modeling process and the specific requirements of the prediction task when determining the relevance of these values for tuning other models.