# Modeling Notebook

Below is the notebook that the data scientist used to build his model. Here we create a simple Lasso model and get cross-validation and out of sample metrics to make sure that our model produces good accuracy metrics (we use R2 for our metric).

The final model deployed should be using `flight_prices_training.csv` as its training data. 

### Train Test Split

In [4]:
import pandas as pd
from joblib import dump, load
from sklearn.model_selection import train_test_split

df = pd.read_csv("flight_prices_training.csv")
train, test = train_test_split(df, test_size=0.2)

In [13]:
list(df['class'].unique())

['Economy', 'Business']

### Preprocessing

In [5]:
train = train.drop(columns=['flight'])
test = test.drop(columns=['flight'])

num_cols = ['days_left', 'duration']
cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class']

train = pd.get_dummies(train, prefix = cat_cols, columns = cat_cols)
test = pd.get_dummies(test, prefix = cat_cols, columns = cat_cols)

y_train = train['price']
X_train = train.drop(['price'], axis=1)
y_test = test['price']
X_test = test.drop(['price'], axis=1)

In [3]:
train.columns

Index(['duration', 'days_left', 'price', 'airline_AirAsia',
       'airline_Air_India', 'airline_GO_FIRST', 'airline_Indigo',
       'airline_SpiceJet', 'airline_Vistara', 'source_city_Bangalore',
       'source_city_Chennai', 'source_city_Delhi', 'source_city_Hyderabad',
       'source_city_Kolkata', 'source_city_Mumbai', 'departure_time_Afternoon',
       'departure_time_Early_Morning', 'departure_time_Evening',
       'departure_time_Late_Night', 'departure_time_Morning',
       'departure_time_Night', 'stops_one', 'stops_two_or_more', 'stops_zero',
       'arrival_time_Afternoon', 'arrival_time_Early_Morning',
       'arrival_time_Evening', 'arrival_time_Late_Night',
       'arrival_time_Morning', 'arrival_time_Night',
       'destination_city_Bangalore', 'destination_city_Chennai',
       'destination_city_Delhi', 'destination_city_Hyderabad',
       'destination_city_Kolkata', 'destination_city_Mumbai', 'class_Business',
       'class_Economy'],
      dtype='object')

### Model Fitting and Cross Validation

In [5]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn import linear_model

lasso = linear_model.Lasso(alpha=.1, max_iter=5000)
cv_results = cross_validate(lasso, X_train, y_train, cv=5, return_estimator=True)
print("Cross Val R2 Score: ", cv_results['test_score'].mean())

Cross Val R2 Score:  0.9104630167894708


### Final Out of Sample Testing

In [6]:
from sklearn.metrics import r2_score
lasso = lasso.fit(X_train, y_train)
predicted = lasso.predict(X_test)
print("Out of Sample R2 Score: ", r2_score(y_test, predicted))

Out of Sample R2 Score:  0.913620937680295


In [7]:
dump(lasso,"model.joblib")

['model.joblib']

In [8]:
df = pd.read_csv("flight_prices_training.csv")
df = df.drop(columns=['flight'])
cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class']
predict = pd.get_dummies(df, prefix = cat_cols, columns = cat_cols)
X = predict.drop(['price'], axis=1)
#df['price'] = model.predict(X)
#df.to_csv("prediction.csv")

In [9]:
predict

Unnamed: 0,duration,days_left,price,airline_AirAsia,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_city_Bangalore,...,arrival_time_Morning,arrival_time_Night,destination_city_Bangalore,destination_city_Chennai,destination_city_Delhi,destination_city_Hyderabad,destination_city_Kolkata,destination_city_Mumbai,class_Business,class_Economy
0,4.92,47,4496,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1
1,24.67,8,7425,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
2,16.25,40,1998,1,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,1
3,7.50,39,47657,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,11.75,33,54684,0,1,0,0,0,0,1,...,0,1,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240117,4.92,13,42457,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
240118,7.00,41,56702,0,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,1,0
240119,17.92,40,5227,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,1
240120,8.83,45,5817,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [10]:
X.columns

Index(['duration', 'days_left', 'airline_AirAsia', 'airline_Air_India',
       'airline_GO_FIRST', 'airline_Indigo', 'airline_SpiceJet',
       'airline_Vistara', 'source_city_Bangalore', 'source_city_Chennai',
       'source_city_Delhi', 'source_city_Hyderabad', 'source_city_Kolkata',
       'source_city_Mumbai', 'departure_time_Afternoon',
       'departure_time_Early_Morning', 'departure_time_Evening',
       'departure_time_Late_Night', 'departure_time_Morning',
       'departure_time_Night', 'stops_one', 'stops_two_or_more', 'stops_zero',
       'arrival_time_Afternoon', 'arrival_time_Early_Morning',
       'arrival_time_Evening', 'arrival_time_Late_Night',
       'arrival_time_Morning', 'arrival_time_Night',
       'destination_city_Bangalore', 'destination_city_Chennai',
       'destination_city_Delhi', 'destination_city_Hyderabad',
       'destination_city_Kolkata', 'destination_city_Mumbai', 'class_Business',
       'class_Economy'],
      dtype='object')