# Decisioning that can be powered using the data shared

Based on the data shared, I have listed a few problem statements that can further be explored:

1. __Forecasting future demand and growth__ : Estimating future demand/growth is an important activity for OYO as it helps to determine how to plan various operational aspects to address that future demand. For example, call center staffing to accomodate increased customer requests due to surges in bookings, dynamic pricing, expansion to new territories, employing scaleable backend technology etc are all business critical decisions that depend on expected demand. However, since we cannot be conclusive about the span of the data we have (single year, multi year), it would be difficult to create accurate projections of growth. This is because we would not be able to isolate the impacts of seasonality in demand and long term growth using a single years worth of data.

2. __Predicting likelihood of booking cancellations__ : As nearly 37% of bookings are cancelled, it is important to be able to predict which bookings are cancelled, to avoid loss in revenue due to making a certain room unavailable. If we flag a booking as likely to cancel, we may want to avoid freezing that particular room for a predetermined amount of time and still accept alternate bookings on the same room. Another way to ensure that a customer is serious about making the booking is to request an advance deposit for risky bookings. The only limiting factor here is that we dont have sufficient historical data per customer to better identify individual cancellation trends.


We will now focus on the second use case - creating a model to predict booking cancellations, which should be a straightforward task due to a well sized cancellation rate and sufficient data.

# 1. Data Load and Cleansing

In [76]:
#import standard libraries
import pandas as pd
import pandas_profiling
import seaborn as sns
import numpy as np
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,confusion_matrix

#set seed for reproducibility
np.random.seed(2021)

#load in constants such as data paths and locations
from constants import TRAIN_DIR,ARTIFACTS_PATH
#load the raw data
data = pd.read_csv(TRAIN_DIR)

#cast to numeric data
data['chidren'] = data['chidren'].fillna('0').astype(int)

To minimize training serving skew, it is a good practice to have a single data preparation function module that can be called during offline training and online inference. In this case since inference is offline, we will not worry about the case where we don't have labels on eval data. This is a function likely to be reused in a production setting, so we may want to enrich this function with better documentation

Here, we will split our data randomly rather than by time, to ensure the model learns seasonal trends as well.

In [34]:
def prepare_data(data_type = 'train',
                 data = None,
                 artifacts_path = '',
                 cat_cols = [],
                 num_cols = [],
                 target_col = ''
                 ):
    '''
        Description : Function to transform raw booking transaction data into modelling data to
                      be used in prediction of customer cancellations
        
        Parameters : 
                
                data_type (str) : ['train','test'] - specify train or test to help the function
                                                     decide if it needs to train data transformers or 
                                                     load pre built transformers
                                                     
                data (pd.DataFrame) - input pandas df on which we want to apply data transformation
                
                artifacts_path - path on hard disk to save data transformation artifacts
                
                cat_cols - list of categorical columns in the input data
                
                num_cols - list of numerical columns in the input data
                
                target_col - target column in the input data. no transformation will occur on this column
                
        Returns :
                
                out (pd.DataFrame) - transformed data ready to be used for modelling purposes
    
    '''
    #common feature engineering
    #check if the requested room was the same as the assigned room.
    data['same_room_flag'] = 1
    data['same_room_flag'][data.roomType == data.assignedType] = 0
    
    custom_features = ['same_room_flag']
    
    if data_type == 'train':
        #apply mean imputations on numeric data, add a new
        imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean').fit(data[num_cols])
        imputed_num_data = imp_mean.transform(data[num_cols])
        imp_cat = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value = 'missing').fit(data[cat_cols])
        imputed_cat_data = imp_cat.transform(data[cat_cols])
        
        #get scaled values of numeric features
        standard_scaler = StandardScaler().fit(imputed_num_data)
        numerical_features_df = pd.DataFrame(standard_scaler.transform(imputed_num_data),columns = num_cols)
        print('numerical_features_df data shape {}'.format(numerical_features_df.shape))

        #get dummies for categorical features
        ohe_encoder = OneHotEncoder(handle_unknown='ignore',sparse = False, drop = 'first').fit(imputed_cat_data)
        categorical_features_df = pd.DataFrame(ohe_encoder.transform(imputed_cat_data),columns = ohe_encoder.get_feature_names_out(cat_cols))
        print('categorical_features_df data shape {}'.format(categorical_features_df.shape))

        #save our data transformers
        joblib.dump(ohe_encoder,artifacts_path+'ohe_obj.pkl')
        joblib.dump(standard_scaler,artifacts_path+'standard_scaler.pkl')
        joblib.dump(imp_mean,artifacts_path+'imp_mean.pkl')
        joblib.dump(imp_cat,artifacts_path+'imp_cat.pkl')

    elif data_type == 'test':
        #load our data transformers and call the transform function on the test data
        ohe_encoder = joblib.load(artifacts_path+'ohe_obj.pkl')
        standard_scaler = joblib.load(artifacts_path+'standard_scaler.pkl')
        imp_mean = joblib.load(artifacts_path+'imp_mean.pkl')
        imp_cat = joblib.load(artifacts_path+'imp_cat.pkl')
        
        #call transform
        imp_cat_df = imp_cat.transform(data[cat_cols])
        imp_num_df = imp_mean.transform(data[num_cols])
        categorical_features_df = pd.DataFrame(ohe_encoder.transform(imp_cat_df),columns = ohe_encoder.get_feature_names_out(cat_cols))
        numerical_features_df = pd.DataFrame(standard_scaler.transform(imp_num_df),columns = num_cols)
        print('categorical_features_df data shape {}'.format(categorical_features_df.shape))
        print('numerical_features_df data shape {}'.format(numerical_features_df.shape))
    
    #concat data together
    out = pd.concat([numerical_features_df,categorical_features_df,data[custom_features + [target_col]].reset_index(drop = True)], axis = 1)
    
    print('processed data shape {}'.format(out.shape))
    return out

In [62]:
#define cat and numeric column lists for modelling. for now we will exclude country as it may need embeddings
#other variables excluded - 'arrivalDay' since we dont know the day of week in the absence of a year column
cat_cols = ['arrivalMonth','segment','roomType','assignedType','customerSegment','deposit','type','country']
num_cols = ['time2Checkin','numberWeekendnights','numberNights','adults','chidren','changesFlag',\
            'repeatFlag','historicCancellations','historicBookings','waitingDays','numberofRequests']
target = 'canceledFlag'

#adopt an 80-20 split for train and test data
train_df, test_df = train_test_split(data, test_size=0.2, random_state=0)

#train the transformers on training data, transform the data and save the artifacts to disk
train_df = prepare_data(data_type = 'train',
                 data = train_df,
                 artifacts_path = ARTIFACTS_PATH,
                 cat_cols = cat_cols,
                 num_cols = num_cols,
                 target_col = target)

#load the transformers from disk, and transform the test data
test_df = prepare_data(data_type = 'test',
                 data = test_df,
                 artifacts_path = ARTIFACTS_PATH,
                 cat_cols = cat_cols,
                 num_cols = num_cols,
                 target_col = target)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['same_room_flag'][data.roomType == data.assignedType] = 0


numerical_features_df data shape (95512, 11)
categorical_features_df data shape (95512, 213)
processed data shape (95512, 226)
categorical_features_df data shape (23878, 213)
numerical_features_df data shape (23878, 11)
processed data shape (23878, 226)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['same_room_flag'][data.roomType == data.assignedType] = 0


Let's visually inspect the final transformed data to pick up any errors in data preparation

In [75]:
train_df.describe()

Unnamed: 0,time2Checkin,numberWeekendnights,numberNights,adults,chidren,changesFlag,repeatFlag,historicCancellations,historicBookings,waitingDays,...,country_UZB,country_VEN,country_VGB,country_VNM,country_ZAF,country_ZMB,country_ZWE,country_missing,same_room_flag,canceledFlag
count,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,...,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0,95512.0
mean,-1.701687e-16,1.664495e-15,-4.04898e-16,-4.423084e-15,-5.195947e-16,1.76129e-16,-4.863325e-15,1.135793e-15,-8.644845e-16,-8.302913e-16,...,2.1e-05,0.000209,1e-05,5.2e-05,0.000691,1e-05,4.2e-05,0.004209,0.124738,0.369378
std,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,...,0.004576,0.014469,0.003236,0.007235,0.026278,0.003236,0.006471,0.06474,0.330424,0.482639
min,-0.9727476,-0.92674,-1.305935,-3.178095,-0.261105,-0.3401628,-0.1824813,-0.1029315,-0.0917544,-0.1311692,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.8038555,-0.92674,-0.7838474,0.2478571,-0.261105,-0.3401628,-0.1824813,-0.1029315,-0.0917544,-0.1311692,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.325328,0.07330283,-0.2617598,0.2478571,-0.261105,-0.3401628,-0.1824813,-0.1029315,-0.0917544,-0.1311692,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.5285151,1.073346,0.2603277,0.2478571,-0.261105,-0.3401628,-0.1824813,-0.1029315,-0.0917544,-0.1311692,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,5.942444,18.07407,24.79844,91.03558,7.283109,31.95111,5.480013,30.6528,48.27422,22.15761,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 2. Baseline Performance

Let's use a simple baseline such as logistic regression to set a benchmark to beat. Here we can see that while the baseline model generates a balanced accuracy of around 80%, the model has a high false negative rate which we need to try and improve

In [64]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(train_df.drop(columns = target), train_df[target])

result = model.score(train_df.drop(columns = target), train_df[target])
print('Train Accuracy: ',result)
print('Train AUC: ',roc_auc_score(train_df[target],model.predict(train_df.drop(columns = target))))
print('Train confusion matrix: \n',confusion_matrix(train_df[target],model.predict(train_df.drop(columns = target))))

result = model.score(test_df.drop(columns = target), test_df[target])
print('Test Accuracy: ',result)
print('Test AUC: ',roc_auc_score(test_df[target],model.predict(test_df.drop(columns = target))))
print('Test confusion matrix: \n',confusion_matrix(test_df[target],model.predict(test_df.drop(columns = target))))

Train Accuracy:  0.8098040036854008
Train AUC:  0.7735447614650413
Train confusion matrix: 
 [[54952  5280]
 [12886 22394]]
Test Accuracy:  0.8065164586648798
Test AUC:  0.7717503602014636
Test confusion matrix: 
 [[13595  1339]
 [ 3281  5663]]


Let's check the coefficients to understand the main predictors of cancellations. It seems that the type of room and deposit are heavily influencing the predictions, while month of booking is moderately important. Historical bookings and cancellations also appear in the top drivers

In [65]:
#extract model coefficients and sort them according to magnitude
coeff = pd.DataFrame(zip(train_df.drop(columns = target).columns, np.transpose(model.coef_)), columns=['features', 'coef']) 
coeff['coef_abs'] = coeff['coef'].apply(lambda x: abs(x[0]))
coeff.sort_values(by = 'coef_abs', ascending = False).head(30)

Unnamed: 0,features,coef,coef_abs
51,deposit_Non Refund,[5.042751505660763],5.042752
120,country_HKG,[2.2257185473506835],2.225719
224,same_room_flag,[-2.1364754406200017],2.136475
58,country_ARE,[2.0560608522859076],2.056061
195,country_SRB,[-1.8505537078048058],1.850554
153,country_MAC,[1.7527320001434257],1.752732
48,assignedType_P,[1.7216432098593155],1.721643
37,roomType_P,[1.7216432098593155],1.721643
7,historicCancellations,[1.6253694374070138],1.625369
134,country_JEY,[1.4972888246526672],1.497289


# 3. Non linear Algorithmic Performance and Feature selection, engineering

We will now try and improve upon our baseline using a combination of non linear algorithms, feature engineering and selection. Recall from DIDQ that we observed many correlated variables. We will use l1 penalties to filter out variables, we can use xgboost with l2 norm for better performance and finally think of some feature engineering.

## Feature Engineering:

1. Difference between room type and assigned room type

In [68]:
#train a Lasso regression model and use the coefficients for feature selection
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.01)
model.fit(train_df.drop(columns = target), train_df[target])

coeff = pd.DataFrame(zip(train_df.drop(columns = target).columns, np.transpose(model.coef_)), columns=['features', 'coef']) 
coeff['coef_abs'] = coeff['coef'].apply(lambda x: abs(x))
coeff.sort_values(by = 'coef_abs', ascending = False).head(40)

Unnamed: 0,features,coef,coef_abs
51,deposit_Non Refund,0.413365,0.413365
183,country_PRT,0.207337,0.207337
27,segment_onl,0.1933,0.1933
224,same_room_flag,-0.134129,0.134129
10,numberofRequests,-0.07816,0.07816
0,time2Checkin,0.067308,0.067308
5,changesFlag,-0.026483,0.026483
53,type_R,-0.021206,0.021206
3,adults,0.013091,0.013091
6,repeatFlag,-0.012052,0.012052


# Observations:
1. Using the full list of features we get a model with 83% accuracy and 80% AUC
2. Using the filtered list of features obtained from fitting a Lasso model, we get a model with 82.5% accuracy and 79% AUC
3. Thus, by using only 13 out of 55 total features, we are able to get a similar model with minimal degradation in model performance. 
4. After adding encoded country variables, we observe a 5% improvement in accuracy and 4% improvement in AUC, however we have 225 variables in the model
5. After applying lasso once again, we observe that we only need one country variable 'country_PRT' and we observe similar model performance
6. We have been able to significantly reduce the False Negative Rate (25%) compared to the baseline (37%), which is satisfactory

In [74]:
from xgboost import XGBClassifier

#feature list to be passed to the model
#features = [col for col in train_df.columns if target not in col]
features = ['deposit_Non Refund','segment_onl','numberofRequests','time2Checkin',\
            'changesFlag','historicCancellations','adults','chidren','numberNights',\
            'numberWeekendnights','country_PRT','repeatFlag','same_room_flag']

#we will apply only L2 penalty for regularization as L1 has already been applied for feature selection
model = XGBClassifier(reg_lambda = 0.01).fit(train_df[features], train_df[target])
result = model.score(train_df[features], train_df[target])
print('Train Accuracy: ',result)
print('Train AUC: ',roc_auc_score(train_df[target],model.predict(train_df[features])))
print('Train confusion matrix: \n',confusion_matrix(train_df[target],model.predict(train_df[features])))

result = model.score(test_df[features], test_df[target])
print('Test Accuracy: ',result)
print('Test AUC: ',roc_auc_score(test_df[target],model.predict(test_df[features])))
print('Test confusion matrix: \n',confusion_matrix(test_df[target],model.predict(test_df[features])))



Train Accuracy:  0.8498303878046738
Train AUC:  0.8306024396520906
Train confusion matrix: 
 [[54462  5770]
 [ 8573 26707]]
Test Accuracy:  0.8414858865901667
Test AUC:  0.8224882161893967
Test confusion matrix: 
 [[13414  1520]
 [ 2265  6679]]


A look at the feature importances by gain generated by the xgboost model agrees with the coefficients reported by the logistic model

In [72]:
#extract feature importances and store them in a pandas df
feature_importances = model.get_booster().get_score(importance_type='gain')
keys = list(feature_importances.keys())
values = list(feature_importances.values())
feature_importances_df = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
feature_importances_df

Unnamed: 0,score
deposit_Non Refund,1738.854549
same_room_flag,83.756869
segment_onl,69.911758
country_PRT,48.949408
historicCancellations,43.079593
numberofRequests,29.521797
repeatFlag,21.098753
changesFlag,12.209634
time2Checkin,10.69733
adults,6.116777


# 4. Conclusions

1. Using the final model, we will be able to identify 75% (recall) of potential cancellations. Correct handling of such bookings will guarantee increased revenues for OYO, as well as improved customer experience with bookings as more rooms could potentially be listed as available. We can potentially bring down the cancellation rate from 37% to 10% per annum with the correct intervention strategies.

2. The most predictive features used in the model are features that are generated directly at the time of manual entry in the online/app session. We don't see the impact of internally defined customer segments and demographic data on cancellations. This is a very critical point since this model is most likely going to be embedded on the website/app itself and will serve real time predictions. As a result, additional latency generated due to hits on OYOs internal databases will be eliminated.

3. Given the requirements of the model (real time predictions), it is also important to have a lighter model with lesser compexity and input feature space for low latency predictions. This is why I would recommend deploying the model with 13 features over the 225 feature model. Usually we need to accept a tradeoff between model accuracy and low latency predictions, however in our case the degradation of model performance is negligible.

# 5. Next Steps

1. Explore neural networks as a way to improve model accuracy.
2. Advanced Feature engineering such as embeddings for country variables
3. Data collection to build richer customer histories to explore sequential aspects of customer cancellation behaviour
3. Model deployment and integration with OYO ecosystem

In [78]:
#save the final model to disk
joblib.dump(model,ARTIFACTS_PATH+'model.pkl')

['/Users/vishwanathprudhivi/Desktop/Work/Interview/OYO/oyo_case_study/artifacts/model.pkl']