# 

In [None]:
# If additional packages are needed but are not installed by default, uncomment the last two lines of this cell
# and replace <package list> with a list of additional packages.
# This will ensure the notebook has all the dependencies and works everywhere

#import sys
#!{sys.executable} -m pip install <package list>

In [None]:
# Libraries
import pandas as pd
pd.set_option("display.max_columns", 101)
import seaborn as sns 
import numpy as np
import matplotlib.pyplot as plt

## Data Description

Column | Description
:---|:---
`id` | Unique identifier for each booking.
`lead_time` | Time between booking date and reservation date (in days)
`arrival_week` | Week number of the arrival date.
`duration` | Booking duration (in Days)
`prev_cancel` | Number of previous bookings that were cancelled by the customer prior to the current booking.
`booking_changes` | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation. 
`waiting_period` | Waiting period for booking confirmation (in Days)	
`per_Day_price` | Per night booking price (in US $).
`parking` | Number of car parking spaces required by the customer.
`special_request` | Number of special requests made by the customer.
`segment` | Market segment designation. In categories, “TA” means “Travel Agents” and “TO” means “Tour Operators”.
`deposit` | Whether the customer made a deposit to guarantee the booking.
`cust_type` | Type of booking, assuming one of four categories.
`is_cancelled` |Value indicating if the booking was cancelled (1) or not (0).

## Data Wrangling & Visualization

In [None]:
# The dataset is already loaded below
data = pd.read_csv("")  # add path to file

In [None]:
data.head()

In [None]:
#Explore columns
data.columns

In [None]:
#Description
data.describe()

## Visualization, Modeling, Machine Learning

Build a classification model and to determine whether a customer will cancel a booking. Please explain the findings effectively to technical and non-technical audiences using comments and visualizations, if appropriate.
- **Build an optimized model that effectively solves the business problem.**
- **The model's performance will be evaluated on the basis of accuracy.**
- **Read the test.csv file and prepare features for testing.**

In [None]:
#Loading Test data
test_data = pd.read_csv("")  # add path to file
test_data.head()

Key Takeaways from the get go: 1)id is unique and as such cannot be used during the classification process, so will drop that 2)output label is is_cancelled 3)segment, deposit and cust_type are categorical features and we will need to test their corelation/independence separately. We first check the distribution of our output labels.

In [None]:
#Observe output class balance
ax = sns.countplot(y,label="")
Y, N = y.value_counts()
print('Yes: ',Y)
print('No: ',N)

This doesn't show a severe class imbalance, so we won't have to bother with class imbalance fixes (like selective upsampling and downsampling). 


**Describe the the most important features of the model to management.**

> #### Task:
- **Visualize the top 10 features and their feature importance.**


To get a basic idea of our feature selection, we will do a simple Random Forest feature classification feature importance plot, wihtout any significant transformation to our data

In [None]:
# random forest for feature importance on a classification problem
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from matplotlib import pyplot
# define dataset
X = data.drop(['',''], axis=1)  # drop unneeded features
y = data.is_cancelled

#Encode categorical values
coder1 = LabelEncoder()
labels1 = coder1.fit_transform(X[''])
X[''] = labels1
coder2= LabelEncoder()
labels2 = coder2.fit_transform(X[''])
X[''] = labels2
coder3= LabelEncoder()
labels3 = coder3.fit_transform(X[''])
X[''] = labels3

# define the model
model = RandomForestClassifier()

# fit the model
model.fit(X, y)

# get importance
importance = model.feature_importances_

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

It appears feature 5 (waiting_period) out of the 11 features is least important, although followed closely by feature 7 (parking).

We will now begin our own analysis and observe if any features can be dropped. To begin our data transformation and detailed analysis, we first work with the numerical input features. We will first normalize our data and obtain a box plot for the numerical features to get a quick overview of the data

In [None]:
num_drop = ['','','','',''] #numerical features to be dropped
x_num = data.drop(num_drop,axis = 1 )
y = data.is_cancelled
x_num.head()

In [None]:
#Normalization
pltdata = x_num
data2 = (pltdata - pltdata.mean()) / (pltdata.std())
pltdata = pd.concat([y,data2],axis=1)
pltdata = pd.melt(pltdata,id_vars="is_cancelled",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(9,9))

#Box PLot
sns.boxplot(x="features", y="value", hue="is_cancelled", data=pltdata)
plt.xticks(rotation=90)

The box plot shows majority of values for prev_cancel, booking_changes and parking are 0, with handful of sever outliers. While this may show overall importance of those three features, instead of outright elliminating them, we will first observe their correlation using a pair grid plot

In [None]:
#Pair grid plot
sns.set(style="white")
df = x_num.loc[:,['prev_cancel','booking_changes','parking']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

It looks all three of those features might be correlated. We however will not draw final conclusions yet. Although at this point it looks like parking can be dropped during our feature selection as it had ultimately low importance from our very first plot. We will proceeed to drawing a final correlation plot using heat map.

In [None]:
#Heat map
f,ax = plt.subplots(figsize=(9, 9))
sns.heatmap(x_num.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

At this point it appears there might not be a strong correlation link between any of the numerical input features, so we will not be dropping any of those features. We will now proceed to categorical feature selection. In order to get a clear idea of unique values for each, we will also get a short summary for the categories

In [None]:
cat_drop = ['','','','','','','','','','','']  # categorical features to be dropped
x_cat = data.drop(cat_drop,axis = 1)

#Summary of complete data

def data_summary(df):

    df = pd.DataFrame({'type': df.dtypes,
                       'null_values': df.isna().sum(),
                       'null_values (%)': (df.isna().sum() / df.shape[0]) * 100,
                       'unique': df.nunique()})
    return df

print(data_summary(x_cat))
print(x_cat.segment.unique())
print(x_cat.deposit.unique())
print(x_cat.cust_type.unique())

In order to begin testing the cateforical features, we will first need to encode our data (similar to what we used in the initial feature plot)

In [None]:
# transform and map pokemon generations
segcoder = LabelEncoder()
seg_labels = segcoder.fit_transform(x_cat['segment'])
x_cat['segment'] = seg_labels
depcoder = LabelEncoder()
dep_labels = depcoder.fit_transform(x_cat['deposit'])
x_cat['deposit'] = dep_labels
custcoder = LabelEncoder()
cust_labels = custcoder.fit_transform(x_cat['cust_type'])
x_cat['cust_type'] = cust_labels

In order to get the correlation, we use chi square value to determine the importance.

In [None]:
#Use Chisquare
from scipy.stats import chi2_contingency
def chisq_of_df_cols(df, c1, c2):
    groups = df.groupby([c1, c2]).size()
    ctsum = groups.unstack(c1)
    #print(ctsum)  # prints the stacked table generated for Chisquare
    # Fill null values with 0s, just incase a mishape happened for error correction
    return(chi2_contingency(ctsum.fillna(0)))


def chisq_of_df(x_cat):
    for column in x_cat:
        column1 = str(column)
        if column1 == 'is_cancelled':
            continue
        print(column1)  # Print the column name
        chisquared_value = chisq_of_df_cols(x_cat, 'is_cancelled', column1)        
        print('test statistics: ', chisquared_value[0])
        print('P-value: ', chisquared_value[1])

        
#add the ouput label to test 
x_cat['is_cancelled'] = data.is_cancelled
chisq_value = chisq_of_df(x_cat)  # chisuqared values of categorical data       

The p-value for all three features are <0.05, so all categorical features are significant. Based on all the analysis so far, we will conclude that only id and parking will need to be dropped from our input features.

We now begin to create our model, which we will use to predict with the test.csv data. We decided to use Random Forest as our classifier and used pandas get_dummies to one-hot encode our categorical features instead of label encoding. We use 1000 decision trees as the hyperparameter for our model, keeping all others as default.

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X = data.drop(['id','parking','is_cancelled'], axis=1)
y = data.is_cancelled

# One-hot encode the data using pandas get_dummies
X = pd.get_dummies(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = RandomForestClassifier(n_estimators = 1000, random_state = 42)
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

We eventually landed with an accuracy of 84.55%, which we are happy with. We will now being predicting the test.csv data with our model.

> #### Task:
- **Submit the predictions on the test dataset using your optimized model** <br/>
    For each record in the test set (`test.csv`), you must predict whether a customer will cancel his booking or not. You should submit a CSV file with a header row and one row per test entry. 

The file (`submissions.csv`) should have exactly 2 columns:
   - **id**
   - **is_cancelled**
   

In [None]:
#prep and one-hot encode prediction data
pred_data = test_data.set_index(['id'])
pred_data = pred_data.drop(['parking'],axis=1)
pred_data = pd.get_dummies(pred_data)

For our test data, it appears that the deposit column's "refundable" category is not present in the test.csv data (therefore get_dummies doesnt create its respective column) and as a result an error kept appearing. To circumvent this, I created its respective blank column for test_data

In [None]:
pred_data.insert(17, 'deposit_Refundable', pred_data.cust_type_Group)
pred_data['deposit_Refundable'] = pred_data['deposit_Refundable']*0
pred_data.deposit_Refundable.nunique()

In [None]:
predictions = model.predict(pred_data)

Once the predictions are made, we will now save the output (id,is_cancelled) into a csv file.

In [None]:
submission_df = test_data[['id']].copy() #copy the id column from test_data

In [None]:
submission_df['is_cancelled'] = predictions
print(submission_df)

In [None]:
#Submission
submission_df.to_csv('submissions.csv',index=False)