# Classification
## Import Libraries

In [175]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot

sb.set() # set the default Seaborn style for graphics

## Import Preprocessed Data from the Data Processing step earlier.
Note: The preprocessed data are one-hot encoded for categorical variables and scaled for numerical variables (from Data Pre-Processing.ipynb)

In [176]:
# Read columns used to build models
columns = pd.read_csv('Data/basic_model_columns.csv')['Columns'].to_list()

FileNotFoundError: [Errno 2] No such file or directory: 'Data/basic_model_columns.csv'

In [None]:
X_train = pd.read_csv('Data/X_train_undersampled_data.csv')[columns]
X_test = pd.read_csv('Data/X_test.csv')[columns]
Y_train = pd.read_csv('Data/y_train_undersampled_data.csv')
Y_test = pd.read_csv('Data/y_test.csv')

print("Train Set :", Y_train.shape, X_train.shape)
print("Test Set  :", Y_test.shape, X_test.shape)

## Attempt 1 - Try to run basic classification models against current preprocessed dataset without any additional tuning (i.e. hyper parameter tuning, feature selection and etc). 


In [None]:
# Import all essential functions from sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix

# Set up a dataframe to store the results from different models
train_metrics = pd.DataFrame(columns=['Classification Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
test_metrics = pd.DataFrame(columns=['Classification Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

### Logistic Regression Classification Model

In [None]:
from sklearn.linear_model import LogisticRegression

logRegModel = LogisticRegression(max_iter=10000, random_state=47).fit(X_train, Y_train.values.ravel())

# Predict the output based on our training and testing dataset
Y_train_pred = logRegModel.predict(X_train)
Y_test_pred = logRegModel.predict(X_test)

#### Plot Confusion Matrix for Logistic Regression Model

In [None]:
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(Y_train, Y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
axes[0].set_title('Train Data Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('Actual Label')

sb.heatmap(confusion_matrix(Y_test, Y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
axes[1].set_title('Test Data Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('Actual Label')

print("Train and Test Data Confusion Matrix:")

#### Calculate General Metrics for Logistic Regression Model

In [None]:
train_metric = {
    "Classification Model": "Logistic Regression",
    "Accuracy": accuracy_score(Y_train, Y_train_pred),
    "Precision": precision_score(Y_train, Y_train_pred),
    "Recall": recall_score(Y_train, Y_train_pred),
    "F1 Score": f1_score(Y_train, Y_train_pred)
}

test_metric = {
    "Classification Model": "Logistic Regression",
    "Accuracy": accuracy_score(Y_test, Y_test_pred),
    "Precision": precision_score(Y_test, Y_test_pred),
    "Recall": recall_score(Y_test, Y_test_pred),
    "F1 Score": f1_score(Y_test, Y_test_pred)
}

# Save to overall metrics dataframe for comparison later
train_metrics = pd.concat([train_metrics, pd.DataFrame.from_records([train_metric])], ignore_index = True)
test_metrics = pd.concat([test_metrics, pd.DataFrame.from_records([test_metric])], ignore_index = True)

# Calculate general metrics for the train set
print("**Training Set Metrics**")
print("Accuracy \t:", train_metric["Accuracy"])
print("Precision \t:", train_metric["Precision"])
print("Recall \t\t:", train_metric["Recall"])
print("F1 Score \t:", train_metric["F1 Score"])

print() # New Line

# Calculate general metrics for the test set
print("**Test Set Metrics**")
print("Accuracy \t:", test_metric["Accuracy"])
print("Precision \t:", test_metric["Precision"])
print("Recall \t\t:", test_metric["Recall"])
print("F1 Score \t:", test_metric["F1 Score"])

##### Insights based on metrics:
There is a decrease in precision, recall and f1 score from the training metrics to the test metrics despite the higher accuracy from training to test. This could suggest some underlying issues such as being able better predict one class over the other. If we look at the confusion matrix, we can actually see that in the test confusion matrix, it is a lot more reliable in predicting class 0 ("Not Canceled"), than class 1 ("Canceled") based on the high number of false positives.


### Decision Tree Classification Model

In [None]:
# Import DecisionTreeClassifier model from Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

In [None]:
decisionTreeModel = DecisionTreeClassifier(random_state=47)
decisionTreeModel.fit(X_train, Y_train)

# Predict the output based on our training and testing dataset
Y_train_pred = decisionTreeModel.predict(X_train)
Y_test_pred = decisionTreeModel.predict(X_test)

#### Plot Confusion Matrix for Decision Tree Model

In [None]:
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(Y_train, Y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
axes[0].set_title('Train Data Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('Actual Label')

sb.heatmap(confusion_matrix(Y_test, Y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
axes[1].set_title('Test Data Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('Actual Label')

print("Train and Test Data Confusion Matrix:")

#### Calculate General Metrics for Decision Tree Model

In [None]:
train_metric = {
    "Classification Model": "Decision Tree",
    "Accuracy": accuracy_score(Y_train, Y_train_pred),
    "Precision": precision_score(Y_train, Y_train_pred),
    "Recall": recall_score(Y_train, Y_train_pred),
    "F1 Score": f1_score(Y_train, Y_train_pred)
}

test_metric = {
    "Classification Model": "Decision Tree",
    "Accuracy": accuracy_score(Y_test, Y_test_pred),
    "Precision": precision_score(Y_test, Y_test_pred),
    "Recall": recall_score(Y_test, Y_test_pred),
    "F1 Score": f1_score(Y_test, Y_test_pred)
}

# Save to overall metrics dataframe for comparison later
train_metrics = pd.concat([train_metrics, pd.DataFrame.from_records([train_metric])], ignore_index = True)
test_metrics = pd.concat([test_metrics, pd.DataFrame.from_records([test_metric])], ignore_index = True)

# Calculate general metrics for the train set
print("**Training Set Metrics**")
print("Accuracy \t:", train_metric["Accuracy"])
print("Precision \t:", train_metric["Precision"])
print("Recall \t\t:", train_metric["Recall"])
print("F1 Score \t:", train_metric["F1 Score"])

print() # New Line

# Calculate general metrics for the test set
print("**Test Set Metrics**")
print("Accuracy \t:", test_metric["Accuracy"])
print("Precision \t:", test_metric["Precision"])
print("Recall \t\t:", test_metric["Recall"])
print("F1 Score \t:", test_metric["F1 Score"])

##### Insights based on metrics:
In the training metrics, we can see that accuracy, precision, recall and f1 score is close to 1. But however, if we look at the metrics on the testing dataset, we can see there is a sharp drop in almost all the 4 metrics (accuracy, precision, recall, f1 score). This could imply that the default decision tree model (without any tuning) is likely overfitting the training data, including the noise and outliers in the training dataset. Because of the potential overfitting, this means the data will not generalise as well to the testing dataset which it has not seen before.


### K-Nearest Neighbour Classification Model

In [None]:
# Import k nearest neighbour classifier from sklearn
from sklearn.neighbors import KNeighborsClassifier

# Default k nearest neighbour is n_neighbours = 5 (this can be further tuned in the future)
kNeighboursModel = KNeighborsClassifier()
kNeighboursModel.fit(X_train, Y_train.values.ravel())

# Predict the output based on our training and testing dataset
Y_train_pred = kNeighboursModel.predict(X_train)
Y_test_pred = kNeighboursModel.predict(X_test)

#### Plot Confusion Matrix for K-Nearest Neighbour Model

In [None]:
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(Y_train, Y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
axes[0].set_title('Train Data Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('Actual Label')

sb.heatmap(confusion_matrix(Y_test, Y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
axes[1].set_title('Test Data Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('Actual Label')

print("Train and Test Data Confusion Matrix:")

#### Calculate General Metrics for K-Nearest Neighbour Model

In [None]:
train_metric = {
    "Classification Model": "K-nearest Neighbour",
    "Accuracy": accuracy_score(Y_train, Y_train_pred),
    "Precision": precision_score(Y_train, Y_train_pred),
    "Recall": recall_score(Y_train, Y_train_pred),
    "F1 Score": f1_score(Y_train, Y_train_pred)
}

test_metric = {
    "Classification Model": "K-nearest Neighbour",
    "Accuracy": accuracy_score(Y_test, Y_test_pred),
    "Precision": precision_score(Y_test, Y_test_pred),
    "Recall": recall_score(Y_test, Y_test_pred),
    "F1 Score": f1_score(Y_test, Y_test_pred)
}

# Save to overall metrics dataframe for comparison later
train_metrics = pd.concat([train_metrics, pd.DataFrame.from_records([train_metric])], ignore_index = True)
test_metrics = pd.concat([test_metrics, pd.DataFrame.from_records([test_metric])], ignore_index = True)

# Calculate general metrics for the train set
print("**Training Set Metrics**")
print("Accuracy \t:", train_metric["Accuracy"])
print("Precision \t:", train_metric["Precision"])
print("Recall \t\t:", train_metric["Recall"])
print("F1 Score \t:", train_metric["F1 Score"])

print() # New Line

# Calculate general metrics for the test set
print("**Test Set Metrics**")
print("Accuracy \t:", test_metric["Accuracy"])
print("Precision \t:", test_metric["Precision"])
print("Recall \t\t:", test_metric["Recall"])
print("F1 Score \t:", test_metric["F1 Score"])

##### Insights based on metrics:
The model does not generalise well to new data, as shown by the decreased metrics in the test data. The default K number of neighbours of the model is 5 when we don't input a specify K number, and this could actually result in overfitting due to the small numbers of neighbours and making the model more sensitive to noise/outliers.

In addition, our dataset has quite a number of features and K-NN performance is not very good for high dimensional data due to the calculation of distances when there are too many features, the distance between data points will seem to be closer.

### Linear Support Vector Machine (SVM) Classification Model

In [None]:
# Import SVM from Sklearn
from sklearn.svm import SVC

svmModel = SVC(kernel="linear", random_state=47)
svmModel.fit(X_train, Y_train.values.ravel())

# Predict the output based on our training and testing dataset
Y_train_pred = svmModel.predict(X_train)
Y_test_pred = svmModel.predict(X_test)

#### Plot Confusion Matrix for Linear Support Vector Machine (SVM) Model

In [None]:
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(Y_train, Y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
axes[0].set_title('Train Data Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('Actual Label')

sb.heatmap(confusion_matrix(Y_test, Y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
axes[1].set_title('Test Data Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('Actual Label')

print("Train and Test Data Confusion Matrix:")

#### Calculate General Metrics for Linear Support Vector Machine (SVM) Model

In [None]:
train_metric = {
    "Classification Model": "Support Vector Machine",
    "Accuracy": accuracy_score(Y_train, Y_train_pred),
    "Precision": precision_score(Y_train, Y_train_pred),
    "Recall": recall_score(Y_train, Y_train_pred),
    "F1 Score": f1_score(Y_train, Y_train_pred)
}

test_metric = {
    "Classification Model": "Support Vector Machine",
    "Accuracy": accuracy_score(Y_test, Y_test_pred),
    "Precision": precision_score(Y_test, Y_test_pred),
    "Recall": recall_score(Y_test, Y_test_pred),
    "F1 Score": f1_score(Y_test, Y_test_pred)
}

# Save to overall metrics dataframe for comparison later
train_metrics = pd.concat([train_metrics, pd.DataFrame.from_records([train_metric])], ignore_index = True)
test_metrics = pd.concat([test_metrics, pd.DataFrame.from_records([test_metric])], ignore_index = True)

# Calculate general metrics for the train set
print("**Training Set Metrics**")
print("Accuracy \t:", train_metric["Accuracy"])
print("Precision \t:", train_metric["Precision"])
print("Recall \t\t:", train_metric["Recall"])
print("F1 Score \t:", train_metric["F1 Score"])

print() # New Line

# Calculate general metrics for the test set
print("**Test Set Metrics**")
print("Accuracy \t:", test_metric["Accuracy"])
print("Precision \t:", test_metric["Precision"])
print("Recall \t\t:", test_metric["Recall"])
print("F1 Score \t:", test_metric["F1 Score"])

##### Insights based on metrics:
Similar to the logistic regression model, the accuracy improves from the training set to the test set, but precision, recall and f1 score is much lower. One possible theory about this is because the SVM is trained with the kernel=linear parameter, which means its possible that the data is not very linearly seperable. In addition, in the confusion matrix for the test data, we can clearly see that it is able to predict class 0 ("Not canceled") very well, but class 1 ("Canceled") not as well as shown by the high false positive counts.

### Naive Bayes (Gaussian) Classification Model

Given the nature of our dataset which consist of mixed data types such as numerical features and one hot encoded values, there is no one best fit naive bayes model (Guassian, Bernoulli, Multinomial), but let's try GaussianNB given the scaled numerical features we have.

In [None]:
# Import Gaussian Naive Bayes Classifier from Sklearn
from sklearn.naive_bayes import GaussianNB

gaussianNBModel = GaussianNB()
gaussianNBModel.fit(X_train, Y_train.values.ravel())

# Predict the output based on our training and testing dataset
Y_train_pred = gaussianNBModel.predict(X_train)
Y_test_pred = gaussianNBModel.predict(X_test)

#### Plot Confusion Matrix for Naive Bayes (Gaussian) Model

In [None]:
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(Y_train, Y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
axes[0].set_title('Train Data Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('Actual Label')

sb.heatmap(confusion_matrix(Y_test, Y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
axes[1].set_title('Test Data Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('Actual Label')

print("Train and Test Data Confusion Matrix:")

#### Calculate General Metrics for Naive Bayes (Gaussian) Model

In [None]:
train_metric = {
    "Classification Model": "Gaussian Naive Bayes",
    "Accuracy": accuracy_score(Y_train, Y_train_pred),
    "Precision": precision_score(Y_train, Y_train_pred),
    "Recall": recall_score(Y_train, Y_train_pred),
    "F1 Score": f1_score(Y_train, Y_train_pred)
}

test_metric = {
    "Classification Model": "Gaussian Naive Bayes",
    "Accuracy": accuracy_score(Y_test, Y_test_pred),
    "Precision": precision_score(Y_test, Y_test_pred),
    "Recall": recall_score(Y_test, Y_test_pred),
    "F1 Score": f1_score(Y_test, Y_test_pred)
}

# Save to overall metrics dataframe for comparison later
train_metrics = pd.concat([train_metrics, pd.DataFrame.from_records([train_metric])], ignore_index = True)
test_metrics = pd.concat([test_metrics, pd.DataFrame.from_records([test_metric])], ignore_index = True)

# Calculate general metrics for the train set
print("**Training Set Metrics**")
print("Accuracy \t:", train_metric["Accuracy"])
print("Precision \t:", train_metric["Precision"])
print("Recall \t\t:", train_metric["Recall"])
print("F1 Score \t:", train_metric["F1 Score"])

print() # New Line

# Calculate general metrics for the test set
print("**Test Set Metrics**")
print("Accuracy \t:", test_metric["Accuracy"])
print("Precision \t:", test_metric["Precision"])
print("Recall \t\t:", test_metric["Recall"])
print("F1 Score \t:", test_metric["F1 Score"])

##### Insights based on metrics:
For both train and test metrics, the accuracy is lesser than 55%, which means only 45% of data are incorrectly predicted, indicating that the trained model used is not good for predicting the given output. In addition, another interesting insight is the extremely low precision but high recall, indicating that out of all the actual positives, it was able to predict the positive correctly. But out of teh total positives predicted, a low amount is actually positive.

### Comparing the Different Classification Models

In [None]:
# train_metrics.sort_values(by=['Accuracy'], ascending=True,inplace=True)
# train_metrics

In [None]:
# ax = sb.barplot(x="Accuracy", y="Classification Model", data=train_metrics)

In [None]:
test_metrics.sort_values(by=['Accuracy'], ascending=True,inplace=True)
test_metrics

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
axs = axs.flatten()

colormap = plt.cm.get_cmap('tab10')

for i, data in enumerate(['Accuracy', 'Precision', 'Recall', 'F1 Score']):
    colors = colormap.colors[:len(test_metrics['Classification Model'])]
    for index, (model, value) in enumerate(zip(test_metrics['Classification Model'], test_metrics[data])):
        axs[i].bar(model, value, color=colors[index], label=model if i == 0 else "", edgecolor='k')
    axs[i].set_title(data)
    axs[i].set_ylabel('Score')
    axs[i].set_xlabel('Classification Model')
    axs[i].set_xticks(test_metrics['Classification Model'])
    axs[i].set_xticklabels(test_metrics['Classification Model'], rotation=45)
    for index, value in enumerate(test_metrics[data]):
        axs[i].text(index, value, str(round(value, 2)), ha='center', va='bottom')

handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center', bbox_to_anchor=(0.5, -0.03), ncol=len(test_metrics['Classification Model']))

plt.tight_layout()
plt.show()

#### Insights From Comparing the Different Models
1. The two best models based on Accuracy is Logistic Regression and Support Vector Machine, both with an accuracy of 0.81. In addition, we can actually see that the two of them have similar precision, recall and f1 score as well.

2. However, it appears the the logistic regression and SVM model have the lowest recall score, which is how the model correctly identifies positive instances (true positives) from all the actual positive samples in the dataset. Recall is calculated by True Positive / (True Positive + False Negative), a low recall score would imply that we have a lot of false negative, which means we predicted the customer to not cancel the hotel booking, but the actual label is the customer canceled the booking. Recall is very important in our use case, especially since the goal of the project is that we want to find out the customers who will be canceling the booking, as this insight will allow the hotel management to prepare in advance to minimise any disruptions, hence it is crucial to reduce the false negative rate.

3. Gaussian Naive Bayes appears to be a bad model for our use case and the given dataset. This is likely due to the mixed data attributes used such as both categorical and numberical data. 

### Next Steps:
The goal here is to improve on the classification performance, the below are some identified steps that we will perform to try to improve the current results of the basic classification models we have trained above.

- Making use of ensembling (bagging and boosting) techniques to improve the classification performance
- Feature Engineering and Feature Selection
- Hyperparameter Tuning