## Holiday Package Prediction

Author: **Ali Raza** \
Project Type: **Mock Project** 


## 1- Problem Statement
Trips and Travel.com aims to optimize its marketing strategy by predicting whether customers are likely to purchase a newly launched holiday package. Using historical data on previous package purchases, the objective is to build a binary classification model that can forecast customer interest. This prediction will help the company decide whether launching a new package is likely to be successful, enabling better resource allocation, reduced marketing costs, and increased conversion rates.

## 2- Data Collection
Dataset is available on kaggle.


In [None]:
## Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline


In [None]:
df = pd.read_csv("Travel.csv")

df.head()

In [None]:
df.info()

## 3- Data Cleaning

### Handling Missing Values

1. Handling Missing Values
2. Handling Duplicates
3. Check Data Types
4. Understand the Dataset

In [None]:

## Checking missing values
df.isnull().sum()

In [None]:
## Checking all the categorical features to understand wether the
## data containes any mistakes or not
print(df['Gender'].value_counts())
print(df['MaritalStatus'].value_counts())

## perform the same operation on other categories as well

In [None]:
df['Gender'] = df['Gender'].replace('Fe Male', 'Female')
df['MaritalStatus'] = df['MaritalStatus'].replace('Single', 'Unmarried')

df['Gender'].value_counts()

In [None]:
df['MaritalStatus'].value_counts()

In [None]:
df.columns

In [None]:
## Checking the Missing values
## these are the Features with nan values

## [expression for item in iterable if condition]

features_with_na = [features for features in df.columns if df[features].isnull().sum()>=1]

for i in features_with_na:
    print(i)

In [None]:
for feature in features_with_na:
    print(feature, np.round(df[feature].isnull().mean()*100, 2), '% missing values')

In [None]:
## basic stats on numerical columns

df[features_with_na].select_dtypes(exclude='object').describe()

### Imputing Null values

1. Impute Median value for Age column
2. Impute Mode for Type of Contact
3. Impute Median for Duration of Pitch
4. Impute Mode for Number of Followups as it is discrete feature
5. Impute Mode for Preffered Property Star
6. Impute Median for Number of trips
7. Impute Mode for Number of Children Visiting
8. Impute Median for Monthly Income 

In [None]:
df.Age.fillna(df.Age.median(), inplace=True)

print(df.Age.isnull().sum())

In [None]:
df.TypeofContact.mode()[0]

In [None]:
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace= True)
print(df.TypeofContact.isnull().sum())

In [None]:
## Repeating the same procedure for all the Nan containing features

df.DurationOfPitch.fillna(df.DurationOfPitch.median(), inplace = True)

df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0], inplace=True)

df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0], inplace=True)

df.NumberOfTrips.fillna(df.NumberOfTrips.median(), inplace=True)

df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0], inplace=True)

df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)


print(df.isnull().sum())

In [None]:
df.head()

In [None]:
df.drop('CustomerID', inplace=True, axis=1)

df.head()

In [None]:
## Creating new columns and removing unncessary columns

df['TotalVisiting'] = df['NumberOfPersonVisiting'] + df['NumberOfChildrenVisiting']

df.drop(columns= ['NumberOfPersonVisiting','NumberOfChildrenVisiting'], inplace=True, axis= 1)

df.head()

In [None]:
## Getting all the Numeric features

num_features = [feature for feature in df.columns if df[feature].dtype != 'O']

## getting all the Categorical features

cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']

print('Number of numeric features', len(num_features))
print('Number of Categorical Features', len(cat_features))

In [None]:
## Discrete Features 

discrete_features = [feature for feature in num_features if len(df[feature].unique())<=25]

## Continuous Features
continuous_features = [feature for feature in num_features if len(df[feature].unique())>25]

print('Number of Discrete features', len(discrete_features))
print('Number of Continuous features', len(continuous_features))

In [None]:
df.head()

## 4- Train Test Split and Model Training

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(['ProdTaken'], axis=1)
y = df['ProdTaken']

X.head()

In [None]:
y.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

In [None]:
X.info()

In [None]:
## Creating Column Transformer

cat_features = X.select_dtypes(include='object').columns
num_features = X.select_dtypes(exclude='object').columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features)
    ]
)


In [None]:
preprocessor

In [None]:
## Applying transformation in training dataset ----> fit_transform

X_train = preprocessor.fit_transform(X_train)

print(X_train)

In [None]:
pd.DataFrame(X_train)

In [None]:
## Apply transformation in test data using ----> transform

X_test = preprocessor.transform(X_test)

In [None]:
pd.DataFrame(X_test)

## 5- Random Forest Classifier Training

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay, \
precision_score, recall_score, f1_score, roc_auc_score, roc_curve

In [None]:
models = {

    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier()
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) ## Model training

    ## Making Predictions
    y_train_pred  = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    ## Training set performance 
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average= 'weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    ## Test Performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average= 'weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)


    print(list(models.keys())[i])

    print('Model Performance for Training Set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print("- F1 score: {:.4f}".format(model_train_f1))
    print("- Precision: {:.4f}".format(model_train_precision))
    print("- Recall: {:.4f}".format(model_train_recall))
    print("- ROC AUC Score: {:.4f}".format(model_train_rocauc_score))

    print('-'*35)

    print("Model Performance for Test set")
    print("- Accuracy: {:.4f}".format(model_test_accuracy))
    print("- F1 score: {:.4f}".format(model_test_f1))
    print("- Precision: {:.4f}".format(model_test_precision))
    print("- Recall: {:.4f}".format(model_test_recall))
    print("- ROC AUC Score: {:.4f}".format(model_test_rocauc_score))

    print('='*35)
    print('\n')


In [None]:
## Hyperparameter tuning

rf_params = {
    "max_depth": [5,8,15,None,10],
    "max_features": [5,7,"auto",8],
    "min_samples_split": [2,8,15,20],
    "n_estimators": [100,200,500,1000]
}

In [None]:
## Models list for hyperparameter tuning
randomcv_models = [
    ("RF", RandomForestClassifier(), rf_params)
]

In [None]:
from sklearn.model_selection import RandomizedSearchCV

model_param = {}

for name,model,params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1)
    
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_


for model_name in model_param:
    print(f'---------------- Best Params for {model_name} ----------------')
    print(model_param[model_name])

In [None]:
models = {

    "Random Forest": RandomForestClassifier(n_estimators=500, min_samples_split=2,
                                            max_features=8, max_depth=None)
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) ## Model training

    ## Making Predictions
    y_train_pred  = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    ## Training set performance 
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average= 'weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    ## Test Performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average= 'weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)


    print(list(models.keys())[i])

    print('Model Performance for Training Set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print("- F1 score: {:.4f}".format(model_train_f1))
    print("- Precision: {:.4f}".format(model_train_precision))
    print("- Recall: {:.4f}".format(model_train_recall))
    print("- ROC AUC Score: {:.4f}".format(model_train_rocauc_score))

    print('-'*35)

    print("Model Performance for Test set")
    print("- Accuracy: {:.4f}".format(model_test_accuracy))
    print("- F1 score: {:.4f}".format(model_test_f1))
    print("- Precision: {:.4f}".format(model_test_precision))
    print("- Recall: {:.4f}".format(model_test_recall))
    print("- ROC AUC Score: {:.4f}".format(model_test_rocauc_score))

    print('='*35)
    print('\n')


In [None]:
## Plotting the ROC AUC Curve

from sklearn.metrics import roc_auc_score, roc_curve

## ADD the models to the list that you want to view on the ROC plot

auc_models = [
    {
        'label': 'Random Forest Classifier',
        'model' : RandomForestClassifier(n_estimators=500, min_samples_split=2,
                                            max_features=8, max_depth=None),
        'auc': 0.8398


    }
]

## create loop through all models

for algo in auc_models:
    model = algo['model'] ## select the model
    model.fit(X_train, y_train) ## train the model
    ## Compute false positive rate and true positive rate
    fpr, tpr, thresholds = roc_curve(y_test, model.predict_prob(X_test)[:,1])
    ## Calculate area under the curve to display on the plot

    plt.plot(fpr, tpr,label= '% ROC (area = %0.2f)' % (algo['label']))



## Custom settings for the plot

plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig("auc.png")
plt.show()
 
