# Travel With Us - Project

## Problem Definition

- 'Visit With Us' is a tourism company that currently offers five types of packages to it's customers, namely Basic, Standard, Deluxe, Super Deluxe and King. The Policy Makers of the company wants to establish a viable business model to expand the customer base by introducing a new travel package, the 'Wellness Tourism Package'.

## Objective

- To build and compare Ensemble Models using the data of the existing customers
- Use the model to target potential customers (including new customers) who are more likely to purchase the new package

## Contents of the dataset

### Customer details

- CustomerID: Unique customer ID
- ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
- Age: Age of customer
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- Designation: Designation of the customer in the current organization
- MonthlyIncome: Gross monthly income of the customer

### Customer interaction data

- PitchSatisfactionScore: Sales pitch satisfaction score
- ProductPitched: Product pitched by the salesperson
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- DurationOfPitch: Duration of the pitch by a salesperson to the customer

In [None]:
# Libraries required for data analysis and data visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)

sns.set(color_codes=True)  # adds background to the graph

In [None]:
# Libraries required for model building and performance evaluation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
)

# Libraries required to build and tune Decision Tree and Bagging models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import scipy.stats as stats
from sklearn import metrics
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

# Libraries required tom build Boosting models
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier

# Libraries to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)
from sklearn.model_selection import GridSearchCV

In [None]:
# reading the excel dataset

travel_data = pd.read_excel(
   '../input/tourismpackage/Tourism.xlsx', sheet_name='Tourism'
)  # sheet name containing the actual data

In [None]:
# creating a copy of the dataset

data = travel_data.copy()

In [None]:
# viewing the first 10 observations of the dataset

data.head(10)

- CustomerID seems to be all unique
- NaN values present in Age
- 6 columns with categorical data
- ProdTaken, Passport and OwnCar having binary values (0 and 1)
- Remaing columns are either float or int

In [None]:
# viewing the shape of the dataset

data.shape

- There are a total of 4888 rows and 20 columns

In [None]:
# viewing the overall information of the dataset

data.info()

- There are 7 float type, 7 int type and 6 object type columns in the dataset
- Memory used is 763.9+ KB
- Missing values in Age, TypeofContact, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting and MonthlyIncome

In [None]:
# getting the total count of null values present in the dataset

data.isna().sum()

- There are missing values in 7 of 20 columns 

In [None]:
# checking for duplicated rows

data.duplicated().sum()

- no duplicated observations present in the dataset

### Fixing the data types

In [None]:
# assigning the object type columns to a list

cols = data.select_dtypes("object")
cols.columns

In [None]:
# converting object dataype to category

for i in cols.columns:
    data[i] = data[i].astype("category")

In [None]:
# checking the datatypes

data.dtypes

- The datatypes are now fixed

In [None]:
# checking the statistical summary of the dataset

data.describe().T

- CustomerID is an ID variable and is not useful for predictive modelling
- Age of the customers ranges from 18 to 61 years and the average age is 37 years
- DurationOfPitch ranges from 5 to 127 minutes and the average is 15 minutes, indicates skewness or presence of outliers
- NumberOfPersonVisiting ranges from 1 to 5 and the average is 2 persons
- NumberOfFollowups ranges from 1 to 6 with an average of 3
- The average PreferredPropertyStar is 3, whereas the maximum is 5
- The average NumberOfTrips in a year by customer is 3, whereas the maximum is 22
- The average PitchSatisfactionScore is 3, whereas it ranges from 1 to 5
- NumberOfChildrenVisiting is anywhere between 0 and 3 and the average is 1
- MonthlyIncome of customers ranges from 1000 to 98,000 Rupees, whereas the average is Rs.23,000. The differnece between 75% and the max indicates skewness in the data

In [None]:
# summary of categorical columns in the dataset

data.describe(include="category").T

- Need to check for values in Gender and MaritalStatus for any data entry errors
- Must look at what the unique values are, to get better idea of the data
- High frequency values in category include Self Enquiry, Salaried, Male customers 
- 'Basic' package is highly preferred by customers. Maybe because most customers are salaried, that they chose the 'Basic' package
- Married customers are more comparatively. Maybe they tend to travel as family than alone
- Most commonn designation of customers is 'Executive'

In [None]:
data.nunique()

- gives an idea on the number of unique values in each column
- CustomerID is all unique i.e. it's just a unique identification number of the customer and can be dropped while model building
- There are 3 unique values in Gender and 4 unique values in MaritalStatus, must further check what the values are

In [None]:
# making a list of the categorical values in the dataset

cols_cat = data.select_dtypes("category")
cols_cat.columns

In [None]:
# viewing the unique values in the categorical columns

for i in cols_cat.columns:
    print("\nUnique values in", i, "are :")
    print(data[i].value_counts())
    print("\n")

- Type of contact is either Self Enquiry or Compaby Invited
- Salaried customers and Small business owners are more in comparison to Large business owners. There are only 2 Free lancers.
- Gender column look to have same data entry error that needs to be fixed
- There are 5 different product packages as mentioned in the data background
- Marital status includes 4 unique values, wherein unmarried customers may potentially have a partner to travel along
- Executive and Manager level customers are more in comparison to Senior Managers, AVP and VPs

In [None]:
# Fixing data entry error in Gender

data["Gender"] = data["Gender"].replace("Fe Male", "Female")

In [None]:
data["Gender"].value_counts()

- The data entry error in Gender column is now fixed

# Exploratory Data Analysis

# Univariate Analysis

In [None]:
# Creating an array of color codes to use in this project

colors = ["#4178FB", "#4DE0FA", "#7DFFC6"]

# Setting custom color palette

sns.set_palette(sns.color_palette(colors))

In [None]:
# Defining a method to print the percentage of data points in the plot


def perc_on_bar(plot, feature):
    """
    plot
    feature : categorical feature
    the function won't work if a column is passed in hue parameter
    
    """
    total = len(feature)  # length of the column
    for p in ax.patches:
        percentage = "{:.1f}%".format(
            100 * p.get_height() / total
        )  # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.06  # width of the plot
        y = p.get_y() + p.get_height()  # height of the plot
        ax.annotate(
            percentage,
            (x, y),
            # ha="center",
            # va="center",
            size=12,
            # xytext=(0, 3),
            # textcoords="offset points",
        )  # annotate the percantage
    plt.show()  # show the plot

In [None]:
# Defining a method to plot histogram and boxplot combined in a single plot


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined
    feature : dataframe column
    figsize : size of figure (default (12,7))
    kde : whether to show the density curve (default False)
    bins : number of bins (default None / auto)
    
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid=2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2,
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

### 1. Product Taken

In [None]:
plt.figure(figsize=(5, 5))
ax = sns.countplot(data["ProdTaken"])
perc_on_bar(ax, data["ProdTaken"])

- Only 18.8% of the customers have taken the product as opposed to 81.2% of customers who did not take the product. Hence it is an imbalanced classification

### 2. Age

In [None]:
histogram_boxplot(data, "Age")

- Age is normally distributed
- No outliers spotted
- Slightly right skewed

### 3. Type of Contact

In [None]:
plt.figure(figsize=(5, 5))
ax = sns.countplot(data["TypeofContact"])
perc_on_bar(ax, data["TypeofContact"])

- 70.5% customers self-enquired as opposed to 29% that were invited by the company

### 4. City Tier

In [None]:
plt.figure(figsize=(7, 5))
ax = sns.countplot(data["CityTier"])
perc_on_bar(ax, data["CityTier"])

- Maximum number of customers are from Tier 1 cities
- Followed by Tier 3 and Tier 2-
- The distribution of the values suggests City Tier can be treated as a 'category'

### 5. Duration of Pitch

In [None]:
histogram_boxplot(data, "DurationOfPitch")

- Mean is around 15
- Extreme outliers above 125

### 6. Occupation

In [None]:
plt.figure(figsize=(7, 5))
ax = sns.countplot(data["Occupation"])
perc_on_bar(ax, data["Occupation"])

- As seen in the data summary, Salaried customers are large in number, followed by Small business owners
- Only about 9% of Large business owners 
- Only 2 Free lancers as seen from the data analysis

### 7. Gender

In [None]:
plt.figure(figsize=(5, 5))
ax = sns.countplot(data["Gender"])
perc_on_bar(ax, data["Gender"])

- Male customers out number the Female customers

### 8. Number of Person Visiting

In [None]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["NumberOfPersonVisiting"])
perc_on_bar(ax, data["NumberOfPersonVisiting"])

- 3 persons visiting take up 49.1% of the customers where they could be a family with one child
- 29% includes two persons visiting followed by 21% of 4 persons visiting

### 9. Number of Followups

In [None]:
histogram_boxplot(data, "NumberOfFollowups")

- Average number of follow ups is 3
- Maximum number of follow ups is 4
- Number of outliers around 1 and 6 

### 10. Product Pitched

In [None]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["ProductPitched"])
perc_on_bar(ax, data["ProductPitched"])

- 'Basic' package has been pitched for a maximum of 37.7% by the salesperson, followed by 'Deluxe' package with 35%
- 'Standard' package has been pitched 15% times, 'Super Deluxe' 7% times and the 'King' package with the least of 4.7%
- The 'King' package, as the names implies may be the most expensive package offered by the company
- As most of the customers are either salaried or small business owners, it is a possibility that the 'Basic' and 'Deluxe' packages are the ones that are being pitched to them depending on their income

### 11. Preferred Property Star

In [None]:
plt.figure(figsize=(7, 5))
ax = sns.countplot(data["PreferredPropertyStar"])
perc_on_bar(ax, data["PreferredPropertyStar"])

- Customers generally prefer hotels with rating 3 stars and above
- 3 star ratings at 61% are a maximum as opposed to 4 star ratings at 18.7% and 5 star ratings at 19.6%
- It maybe because 4 star and 5 star rated hotels are much expensive in comparison to 3 star rated hotels

### 12. Marital Status

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(data["MaritalStatus"])
perc_on_bar(ax, data["MaritalStatus"])

- Married customers are large in number at 47.9%
- Customers with family tend to travel more on a vacation in comparison with divorced, single and unmarried customers

### 13. Number of Trips

In [None]:
histogram_boxplot(data, "NumberOfTrips")

- Maximum number of trips in a year by a customer is 2
- Average number of trips in a year by a customer is 3
- Extreme outliers spotted around 20 times. Maybe they are customers who travel on business purposes.

### 14. Passport

In [None]:
plt.figure(figsize=(5, 5))
ax = sns.countplot(data["Passport"])
perc_on_bar(ax, data["Passport"])

- 70.9% of cutsomers do not own a passport as opposed to 29.1%
- This could also be the reason for having a large number of customers to be pitched 'Basic' package by the salesperson

### 15. Pitch Satisfaction Score

In [None]:
histogram_boxplot(data, "PitchSatisfactionScore")

- Pitch satisfaction score ranges from 1 to 5
- Mean and median are around 3

### 16. Own Car

In [None]:
plt.figure(figsize=(5, 5))
ax = sns.countplot(data["OwnCar"])
perc_on_bar(ax, data["OwnCar"])

- 62% of customers own a car as opposed to 38% of customers who do not own a car

### 17. Number of Children Visiting

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(data["NumberOfChildrenVisiting"])
perc_on_bar(ax, data["NumberOfChildrenVisiting"])

- 42.6% of times there is atleast one child travelling with the customer
- 27.3% of times there is atleast two children travelling with the customer
- whereas 22.1% of times, the customers travel without any kids

### 18. Designation

In [None]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Designation"])
perc_on_bar(ax, data["Designation"])

- There are a maximum of 37.7% of customers who are Executives, followed by 35.4% of customers who are Managers.
- There are 15.2% of customers who are Senior Managers 
- 7% AVPs and 4.7% VPs

### 19. Monthly Income

In [None]:
histogram_boxplot(data, "MonthlyIncome")

- Extreme outliers spotted around Rs.1000 and Rs.1 Lakh
- Mean is around Rs.23000

# Bivariate Analysis

In [None]:
# User defined function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """Print the category counts and plot a stacked bar chart
    data : dataframe
    predictor : independent variable
    target : target variable"""

    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 3, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.xticks(rotation=0)
    plt.show()

In [None]:
data1 = data.copy().drop("CustomerID", axis=1)
plt.figure(figsize=(14, 7))
sns.heatmap(data1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

- Number of person visiting is in positive correlation with the number of children visiting
- Monthly income is positively correlated to Age. As Age increases, experience increases and hence income is higher with age
- Product taken is moderately correlated with Passport and Number of followups, Preferred property star
- Product taken is negatively correlated with Age and Monthly income
- Monthly income, Number of trips and Number of followups are moderately correlated with Number of people visiting
- Monthly income is in moderate correlation with Number of children visiting

In [None]:
sns.pairplot(data1, hue="ProdTaken")
plt.show()

- Varying distributions of Product taken across different features is visible, which needs further analysis

In [None]:
data.info()

### Product taken vs Customer information data

In [None]:
cust_info_cols1 = data[
    [
        "Age",
        "NumberOfPersonVisiting",
        "NumberOfChildrenVisiting",
        "MonthlyIncome",
        "NumberOfTrips",
    ]
]

plt.figure(figsize=(15, 8))

for i, variable in enumerate(cust_info_cols1):
    plt.subplot(2, 3, i + 1)
    sns.boxplot(data["ProdTaken"], data[variable])
    plt.tight_layout()
    plt.title(variable)
plt.show()

- Customers in Age range 28 to 42 seems to have purchased the travel package more in number
- Large number customers who purchased the travel packages travelled in a group of 2 to 3, with an upper whisker at 4
- The boxplot indicates that customers who purchased the package travel anywhere between 2 to 4 times per year
- Large number of customers are accompanied by atleast 1 to 2 kids
- Customers with a minimum monthly income of about 18K/per month and above, seem to show much interest in purchasing travel packages

In [None]:
stacked_barplot(data, "Gender", "ProdTaken")

- 578 out of 2916 Male customers have purchased a travel package
- 342 out of 1972 Female customers have purchased a travel package

In [None]:
stacked_barplot(data, "MaritalStatus", "ProdTaken")

- The highest percentage of customers who actually purchased a package are single
- Second highest are unmarried customers
- Lastly are the married and divorced customers

In [None]:
stacked_barplot(data, "Occupation", "ProdTaken")

- From the previous analysis, it was seen that Free Lancer customers are too low in number. From this plot it is evident that there are only 2 Free Lancer customers and that they have both bought a travel package.
- Customers who are large business owners have purchased highest number of packages
- Followed equally by small business and salaried customers

In [None]:
stacked_barplot(data, "Designation", "ProdTaken")

- Customers who are Executives seem to have taken the package in highest number followed by Senior Managers and Managers
- VP and AVP are considerably low in number

In [None]:
stacked_barplot(data, "CityTier", "ProdTaken")

- More number of customers from Tier2 and Tier3 cities seem to have purchased travel package in more number in comparison with Tier1

In [None]:
stacked_barplot(data, "OwnCar", "ProdTaken")

- The percentage of customers who purchased a package is almost the same whether or not they own a car 

In [None]:
stacked_barplot(data, "Passport", "ProdTaken")

- The percentage of customers who own a passport seem to show more interest in purchasing the travel package in comparison to those who do not own one

In [None]:
stacked_barplot(data, "PreferredPropertyStar", "ProdTaken")

- Customers perfer a PropertyStar of 3 and above with highest percentage of customers opting a 5 star

### Product taken vs Customer interaction data

In [None]:
pitch_cols = data[["DurationOfPitch", "PitchSatisfactionScore", "NumberOfFollowups"]]

plt.figure(figsize=(15, 5))

for i, variable in enumerate(pitch_cols):
    plt.subplot(1, 3, i + 1)
    sns.boxplot(data["ProdTaken"], data[variable])
    plt.tight_layout()
    plt.title(variable)
plt.show()

- Higher the duration of pitch, customer is more likely to purchase the product
- Pitch satisfaction score does not impact the customer actually purchasing the product
- Interestingly, higher the number of follow ups, more aren the chances of customer purchasing the product

In [None]:
stacked_barplot(data, "TypeofContact", "ProdTaken")

- Customers purchasing a product is almost same for different types of contact

In [None]:
stacked_barplot(data, "ProductPitched", "ProdTaken")

- If the prodcut pitched is 'Basic', then the customer is most likely to purchase the product. Followed by 'Standard' package.
- May be the reason being these two packages are less expensive comparatively.

### Product taken & Product Pitched vs Age, Number of Person Visiting, Number of Children Visiting

In [None]:
col1 = data[["Age", "NumberOfPersonVisiting", "NumberOfChildrenVisiting"]]

plt.figure(figsize=(12, 8))

for i, variable in enumerate(col1):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x=data["ProdTaken"], y=data[variable], hue=data["ProductPitched"])
    plt.tight_layout()
    plt.title(variable)
    plt.legend(bbox_to_anchor=(1, 1))
plt.show()

#### Observations

**Age** of the customer w.r.t. package purchased :
- `Basic` - Min 18 and max 50 with IQR between 25 and 35 years of age with outliers until around 60 years
- `Deluxe`- Min 21 and max 59 years ; IQR between 32 to 43 years of age
- `King` - Min 27 to max 59 years ; IQR between 42 to 58 years
- `Standard` - Min 19 to max 61 years ; IQR between 42 to 57 years
- `Super Deluxe` - Min 39 to max 47 years ; IQR between 40 and 45 years with outliers around 55 years

Overall, `Basic` and `Deluxe` packages are preferred by younger customers in comparision to other packages. `King` package is highly preferred by customers above 40 years, `Standard` package is preferred by customers above 33 years and `Super Deluxe` is higly preferred by customers in their 40s

Average **Number of person visiting** with customers who purchased travel packages is between 2 and 3 and it is common across all the different packages available.

**Number of children visiting** is same for customers with `Basic`,`Deluxe`,`King` and `Super Deluxe` packages.

### Product taken & Product Pitched vs Monthly Income, Number of Trips

In [None]:
col2 = data[["MonthlyIncome", "NumberOfTrips"]]

plt.figure(figsize=(10, 5))

for i, variable in enumerate(col2):
    plt.subplot(1, 2, i + 1)
    sns.boxplot(x=data["ProdTaken"], y=data[variable], hue=data["ProductPitched"])
    plt.tight_layout()
    plt.title(variable)
    plt.legend(bbox_to_anchor=(1, 1))
plt.show()

#### Observations

**Monthly Income** of customers across differnt packages : 
- `Basic` - Min 17K to max 36K with outliers until 39K 
- `Deluxe` - Min 18K to max 37K with outliers until 40K
- `King` - Min 35K to max 39K with outliers around 20K
- `Standard` - Min 18K to max 38K with outliers until 40K
- `Super Deluxe` - Min 28K to max 38.5K with outliers around 21K

Overall, customers with an average monthly income around 22K prefer either `Basic` or `Deluxe` packages ; `King` is preferred by customers with an average monthly income above 35K and `Standard` and `Super Deluxe` packages are preferred by customers with an average monthly income between 22K to 35K.
Hence, we can infer that `Basic` and `Standard` packages are very affordable are they are the basic packages. Whereas, `King` is the most expensive package of all followed by `Super Deluxe` and `Standard`

**Number of trips** per year made by customers who have purchased the travel package is :
- For `Basic`,`King` and `Standard` package customers, it is min 1 and max 7 with IQR between 3 and 4 
- `Deluxe` - Min 1 to 8 with an IQR between 3 and 5
- `Super Deluxe` - 1 to 7 with IQR 1 to 6

### Product taken & Product Pitched vs Customer Interaction Data

In [None]:
pitch_cols = data[
    ["DurationOfPitch", "NumberOfFollowups", "PitchSatisfactionScore"]
].columns.tolist()

plt.figure(figsize=(15, 8))

for i, variable in enumerate(pitch_cols):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x=data["ProdTaken"], y=data[variable], hue=data["ProductPitched"])
    plt.tight_layout()
    plt.title(variable)
    plt.legend(bbox_to_anchor=(1, 1))
plt.show()

#### Observations

**Duration of Pitch** to convince a customer to purchase different packages :
- `Basic`- Min 6 mins to max 38 mins ; IQR between 10 to 22 mins
- `Deluxe` - Min 6 to max 38 mins ; IQR between 11 to 35 mins
- `King` - Around 8-9 mins ; outliers around 29 mins
- `Standard` - Min 6 to max 38 mins ; IQR between 11 to 36 mins
- `Super Deluxe` - Min 8 to max 21 mins ; IQR between 15 to 20 mins ; outliers around 30 mins

Overall, more duration is required to convince a customer to purchase `Deluxe` and `Standard` packages 

**Numer of Followups** required to convince a customer to purchase different packages :
- `Basic` - Min 1 to max 6 and IQR between 3 to 5 followups
- `Deluxe` - Same pattern followed for Basic package
- `King` - Min 3 to max 6 and IQR between 3 to 5 followups
- `Standard` - Min 2 to max 6 and IQR between 3 to 4.25 followups ; outliers around 1
- `Super Deluxe` - Min 1 to max 6 and IQR between 2 to 4 followups

Overall, more number of followups more the chances are that a customer will purchase the travel package

**Pitch satisfaction score** acquired across different packages where the customer actually purchased the package :
- `Basic` & `Deluxe` - Min 1 to max 5 ; IQR between 2 to 4
- `King` - Min 2 to max 5 ; IQR between 3 and 4 ; outliers around 1
- `Standard` - Ranges from 1 ; IQR between 3 and 5 
- `Super Deluxe` - IQR between 3 and 5

The pitch satisfaction score is for `Basic` and `Deluxe` packages are the same irrespective of the customer purchasing the package. Maybe the marketing team has to work on it.
Overall, for `King`, `Standard` and `Super Deluxe` packages, the customers who purchased the packages have given a better satisfaction score than the customers who have not purchased the package.

# Building Customer Profile for different packages

In [None]:
# creating a new dataframe df where the package was actually purchased by the customer
df = data[data["ProdTaken"] == 1]
df.head()

## Grouping data w.r.t different travel packages to build customer profile

### Listing the statistical summary w.r.t. to the package taken

In [None]:
df[df["ProductPitched"] == "Basic"].describe().T

In [None]:
df[df["ProductPitched"] == "Standard"].describe().T

In [None]:
df[df["ProductPitched"] == "Deluxe"].describe().T

In [None]:
df[df["ProductPitched"] == "Super Deluxe"].describe().T

In [None]:
df[df["ProductPitched"] == "King"].describe().T

### Package vs Categorical data

In [None]:
stacked_barplot(df, "ProductPitched", "Gender")

- On an average Male customers have largely purchased `Basic`,`Standard`,`Deluxe` and `Super Deluxe` packages
- Comparatively Female customers have shown much interest in `King` package

In [None]:
stacked_barplot(df, "ProductPitched", "MaritalStatus")

- Married customers highly prefer `Standard` and `Deluxe` packages
- Single customers highly prefer `Basic`, `Super Deluxe` and `King` packages
- Unmarried customers generally prefer `Basic` and `Deluxe` packages
- Divorced customers generally prefer the `Basic` package

In [None]:
stacked_barplot(df, "ProductPitched", "Occupation")

- `Basic` package is largely preferred by Free lancer and Large business customers
- `Deluxe`, `King` and `Standard` packages are highly purchased by Small business customers
- `Super Deluxe` package is preferred by Salaried customers

In [None]:
stacked_barplot(df, "ProductPitched", "Designation")

- Customers who purchased the travel package `King` are mostly VPs,`Basic` are mostly Executives, `Deluxe` are Managers, `Standard` are Senior Managers and `Super Deluxe` are AVPs.
- The price range of the packages offered from low to high are namely `Basic, Deluxe, Standard, Super Deluxe and King` that can be inferred from the designation of the customers w.r.t. their income

In [None]:
stacked_barplot(df, "ProductPitched", "CityTier")

- Customers who purchased `Basic` and `King` packages are mostly from Tier 1 cities
- Customers who purchased `Standard, Deluxe and Suoer Deluxe` packages are mostly from Tier 3 cities
- Customers from Tier 2 cities generally prefer the `Basic` package

In [None]:
stacked_barplot(df, "ProductPitched", "TypeofContact")

- `Basic, Deluxe and Standard` customers are mostly by their personal interest (Self enquiry)
- `Super Deluxe` customers are mostly company invited
- Customers who purchased the `King` package are all Self-enquired

In [None]:
stacked_barplot(df, "ProductPitched", "Passport")

- 50% of the customers choosing `Basic, King and Super Deluxe` packages have passport
- Comparatively high proportion of the `Standard and Super Deluxe` customers do not own a passport

In [None]:
stacked_barplot(df, "ProductPitched", "OwnCar")

- All `Super Deluxe` customers own a car
- Most of the `King` customers own a car
- About half the `Basic and Deluxe` customers do not own a car 

# Customer Characteristics based on different travel packages

## BASIC package 

- `Age` - Early 20s to early 30s
- `Type of contact` - Self enquiry
- `City Tier` - Mostly from tier 1
- `Occupation` - Salaried
- `Designation` - Mostly preferred by Executives
- `Gender` - Preferred by Male customers
- `Marital status` - Single
- `Number of person visiting` - 2 to 3 on an average
- `Number of children visiting` - 1 kid on an average
- `Passport` - 50% of the customers have a passport
- `Own car` - 50% of the customers own a car
- `Preferred property star` - 3 star and above
- `Number of trips` - 3 trips per year on an average ; Min 1 to max 20 trips/year
- `Monthly Income` - An average of 20K per month

## STANDARD package 

- `Age` - Early 30s to late 40s
- `Type of contact` - Mostly self enquiry
- `City Tier` - Mostly from tier 3
- `Occupation` - Small business owners and Salaried customers
- `Designation` - Senior Managers
- `Gender` - Preferred by Male customers
- `Marital status` - Married
- `Number of person visiting` - 2 to 3 on an average
- `Number of children visiting` - 1 kid on an average
- `Passport` - Most of them do not have a passport
- `Own car` - 50% of the customers do not own a car
- `Preferred property star` - 3 star and above
- `Number of trips` - 3 trips per year on an average ; Min 1 to max 8 trips/year
- `Monthly Income` - An average of 26K per month

## DELUXE package 

- `Age` - Early 30s to late 50s
- `Type of contact` - Only Self enquiry
- `City Tier` - Mostly from tier 3
- `Occupation` - Small business
- `Designation` - Managers
- `Gender` - Preferred by Male customers
- `Marital status` - Married
- `Number of person visiting` - 3 on an average
- `Number of children visiting` - 1 kid on an average
- `Passport` - 50% of the customers do not have a passport
- `Own car` - 75% of the customers do own a car
- `Preferred property star` - 3 star and above
- `Number of trips` - 3 trips per year on an average ; Min 1 to max 8 trips/year
- `Monthly Income` - An average of 23K per month

## SUPER DELUXE package 

- `Age` - Early 30s to late 40s
- `Type of contact` - Company invited
- `City Tier` - Mostly from tier 3
- `Occupation` - Salaried customers
- `Designation` - AVP
- `Gender` - Preferred by Male customers
- `Marital status` - Single
- `Number of person visiting` - 2 to 3 on an average
- `Number of children visiting` - 1 kid on an average
- `Passport` - Most of the customers have passport
- `Own car` - All customers own a car
- `Preferred property star` - 3 star and above
- `Number of trips` - 3 trips per year on an average ; Min 1 to 8 trips/year
- `Monthly Income` - An average of 29K per month

## KING package 

- `Age` - Early 40s to late 50s
- `Type of contact` - Self enquiry
- `City Tier` - Mostly from tier 1
- `Occupation` - Small business
- `Designation` - VP
- `Gender` - Preferred by Female customers
- `Marital status` - Single
- `Number of person visiting` - 3 on an average
- `Number of children visiting` - 1 kid on an average
- `Passport` - Most of the customers have a passport
- `Own car` - Most of the customers do own a car
- `Preferred property star` - 4 star and above
- `Number of trips` - 2 trips per year on an average ; Min 1 to 7 trips/year
- `Monthly Income` - An average of 34K per month

# Key Insights from EDA

Characteristics of customers who seem to be more interested in purchasing the travel package include :

- customers aged between the late 20s and early 40s
- customers who travel in a group of 2 to 4 which includes one child (on average)
- customers from Tier2 and Tier3 cities
- customers who own a passport
- customers who are mostly Executives and Managers
- customers who mostly bought the 'Basic' package after they were pitched
- Female customers with a higher designation seem to be interested in the 'King' package, whereas Male customers show interest in the 'Super Deluxe'
- Single, unmarried and divorced customers highly prefer 'Basic' package while married customers prefer more expensive packages, comparatively
- On average, customers who were followed up three times and above are most likely to purchase the product along with a higher duration of pitch
- gender, number of children visiting, and owning a car - these factors do not add any significance to the chances of the product being purchased
- On average, customers who tend to travel three times and above, annually, are more likely to purchase the product
- customers with an average monthly income ranging between 20K - 25K opt opt for Basic and Standard packages, 23K - 30K opt for Deluxe and Super Deluxe packages and above 33K prefer the King package

# Missing Value Treatment

In [None]:
# checking again for missing values
data.isnull().sum()

In [None]:
# Checking the Age of a customer w.r.t their Designation, Gender and Marital Status

data.groupby(["Designation", "Gender", "MaritalStatus"])["Age"].median()

In [None]:
# Imputing the missing values in Age with the median values as shown in the above analysis

data["Age"] = data.groupby(["Designation", "Gender", "MaritalStatus"])["Age"].transform(
    lambda x: x.fillna(x.mean())
)

In [None]:
# checking the highest occuring category in Type of contact

data["TypeofContact"].value_counts()

In [None]:
# Imputing the missing values in Type of Contact with 'Self Enquiry' as it has the highest frequency

data["TypeofContact"] = data["TypeofContact"].fillna("Self Enquiry")

In [None]:
# Checking the median value of Duration of Pitch w.r.t. Type of contact and the Product pitched

data.groupby(["TypeofContact", "ProductPitched"])["DurationOfPitch"].median()

In [None]:
# Imputing the missing values in Duration of Pitch with the median values from the above analysis

data["DurationOfPitch"] = data.groupby(["TypeofContact", "ProductPitched"])[
    "DurationOfPitch"
].transform(lambda x: x.fillna(x.median()))

In [None]:
# Checking the median value of Number of Followups w.r.t. the Product Taken and the Product Pitched

data.groupby(["ProdTaken", "ProductPitched"])["NumberOfFollowups"].median()

In [None]:
# Imputing the missing values in NumberOfFollowups with the median values from the above analysis

data["NumberOfFollowups"] = data.groupby(["ProdTaken", "ProductPitched"])[
    "NumberOfFollowups"
].transform(lambda x: x.fillna(x.median()))

In [None]:
# Checking the median value of Preferred Property Star w.r.t. the Gender and Designation of the customer

data.groupby(["Gender", "Designation"])["PreferredPropertyStar"].median()

In [None]:
# Imputing the missing values in PreferredPropertyStar with the median values from the above analysis

data["PreferredPropertyStar"] = data.groupby(["Gender", "Designation"])[
    "PreferredPropertyStar"
].transform(lambda x: x.fillna(x.median()))

In [None]:
# Checking the median value of Number Of Trips w.r.t. Gender, Marital Status and Designation

data.groupby(["Gender", "MaritalStatus", "Designation"])["NumberOfTrips"].median()

In [None]:
# Imputing the missing values in NumberOfTrips with the median values from the above analysis

data["NumberOfTrips"] = data.groupby(["Gender", "MaritalStatus", "Designation"])[
    "NumberOfTrips"
].transform(lambda x: x.fillna(x.median()))

In [None]:
# Checking the Number of Children Visiting w.r.t. Marital Status, Product Pitched and the Number of Person Visiting

data.groupby(["MaritalStatus", "ProductPitched", "NumberOfPersonVisiting"])[
    "NumberOfChildrenVisiting"
].median()

In [None]:
# Imputing the missing values in NumberOfChildrenVisiting with the median values from the above analysis

data["NumberOfChildrenVisiting"] = data.groupby(
    ["MaritalStatus", "ProductPitched", "NumberOfPersonVisiting"]
)["NumberOfChildrenVisiting"].transform(lambda x: x.fillna(x.median()))

In [None]:
# Checking the Monthly Income w.r.t. the Occupation, Designation and Gender of the customer

data.groupby(["Occupation", "Designation", "Gender"])["MonthlyIncome"].mean()

In [None]:
# Imputing the missing values in MonthlyIncome with the mean values from the above analysis

data["MonthlyIncome"] = data.groupby(["Occupation", "Designation", "Gender"])[
    "MonthlyIncome"
].transform(lambda x: x.fillna(x.mean()))

In [None]:
# Checking for the missing values again

data.isnull().sum()

- No more missing values in the dataset ; they're all fixed

# Outlier Detection

In [None]:
data.info()

In [None]:
# creating a list of numerical columns

num_col = ["DurationOfPitch", "NumberOfFollowups", "NumberOfTrips", "MonthlyIncome"]

In [None]:
# Boxplots of numerical columns to view the outliers

plt.figure(figsize=(20, 7))

for i, variable in enumerate(num_col):
    plt.subplot(1, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

# Outlier Treatment

- As seen in the EDA, the columns DurationOfPitch, NumberOfFollowups, NumberOfTrips and MonthlyIncome columns have outliers. However, I choose not to treat the outliers as in real life the data will have outliers and I want the model learn the variations in the data distribution. Also, the **Bagging and Boosting** algorithms are robust and they can handle outliers.

# Model Building

## Data Preparation

- Since the objective is to build models on data of the existing customers which can be used to target new customers, we can drop the customer interaction data from the dataset as those features will not be available for new customers.
- Also, CustomerID will not be of much help in model building and hence dropping that too.

In [None]:
data.drop(
    [
        "CustomerID",
        "DurationOfPitch",
        "NumberOfFollowups",
        "ProductPitched",
        "PitchSatisfactionScore",
    ],
    axis=1,
    inplace=True,
)

In [None]:
# Checking the overall information of the dataset
data.info()

In [None]:
# As seen in the EDA, the distribution of City Tier can be treated as a category. Therefore doing the same

data["CityTier"] = data["CityTier"].astype("category")

In [None]:
# checking the dataset again
data.info()

- The dataset now consists of 6 category, 5 float and 4 integer type columns with no null values. The dataset is now ready for model building

## Split Data

In [None]:
X = data.drop(["ProdTaken"], axis=1)
y = data["ProdTaken"]

# creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)

# splitting the data into train and test sets
# using 'stratify' parameter as the distribution of the target classes is imbalanced
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y
)

In [None]:
print("Shape of Training set :", X_train.shape)
print("Shape of test set :", X_test.shape)
print("\n Percentage of classes in training set :")
print(y_train.value_counts(normalize=True))
print("\n Percentage of classes in test set :")
print(y_test.value_counts(normalize=True))

## Model evaluation criterion

#### Model can make wrong predictions such as :

1. Predicting a customer will purchase a travel package but in reality the customer does not purchase one.
2. Predicting a customer will not purchase a travel package but in reality the customer will purchase the travel package.

#### Prediction of concern :

The second prediction is our major concern as the 'Visit With Us' travel company plans to launch a new tourism package and wants to harness the available data to make the marketing expenditure more efficient. In order to do so, mistakes in the second prediction (i.e. False negatives) have to be considerably reduced.

#### How to reduce False Negatives :

Recall score should be maximized. Greater the Recall score, higher the chances of predicting the potential customers who may purchase the new travel package.

In [None]:
# defining a function to plot the confusion matrix to visualize the model performance


def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )
    return df_perf

# Model Building - Bagging 

## Decision Tree

####  with default parameters

In [None]:
dtree = DecisionTreeClassifier(random_state=1)
dtree.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(dtree, X_test, y_test)

In [None]:
dtree_train_perf = model_performance_classification(dtree, X_train, y_train)
dtree_train_perf

In [None]:
dtree_test_perf = model_performance_classification(dtree, X_test, y_test)
dtree_test_perf

- The decision tree is fully grown, hence overfitting

#### Hyperparameter tuning

In [None]:
# Choose the type of classifier.
dtree_tuned = DecisionTreeClassifier(class_weight={0: 0.18, 1: 0.82}, random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": [1, 4, 7, 15],
    "min_samples_leaf": [2, 3, 5],
    "max_leaf_nodes": [5, 7, 10, 15],
}

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring="recall")
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
dtree_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(dtree_tuned, X_test, y_test)

In [None]:
dtree_tuned_train_perf = model_performance_classification(dtree_tuned, X_train, y_train)
dtree_tuned_train_perf

In [None]:
dtree_tuned_test_perf = model_performance_classification(dtree_tuned, X_test, y_test)
dtree_tuned_test_perf

- The model is generalising well with tuned parameters

### Visualizing the Decision Tree

In [None]:
# creating a list of column names
feature_names = X_train.columns.to_list()

plt.figure(figsize=(15, 10))
out = tree.plot_tree(
    dtree_tuned,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

In [None]:
# Importance of features in the tree building
print(
    pd.DataFrame(
        dtree_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

In [None]:
importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- Passport feature is given highest importance in tree building followed by Designation_Executive, CityTier3 and Age

## Bagging Classifier

#### with default parameters

In [None]:
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(bagging, X_test, y_test)

In [None]:
bagging_train_perf = model_performance_classification(bagging, X_train, y_train)
bagging_train_perf

In [None]:
bagging_test_perf = model_performance_classification(bagging, X_test, y_test)
bagging_test_perf

- Bagging classifier is overfitting on the training set and performing poorly on the test set in terms of Recall

#### With Hypertuned Decision Tree as base estimator

In [None]:
bagging_dtree_tuned = BaggingClassifier(base_estimator=dtree_tuned, random_state=1)
bagging_dtree_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(bagging_dtree_tuned, X_test, y_test)

In [None]:
bagging_dtree_tuned_train_perf = model_performance_classification(
    bagging_dtree_tuned, X_train, y_train
)
bagging_dtree_tuned_train_perf

In [None]:
bagging_dtree_tuned_test_perf = model_performance_classification(
    bagging_dtree_tuned, X_test, y_test
)
bagging_dtree_tuned_test_perf

- Bagging classifier with tuned decision tree as the base estimator is giving a genralized model. The metrics are still low. Must try hyperparameter tuning to check for better Recall on the test data

#### Hyperparameter Tuning

- Hypertuning the bagging classifier with tuned decision tree as base estimator since it is giving a more generalized model

In [None]:
# grid search for bagging classifier
cl1 = dtree_tuned
param_grid = {
    "base_estimator": [cl1],
    "max_samples": [0.7, 0.8, 0.9, 1],
    "n_estimators": [5, 7, 10, 15, 20, 30, 40, 51, 101],
    "max_features": [0.7, 0.8, 0.9, 1],
}

grid = GridSearchCV(
    BaggingClassifier(random_state=1, bootstrap=True),
    param_grid=param_grid,
    scoring="recall",
    cv=5,
)
grid.fit(X_train, y_train)

# getting the best estimator
bagging_tuned = grid.best_estimator_
bagging_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(bagging_tuned, X_test, y_test)

In [None]:
bagging_tuned_train_perf = model_performance_classification(
    bagging_tuned, X_train, y_train
)
bagging_tuned_train_perf

In [None]:
bagging_tuned_test_perf = model_performance_classification(
    bagging_tuned, X_test, y_test
)
bagging_tuned_test_perf

- Tuned bagging classifier is giving a generalized model with a very good Recall scores in both train and test sets. However, it is performing poorly in terms of Precision score

## Random Forest

#### with default parameters

In [None]:
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(rf, X_test, y_test)

In [None]:
rf_train_perf = model_performance_classification(rf, X_train, y_train)
rf_train_perf

In [None]:
rf_test_perf = model_performance_classification(rf, X_test, y_test)
rf_test_perf

- Random Forest is overfitting on the training data and is performing poorly on the test data in terms of Recall

#### Hyperparameter tuning

In [None]:
# Choose the type of classifier
rf_estimator = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [150, 200, 250],
    "min_samples_leaf": np.arange(5, 10),
    "max_features": np.arange(0.2, 0.7, 0.1),
    "max_samples": np.arange(0.3, 0.7, 0.1),
}

# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
rf_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(rf_tuned, X_test, y_test)

In [None]:
rf_tuned_train_perf = model_performance_classification(rf_tuned, X_train, y_train)
rf_tuned_train_perf

In [None]:
rf_tuned_test_perf = model_performance_classification(rf_tuned, X_test, y_test)
rf_tuned_test_perf

- Tuned random forest model is generalizing in comparison to the model with default parameters. However, it is performing poorly in terms of the Recall score

### Feature importance of Random Forest

In [None]:
# Importance of features in the tree building
print(
    pd.DataFrame(
        rf_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)


In [None]:
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- Random Forest Classifier has given highest importance to Monthly Income and Age
- Followed by Passport, Number of trips and Designation_Executive
- Designation_VP has been given the least imortance

## Comparison of Models - Bagging

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_train_perf.T,
        dtree_tuned_train_perf.T,
        bagging_train_perf.T,
        bagging_dtree_tuned_train_perf.T,
        bagging_tuned_train_perf.T,
        rf_train_perf.T,
        rf_tuned_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Tuned",
    "Bagging Classifier",
    "Bagging Classifier with dtree_tuned base estimator",
    "Bagging Classifier Tuned",
    "Random Forest",
    "Random Forest Tuned",
]
print("Training performance comparison:")
models_train_comp_df

In [None]:
# testing performance comparison

bagging_models_test_comp_df = pd.concat(
    [
        dtree_test_perf.T,
        dtree_tuned_test_perf.T,
        bagging_test_perf.T,
        bagging_dtree_tuned_test_perf.T,
        bagging_tuned_test_perf.T,
        rf_test_perf.T,
        rf_tuned_test_perf.T,
    ],
    axis=1,
)
bagging_models_test_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Tuned",
    "Bagging Classifier",
    "Bagging Classifier with dtree_tuned base estimator",
    "Bagging Classifier Tuned",
    "Random Forest",
    "Random Forest Tuned",
]
print("\nTesting performance comparison:")
bagging_models_test_comp_df

## Model Performance - Observations (Bagging)

- Overfit models - Desicion tree, Bagging Classifier and Random Forest
- Generalized models - Tuned decision tree, Bagging Classifier with tuned decision tree as the base estimator, tuned Bagging Classifier and tuned Random Forest models
- Tuned Bagging Classifier gives the highest Recall in the test set
- The business may choose tuned Decision Tree for a good Recall with a better Precision score

# Model Building - Boosting

## AdaBoost Classifier

#### With default parameters

In [None]:
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(abc, X_test, y_test)

In [None]:
abc_train_perf = model_performance_classification(abc, X_train, y_train)
abc_train_perf

In [None]:
abc_test_perf = model_performance_classification(abc, X_test, y_test)
abc_test_perf

- AdaBoost is generalizing well but it is giving very poor performance in terms of Recall

#### Hyperparameter tuning

In [None]:
# choose the type of classifier
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters
parameters = {  # trying different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
    "n_estimators": [100],  # np.arange(10, 110, 10),
    "learning_rate": np.arange(0.1, 2, 0.1),
}

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
abc_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(abc_tuned, X_test, y_test)

In [None]:
abc_tuned_train_perf = model_performance_classification(abc_tuned, X_train, y_train)
abc_tuned_train_perf

In [None]:
abc_tuned_test_perf = model_performance_classification(abc_tuned, X_test, y_test)
abc_tuned_test_perf

- Tuned adaboost classifier is overfitting on the trainind data. However, Recall has improved in comparison to the model with default parameters

In [None]:
# Importance of features in the tree building
print(
    pd.DataFrame(
        abc_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

In [None]:
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- Monthly Income is given the highest feature importance by the tuned Adaboost classifier, followed by Age and Number of trips. 
- Passport, which was given much higher importance by the Bagging models is given lower importance here

## Gradient Boosting Classifier

#### With default parameters

In [None]:
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(gbc, X_test, y_test)

In [None]:
gbc_train_perf = model_performance_classification(gbc, X_train, y_train)
gbc_train_perf

In [None]:
gbc_test_perf = model_performance_classification(gbc, X_test, y_test)
gbc_test_perf

- The model is giving generalized scores on train and test sets. The Recall score is better than the Adaboost model with default parameters.

#### Hyperparameter tuning

- Hyperparameter tuning the gradient boost model with AdaBoost Classifier as the base estimator as it gave a generalized model

In [None]:
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [250],
    "subsample": [0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
}

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)

In [None]:
gbc_tuned_train_perf = model_performance_classification(gbc_tuned, X_train, y_train)
gbc_tuned_train_perf

In [None]:
gbc_tuned_test_perf = model_performance_classification(gbc_tuned, X_test, y_test)
gbc_tuned_test_perf

- Recall has improved a little compared to the model with default parameters

In [None]:
# Importance of features in the tree building
print(
    pd.DataFrame(
        gbc_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

In [None]:
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- The feature with highest importance are same as the feature importances given by the tuned Random Forest model

## XGBoost Classifier

#### Default parameter

In [None]:
xgb = XGBClassifier(random_state=1, eval_metric="logloss")
xgb.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(xgb, X_test, y_test)

In [None]:
xgb_train_perf = model_performance_classification(xgb, X_train, y_train)
xgb_train_perf

In [None]:
xgb_test_perf = model_performance_classification(xgb, X_test, y_test)
xgb_test_perf

- XGBoost with default paramerts is overfitting on the training data and is giving a low Recall score on the test data

#### Hyperparameter tuning

In [None]:
# choose the type of classifier
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters
parameters = {
    "n_estimators": np.arange(50, 100, 20),
    "scale_pos_weight": [5],
    "subsample": [0.9, 1],
    "learning_rate": [0.1],
    "gamma": [3],
    "colsample_bytree": [0.5, 0.7, 0.9, 1],
    "colsample_bylevel": [0.5, 0.7, 0.9, 1],
}

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
xgb_tuned.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)

In [None]:
xgb_tuned_train_perf = model_performance_classification(xgb_tuned, X_train, y_train)
xgb_tuned_train_perf

In [None]:
xgb_tuned_test_perf = model_performance_classification(xgb_tuned, X_test, y_test)
xgb_tuned_test_perf

- The model is overfit a little
- However, hyperparamter tuned XGBoost classifier is giving the highest Recall highest amongst all the Boosting algorithms

In [None]:
# Importance of features in the tree building
print(
    pd.DataFrame(
        xgb_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

In [None]:
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- Passport is given the highset feature importance by the tuned XGBoost classifier
- MaritalStatus_Single, Designation_Executive and CityTier_3 are given second highest importance

## Stacking Classifier

In [None]:
# Building a stacking model with tuned random forest, bagging classifier, tuned gradient boosting classifier and tuned XGBoost classifier as the final predictor

estimators = [
    ("Random Forest tuned", rf_tuned),
    ("Bagging Classifier", bagging),
    ("Gradient Boosting Tuned", gbc_tuned),
]
final_estimator = xgb_tuned

stacking_classifier = StackingClassifier(
    estimators=estimators, final_estimator=final_estimator
)
stacking_classifier.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(stacking_classifier, X_test, y_test)

In [None]:
stacking_train_perf = model_performance_classification(
    stacking_classifier, X_train, y_train
)
stacking_train_perf

In [None]:
stacking_test_perf = model_performance_classification(
    stacking_classifier, X_test, y_test
)
stacking_test_perf

- The model is a little overfit
- However, Stacking classifier is giving the highest Recall in the test set

## Comparison of Models - Boosting

In [None]:
# training performance comparison

boosting_models_train_comp_df = pd.concat(
    [
        abc_train_perf.T,
        abc_tuned_train_perf.T,
        gbc_train_perf.T,
        gbc_tuned_train_perf.T,
        xgb_train_perf.T,
        xgb_tuned_train_perf.T,
        stacking_train_perf.T,
    ],
    axis=1,
)
boosting_models_train_comp_df.columns = [
    "AdaBoost Classifier",
    "AdaBoost Classifier Tuned",
    "Gradient Boosting Classifier",
    "Gradient Boosting Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Training performance comparison:")
boosting_models_train_comp_df

In [None]:
# testing performance comparison

boosting_models_test_comp_df = pd.concat(
    [
        abc_test_perf.T,
        abc_tuned_test_perf.T,
        gbc_test_perf.T,
        gbc_tuned_test_perf.T,
        xgb_test_perf.T,
        xgb_tuned_test_perf.T,
        stacking_test_perf.T,
    ],
    axis=1,
)
boosting_models_test_comp_df.columns = [
    "AdaBoost Classifier",
    "AdaBoost Classifier Tuned",
    "Gradient Boosting Classifier",
    "Gradient Boosting Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Training performance comparison:")
boosting_models_test_comp_df

## Model Performance - Observations (Boosting)

- AdaBoost classifier and Gradient Boost Classifier are most generalized models but, they perform poorly in terms of Recall 
- XGBoost and Stacking models are little overfit but they are giving the highest Recall scores in the test set
- We can also look into tuning the XGBoost classifier with different parameters and stacking classifier with different weak learners to get more generalized models
- Business may choose the stacking model for highest Recall or tuned XGBoost model for a higher Recall with a little better Precision score

**Note** : Have not added class_weight for Boosting algorithms as they gave much lower Recall scores while tuning

# Business Insights and Recommendations

- The business can use this predictive model to 
    - identify potential customers who may purchase the travel packages
    - potential new customers who may purchase the packages that are offered / packages that are newly launched
    - the features that drive the customer to buy the package
- Features that impact Product taken - Passport, Designation, Marital Status, City Tier, Monthly Income, Age and Number of trips annually
    - customers who own a passport show more interesting in buying the product
    - customers with Designation Executive, Marital Status single and City Tier 3 should be our target customers
    - customers with Monthly Income 15K to 25K, Age 25 to 40 show more interest in buying a travel package
    - larger the number of trips taken by a customer annually, higher is the chances of customer buying the package
- The marketing team should focus on 
    - higher duration of pitch by salesperson with the customer
    - do multiple followups with the customer
    - encourage customers to get passport
    - market and company invite customers for 'King' package 
- Once the 'Wellness Package' is launched, the business can collect data on customer information, their preference, product satisfaction and customer interaction so as to enable data analysis for better results