## **HOLIDAY PACKAGE PREDICTION**

**PROBLEM STATEMENT**

"Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.


##**DATA COLLECTION**

In [36]:
# Import necessary libraries for data manipulation, analysis, and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [37]:
# Load the dataset into a pandas DataFrame
df = pd.read_csv('/content/Travel.csv')

## **DATA CLEANING**

In [38]:
# Display the first 2 rows of the DataFrame to get a glimpse of the data
df.head(2)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0


In [39]:
# Get information about the DataFrame, including data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [40]:
# Check the value counts for the 'ProductPitched' column
df['ProductPitched'].value_counts()

Unnamed: 0_level_0,count
ProductPitched,Unnamed: 1_level_1
Basic,1842
Deluxe,1732
Standard,742
Super Deluxe,342
King,230


In [41]:
# Check the value counts for the 'Gender' column to identify inconsistencies
df['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,2916
Female,1817
Fe Male,155


In [42]:
# Check the value counts for the 'MaritalStatus' column to identify inconsistencies
df['MaritalStatus'].value_counts()

Unnamed: 0_level_0,count
MaritalStatus,Unnamed: 1_level_1
Married,2340
Divorced,950
Single,916
Unmarried,682


In [43]:
# Replace inconsistent values in 'MaritalStatus'
df['MaritalStatus'] = df['MaritalStatus'].replace(("Single" , "Unmarried") , "Single" )

In [44]:
# Replace inconsistent values in 'Gender'
df['Gender']=df['Gender'].replace(("Fe Male" , "Female"), "Female")

In [45]:
# Verify the changes in 'MaritalStatus' after replacement
df['MaritalStatus'].value_counts()

Unnamed: 0_level_0,count
MaritalStatus,Unnamed: 1_level_1
Married,2340
Single,1598
Divorced,950


In [46]:
# Check for missing values in each column
df.isnull().sum()

Unnamed: 0,0
CustomerID,0
ProdTaken,0
Age,226
TypeofContact,25
CityTier,0
DurationOfPitch,251
Occupation,0
Gender,0
NumberOfPersonVisiting,0
NumberOfFollowups,45


In [47]:
# Calculate and print the percentage of missing values for columns with missing data
Handling_na = [Features for Features in df.columns if df[Features].isnull().sum()>=1]
for Features in Handling_na:
  print(Features ,np.round(df[Features].isnull().mean()*100,5),'% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [48]:
# Get descriptive statistics for numerical columns with missing values
df[Handling_na].select_dtypes(exclude='object').describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


In [49]:
# Check the value counts for the 'TypeofContact' column
df['TypeofContact'].value_counts()

Unnamed: 0_level_0,count
TypeofContact,Unnamed: 1_level_1
Self Enquiry,3444
Company Invited,1419


In [50]:
# Fill missing values using appropriate strategies (mean, mode, median)
df.Age.fillna(df.Age.mean() , inplace =True)
df.TypeofContact.fillna(df.TypeofContact.mode()[0] , inplace =True)
df.DurationOfPitch.fillna(df.DurationOfPitch.median() , inplace =True)
df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0] , inplace =True)
df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0] , inplace =True)
df.NumberOfTrips.fillna(df.NumberOfTrips.mode()[0] , inplace =True)
df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0] , inplace =True)
df.MonthlyIncome.fillna(df.MonthlyIncome.median() , inplace =True)

In [51]:
# Verify that all missing values have been handled
df.isnull().sum()

Unnamed: 0,0
CustomerID,0
ProdTaken,0
Age,0
TypeofContact,0
CityTier,0
DurationOfPitch,0
Occupation,0
Gender,0
NumberOfPersonVisiting,0
NumberOfFollowups,0


In [52]:
# Display the first row of the DataFrame after handling missing values
df.head(1)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0


In [53]:
# Create a new feature 'TotalVisiting' by summing 'NumberOfPersonVisiting' and 'NumberOfChildrenVisiting'
df['TotalVisiting'] = df['NumberOfPersonVisiting'] + df['NumberOfChildrenVisiting']
# Drop the original columns as they are no longer needed
df.drop(['NumberOfPersonVisiting' , 'NumberOfChildrenVisiting'] , axis=1, inplace =True)

In [54]:
# Display the first 2 rows of the DataFrame to see the new feature and dropped columns
df.head(2)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisiting
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Single,1.0,1,2,1,Manager,20993.0,3.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0


In [55]:
# Get information about the DataFrame after feature engineering
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   CustomerID              4888 non-null   int64  
 1   ProdTaken               4888 non-null   int64  
 2   Age                     4888 non-null   float64
 3   TypeofContact           4888 non-null   object 
 4   CityTier                4888 non-null   int64  
 5   DurationOfPitch         4888 non-null   float64
 6   Occupation              4888 non-null   object 
 7   Gender                  4888 non-null   object 
 8   NumberOfFollowups       4888 non-null   float64
 9   ProductPitched          4888 non-null   object 
 10  PreferredPropertyStar   4888 non-null   float64
 11  MaritalStatus           4888 non-null   object 
 12  NumberOfTrips           4888 non-null   float64
 13  Passport                4888 non-null   int64  
 14  PitchSatisfactionScore  4888 non-null   

In [56]:
# Identify numerical features
num_feature = [feature for feature in df.columns if df[feature].dtype != 'O']
print('no of numerical feature :', len(num_feature))

no of numerical feature : 13


In [57]:
# Identify categorical features
cat_feature = [feature for feature in df.columns if df[feature].dtype == 'O']
print('no of numerical feature :', len(cat_feature))

no of numerical feature : 6


In [58]:
# Identify discrete numerical features (less than or equal to 25 unique values)
discreate_feature = [feature for feature in num_feature if len(df[feature].unique()) <= 25]
print('no of discreate feature :', len(discreate_feature))

no of discreate feature : 9


In [59]:
# Identify continuous numerical features
continuous_feature = [feature for feature in num_feature if feature not in discreate_feature]
print('no of continous feature :', len(continuous_feature))

no of continous feature : 4


In [60]:
# Separate features (x) and target variable (y)
x = df.drop(['ProdTaken'] , axis = 1)
y = df['ProdTaken']

In [61]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x,y , test_size = 0.20 , random_state =999 )

In [62]:
# Identify categorical and numerical features for preprocessing
cat_features =x.select_dtypes(include ='object').columns
num_features = x.select_dtypes(exclude = "object").columns

# Import necessary preprocessing classes
from sklearn.preprocessing import OneHotEncoder , StandardScaler
from sklearn.compose import ColumnTransformer

#standard = StandardScaler() # Scaler for numerical features (commented out)
onehot = OneHotEncoder(drop ='first') # One-hot encoder for categorical features

# Create a ColumnTransformer to apply different preprocessing steps to different columns
preprocessor = ColumnTransformer(
    [
        ("OneHotEncode" ,onehot , cat_features )
        #, ("StandardScaler" , standard ,num_features) # Apply scaler to numerical features (commented out)
    ]
)

## **RANDOM FOREST IMPLEMENTATION**

In [63]:
# Apply the preprocessing steps to the training and testing data
x_train =preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

In [64]:
# Import the RandomForestClassifier model
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the Random Forest model
randomforest = RandomForestClassifier()
randomforest.fit(x_train ,y_train)

# Make predictions on the test set
y_pred = randomforest.predict(x_test)

In [65]:
# Import metrics for model evaluation
from sklearn.metrics import accuracy_score , classification_report

In [66]:
# Calculate and print the accuracy score
print(accuracy_score(y_pred , y_test))

0.8057259713701431


In [67]:
# Print the classification report to see precision, recall, and F1-score
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.98      0.81      0.89       948
           1       0.09      0.60      0.16        30

    accuracy                           0.81       978
   macro avg       0.54      0.71      0.52       978
weighted avg       0.96      0.81      0.87       978

