## Holiday_Package_Prediction

"Travel.Com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

We need to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package

https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction

In [3]:
import seaborn as sns 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 

In [4]:
dataset = pd.read_csv('Travel.csv')
dataset.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


## Data Cleaning 
1. Handling missing value
2. Handlaing Duplicate 
3. Check Datatype
4. Understand the datatype

In [5]:
# checking the null value
dataset.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [6]:
# checking the categorical value 
dataset['Gender'].value_counts()

Gender
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64

In [7]:
# here we can see the error that Female and Fe male is same 
# so we need to correct it 
dataset['Gender'] = dataset['Gender'].replace('Fe Male','Female')

In [9]:
dataset['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64

In [10]:
dataset['MaritalStatus'] = dataset['MaritalStatus'].replace('Single','Unmarried')

In [11]:
dataset['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Unmarried    1598
Divorced      950
Name: count, dtype: int64

In [None]:
dataset['Occupation'].value_counts()
# we don't need to change any value here

Occupation
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: count, dtype: int64

In [None]:
dataset['TypeofContact'].value_counts()
# it is also balance

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

In [15]:
dataset['ProductPitched'].value_counts()
# all are different value so we don't need to change

ProductPitched
Basic           1842
Deluxe          1732
Standard         742
Super Deluxe     342
King             230
Name: count, dtype: int64

In [17]:
dataset['Designation'].value_counts()
# no values are repeated

Designation
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: count, dtype: int64

np.int64(226)

In [27]:
# check null values 
feature_with_null_values = [features for features in dataset.columns if dataset[features].isnull().sum() > 0]
for f in feature_with_null_values:
 print(f,np.round(dataset[f].isnull().mean()*100,3))

Age 4.624
TypeofContact 0.511
DurationOfPitch 5.135
NumberOfFollowups 0.921
PreferredPropertyStar 0.532
NumberOfTrips 2.864
NumberOfChildrenVisiting 1.35
MonthlyIncome 4.767


In [40]:
dataset[feature_with_null_values].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       4888 non-null   float64
 1   TypeofContact             4863 non-null   object 
 2   DurationOfPitch           4637 non-null   float64
 3   NumberOfFollowups         4843 non-null   float64
 4   PreferredPropertyStar     4862 non-null   float64
 5   NumberOfTrips             4748 non-null   float64
 6   NumberOfChildrenVisiting  4822 non-null   float64
 7   MonthlyIncome             4655 non-null   float64
dtypes: float64(7), object(1)
memory usage: 305.6+ KB


In [41]:
dataset[feature_with_null_values].describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4888.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.547259,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.104795,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,43.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


In [32]:
# filling the NaN value

#for age we are using the median value to fill the NaN
dataset['Age'].fillna(dataset['Age'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['Age'].fillna(dataset['Age'].median(),inplace=True)


In [42]:
# for TypeofContact we are using the mode value
dataset['TypeofContact'].fillna(dataset['TypeofContact'].mode()[0],inplace=True)

In [43]:
dataset['TypeofContact'].isnull().sum()

np.int64(0)

In [44]:
# For NumberOfTrips
dataset['NumberOfTrips'].fillna(dataset['NumberOfTrips'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['NumberOfTrips'].fillna(dataset['NumberOfTrips'].median(), inplace=True)


In [46]:
dataset.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                           0
TypeofContact                 0
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips                 0
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [47]:
dataset['PreferredPropertyStar'].fillna(dataset['PreferredPropertyStar'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['PreferredPropertyStar'].fillna(dataset['PreferredPropertyStar'].mode()[0],inplace=True)


In [48]:
dataset['NumberOfFollowups'].fillna(dataset['NumberOfFollowups'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['NumberOfFollowups'].fillna(dataset['NumberOfFollowups'].mode()[0],inplace=True)


In [49]:
dataset['NumberOfChildrenVisiting'].fillna(dataset['NumberOfChildrenVisiting'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['NumberOfChildrenVisiting'].fillna(dataset['NumberOfChildrenVisiting'].mode()[0],inplace=True)


In [50]:
dataset['MonthlyIncome'].fillna(dataset['MonthlyIncome'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['MonthlyIncome'].fillna(dataset['MonthlyIncome'].median(),inplace=True)


In [52]:
dataset['DurationOfPitch'].fillna(dataset['DurationOfPitch'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['DurationOfPitch'].fillna(dataset['DurationOfPitch'].median(),inplace=True)


In [53]:
dataset.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [67]:
# we need to drop the column which is not necessary for us
dataset.drop(['CustomerID'],axis=1,inplace=True)

## Feature Engineering 
Feature Extraction

In [68]:
# NumberOfPersonVisiting
# NumberOfChildrenVisiting
# This two features is providing the similar kind of information like we can consider it as a total 
# number of people visiting. So we can combine them together
dataset['totalvisiting'] = dataset['NumberOfPersonVisiting'] + dataset['NumberOfChildrenVisiting']

In [69]:
# now droping those features which is no needed
dataset.drop(['NumberOfPersonVisiting','NumberOfChildrenVisiting'],axis=1,inplace=True)

In [70]:
dataset.columns

Index(['ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch',
       'Occupation', 'Gender', 'NumberOfFollowups', 'ProductPitched',
       'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'Designation', 'MonthlyIncome',
       'totalvisiting'],
      dtype='object')

In [72]:
# all numeric features
num_features = [feature for feature in dataset.columns if dataset[feature].dtype != 'O']
num_features

['ProdTaken',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'MonthlyIncome',
 'totalvisiting']

In [73]:
# categorical features
cat_feature = [feature for feature in dataset.columns if dataset[feature].dtype == 'O']
cat_feature

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [76]:
# desceret features
descret_fet = [feature for feature in num_features if len(dataset[feature].unique())<=25]
descret_fet

['ProdTaken',
 'CityTier',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'totalvisiting']

In [78]:
# desceret features
continous_fet = [feature for feature in num_features if feature  not in descret_fet ]
continous_fet

['Age', 'DurationOfPitch', 'MonthlyIncome']

In [83]:
# spliting the data into traing and test
from sklearn.model_selection import train_test_split
x = dataset.drop(['ProdTaken'],axis=1) # independent features
y = dataset['ProdTaken'] #ouput features or output featues

In [84]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=42)

In [85]:
x_train.shape,x_test.shape

((3910, 17), (978, 17))

In [89]:
# encoding and transforming
cat_features = x.select_dtypes(include='object').columns
num_features = x.select_dtypes(exclude='object').columns

In [91]:
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer

num_transformer = StandardScaler()
cat_tarnsformer = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer([
    ("OneHotEncoder",cat_tarnsformer,cat_features),
    ("StandardScaler",num_transformer,num_features)
])

In [92]:
preprocessor

0,1,2
,transformers,"[('OneHotEncoder', ...), ('StandardScaler', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [93]:
x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

In [94]:
pd.DataFrame(x_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-1.020350,1.284279,-0.725271,-0.127737,-0.632399,0.679690,0.782966,-0.382245,-0.774151
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,-0.721400,0.690023,0.282777,-0.725271,1.511598,-0.632399,0.679690,0.782966,-0.459799,0.643615
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-1.020350,0.282777,1.771041,0.418708,-0.632399,0.679690,0.782966,-0.245196,-0.065268
3,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,-0.721400,-1.020350,1.284279,-0.725271,-0.127737,-0.632399,1.408395,-1.277194,0.213475,-0.065268
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,2.400396,-1.720227,-0.725271,1.511598,-0.632399,-0.049015,-1.277194,-0.024889,2.061382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3905,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-0.653841,1.284279,-0.725271,-0.674182,-0.632399,-1.506426,0.782966,-0.536973,0.643615
3906,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.455047,-0.898180,-0.718725,1.771041,-1.220627,-0.632399,1.408395,0.782966,1.529609,-0.065268
3907,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.455047,1.545210,0.282777,-0.725271,2.058043,-0.632399,-0.777720,0.782966,-0.360576,0.643615
3908,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.455047,1.789549,1.284279,-0.725271,-0.127737,-0.632399,-1.506426,0.782966,-0.252799,0.643615


## Random Forest Classifier Traning
Model Traning

In [96]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,f1_score,precision_score,recall_score,roc_auc_score

random_forest_model = RandomForestClassifier() #default it have 100 decision tree
random_forest_model.fit(x_train,y_train)

# make prediction for train and test data
y_train_pred = random_forest_model.predict(x_train)
y_test_pred = random_forest_model.predict(x_test)

# model performance for traning data
tr_accuracy_score = accuracy_score(y_train,y_train_pred)
tr_f1score = f1_score(y_train,y_train_pred)
tr_precision = precision_score(y_train,y_train_pred)
tr_recall = recall_score(y_train,y_train_pred)
tr_roc = roc_auc_score(y_train,y_train_pred)


# model performance for test data
ts_accuracy_score = accuracy_score(y_test,y_test_pred)
ts_f1score = f1_score(y_test,y_test_pred)
ts_precision = precision_score(y_test,y_test_pred)
ts_recall = recall_score(y_test,y_test_pred)
ts_roc = roc_auc_score(y_test,y_test_pred)


In [100]:
print("Model Peroformance in train data")
print("Accuracy score:",tr_accuracy_score)
print("f1 score:",tr_f1score)
print("Precision:",tr_precision)
print("Recall:",tr_recall)
print("ROC AUC score",tr_roc)

Model Peroformance in train data
Accuracy score: 1.0
f1 score: 1.0
Precision: 1.0
Recall: 1.0
ROC AUC score 1.0


In [99]:
print("Model Peroformance in Test data")
print("Accuracy score:",ts_accuracy_score)
print("f1 score:",ts_f1score)
print("Precision:",ts_precision)
print("Recall:",ts_recall)
print("ROC AUC score",ts_roc)

Model Peroformance in Test data
Accuracy score: 0.934560327198364
f1 score: 0.8048780487804879
Precision: 0.9635036496350365
Recall: 0.6910994764397905
ROC AUC score 0.8423731181436565
