## Holiday Package Prediction

1) Problem Statement

"Trip and Travel.com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering Basic, Standard, Deluxe, Siper Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers are purchased the packages. However, the marketing cost was quite high because customers were contacted at random wihtout looking at the available information. The company is now pplanning launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveller to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. However, this time company wants to harness the availability data of existing and potential customers to make the marketing expenditure more efficient.

2) Data Collection 

https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [2]:
df = pd.read_csv("Travel.csv")
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


 ### Data Cleaning 
 
 #### Handling Missing values
 
 1. Handling Missing values
 2. Handling Duplicates
 3. Check data type
 4. Understand the dataset

In [3]:
df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [4]:
df['Gender'].value_counts()

Male       2916
Female     1817
Fe Male     155
Name: Gender, dtype: int64

In [5]:
df['MaritalStatus'].value_counts()

Married      2340
Divorced      950
Single        916
Unmarried     682
Name: MaritalStatus, dtype: int64

In [6]:
df['TypeofContact'].value_counts()

Self Enquiry       3444
Company Invited    1419
Name: TypeofContact, dtype: int64

In [7]:
df['Gender'] = df['Gender'].replace('Fe Male', 'Female')
df['MaritalStatus'] = df['MaritalStatus'].replace('Single', 'Unmarried')

In [8]:
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Unmarried,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [9]:
## Check Missing Values
## These are the features with nan value
features_with_na=[features for features in df.columns if df[features].isnull().sum()>=1]
for feature in features_with_na:
    print(feature,np.round(df[feature].isnull().mean()*100,5), '% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [10]:
# statistics on numerical columns (Null cols)
df[features_with_na].select_dtypes(exclude='object').describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


### Imputing Null values

1. Impute Median value for Age column
2. Impute Mode for Type of Contact
3. Impute Median for Duration of Pitch
4. Impute Mode for numericalofFollow as it is Discrete feature
5. Impute Mode for PreferedPropertyStar
6. Impute Median for NumericalofTrips
7. Impute Mode for NumerOfChildrenVisiting
8. Impute Median for MonthlyIncome

In [11]:
# Age
df.Age.fillna(df.Age.median(), inplace=True)

#TypeofContact
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)

#DurationOfPitch
df.DurationOfPitch.fillna(df.DurationOfPitch.mode()[0], inplace=True)

#NumberOfFollowups
df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0], inplace=True)

#PreferredPropertyStar
df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0], inplace=True)

#NumberOfTrips
df.NumberOfTrips.fillna(0, inplace=True)

#NumberOfChildrenVisiting
df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0], inplace=True)

#MonthlyIncome
df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)

In [12]:
df.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [13]:
df.drop('CustomerID', inplace=True, axis=1)

### Feature Engineering

Feature Extraction

In [14]:
#create new column for feature
df['TotalVisiting'] = df['NumberOfPersonVisiting'] + df['NumberOfChildrenVisiting']
df.drop(columns=['NumberOfPersonVisiting', 'NumberOfChildrenVisiting'], axis = 1, inplace=True)

In [15]:
df.dtypes

ProdTaken                   int64
Age                       float64
TypeofContact              object
CityTier                    int64
DurationOfPitch           float64
Occupation                 object
Gender                     object
NumberOfFollowups         float64
ProductPitched             object
PreferredPropertyStar     float64
MaritalStatus              object
NumberOfTrips             float64
Passport                    int64
PitchSatisfactionScore      int64
OwnCar                      int64
Designation                object
MonthlyIncome             float64
TotalVisiting             float64
dtype: object

In [16]:
## Get all the numeric features
num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print("Number of Numerical Features: ", len(num_features))
num_features

Number of Numerical Features:  12


['ProdTaken',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'MonthlyIncome',
 'TotalVisiting']

In [17]:
## Get all the categorical features
cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print('Num of Categorical Features :', len(cat_features))

Num of Categorical Features : 6


In [18]:
## Get all the Discrete features
discrete_features = [feature for feature in num_features if len(df[feature].unique()) <= 25]
print('Num of Discrete Features :', len(discrete_features))

Num of Discrete Features : 9


In [19]:
## Get all the continuous features
discrete_features = [feature for feature in num_features if feature not in discrete_features]
print('Num of Discrete Features :', len(discrete_features))

Num of Discrete Features : 3


In [20]:
df.head()

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisiting
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,0,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0


### Train Test Split, Model Training

In [21]:
from sklearn.model_selection import train_test_split
X = df.drop(['ProdTaken'], axis=1)
y = df['ProdTaken']

In [22]:
y.value_counts()

0    3968
1     920
Name: ProdTaken, dtype: int64

In [23]:
X.head()

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisiting
0,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0


In [24]:
# separate dataset into train and test
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=10)
#X_train.shape, X_test.shape

#X = preprocessor.fit_transform(X)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


In [25]:
# Create Column Transfer with 3 types of transformers
cat_features = X.select_dtypes(include="object").columns
num_features = X.select_dtypes(exclude="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop="first", handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features)
    ]
)

X_transformed = preprocessor.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=10)

X_train = pd.DataFrame(X_train, columns=preprocessor.get_feature_names_out())
X_test = pd.DataFrame(X_test, columns=preprocessor.get_feature_names_out())


In [26]:
X_train

Unnamed: 0,OneHotEncoder__TypeofContact_Self Enquiry,OneHotEncoder__Occupation_Large Business,OneHotEncoder__Occupation_Salaried,OneHotEncoder__Occupation_Small Business,OneHotEncoder__Gender_Male,OneHotEncoder__ProductPitched_Deluxe,OneHotEncoder__ProductPitched_King,OneHotEncoder__ProductPitched_Standard,OneHotEncoder__ProductPitched_Super Deluxe,OneHotEncoder__MaritalStatus_Married,...,StandardScaler__CityTier,StandardScaler__DurationOfPitch,StandardScaler__NumberOfFollowups,StandardScaler__PreferredPropertyStar,StandardScaler__NumberOfTrips,StandardScaler__Passport,StandardScaler__PitchSatisfactionScore,StandardScaler__OwnCar,StandardScaler__MonthlyIncome,StandardScaler__TotalVisiting
0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.713871,-0.731307,-0.712434,0.529604,0.450515,-0.640524,0.675025,0.782392,-0.230570,-1.477450
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,-0.713871,0.693889,0.289401,-0.725222,-0.601871,-0.640524,-1.521728,0.782392,0.356801,-0.063495
2,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,...,-0.713871,-0.731307,0.289401,0.529604,-0.601871,-0.640524,1.407276,0.782392,-0.268802,0.643483
3,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,-0.713871,-0.968840,-1.714268,-0.725222,-0.601871,-0.640524,-0.057226,0.782392,0.003580,2.057438
4,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,-0.713871,0.337590,1.291235,1.784431,0.450515,1.561221,1.407276,0.782392,0.951590,2.057438
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3905,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.713871,-0.256242,-0.712434,0.529604,-0.075678,-0.640524,-1.521728,-1.278132,-1.245154,0.643483
3906,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.713871,1.050188,0.289401,-0.725222,-0.075678,1.561221,1.407276,0.782392,-0.473088,-0.063495
3907,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,-0.713871,-0.731307,-0.712434,1.784431,1.502902,1.561221,-0.057226,-1.278132,-0.230570,-0.770473
3908,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.713871,0.100057,1.291235,1.784431,-0.601871,1.561221,-0.057226,-1.278132,-0.486974,0.643483


In [27]:
X_test

Unnamed: 0,OneHotEncoder__TypeofContact_Self Enquiry,OneHotEncoder__Occupation_Large Business,OneHotEncoder__Occupation_Salaried,OneHotEncoder__Occupation_Small Business,OneHotEncoder__Gender_Male,OneHotEncoder__ProductPitched_Deluxe,OneHotEncoder__ProductPitched_King,OneHotEncoder__ProductPitched_Standard,OneHotEncoder__ProductPitched_Super Deluxe,OneHotEncoder__MaritalStatus_Married,...,StandardScaler__CityTier,StandardScaler__DurationOfPitch,StandardScaler__NumberOfFollowups,StandardScaler__PreferredPropertyStar,StandardScaler__NumberOfTrips,StandardScaler__Passport,StandardScaler__PitchSatisfactionScore,StandardScaler__OwnCar,StandardScaler__MonthlyIncome,StandardScaler__TotalVisiting
0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.713871,-0.018709,-0.712434,-0.725222,-1.128065,-0.640524,-0.789477,0.782392,-1.099072,-1.477450
1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.468369,0.100057,0.289401,0.529604,-0.601871,1.561221,1.407276,0.782392,-0.055385,-1.477450
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.713871,-0.256242,-0.712434,1.784431,-1.128065,-0.640524,1.407276,0.782392,0.271206,-1.477450
3,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,-0.713871,1.168955,-0.712434,1.784431,-1.128065,1.561221,-0.057226,-1.278132,-0.584742,0.643483
4,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.468369,-0.018709,0.289401,-0.725222,2.029095,-0.640524,-0.057226,-1.278132,1.322882,2.057438
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
973,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,-0.713871,-0.137475,-0.712434,-0.725222,-0.601871,-0.640524,0.675025,0.782392,0.292320,-0.770473
974,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.713871,-0.256242,1.291235,-0.725222,-0.601871,-0.640524,-0.789477,0.782392,-0.512462,1.350460
975,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.468369,-0.968840,0.289401,0.529604,-0.075678,-0.640524,0.675025,0.782392,0.459135,2.057438
976,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.468369,-0.018709,-0.712434,-0.725222,-0.601871,-0.640524,0.675025,0.782392,-1.131598,-1.477450


### AdaBoost Classifier Training

In [28]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    precision_score,
    recall_score,
    f1_score,
    ConfusionMatrixDisplay
)


In [29]:
from sklearn.metrics import roc_auc_score

models={
    "Logistic Regression":LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Adaboost": AdaBoostClassifier()
}
for name, model in models.items():
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)

    print(name)
    print('Model performance for Training set')
    print(f"- Accuracy: {model_train_accuracy:.4f}")
    print(f"- F1 score: {model_train_f1:.4f}")
    print(f"- Precision: {model_train_precision:.4f}")
    print(f"- Recall: {model_train_recall:.4f}")
    print(f"- ROC AUC Score: {model_train_rocauc_score:.4f}")

    print(name)
    print('Model performance for Testing set')
    print(f"- Accuracy: {model_test_accuracy:.4f}")
    print(f"- F1 score: {model_test_f1:.4f}")
    print(f"- Precision: {model_test_precision:.4f}")
    print(f"- Recall: {model_test_recall:.4f}")
    print(f"- ROC AUC Score: {model_test_rocauc_score:.4f}")
    
    print("="*35)
    print("\n")


Logistic Regression
Model performance for Training set
- Accuracy: 0.8458
- F1 score: 0.8239
- Precision: 0.7097
- Recall: 0.3478
- ROC AUC Score: 0.6568
Logistic Regression
Model performance for Testing set
- Accuracy: 0.8476
- F1 score: 0.8279
- Precision: 0.5698
- Recall: 0.3043
- ROC AUC Score: 0.6295


Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- ROC AUC Score: 1.0000
Decision Tree
Model performance for Testing set
- Accuracy: 0.9100
- F1 score: 0.9127
- Precision: 0.6952
- Recall: 0.8075
- ROC AUC Score: 0.8688


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- ROC AUC Score: 1.0000
Random Forest
Model performance for Testing set
- Accuracy: 0.9387
- F1 score: 0.9344
- Precision: 0.9391
- Recall: 0.6708
- ROC AUC Score: 0.8311


Adaboost
Model performance for Training set
- Accuracy: 0.8481
- F1 score: 0.8310
- Precision: 0.

In [30]:
## Hyperparameter Training

rf_params = {"max_depth":[5, 8, 15, None, 10],
            "max_features":[5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
            "n_estimators": [100, 200, 500, 1000]}
adaboost_params = {
    "n_estimators": [50, 60, 70, 80, 90],
    "algorithm": ["SAMME", "SAMME.R"]
    
}

In [31]:
## Model list for Hyperparameter tuning
randomcv_models = [
    ("RF", RandomForestClassifier(), rf_params),
    ("AB", AdaBoostClassifier(), adaboost_params)
]

In [32]:
randomcv_models

[('RF',
  RandomForestClassifier(),
  {'max_depth': [5, 8, 15, None, 10],
   'max_features': [5, 7, 'auto', 8],
   'min_samples_split': [2, 8, 15, 20],
   'n_estimators': [100, 200, 500, 1000]}),
 ('AB',
  AdaBoostClassifier(),
  {'n_estimators': [50, 60, 70, 80, 90], 'algorithm': ['SAMME', 'SAMME.R']})]

In [33]:
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1)
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_
    
for model_name in model_param:
    print(f"------------- Best Params for {model_name} -----------")
    print(model_param[model_name])

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 10 candidates, totalling 30 fits
------------- Best Params for RF -----------
{'n_estimators': 200, 'min_samples_split': 2, 'max_features': 8, 'max_depth': None}
------------- Best Params for AB -----------
{'n_estimators': 90, 'algorithm': 'SAMME'}
[CV] END max_depth=8, max_features=auto, min_samples_split=8, n_estimators=100; total time=   0.3s
[CV] END max_depth=None, max_features=7, min_samples_split=20, n_estimators=1000; total time=   4.1s
[CV] END max_depth=10, max_features=8, min_samples_split=20, n_estimators=100; total time=   0.6s
[CV] END max_depth=8, max_features=8, min_samples_split=20, n_estimators=100; total time=   0.6s
[CV] END max_depth=8, max_features=7, min_samples_split=20, n_estimators=1000; total time=   4.4s
[CV] END max_depth=15, max_features=8, min_samples_split=8, n_estimators=100; total time=   0.5s
[CV] END max_depth=15, max_features=8, min_samples_split=8, n_estimat

[CV] END max_depth=None, max_features=7, min_samples_split=20, n_estimators=1000; total time=   4.1s
[CV] END max_depth=15, max_features=auto, min_samples_split=8, n_estimators=500; total time=   2.2s
[CV] END max_depth=5, max_features=auto, min_samples_split=20, n_estimators=200; total time=   0.6s
[CV] END max_depth=None, max_features=7, min_samples_split=2, n_estimators=200; total time=   1.2s
[CV] END max_depth=15, max_features=8, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END max_depth=8, max_features=auto, min_samples_split=20, n_estimators=500; total time=   2.1s
[CV] END max_depth=8, max_features=8, min_samples_split=8, n_estimators=100; total time=   0.3s
[CV] END max_depth=8, max_features=7, min_samples_split=15, n_estimators=1000; total time=   3.1s
[CV] END max_depth=8, max_features=8, min_samples_split=20, n_estimators=1000; total time=   3.3s
[CV] END max_depth=None, max_features=8, min_samples_split=20, n_estimators=200; total time=   0.8s
[CV] END ma

In [37]:
from sklearn.metrics import roc_auc_score

models={
    "Random Forest": RandomForestClassifier(n_estimators=1000,min_samples_split=2,max_features=8,max_depth=None),
    "Adaboost": AdaBoostClassifier(n_estimators= 80, algorithm= "SAMME")
}
for name, model in models.items():
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)

    print(name)
    print('Model performance for Training set')
    print(f"- Accuracy: {model_train_accuracy:.4f}")
    print(f"- F1 score: {model_train_f1:.4f}")
    print(f"- Precision: {model_train_precision:.4f}")
    print(f"- Recall: {model_train_recall:.4f}")
    print(f"- ROC AUC Score: {model_train_rocauc_score:.4f}")

    print(name)
    print('Model performance for Testing set')
    print(f"- Accuracy: {model_test_accuracy:.4f}")
    print(f"- F1 score: {model_test_f1:.4f}")
    print(f"- Precision: {model_test_precision:.4f}")
    print(f"- Recall: {model_test_recall:.4f}")
    print(f"- ROC AUC Score: {model_test_rocauc_score:.4f}")
    
    print("="*35)
    print("\n")


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- ROC AUC Score: 1.0000
Random Forest
Model performance for Testing set
- Accuracy: 0.9468
- F1 score: 0.9437
- Precision: 0.9504
- Recall: 0.7143
- ROC AUC Score: 0.8535


Adaboost
Model performance for Training set
- Accuracy: 0.8448
- F1 score: 0.8160
- Precision: 0.7568
- Recall: 0.2951
- ROC AUC Score: 0.6361
Adaboost
Model performance for Testing set
- Accuracy: 0.8538
- F1 score: 0.8237
- Precision: 0.6552
- Recall: 0.2360
- ROC AUC Score: 0.6058




In [None]:
#Plot ROC AUC curve