<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Sean Dhanasar</font>

<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Project 5: Ensemble Techniques - Travel Package Purchase Prediction</font>


This *Ensemble Techniques - Travel Package Purchase Prediction Project* is the fifth project assignment for the  
**Post Graduate Programme in Data Science and Business Analytics certificate (PGP-DSBA)**  
**University of Texas at Austin**

<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Context</font>

"**Visit with us**" travel company wants to retain its customers for a longer time period by launching a long-term travel package.  

The company had launched a holiday package last year and **18%** of the customers purchased that package however, *the marketing cost was quite high* because *customers were contacted at random without looking at the available information*.  

Now again the company is planning to launch a new product i.e., a long-term travel package, but this time company wants to *utilize previously available data* to reduce the marketing cost.  

You as a data scientist at "Visit with us" travel company have to *analyze the trend of existing customers' data* and information to *provide recommendations to the marketing team* and also *build a model to predict which customer is potentially going to purchase the long-term travel package*.


<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Objective</font>


1. To predict which customer is more likely to purchase the long-term travel package.

<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Data Dictionary</font>

**Customer details:**
1. **CustomerID**: Unique customer ID
2. **ProdTaken**: Product taken flag
3. **Age**: Age of customer
4. **PreferredLoginDevice**: Preferred login device of the customer in last month
5. **CityTier**: City tier
6. **Occupation**: Occupation of customer
7. **Gender**: Gender of customer
8. **NumberOfPersonVisited**: Total number of person came with customer
9. **PreferredPropertyStar**: Preferred hotel property rating by customer
10. **MaritalStatus**: Marital status of customer
11. **NumberOfTrips**: Average number of the trip in a year by customer
12. **Passport**: Customer passport flag
13. **OwnCar**: Customers owns a car flag
14. **NumberOfChildrenVisited**: Total number of children visit with customer
15. **Designation**: Designation of the customer in the current organization
16. **MonthlyIncome**: Gross monthly income of the customer

**Customer interaction data:**
1. **PitchSatisfactionScore**: Sales pitch satisfactory score
2. **ProductPitched**: Product pitched by a salesperson
3. **NumberOfFollowups**: Total number of follow up has been done by sales person after sales pitch
4. **DurationOfPitch**: Duration of the pitch by a salesman to customer


<font style = "font-family: Arial; font-weight:bold;font-size:2em;color:blue;">Expectation</font>

Perform an Exploratory Data Analysis on the data
- Univariate analysis 
- Bivariate analysis 
- Visualizations to identify the patterns and insights 
- Any other exploratory deep dive
- Illustrate the insights based on EDA
- Key meaningful observations on the relationship between variables

Data pre-processing
- Prepare the data for analysis 
- Missing value Treatment, Outlier Treatment, Feature Engineering

Data Preparation
- Prepare data for modelling and check the split

Model building - Bagging
- Bagging classifier
- Random forest
- Decision tree
- Build the model and comment on the model statistics
- Model performance evaluation and improvement
    - Select and explain appropriate metric for model performance evaluation
    - Discuss model performance and explore options to improve performance.

Model building - Boosting
- Adaboost, 
- Gradient Boosting 
- XGboost
- Stacking classifier
- Build the model and comment on the model statistics
- Model performance evaluation and improvement
    - Select and explain appropriate metric for model performance evaluation
    - Discuss model performance and explore options to improve performance.

Actionable Insights & Recommendations
- Compare Bagging & Boosting models 
- Business recommendations and insights

---

# Import all the necessary libraries

In [1]:
# Basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as spy
from scipy import stats
# from scipy.stats import zscore, norm, randint
%matplotlib inline
import copy

In [2]:
# Impute and Encode
from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import LabelEncoder

In [3]:
# Modelling - Preparation, Metrics, Classifiers

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import f1_score, precision_score, recall_score

from sklearn import metrics, tree

from sklearn.tree import DecisionTreeClassifier

# Ensemble Methods Classifers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import GradientBoostingClassifier

#To install xgboost library use - !pip install xgboost
from xgboost import XGBClassifier

# from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
# from sklearn.linear_model import LogisticRegression

# from sklearn.metrics import roc_auc_score, roc_curve
# from sklearn.metrics import precision_recall_curve

# import statsmodels.api as sm
# from statsmodels.tools.tools import add_constant
# from statsmodels.stats.outliers_influence import variance_inflation_factor

In [4]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Pandas display settings - rows & columns

# Display all rows
# pd.options.display.max_rows = 10000

# Display all columns
pd.set_option("display.max_columns", None)

# Data ingestion 

Initial there were issues reading the **xlsx** file as **XLRD** was modified in the most recent update, it could not read XLSX files.

I had to update Pandas library to the most recent version 1.2.1 (Jan 20, 2021)

In [6]:
# Check to see Pandas version is 1.2.1
print("The version of Pandas library used in this notebook is: ", pd.__version__)

if pd.__version__ != "1.2.1":
    print("Pandas library need to be updated to version 1.2.1")
    # !pip install --upgrade pandas

The version of Pandas library used in this notebook is:  1.1.3
Pandas library need to be updated to version 1.2.1


In [7]:
# Load dataset
data = pd.read_excel('Tourism.xlsx', sheet_name='Tourism')

# **Data Inspection**

**Preview dataset**

In [8]:
# Preview the dataset
# View the first 5, last 5 and random 10 rows
print('First five rows', '--'*55)
display(data.head())

print('Last five rows', '--'*55)
display(data.tail())

print('Random ten rows', '--'*55)
np.random.seed(1)
display(data.sample(n=10))

First five rows --------------------------------------------------------------------------------------------------------------


Unnamed: 0,CustomerID,ProdTaken,Age,PreferredLoginDevice,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Super Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Super Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Multi,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Multi,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Multi,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


Last five rows --------------------------------------------------------------------------------------------------------------


Unnamed: 0,CustomerID,ProdTaken,Age,PreferredLoginDevice,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Super Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Multi,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Multi,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0
4887,204887,1,36.0,Self Enquiry,1,14.0,Salaried,Male,4,4.0,Multi,4.0,Unmarried,3.0,1,3,1,2.0,Executive,24041.0


Random ten rows --------------------------------------------------------------------------------------------------------------


Unnamed: 0,CustomerID,ProdTaken,Age,PreferredLoginDevice,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
3015,203015,0,27.0,Company Invited,1,7.0,Salaried,Female,4,6.0,Multi,3.0,Married,5.0,0,4,1,3.0,Executive,23042.0
1242,201242,0,40.0,Self Enquiry,3,13.0,Small Business,Male,2,3.0,King,4.0,Single,2.0,0,4,1,,VP,34833.0
3073,203073,0,29.0,Self Enquiry,2,15.0,Small Business,Male,4,5.0,Multi,3.0,Married,3.0,0,2,0,2.0,Executive,23614.0
804,200804,0,48.0,Company Invited,1,6.0,Small Business,Male,2,1.0,Deluxe,3.0,Single,3.0,0,2,0,0.0,AVP,31885.0
3339,203339,0,32.0,Self Enquiry,1,18.0,Small Business,Male,4,4.0,Super Deluxe,5.0,Divorced,3.0,1,2,0,3.0,Manager,25511.0
3080,203080,1,36.0,Company Invited,1,32.0,Salaried,Female,4,4.0,Multi,4.0,Married,3.0,1,3,0,1.0,Executive,20700.0
2851,202851,0,46.0,Self Enquiry,1,17.0,Salaried,Male,4,4.0,Multi,3.0,Divorced,5.0,0,5,1,1.0,Executive,21332.0
2883,202883,1,32.0,Company Invited,1,27.0,Salaried,Male,4,4.0,Standard,3.0,Divorced,5.0,0,3,1,1.0,Senior Manager,28502.0
1676,201676,0,22.0,Self Enquiry,1,11.0,Salaried,Male,2,1.0,Multi,4.0,Married,2.0,1,4,1,0.0,Executive,17328.0
1140,201140,0,44.0,Self Enquiry,1,13.0,Small Business,Female,2,3.0,King,3.0,Married,1.0,1,4,1,1.0,VP,34049.0


- `CustomerID` is row identifier, which does not add any value. This variable will be removed later.
- There are missing values in the dataset as indicated by **Nan** in the `Age` variable.

## Variable List

In [9]:
# Display list of variables in dataset
variable_list = data.columns.tolist()
print(variable_list)

['CustomerID', 'ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation', 'MonthlyIncome']


## Dataset shape

In [10]:
shape = data.shape
n_rows = shape[0]
n_cols = shape[1]
print(f"The Dataframe consists of '{n_rows}' rows and '{n_cols}' columns")

The Dataframe consists of '4888' rows and '20' columns


**Data types**

In [11]:
# Check the data types
data.dtypes

CustomerID                   int64
ProdTaken                    int64
Age                        float64
PreferredLoginDevice        object
CityTier                     int64
DurationOfPitch            float64
Occupation                  object
Gender                      object
NumberOfPersonVisited        int64
NumberOfFollowups          float64
ProductPitched              object
PreferredPropertyStar      float64
MaritalStatus               object
NumberOfTrips              float64
Passport                     int64
PitchSatisfactionScore       int64
OwnCar                       int64
NumberOfChildrenVisited    float64
Designation                 object
MonthlyIncome              float64
dtype: object

**Data info**

In [12]:
# Get info of the dataframe columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CustomerID               4888 non-null   int64  
 1   ProdTaken                4888 non-null   int64  
 2   Age                      4662 non-null   float64
 3   PreferredLoginDevice     4863 non-null   object 
 4   CityTier                 4888 non-null   int64  
 5   DurationOfPitch          4637 non-null   float64
 6   Occupation               4888 non-null   object 
 7   Gender                   4888 non-null   object 
 8   NumberOfPersonVisited    4888 non-null   int64  
 9   NumberOfFollowups        4843 non-null   float64
 10  ProductPitched           4888 non-null   object 
 11  PreferredPropertyStar    4862 non-null   float64
 12  MaritalStatus            4888 non-null   object 
 13  NumberOfTrips            4748 non-null   float64
 14  Passport                

- Six (6) variables have been identified as `Panda object` type. These shall be converted to the `category` type.

**Convert Pandas Objects to Category type**

In [13]:
# Convert variables with "object" type to "category" type
for i in data.columns:
    if data[i].dtypes == "object":
        data[i] = data[i].astype("category") 

# Confirm if there no variables with "object" type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   CustomerID               4888 non-null   int64   
 1   ProdTaken                4888 non-null   int64   
 2   Age                      4662 non-null   float64 
 3   PreferredLoginDevice     4863 non-null   category
 4   CityTier                 4888 non-null   int64   
 5   DurationOfPitch          4637 non-null   float64 
 6   Occupation               4888 non-null   category
 7   Gender                   4888 non-null   category
 8   NumberOfPersonVisited    4888 non-null   int64   
 9   NumberOfFollowups        4843 non-null   float64 
 10  ProductPitched           4888 non-null   category
 11  PreferredPropertyStar    4862 non-null   float64 
 12  MaritalStatus            4888 non-null   category
 13  NumberOfTrips            4748 non-null   float64 
 14  Passport

- `The memory usage has decreased from 764 KB to 565 KB`

**Missing value summary function**

In [14]:
def missing_val_chk(data):
    """
    This function to checks for missing values 
    and generates a summary.
    """
    if data.isnull().sum().any() == True:
        # Number of missing in each column
        missing_vals = pd.DataFrame(data.isnull().sum().sort_values(
            ascending=False)).rename(columns={0: '# missing'})

        # Create a percentage missing
        missing_vals['percent'] = ((missing_vals['# missing'] / len(data)) *
                                   100).round(decimals=3)

        # Remove rows with 0
        missing_vals = missing_vals[missing_vals['# missing'] != 0].dropna()

        # display missing value dataframe
        print("The missing values summary")
        display(missing_vals)
    else:
        print("There are NO missing values in the dataset")

## Missing Values Check

In [15]:
#Applying the missing value summary function
missing_val_chk(data)

The missing values summary


Unnamed: 0,# missing,percent
DurationOfPitch,251,5.135
MonthlyIncome,233,4.767
Age,226,4.624
NumberOfTrips,140,2.864
NumberOfChildrenVisited,66,1.35
NumberOfFollowups,45,0.921
PreferredPropertyStar,26,0.532
PreferredLoginDevice,25,0.511


***

## 5 Point Summary

**Numerical type Summary**

In [16]:
# Five point summary of all numerical type variables in the dataset
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustomerID,4888.0,202443.5,1411.188388,200000.0,201221.75,202443.5,203665.25,204887.0
ProdTaken,4888.0,0.188216,0.390925,0.0,0.0,0.0,0.0,1.0
Age,4662.0,37.622265,9.316387,18.0,31.0,36.0,44.0,61.0
CityTier,4888.0,1.654255,0.916583,1.0,1.0,1.0,3.0,3.0
DurationOfPitch,4637.0,15.490835,8.519643,5.0,9.0,13.0,20.0,127.0
NumberOfPersonVisited,4888.0,2.905074,0.724891,1.0,2.0,3.0,3.0,5.0
NumberOfFollowups,4843.0,3.708445,1.002509,1.0,3.0,4.0,4.0,6.0
PreferredPropertyStar,4862.0,3.581037,0.798009,3.0,3.0,3.0,4.0,5.0
NumberOfTrips,4748.0,3.236521,1.849019,1.0,2.0,3.0,4.0,22.0
Passport,4888.0,0.290917,0.454232,0.0,0.0,0.0,1.0,1.0


- `ProdTaken` is a binary variable with 18.8% of the rows having a value of 1
- `Age` is fairly symmetrical with *mean* and *median* being very close
- `CityTier` is a categorical ordinal variable with three states
- `DurationOfPitch` is numerical variable being highly right skewed as there is significant change between Q3 and Q4
- `NumberOfPersonVisited` is a categorical ordinal variable with five states
- `PreferredPropertyStar` is a categorical ordinal variable with three states
- `NumberOfTrips` is numerical variable being highly right skewed as there is significant change between Q3 and Q4
- `Passport` is a binary variable with 29.1% of the rows having a value of 1
- `PitchSatisfactionScore` is a categorical ordinal variable with five states
- `OwnCar` is a binary variable with 62% of the rows having a value of 1
- `NumberOfChildrenVisited` is a categorical ordinal variable with three states
- `MonthlyIncome` is numerical variable being highly right skewed as there is significant change between Q3 and Q4

**Categorical type Summary**

In [17]:
data.describe(include=['category']).T

Unnamed: 0,count,unique,top,freq
PreferredLoginDevice,4863,2,Self Enquiry,3444
Occupation,4888,4,Salaried,2368
Gender,4888,3,Male,2916
ProductPitched,4888,5,Multi,1842
MaritalStatus,4888,4,Married,2340
Designation,4888,5,Executive,1842


- `Gender` has three states which seems a bit odd. Further investigation will be done.

---

**Number of unique states for all variables**

In [18]:
# Check the unique values
data.nunique().to_frame()

Unnamed: 0,0
CustomerID,4888
ProdTaken,2
Age,44
PreferredLoginDevice,2
CityTier,3
DurationOfPitch,34
Occupation,4
Gender,3
NumberOfPersonVisited,5
NumberOfFollowups,6


* `Age`, `DurationOfPitch`, `NumberOfTrips` & `MonthlyIncome` are numerical variables

---

**Categorical Variable Identification**

Although the following variables are numerical in nature, they represent **categorical** variables:
* `CustomerID`
* `ProdTaken`
* `PreferredLoginDevice`
* `CityTier`
* `Occupation`
* `Gender`
* `NumberOfPersonVisited`
* `NumberOfFollowups` 
* `ProductPitched`
* `PreferredPropertyStar`
* `MaritalStatus`
* `Passport`
* `PitchSatisfactionScore`
* `OwnCar`
* `NumberOfChildrenVisited`
* `Designation`

---

**Create a list of numerical variables**

In [19]:
numerical_vars = ['Age', 'DurationOfPitch', 'NumberOfTrips', 'MonthlyIncome']

**Create a list of categorical variables**

In [20]:
categorical_vars = [
    'CustomerID', 'ProdTaken', 'PreferredLoginDevice', 'CityTier',
    'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups',
    'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'Passport',
    'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited',
    'Designation'
]

---

## Numerical data

In [21]:
data[numerical_vars].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,4662.0,37.622265,9.316387,18.0,31.0,36.0,44.0,61.0
DurationOfPitch,4637.0,15.490835,8.519643,5.0,9.0,13.0,20.0,127.0
NumberOfTrips,4748.0,3.236521,1.849019,1.0,2.0,3.0,4.0,22.0
MonthlyIncome,4655.0,23619.853491,5380.698361,1000.0,20346.0,22347.0,25571.0,98678.0


### Skew Summary

In [22]:
# Display the skew summary for the numerical variables
for var in data[numerical_vars].columns:
    var_skew = data[var].skew()
    if var_skew > 1:
        print(f"The '{var}' distribution is highly right skewed.\n")
    elif var_skew < -1:
        print(f"The '{var}' distribution is highly left skewed.\n")
    elif (var_skew > 0.5) & (var_skew < 1):
        print(f"The '{var}' distribution is moderately right skewed.\n")
    elif (var_skew < -0.5) & (var_skew > -1):
        print(f"The '{var}' distribution is moderately left skewed.\n")
    else:
        print(f"The '{var}' distribution is fairly symmetrical.\n")

The 'Age' distribution is fairly symmetrical.

The 'DurationOfPitch' distribution is highly right skewed.

The 'NumberOfTrips' distribution is highly right skewed.

The 'MonthlyIncome' distribution is highly right skewed.



**Outlier check function**

In [23]:
# Outlier check
def outlier_count(data):
    """
    This function checks the lower and upper 
    outliers for all numerical variables.
    
    Outliers are found where data points exists either:
    - Greater than `1.5*IQR` above the 75th percentile
    - Less than `1.5*IQR` below the 25th percentile
    """
    numeric = data.select_dtypes(include=np.number).columns.to_list()
    for i in numeric:
        # Get name of series
        name = data[i].name
        # Calculate the IQR for all values and omit NaNs
        IQR = spy.stats.iqr(data[i], nan_policy="omit")
        # Calculate the boxplot upper fence
        upper_fence = data[i].quantile(0.75) + 1.5 * IQR
        # Calculate the boxplot lower fence
        lower_fence = data[i].quantile(0.25) - 1.5 * IQR
        # Calculate the count of outliers above upper fence
        upper_outliers = data[i][data[i] > upper_fence].count()
        # Calculate the count of outliers below lower fence
        lower_outliers = data[i][data[i] < lower_fence].count()
        # Check if there are no outliers
        if (upper_outliers == 0) & (lower_outliers == 0):
            continue
        print(
            f"The '{name}' distribution has '{lower_outliers}' lower outliers and '{upper_outliers}' upper outliers.\n"
        )

### Outlier check

In [24]:
#Applying the Outlier check function for the sub-dataframe of numerical variables
outlier_count(data[numerical_vars])

The 'DurationOfPitch' distribution has '0' lower outliers and '2' upper outliers.

The 'NumberOfTrips' distribution has '0' lower outliers and '109' upper outliers.

The 'MonthlyIncome' distribution has '2' lower outliers and '343' upper outliers.



### Numerical Variable Summary

| Variable| Skew | Outliers | 
| :-: | :-: | :-: |
| **Age** | Fairly symmetrical | No Outliers | 
| **DurationOfPitch** | Highly right skewed | 2 Upper Outliers | 
| **NumberOfTrips** | Highly right skewed | 109 Upper Outliers |
| **MonthlyIncome** | Highly right skewed | 2 Lower & 343 Upper Outliers |

---

## Categorical data

### Unique states

**Detailed investigation of unique values**

In [25]:
# Display the unique values for all categorical variables
for i in categorical_vars:
    print('Unique values in',i, 'are :')
    print(data[i].value_counts())
    print('--'*55)

Unique values in CustomerID are :
200702    1
201479    1
203514    1
201467    1
203518    1
         ..
204257    1
200163    1
202212    1
204261    1
204800    1
Name: CustomerID, Length: 4888, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in ProdTaken are :
0    3968
1     920
Name: ProdTaken, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in PreferredLoginDevice are :
Self Enquiry       3444
Company Invited    1419
Name: PreferredLoginDevice, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in CityTier are :
1    3190
3    1500
2     198
Name: CityTier, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Occupation are :
Sala

- `Gender` -  There is another state **"Fe Male"**. This will be interpreted as a error in data input. All instances of **"Fe Male"** will be **replaced** by **"Female"**
- `MaritalStatus` - There are two states **"Single"** and **"Unmarried"** which are similar in certain contexts but will be left unchanged as such in the EDA.

---

**Replacing "Fe Male" with "Female"**

In [26]:
# Replace "Fe Male" with "Female"
data['Gender'] = data['Gender'].replace({'Fe Male':'Female'})

In [27]:
# Check states in "Gender"
data['Gender'].value_counts()

Male      2916
Female    1972
Name: Gender, dtype: int64

---

### Categorical Variable Summary

There are categorical variables in the numeric format.

| Variable| Type | Range | 
| :-: | :-: | :-: |
| **CustomerID** |  Nominal | 200000-204887 |
| **ProdTaken**| Nominal | Binary |
| **PreferredLoginDevice**| Nominal | 2 states |
| **CityTier**| Ordinal | 3 states |
| **Occupation**| Nominal | 4 states |
| **Gender**| Nominal | 2 states |
| **NumberOfPersonVisited**| Ordinal | 5 states |
| **NumberOfFollowups**| Ordinal | 6 states |
| **ProductPitched**| Nominal | 5 states |
| **PreferredPropertyStar**| Ordinal | 3 states |
| **MaritalStatus**| Nominal | 4 states |
| **Passport**| Nominal | Binary |
| **PitchSatisfactionScore**| Ordinal | 5 states |
| **OwnCar**| Nominal | Binary |
| **NumberOfChildrenVisited**| Ordinal | 4 states |
| **Designation**| Nominal | 5 states |

---

## Target Variable

Target variable is **`ProdTaken`**

In [28]:
# Checking the distribution of target variable

# Count the different "ProdTaken" states
count = data["ProdTaken"].value_counts().T
# Calculate the percentage different "ProdTaken" states
percentage = data['ProdTaken'].value_counts(normalize=True).T * 100
# Join count and percentage series
target_dist = pd.concat([count, percentage], axis=1)
# Set column names
target_dist.columns = ['count', 'percentage']
# Set Index name
target_dist.index.name = "ProdTaken"
# Display target distribution dataframe
target_dist

Unnamed: 0_level_0,count,percentage
ProdTaken,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3968,81.178396
1,920,18.821604


**Out of the 4888 customers, only 18.8% accepted the personal loan offer in the previous campaign**

<font color='red'> The Target variable is **Moderately Imbalanced**

---

**Dropping the `CustomerID` variable**

We shall drop the `CustomerID` variable as it does not add any value to the dataset.

In [29]:
# Drop CustomerID column inplace
data.drop(columns = 'CustomerID', inplace=True)

# Remove CustomerID from "categorical_vars" list
categorical_vars.remove('CustomerID')

---

---