# Travel Insurance Prediction Using Binary Logistic Regression
Created at 30/06/2023

#### Dataset
**`TravelInsurancePrediction.csv`: https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data**

A tour & travels company is offering a travel insurance package to their customers. The company requires to know which customers would be interested in buying it based on its database history. The data is provided for almost 2000 of its previous customers, and you are required to build an intelligent model that can predict if the customer will be interested in buying the travel insurance package based on certain parameters.

### Import Library

In [None]:
install.packages("lmtest")
install.packages("car")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘zoo’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘numDeriv’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘Rcpp’, ‘RcppEigen’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘lme4’




In [None]:
install.packages("corrplot")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(lmtest)
library(car)
library(corrplot)

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Loading required package: carData

corrplot 0.92 loaded



In [None]:
library(dplyr)


Attaching package: ‘dplyr’


The following object is masked from ‘package:car’:

    recode


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




# Binary Classification

### Dataset

In [None]:
travel <- read.csv("TravelInsurancePrediction.csv")
travel <- travel[, -1]
head(travel)

Unnamed: 0_level_0,Age,Employment.Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<int>
1,31,Government Sector,Yes,400000,6,1,No,No,0
2,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
3,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
4,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
5,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0
6,25,Private Sector/Self Employed,No,1150000,4,0,No,No,0


In [None]:
sum(is.na(travel))

### Chi Square Correlation Test

In [None]:
chisq.test(travel$Employment.Type, travel$TravelInsurance)


	Pearson's Chi-squared test with Yates' continuity correction

data:  travel$Employment.Type and travel$TravelInsurance
X-squared = 42.754, df = 1, p-value = 6.208e-11


In [None]:
chisq.test(travel$GraduateOrNot, travel$TravelInsurance)


	Pearson's Chi-squared test with Yates' continuity correction

data:  travel$GraduateOrNot and travel$TravelInsurance
X-squared = 0.60551, df = 1, p-value = 0.4365


In [None]:
chisq.test(travel$ChronicDiseases, travel$TravelInsurance)


	Pearson's Chi-squared test with Yates' continuity correction

data:  travel$ChronicDiseases and travel$TravelInsurance
X-squared = 0.57541, df = 1, p-value = 0.4481


In [None]:
chisq.test(travel$FrequentFlyer, travel$TravelInsurance)


	Pearson's Chi-squared test with Yates' continuity correction

data:  travel$FrequentFlyer and travel$TravelInsurance
X-squared = 105.86, df = 1, p-value < 2.2e-16


In [None]:
chisq.test(travel$EverTravelledAbroad, travel$TravelInsurance)


	Pearson's Chi-squared test with Yates' continuity correction

data:  travel$EverTravelledAbroad and travel$TravelInsurance
X-squared = 370.56, df = 1, p-value < 2.2e-16


The Chi-Squared independence test is conducted to test and see if a variable is independent of other variables or is dependent. The following are the results of the Chi-Squared independence test that has been carried out using the variables Employment Type, Graduate Or Not, Chronic Diseases, Frequent Flyer and Ever Traveled Abroad. <br><br>The results of the Chi-Squared independence test show that only the variables **Graduate Or Not** and **Chronic Diseases** are independent because they have a p-value of more than 0.05.

### Pearson Correlation Test

In [None]:
# Age
print(cor.test(travel$Age, travel$TravelInsurance)$estimate)
print(cor.test(travel$Age, travel$TravelInsurance)$p.value)

       cor 
0.06105985 
[1] 0.006476684


In [None]:
# Annual Income
print(cor.test(travel$AnnualIncome, travel$TravelInsurance)$estimate)
print(cor.test(travel$AnnualIncome, travel$TravelInsurance)$p.value)

      cor 
0.3967632 
[1] 6.635018e-76


In [None]:
# Family Members
print(cor.test(travel$FamilyMembers, travel$TravelInsurance)$estimate)
print(cor.test(travel$FamilyMembers, travel$TravelInsurance)$p.value)

       cor 
0.07990901 
[1] 0.000363208


Pearson correlation test was conducted to test and see the level of correlation between variables with numeric data types. Above is the correlation between the predictor variables Age, Annual Income, and Family Members on the response variable Travel Insurance. <br><br>The three predictor variables have a p-value much smaller than 0.05. This means that these predictor variables have an effect on the response variable, Travel Insurance. The correlation coefficient value shows how much correlation occurs. The **Age** and **Family Members** variables have a weak correlation level (below 0.3) and the **Annual Income** variable has a low correlation level (between 0.3 and 0.5).

### Binary Logistic Regression Model

In [None]:
model <- glm(TravelInsurance ~ ., travel, family = binomial)

In [None]:
summary(model)


Call:
glm(formula = TravelInsurance ~ ., family = binomial, data = travel)

Coefficients:
                                              Estimate Std. Error z value
(Intercept)                                 -5.405e+00  6.340e-01  -8.525
Age                                          7.326e-02  1.851e-02   3.958
Employment.TypePrivate Sector/Self Employed  9.857e-02  1.326e-01   0.743
GraduateOrNotYes                            -1.813e-01  1.562e-01  -1.160
AnnualIncome                                 1.565e-06  1.769e-07   8.844
FamilyMembers                                1.529e-01  3.359e-02   4.551
ChronicDiseases                              8.999e-02  1.211e-01   0.743
FrequentFlyerYes                             4.595e-01  1.365e-01   3.366
EverTravelledAbroadYes                       1.718e+00  1.532e-01  11.211
                                            Pr(>|z|)    
(Intercept)                                  < 2e-16 ***
Age                                         7.57e-05 **

### Multicollinearity Test

In [None]:
print(vif(model))

                Age     Employment.Type       GraduateOrNot        AnnualIncome 
           1.044710            1.146495            1.059921            1.311309 
      FamilyMembers     ChronicDiseases       FrequentFlyer EverTravelledAbroad 
           1.017244            1.005954            1.090805            1.156222 


All variables have VIF values around 1, which means there is no multicollinearity of the eight predictor variables.

### Likelihood Ratio Significance Test

In [None]:
null <- glm(TravelInsurance ~ 1, travel, family = binomial)

In [None]:
lrtest(model, null)

Unnamed: 0_level_0,#Df,LogLik,Df,Chisq,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,9,-1034.157,,,
2,1,-1295.25,-8.0,522.1857,1.219128e-107


There is at least one independent variable that affects the dependent variable.

### Wald Significance Test

In [None]:
summary(model)


Call:
glm(formula = TravelInsurance ~ ., family = binomial, data = travel)

Coefficients:
                                              Estimate Std. Error z value
(Intercept)                                 -5.405e+00  6.340e-01  -8.525
Age                                          7.326e-02  1.851e-02   3.958
Employment.TypePrivate Sector/Self Employed  9.857e-02  1.326e-01   0.743
GraduateOrNotYes                            -1.813e-01  1.562e-01  -1.160
AnnualIncome                                 1.565e-06  1.769e-07   8.844
FamilyMembers                                1.529e-01  3.359e-02   4.551
ChronicDiseases                              8.999e-02  1.211e-01   0.743
FrequentFlyerYes                             4.595e-01  1.365e-01   3.366
EverTravelledAbroadYes                       1.718e+00  1.532e-01  11.211
                                            Pr(>|z|)    
(Intercept)                                  < 2e-16 ***
Age                                         7.57e-05 **

The variables that affect the Travel Insurance variable are **Age**, **Annual Income**, **Family Members**, **Frequent Flyer**, and **Ever Traveled Abroad**.

### Odds Ratio

In [None]:
odds_ratio = exp(coef(model))
print(odds_ratio)

                                (Intercept) 
                                 0.00449543 
                                        Age 
                                 1.07601556 
Employment.TypePrivate Sector/Self Employed 
                                 1.10359715 
                           GraduateOrNotYes 
                                 0.83418503 
                               AnnualIncome 
                                 1.00000156 
                              FamilyMembers 
                                 1.16518482 
                            ChronicDiseases 
                                 1.09416229 
                           FrequentFlyerYes 
                                 1.58327846 
                     EverTravelledAbroadYes 
                                 5.57096471 


The five (5) independent variables that have the most influence on the likelihood of someone having travel insurance are the variables **Ever Traveled Abroad**, **Frequent Flyer**, **Graduate or Not**, **Family Members**, **Employment Type**, and **Chronic Diseases**.