# Regression modeling with insurance dataset 

Adapted from Lantz (2015), Chapter 6 

- In order for a health insurance company to make money, it needs to collect more in yearly premiums than it spends on medical care to its beneficiaries.
- As a result, insurers invest a great deal of time and money in developing models that accurately forecast medical expenses for the insured population.


- Medical expenses are difficult to estimate because the most costly conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population.
- For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.


We will try to estimate the expenses (claims) of clients given data on health conditions and family 

**SO EXPENSES WILL BE THE DEPENDENT VARIABLE**

## Libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(psych) # for visualizing relationship among pairs of variables
library(GGally) # for better visualizing relationship among pairs of variables
library(corrplot) # for correlation plots
library(listviewer) # for visualizing nested data structures
library(broom) # for getting a glance of model fit
library(TSrepr) # for evaluating predictive power
options(warn = -1) # for suppressing messages

In [None]:
insurance <- fread("../data/csv/04_01_insurance.csv", stringsAsFactors = T)

## Explore data

In [None]:
str(insurance)

### Factor variables

See the levels of factor variables:

In [None]:
insurance %>% purrr::keep(is.factor) %>% purrr::map(levels)

In [None]:
insurance_factors <- insurance %>% purrr::keep(is.factor) %>% # select factor columns
    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(insurance_factors)

Note that, regression can only handle numeric variables, so factors must be converted into dummy variables that take binary values of 0 or 1. R handles this as we will see

### Numeric variables

And five point summaries for numeric variables:

In [None]:
insurance %>% purrr::keep(is.numeric) %>% sapply(quantile) %>% t()

In [None]:
insurance %>% purrr::keep(is.numeric) %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key, scale = "free" ) + # divide into separate plots by key
        geom_density(fill = "green")  # get density plots

Numeric variables other than bmi are not normally distributed.

- Although linear regression does not strictly require a normally distributed dependent variable, the model often fits better 
when this is true.

### Relationships among features

First we create a correlation plot across numeric variables. It is important that dependent variables (age, bmi and children) are not highly correlated across each other - a condition known as multicollinearity which can distort the result of the regression model

In [None]:
insurance %>% purrr::keep(is.numeric) %>% cor() %>%

corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

Here we see that dependent variables are not highly correlated across each other, while only age is mildly correlated with expenses

### Scatterplots

Scatterplots are another way to visualize pairwise relationships among variables. pairs() function creates a matrix of such scatterplots 

In [None]:
insurance %>% purrr::keep(is.numeric) %>% pairs()

- The relationship between age and expenses displays several relatively straight lines,
- while the bmi versus expenses plot has two distinct groups of points.

pairs.panel() function from psych package provides even more visual information on pairs:

In [None]:
insurance %>% purrr::keep(is.numeric) %>% psych::pairs.panels()

- The oval-shaped object on each scatterplot is a correlation ellipse. It provides a visualization of correlation strength.

- The dot at the center of the ellipse indicates the point at the mean values for the x and y axis variables.

- The correlation between the two variables is indicated by the shape of the ellipse; the more it is stretched, the stronger the correlation. An almost perfectly round oval, as with bmi and children, indicates a very weak correlation

- The curve drawn on the scatterplot is called a loess curve. It indicates the general relationship between the x and y axis variables, and whether the relationship is linear.

And ggpairs function from GGally package is another neat and fancy way to visualize pairwise relationships:

In [None]:
insurance %>% purrr::keep(is.numeric) %>% GGally::ggpairs()

## Partition the dataset

In [None]:
set.seed(19231029)
train_ind <- insurance[,sample(.I, 0.7 * .N)]

In [None]:
insurance_train <- insurance[train_ind]
insurance_test <- insurance[-train_ind]

## Build and train a model

We will first create a formula for the model as such:

dependent_variable ~ independent_var1 + independent_var2 ...

Although we can write this formula manually with a handful of variables, it may be cumbersome when the number of independent variables is high.

We have a shortcut for that: 

In [None]:
indepvars <- names(insurance) %>% setdiff("expenses")
indepvars

formula1 <- reformulate(indepvars, "expenses")
formula1

When we are using all variables we also have a better shortcut:

In [None]:
formula2 <- reformulate(".", "expenses")
formula2

In [None]:
ins_model <- lm(formula2, data = insurance_train)

In [None]:
ins_model

Note that factor variables are automatically converted into dummy variables of 1 or 0 values

### Dummy variables

- Dummy coding allows a nominal feature to be treated as numeric by creating a 
binary variable, often called a dummy variable, for each category of the feature. 

- The dummy variable is set to 1 if the observation falls into the specified category or 0 otherwise. For instance, the sex feature has two categories: male and female. 

- This will be split into two binary variables, which R names sexmale and sexfemale. 

- For observations where sex = male, then sexmale = 1 and sexfemale = 0; conversely, if sex = female, then sexmale = 0 and sexfemale = 1. The same coding applies to variables with three or more categories.

- For example, R split the four-category feature region into four dummy variables: regionnorthwest, regionsoutheast, regionsouthwest, and regionnortheast.

- When adding a dummy variable to a regression model, one category is always left out to serve as the reference category. The estimates are then interpreted relative to the reference.

- In our model, R automatically held out the sexfemale, smokerno, and regionnortheast variables, making female non-smokers in the northeast region the reference group.

To see how R converts factors into dummy variables, we can call model.matrix() function - which is also called implicitly when lm() is run with factor variables:

In [None]:
model.matrix(formula2, data = insurance_train)

We can also compare the original factor variables with the newly created dummy variables:

In [None]:
insurance_train %>% purrr::keep(is.factor) %>%
    cbind(model.matrix(formula2, data = insurance_train))

### Interpretation of the model

Intercept is the level of expenses when all independent variables take the value of 0

- The beta coefficients indicate the estimated increase in expenses for an increase of 
one in each of the features, assuming all other values are held constant.
- For instance, for each additional year of age, we would expect $248.17 higher medical expenses on 
average, assuming everything else is equal.

- The results of the linear regression model make logical sense: old age, smoking, and 
obesity tend to be linked to additional health issues,
- while additional family member dependents may result in an increase in physician visits and preventive care such 
as vaccinations and yearly physical exams.

## Evaluate model performance

The output of the model is as such:

In [None]:
listviewer::jsonedit(ins_model)

### Scatterplot

We can compare the actual and fitted values visually in a scatterplot:

Let's differentiate the color of the points across smoker values:

In [None]:
p1 <- cbind(insurance_train, fitted.values = ins_model$fitted.values) %>%
        ggplot(aes(x = expenses,
                   y = fitted.values,
                   color = smoker)) +
            geom_point() +
            xlab("actual values") +
            ylab("fitted values")

plotly::ggplotly(p1)

The predicted and actual values do not fit well on a line. There is room for improvement

Now, let's get a summary of the model:

In [None]:
ins_sum <- summary(ins_model)
ins_sum

We can have a pretty representation of the nested output of summary:

In [None]:
listviewer::jsonedit(ins_sum, mode = "form")

### Coefficients

To interpret the fit of the model, we will first look at the t-values of of the standard errors of variables:

In [None]:
ins_sum$coefficients

Let's see the coefficients of which variables are significantly different from 0 at a significance level of 5%:

In [None]:
coefs1 <- as.data.table(ins_sum$coefficients, keep.rownames = T)
split(coefs1, coefs1[,5] < 0.05)

### R-squared

Then let's view the r-squared and adjusted r-squared values:

In [None]:
ins_sum$r.squared;
ins_sum$adj.r.squared

So the model explains 76% of the variance in expenses variable. Note that adjusted r squared penalizes the model for number of predictors included

### F-statistic

Now let's view the f statistic. This criterion tells us about the overall significance of the predictor variables:

In [None]:
ins_sum$fstatistic

These values do not mean much as is. We can get the p-value of the f-statistics:

In [None]:
fstat <- ins_sum$fstatistic
pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)

Or in a more concise way:

In [None]:
do.call(pf, c(ins_sum$fstatistic %>% as.list() %>% unname(), lower.tail = FALSE))

F-statistic is highly significant: So the predictors as a whole are significantly different than zero or to put it differently, the model is significantly better than a model which tries to predict expenses by just getting the average of expenses values (intercept only model).

However, a significant F-statistic does not mean our model explains a good portion of the variance in the dependent variables (as measured by r-squared)

A better way to get a glance of the fit of the model:

In [None]:
broom::glance(ins_model)

### AIC and BIC

Note that, AIC and BIC measures are useful to compare alternative models on the same dataset. We can also calculate them separately as such: 

In [None]:
AIC(ins_model);
BIC(ins_model)

The lower these values are, the better the model fit

## Predictive power of the model

Now let's see whether the model predicts the test data accurately

In [None]:
expenses_test_predicted <- predict.lm(ins_model, insurance_test)

We can create a scatterplot across predicted and actual values:

In [None]:
p2 <- cbind(insurance_test, fitted.values = expenses_test_predicted) %>%
        ggplot(aes(x = expenses,
                   y = fitted.values,
                   color = smoker)) +
            geom_point() +
            xlab("actual values") +
            ylab("fitted values")

plotly::ggplotly(p2)

A good way to measure the predictive power of a model on test data is to calculate two criteria:

- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)

They are measures on how much the predictions deviate from actual values

In [None]:
TSrepr::mae(insurance_test$expenses, expenses_test_predicted)
TSrepr::rmse(insurance_test$expenses, expenses_test_predicted)

We can compare these values for test data with the same criteria calculated for the train data:

In [None]:
TSrepr::mae(insurance_train$expenses, ins_model$fitted.values)
TSrepr::rmse(insurance_train$expenses, ins_model$fitted.values)

Though the test data has slightly higher rmse and mae values than the train data do, the model fits rather similarly on test and train data

## Improving model performance

### Adding non-linear relationships

Not all relationships are linear:

In [None]:
p3 <- insurance %>%
        ggplot(aes(x = expenses,
                   y = age)) +
            geom_point()

plotly::ggplotly(p3)

There is slight curved relationship between age and expenses, which may be represented with a quadratic function.

We add the squared term of age as a new variable named age2:

In [None]:
insurance[,age2 := age^2]
insurance

### Converting a numeric variable to a binary indicator

Suppose we have a hunch that the effect of a feature is not cumulative, rather it has 
an effect only after a specific threshold has been reached.

- For instance, BMI may have zero impact on medical expenditures for individuals in the normal weight range,
- but it may be strongly related to higher costs for the obese (that is, BMI of 30 or above).

In [None]:
insurance[,bmi30 := (bmi >= 30) + 0]

insurance

### Adding interaction effects

- So far, we have only considered each feature's individual contribution to the 
outcome.

- What if certain features have a combined impact on the dependent 
variable?

- For instance, smoking and obesity may have harmful effects separately,  
but it is reasonable to assume that their combined effect may be worse than the  
sum of each one alone.

- When two features have a combined effect, this is known as an interaction.

- If we suspect that two variables interact, we can test this hypothesis by adding their interaction to the model. 

- Interaction effects are specified using the R formula syntax. 

- To have the obesity indicator (bmi30) and the smoking indicator (smoker) interact, we would write a formula in the form expenses ~ bmi30*smoker.

- The * operator is shorthand that instructs R to model expenses ~ bmi30 + smokeryes + bmi30:smokeryes.

- The : (colon) operator in the expanded form indicates that bmi30:smokeryes is the interaction between the two variables.

- Note that the expanded form also automatically included the bmi30 and smoker variables as well as the interaction.

### Improved model

In [None]:
set.seed(19231029)
train_ind <- insurance[,sample(.I, 0.7 * .N)]

In [None]:
insurance_train2 <- insurance[train_ind]
insurance_test2 <- insurance[-train_ind]

In [None]:
names(insurance)

In [None]:
predictors <- names(insurance) %>% setdiff(c("smoker", "expenses", "bmi30")) %>% c(paste("bmi30", "smoker", sep = "*"))
predictors

In [None]:
formula3 <- reformulate(predictors, "expenses")
formula3

In [None]:
ins_model2 <- lm(formula3, data = insurance_train2)
ins_sum2 <- summary(ins_model2)

In [None]:
ins_sum2 %>% listviewer::jsonedit(mode = "form")

#### Evaluate improved model performance:

First let's plot actual and fitted values:

In [None]:
p4 <- cbind(insurance_train2, fitted.values = ins_model2$fitted.values) %>%
        ggplot(aes(x = expenses,
                   y = fitted.values,
                   color = smoker)) +
            geom_point() +
            xlab("actual values") +
            ylab("fitted values")

plotly::ggplotly(p4)

We can also create a side-by-side scatterplot of both models to better view the improvement: 

In [None]:
pl.1 <- ggplotly(p1)
pl.2 <- ggplotly(p4)

subplot(list(pl.1,pl.2),nrows=1,shareX=F,shareY=F,titleX=T,titleY=T)

Visual inspection reveals that we now have a better fit

Let's compare the significance of coefficients:

In [None]:
ins_sum$coefficients[ins_sum$coefficients[,4] < 0.05,]

See which predictors are significant in both models:

In [None]:
coefs1 <- as.data.table(ins_sum$coefficients, keep.rownames = T)
split(coefs1, coefs1[,5] < 0.05)

In [None]:
coefs2 <- as.data.table(ins_sum2$coefficients, keep.rownames = T)
split(coefs2, coefs2[,5] < 0.05)

We have 7 significant variables as compared to 5 (+ intercept) in the initial model.

The newly added age2 and bmi30:smokeryes variables are significant while bmi30 separately is not

Let's compare the r squared and adjusted r squared values

In [None]:
summary_list <- list(model1 = ins_sum, model2 = ins_sum2)

sapply(summary_list, "[", c("r.squared", "adj.r.squared"))

Both values are much improved

We can compare the output of glance function for fstatistics, AIC and BIC as well as r squared comparison:

In [None]:
models_list <- list(model1 = ins_model, model2 = ins_model2)

sapply(models_list, broom::glance)

The p-value of F statistics is much lower - hence model is more significant

Both AIC and BIC values are lower, again indicator of a better fit

#### Predictive power of the improved model

In [None]:
expenses_test_predicted2 <- predict.lm(ins_model2, insurance_test2)

We can create a scatterplot across predicted and actual values:

In [None]:
p5 <- cbind(insurance_test2, fitted.values = expenses_test_predicted2) %>%
        ggplot(aes(x = expenses,
                   y = fitted.values,
                   color = smoker)) +
            geom_point() +
            xlab("actual values") +
            ylab("fitted values")

plotly::ggplotly(p5)

We can also create a side-by-side scatterplot of both models to better view the improvement: 

In [None]:
pl.3 <- ggplotly(p2)
pl.4 <- ggplotly(p5)

subplot(list(pl.3,pl.4),nrows=1,shareX=F,shareY=F,titleX=T,titleY=T)

The fit of predicted values are also better

Now let's compare the mae and rmse values of fitted and predicted values with those of the original model.

We first write a wrapper function so that we do not repeat all steps:

In [None]:
mae_rmse <- function(data_train, data_test, fitted, predicted)
{
    c(
        rmse_fitted = TSrepr::rmse(data_train$expenses, fitted),
        rmse_predicted = TSrepr::rmse(data_test$expenses, predicted),
        mae_fitted = TSrepr::mae(data_train$expenses, fitted),
        mae_predicted = TSrepr::mae(data_test$expenses, predicted)
    ) %>% round(2)
        
}

In [None]:
cbind(model1 = mae_rmse(insurance_train, insurance_test, ins_model$fitted.values, expenses_test_predicted),
      model2 = mae_rmse(insurance_train2, insurance_test2, ins_model2$fitted.values, expenses_test_predicted2)
      )

We see that all rmse and mae values are lower (lower the error, better the fit and predictive power) in the second model as compared to those of the first model

And rmse and mae values of predicted values are slightly higher but not so different than those of the fitted values. So the model works well on test data as it does on the train data