# Prediction of insurance charges

This regression model predicts the annual charges of clients based on 6 different features. 

In [2]:
from IPython.display import Image
Image(url="https://ushoptions.com/wp-content/uploads/2021/12/Affordable-health-insurance.jpg", width=800, height=200)

### Business Understanding

In this example, structured data is available from a .csv file. Data has been collected by a U.S. insurance company. For 1339 clients the following features are contained:
* age
* sex
* smoker
* Body-Mass-Index (BMI)
* Number of children
* living region
* annual charges 

In [40]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style='whitegrid', palette='muted', font_scale=1.5)
from warnings import filterwarnings
filterwarnings("ignore")
np.set_printoptions(precision=3)

### Access Data from .csv file

In [41]:
data="../Data/insurance.csv"
insurancedf=pd.read_csv(data,na_values=[" ","null"])
insurancedf.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Understand Data

#### Numeric features:
For numeric variables standard descriptive statistics such as mean, standard-deviation, quantiles etc. are calculated:

In [42]:
insurancedf.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


#### Categorical Features:
For non-numeric features the possible values and their count can be calculated as follows:

In [43]:
catFeats=['sex','smoker','region']
for cf in catFeats:
    print("\nFeature %s :"%cf)
    print(insurancedf[cf].value_counts())
    


Feature sex :
male      676
female    662
Name: sex, dtype: int64

Feature smoker :
no     1064
yes     274
Name: smoker, dtype: int64

Feature region :
southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64


### Preprocess Data

#### Transformation of non-numeric Features
Non-numeric features must be transformed to a numeric representation. For this we apply the `LabelEncoder` from scikit-learn, which belongs to the class of *Transformers*:

In [47]:
from sklearn.preprocessing import LabelEncoder
for cf in catFeats:
    insurancedf[cf] = LabelEncoder().fit_transform(insurancedf[cf].values)

In [48]:
insurancedf.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


#### One-Hot-Encoding of nominal Features

For **non-binary nominal features** a transformation into a numeric value is not sufficient, because algorithms interpret integers as ordinal data. Therefore non-binary nominal features must be **One-Hot-Encoded**. For columns of pandas dataframes the `get_dummies()`-function does the job. In the code-cell below the columns are reordered after One-Hot-Encoding, such that the attribute, which shall be predicted (charges) remains the last column:

In [49]:
insurancedfOH=pd.get_dummies(insurancedf,columns=["region"])
insurancedfOH.head()
ch=insurancedfOH["charges"]
insurancedfOH.drop(labels=['charges'], axis=1, inplace = True)
insurancedfOH.insert(len(insurancedfOH.columns), 'charges', ch)
insurancedfOH.head()

Unnamed: 0,age,sex,bmi,children,smoker,region_0,region_1,region_2,region_3,charges
0,19,0,27.9,0,1,0,0,0,1,16884.924
1,18,1,33.77,1,0,0,0,1,0,1725.5523
2,28,1,33.0,3,0,0,0,1,0,4449.462
3,33,1,22.705,0,0,0,1,0,0,21984.47061
4,32,1,28.88,0,0,0,1,0,0,3866.8552


```{note} 
Theory says that nominal features must be One-Hot-encoded. However, in practice prediction-accuracy may be better if One-Hot-encoding is not applied. In order to find out, which option is better, both variants must be implemented and evaluated. Below, the non-One-Hot-Encoded dataset `insurancedf` is applied for modelling. Apply also the One-Hot-encoded dataset `insurancedfOH` and determine, which variant performs better.
```

#### Scaling of data
Except decision trees and ensemble methods, which contain decision trees, nearly all machine learning algorithms require features of similar scale at the input. Since the value ranges of practical data can be very different a corresponding scaling must be performed in the preprocessing chain. The most common scaling approaches are *normalization (MinMax-scaling)* and *standardization*.

**Normalization:** In order to normalize feature *x* it's minimum $x_{min}$ and maximum $x_{max}$ must be determined. Then the normalized values $x_n^{(i)}$ are calculated from the original values $x^{(i)}$ by

$$
x_n^{(i)}=\frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}.
$$

The range of normalized values is $[0,1]$. A problem of this type of scaling is that in the case of outliers the value range of non-outliers may be very small. 

**Standardization:** In order to standardize feature *x* it's mean value $\mu_x$ and standard deviation $\sigma_x$ must be determined. Then the standardized values $x_s^{(i)}$ are calculated from the original values $x^{(i)}$ by

$$
x_s^{(i)}=\frac{x^{(i)}-\mu_x}{\sigma_x}
$$

All standardized features have zero mean and a standard deviation of one.

In [50]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

normalizer = MinMaxScaler()
normalizer.fit(insurancedf)
insurancedfNormed = normalizer.transform(insurancedf)
print("Min-Max Normalized Data:")
insurancedfNormed

Min-Max Normalized Data:


array([[0.022, 0.   , 0.321, ..., 1.   , 1.   , 0.252],
       [0.   , 1.   , 0.479, ..., 0.   , 0.667, 0.01 ],
       [0.217, 1.   , 0.458, ..., 0.   , 0.667, 0.053],
       ...,
       [0.   , 0.   , 0.562, ..., 0.   , 0.667, 0.008],
       [0.065, 0.   , 0.265, ..., 0.   , 1.   , 0.014],
       [0.935, 0.   , 0.353, ..., 1.   , 0.333, 0.447]])

In [51]:
standardizer = StandardScaler()
standardizer.fit_transform(insurancedf)
insurancedfStandardized = standardizer.transform(insurancedf)
print("Standardized Data:")
insurancedfStandardized

Standardized Data:


array([[-1.439, -1.011, -0.453, ...,  1.971,  1.344,  0.299],
       [-1.51 ,  0.99 ,  0.51 , ..., -0.507,  0.438, -0.954],
       [-0.798,  0.99 ,  0.383, ..., -0.507,  0.438, -0.729],
       ...,
       [-1.51 , -1.011,  1.015, ..., -0.507,  0.438, -0.962],
       [-1.296, -1.011, -0.798, ..., -0.507,  1.344, -0.93 ],
       [ 1.552, -1.011, -0.261, ...,  1.971, -0.467,  1.311]])

```{note}
As can be seen above, both transformers must be fitted to data by applying the `fit()`-method. Within this method the parameters for the transformation must be learned. These are the columnwise `min` and `max` in the case of the `MinMaxScaler` and the columnwise `mean` and `standard-deviation` in the case of the `StandardScaler`. Once these transformers are fitted (i.e. the parameters are learned), the `transform()`-method can be invoked for actually transforming the data. It is important, that in the context of Machine Learning, the `fit()`-method is only invoked for the training data. Then the fitted transformer is applied to transform **training- and test-data**. It is not valid to learn individual parameters for test-data, since in Machine Learning we pretend test-data to be unknown in advance. 
```

### Modelling
In this example a regression-model shall be learned, which can be applied to estimate the annual charges, given the other 6 features of a person. Since we also like to evaluate the learned model, we have to split the set of all labeled data into 2 disjoint sets - one for training and the other for test.

```{note}
Since the goal of this section is to keep things as simple as possible, we neglect One-Hot-Encoding and Scaling here. In an offline experiment it has been shown, that for this data and the applied ML-algorithm, the two transformations yield no significant performance difference.
```

In [52]:
from sklearn.model_selection import train_test_split

Split input-features from output-label:

In [53]:
X=insurancedf.values[:,:-1] # all features, which shall be applied as input for the prediction
y=insurancedf.values[:,-1]  # annual charges, i.e. the output-label that shall be predicted

Note that in the code cell above, the `values`-attribute of pandas dataframes has been invoked. This attribute contains only the data-part of a pandas-dataframe. The format of this data-part is a numpy-array. I.e. the variables `X`and `y` are numpy-arrays:

Split training- and test-partition:

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)

First 5 rows of the training-partition:

In [55]:
X_test[:5,:]

array([[52.   ,  1.   , 30.2  ,  1.   ,  0.   ,  3.   ],
       [47.   ,  0.   , 29.37 ,  1.   ,  0.   ,  2.   ],
       [48.   ,  1.   , 40.565,  2.   ,  1.   ,  1.   ],
       [61.   ,  1.   , 38.38 ,  0.   ,  0.   ,  1.   ],
       [51.   ,  0.   , 18.05 ,  0.   ,  0.   ,  1.   ]])

In scikit-learn a model is learned by calling the `fit(X,y)`-method of the corresponding algorithm-class. The arguments $X$ and $y$ are the array of input-samples and corresponding output-labels, respectively.

In [56]:
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
linreg.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In the same way as `LinearRegression` has been applied in the code cell above, any regression algorithm, provided by [scikit-learn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) can be imported and applied. Even conventional feed forward neural networks such as the [Multi Layer Perceptron (MLP) for Regression](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor) are provided.

### Evaluation of Regression Models

Once the model has been learned it can be applied for predictions. Here the model output for the test-data is calculated:

In [57]:
ypred=linreg.predict(X_test)

Next, for the first 10 persons of the test-partition the prediction of the model and the true charges are printed:

In [58]:
for pred, target in zip(ypred[:10],y_test[:10]):
    print("Predicted Charges: {0:2.2f} \t True Charges: {1:2.2f}".format(pred,target))

Predicted Charges: 11051.55 	 True Charges: 9724.53
Predicted Charges: 9821.28 	 True Charges: 8547.69
Predicted Charges: 37867.57 	 True Charges: 45702.02
Predicted Charges: 16125.71 	 True Charges: 12950.07
Predicted Charges: 6920.27 	 True Charges: 9644.25
Predicted Charges: 3879.39 	 True Charges: 4500.34
Predicted Charges: 1448.92 	 True Charges: 2198.19
Predicted Charges: 14390.18 	 True Charges: 11436.74
Predicted Charges: 9022.95 	 True Charges: 7537.16
Predicted Charges: 7458.83 	 True Charges: 5425.02


In [83]:
print("Minimum predicted charges: ",ypred.min())
print("Maximum predicted charges: ",ypred.max())

Minimum predicted charges:  -921.5688245477122
Maximum predicted charges:  40123.71002379287
