# ML Basics - Predicting Insurance Costs with Linear Regression
This Notebook is about the training of a Linear-Regression-Model on insurance data, to predict the cost expenses of a person per year.

## Load and Perapare Datasets   
Import of standard libraries and basic funtions to include data sets in the notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import os
for dirname, _, filenames in os.walk('../input/time-series-analysis-datasets/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
#load insurance data from csv into dataframe 
df_insurance = pd.read_csv('../input/insurance/insurance.csv')

## Exploratory Analysis

In [None]:
df_insurance.shape

In [None]:
df_insurance.head(5)

In [None]:
df_insurance.describe()

## Linear Regression with Scikit Learn
Import insurance dataset to predict insurance costs of a person in a year. First make prediction the feature 'age' and the 'charges'. 
include the csv in a pandas dataframe and extract and them in numpy arrays. The Linear Regression Model need a numpy array format for data input. 

In [None]:
# build Features X and target Variable y 
X = df_insurance[['age']].values
y = df_insurance['charges'].values

# check shape an first 5 rows from X --> age
print(X.shape, y.shape)
print(X[0:5])

### Training and Prediction of the Linear Regression Model
The Linear Regression Model class is from the scikit-learn library. 
It will be trained on the on the 'age' and the 'charges'. The usage the key values 'perception' and 'coefficient', helps to understand the value of the training. The 'age' will be limited by a range of 18 years to 65 years for a more realistic prediction.      

In [None]:
from sklearn.linear_model import LinearRegression
# initiate model  
model = LinearRegression()
# handover data in model and learn
model.fit(X,y)
# get the intercept of the linear funktion 
intercept = model.intercept_
# geht the coeficient for age of the Model
coef = model.coef_

# set age limits for prediction 
y_pred = model.predict([[18], [65]])

print('intercept {:.3f}, coef.age {} '.format(intercept, coef))
print('predictions: ', y_pred)

### Evaluate the Linear Regression Model
To Evaluate the treining ofthe the Model, the mean absolute error and the r2 score will be calcualtated to get insights about the variance and the error of the variance.    

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score
# variable to handover input, model result comparison   
y_pred = model.predict(X)

# calculate mean_absoulte error and r_square to prediction with train data and trained data
mean_absolute_error = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)

print('mean absolute error: {:.3f}, r2: {:3f}'.format(mean_absolute_error, r2))

*mean absolute error: 9055.150, <br> 
r2: 0.0894*

The error of the model prediction is 9055 € p.a. . Calculating the the avarage over the training data would cause an error of 13.231 € p.a. 

The r2_score of 0.0894 implies a 9% variance of the costs caused by the age  

### Expand Train Set and split in Train-set and Test-set 
To get more precise precise result for the cost expenses prediction, features like the 'bmi' (body mass index) and the owner ship of 'children' will added to the train data set.

For more precise evaluation results and to avoid overfitting results, the trainset will split of in Train-set and Test-set. 


In [None]:
# write new train dataset
X = df_insurance[['age', 'bmi', 'children']].values
y = df_insurance['charges'].values

# import train test split functions
from sklearn.model_selection import train_test_split

# split the trainset in by 80% train and 20% test set  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=11)

# check the shape of the splittes sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape  

### Train and Evaluate the Model with the 20% Test-set

In [None]:
model = LinearRegression()
# handover data in model and learn
model.fit(X_train, y_train)
# get the intercept of the linear funktion 
intercept = model.intercept_
# geht the coeficient for age of the Model
coef = model.coef_

# variable for model prediction
y_test_pred = model.predict(X_test)

#Evaluation over test data 
from sklearn.metrics import mean_absolute_error, r2_score

# calculate mean_absoulte error and r_square to prediction with train data and trained data
mean_absolute_error = mean_absolute_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)

print('intercept: {:.3f}, coefs.: {}'.format(intercept, coef))
print('mean absolute error: {:.3f}, r2: {:.3f}'.format(mean_absolute_error, r2))

*intercept: -7262.450, coefs.: [246.83315449 330.9616626  686.75706631]
mean absolute error: 9057.372, r2: 0.083* <br>
Return of 3 coefficients in the order 'age', 'bmi', 'children'. <br><br>

The R2 score decreases from 9% to 8%. and the mean absolute error increases from 9055€ to 9057€ by 2€. 

Maybe the the feastures 'age', 'bmi' and numbers of 'child' doesnt have a big impact of the estimation from 'charges' or the features corellate with the 'age'.  

In [None]:
costs_40 = model.predict([[40, 20.1, 1]])
print('costs caused by a 40 years old person with a bmi of 20.1 and one child: {:.2f}€ p.a.'.format(costs_40[0]))
costs_40 = model.predict([[35, 20.1, 2]])
print('costs caused by a 40 years old person with a bmi of 20.1 and two childs: {:.2f}€ p.a.'.format(costs_40[0]))

*costs caused by a 40 years old person with a bmi of 20.1 and one child: 9949.96€ p.a.* <br>
*costs caused by a 40 years old person with a bmi of 20.1 and two childs: 9202.55€ p.a.*<br>
There is some linearity recognisable. It leads to some point of interpretation that person with a high a age, an high bmi and a high number of children causes less insurance expenses. 

### Sources: 
* https://github.com/PacktPublishing/Practical-Time-Series-Analysis
* Hirschle, Jochen. Machine Learning für Zeitreihen (German Edition) (S.94). Carl Hanser Verlag GmbH & Co. KG. Kindle-Version. Hirschle, Jochen. Machine Learning für Zeitreihen (German Edition) (S.94). Carl Hanser Verlag GmbH & Co. KG. Kindle-Version. 
* https://www.kaggle.com/mirichoi0218/insurancehttps://www.kaggle.com/mirichoi0218/insurance
