<a href="https://colab.research.google.com/github/ruddysimon/Predictive-Insurance-Premium-Estimation-with-XGBoost/blob/main/Insurance_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Approach**
- Investigating the Data through Exploratory Data Analysis (EDA)
- Creating and Assessing a Baseline Linear Model
- Examining Linear Regression Assumptions
- Preprocessing the Data for Modeling
- Training the Model
- Evaluating the Model's Performance
- Enhancing the Baseline Linear Model
- Introducing a Non-Linear Model: XGBoost
- Data Preprocessing for XGBoost
- Optimizing Model Training with Sklearn's Pipeline
- Evaluating the Performance of the XGBoost Model
- Comparing the XGBoost Model to the Baseline Linear Model

**Install Packages**

In [5]:
# !pip install numpy==1.21.0 --quiet
# !pip install pandas==1.5.2 --quiet
# !pip install plotly==5.11.0 --quiet
# !pip install scikit-learn==1.2.0 --quiet
# !pip install scikit-optimize==0.9.0 --quiet
# !pip install statsmodels==0.13.5 --quiet
# !pip install category_encoders==2.5.1 --quiet
# !pip install xgboost==1.7.2 --quiet
# !pip install nbformat==5.7.1 --quiet
# !pip install matplotlib==3.6.2 --quiet

In [4]:
# Import Dependencies
import pandas as pd
import numpy as np
import plotly.express as px
import sys
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import RFE

In [6]:
file = pd.read_csv("/content/insurance.csv")
display(file)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


We have a dataset with 7 columns and 1338 rows. Let's delve into each column's description individually:


- - age: The age of the primary beneficiary.
-- sex: The gender of the primary beneficiary.
-- bmi: Body Mass Index (BMI) of the primary beneficiary, calculated as $\frac{weight_{kg}}{(height_{metres})^2}$, which represents a person's weight-to-height ratio.
-- children: The number of dependents or children the primary beneficiary has.
-- smoker: Indicates if the primary beneficiary is a smoker or not.
-- region: The geographical region in the US where the primary beneficiary resides.
-- charges: The individual healthcare expenses billed by the health insurance for the primary beneficiary.

In [8]:
file.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [10]:
# Let's check the dataset data types.
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In this dataset, we have three numerical variables (age, BMI, and children) and three categorical variables (sex, smoker, and region). It is important to note that there are no missing values in any of the columns, eliminating the need for imputation during the data preprocessing phase. 

## Split the X and y 

In [11]:
target = 'charges'
X = file.drop(target,axis=1)
y = file[target]

In [12]:
print(X.shape)
print(y.shape)

(1338, 6)
(1338,)
