![](http://pioneerinstitute.org/wp-content/uploads/healthcare_costs_scrabble.jpg)

**Column Descriptions**

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../input/insurance/insurance.csv")
print(df.head())
print(f"Shape of data: {df.shape}")

In [None]:
#check for null values
df.isnull().sum()

No null values,proceed!

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


plt.figure();

df[['age', 'bmi', 'children', 'charges']].diff().hist(color="r", alpha=0.8, bins=50, figsize=(12, 6));

_Columns children, charges, children, bmi follows fairly normal distribution._

In [None]:
import plotly.express as px
fig = px.box(df['charges'], color = df['sex'],points="all")
fig.show()

**Median charges for female is 9412.963k dollars, Median charges for male is 9369.616k dollars**

**The upper fence charges for male is 40.27k dollars which is much higher than the upper fence charges for female which is 28.47k dollars**

**Outliers exist both in male and female charges**

In [None]:
fig = px.box(df['charges'], color = df['smoker'],points="all")
fig.show()

**Median charges for smoker is significantly high 34.45k dollars, and that of non smoker is 7345.40k dollars**

In [None]:
fig = px.box(df['charges'], color = df['region'],points="all")
fig.show()

There is no significant visible discrimination in charges based on region.

In [None]:
fig = px.scatter_matrix(df, color = 'charges')
fig.show()

Scatter plot matrix of the dataframe

# We need to encode the categoricals.

In [None]:
from sklearn.preprocessing import LabelEncoder

for c in df.columns:
    if df[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[c].values))
        df[c] = lbl.transform(df[c].values)
        
        
display(df.head())

# Spliting and scaling

In [None]:
X = df.drop(['charges'], axis = 1)
y = df['charges']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Model training & Testing

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor


In [None]:
lr = LinearRegression()

knn = KNeighborsRegressor(n_neighbors=10)

dt = DecisionTreeRegressor(max_depth = 3)

rf = RandomForestRegressor(max_depth = 3, n_estimators=500)

ada = AdaBoostRegressor( n_estimators=50, learning_rate =.01)

gbr = GradientBoostingRegressor(max_depth=2, n_estimators=100, learning_rate =.2)

xgb = XGBRegressor(max_depth = 3, n_estimators=50, learning_rate =.2)

cb = CatBoostRegressor(learning_rate =.01, max_depth =5, verbose = 0)

regressors = [('Linear Regression', lr), ('K Nearest Neighbours', knn),
               ('Decision Tree', dt), ('Random Forest', rf), ('AdaBoost', ada),
              ('Gradient Boosting Regressor', gbr), ('XGBoost', xgb), ('catboost', cb)]


In [None]:
from sklearn.metrics import r2_score

for regressor_name, regressor in regressors:
 
    # Fit regressor to the training set
    regressor.fit(X_train, y_train)    
   
    # Predict 
    y_pred = regressor.predict(X_test)
    accuracy = round(r2_score(y_test,y_pred),1)*100
    

   
    # Evaluate  accuracy on the test set
    print('{:s} : {:.0f} %'.format(regressor_name, accuracy))
    plt.rcParams["figure.figsize"] = (20,8)
    plt.bar(regressor_name,accuracy)
    

**Highest accuracies given by are Gradient Boosting Regressor, XGBoost, CatBoost, RandomForest, Decision Tree models**

# **Upvote if you like it, this motivates us to produce more notebooks for the community**