# 1. Introduction: Business Goal & Problem Definition

The goal of this project is to study and predict the best insurance price to be charged to the customers according to their unique characteristics, allowing the company to adopt a reasonable and fair price in the market and to maximize its profits. For that we´ll use the Medical Cost Personal Dataset available in Kaggle, containing 1338 observations, each with the following attributes:

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

* Age: age of primary beneficiary
* Sex: insurance contractor gender, female, male
* BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* Children: Number of children covered by health insurance / Number of dependents
* Smoker: Smoking
* Region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
* Charges: Individual medical costs billed by health insurance

# 2. Importing Basic Libraries 

In [None]:
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 3. Data Collection

In [None]:
insurance_ds = pd.read_csv("../input/insurance/insurance.csv", sep=",")

insurance_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
insurance_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

insurance_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(insurance_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(insurance_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

insurance_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

insurance_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:

    1. Remove duplicated row
    

    2. Convert categorical variables (sex, smoker, region) to dummies
    
    
    3. Convert all numerical variables to categorical ranges (to be used in next step when analyzing correlations)


    * no missing, zero or invalid values to treat
    * no columns to remove
    * no outliers to treat
    * the entire dataset will be taken

In [None]:
#1

insurance_ds.drop_duplicates(inplace=True)

#2

insurance_ds = pd.concat([insurance_ds, pd.get_dummies(insurance_ds["sex"], prefix="sex")], axis=1)
insurance_ds = pd.concat([insurance_ds, pd.get_dummies(insurance_ds["smoker"], prefix="smoker")], axis=1)
insurance_ds = pd.concat([insurance_ds, pd.get_dummies(insurance_ds["region"], prefix="region")], axis=1)

#3

insurance_ds["age_range"] = np.where(insurance_ds.age>=60, "60+", np.where(insurance_ds.age>=50, "50-60", np.where(insurance_ds.age>=40, "40-50", np.where(insurance_ds.age>=30, "30-40", np.where(insurance_ds.age>=18, "18-30", "18-")))))
insurance_ds["bmi_range"] = np.where(insurance_ds.bmi>=50, "50+", np.where(insurance_ds.bmi>=40, "40-50", np.where(insurance_ds.bmi>=30, "30-40", np.where(insurance_ds.bmi>=20, "20-30", np.where(insurance_ds.bmi>=15, "15-20", "15-")))))
insurance_ds["children_range"] = np.where(insurance_ds.children>5, "5+", np.where(insurance_ds.children==5, "5", np.where(insurance_ds.children==4, "4", np.where(insurance_ds.children==3, "3", np.where(insurance_ds.children==2, "2", np.where(insurance_ds.children==1, "1", "0"))))))
insurance_ds["charges_range"] = np.where(insurance_ds.charges>=50000, "50000+", np.where(insurance_ds.charges>=40000, "40000-50000", np.where(insurance_ds.charges>=30000, "30000-40000", np.where(insurance_ds.charges>=20, "20000-30000", np.where(insurance_ds.charges>=10000, "10000-20000", "10000-")))))

insurance_ds.to_excel("insurance_ds_clean.xlsx")

# 6. Data Exploration

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2)
insurance_ds["sex"].value_counts().plot.bar(color="purple", ax=ax[0])
insurance_ds["sex"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Gender Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
insurance_ds["smoker"].value_counts().plot.bar(color="purple", ax=ax[0])
insurance_ds["smoker"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Smoking Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
insurance_ds["region"].value_counts().plot.bar(color="purple", ax=ax[0])
insurance_ds["region"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Region Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)


#Plotting Numerical Variables

fig, ax = plt.subplots(1, 3)
fig.suptitle("Charges Distribution", fontsize=15)
sns.distplot(insurance_ds["charges"], ax=ax[0])
sns.boxplot(insurance_ds["charges"], ax=ax[1])
sns.violinplot(insurance_ds["charges"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Age Distribution", fontsize=15)
sns.distplot(insurance_ds["age"], ax=ax[0])
sns.boxplot(insurance_ds["age"], ax=ax[1])
sns.violinplot(insurance_ds["age"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("BMI Distribution", fontsize=15)
sns.distplot(insurance_ds["bmi"], ax=ax[0])
sns.boxplot(insurance_ds["bmi"], ax=ax[1])
sns.violinplot(insurance_ds["bmi"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Children Distribution", fontsize=15)
sns.distplot(insurance_ds["children"], ax=ax[0])
sns.boxplot(insurance_ds["children"], ax=ax[1])
sns.violinplot(insurance_ds["children"], ax=ax[2])

In [None]:
#Alternatively using Profile Report to see variables statistics and correlations

# from pandas_profiling import ProfileReport
# profile = ProfileReport(insurance_ds, title="Medical Cost")
# profile.to_file(output_file="Medical_Cost.html")

# 7. Correlations Analysis & Features Selection

In [None]:
#Plotting Bar Charts, also considering all numerical to categorical variables created at the step before

fig, axarr = plt.subplots(3, 2, figsize=(20, 12))
sns.countplot(x="age_range", hue = "charges_range", data = insurance_ds, ax=axarr[0][0])
sns.countplot(x="bmi_range", hue = "charges_range", data = insurance_ds, ax=axarr[0][1])
sns.countplot(x="children_range", hue = "charges_range", data = insurance_ds, ax=axarr[1][0])
sns.countplot(x="sex", hue = "charges_range", data = insurance_ds, ax=axarr[1][1])
sns.countplot(x="smoker", hue = "charges_range", data = insurance_ds, ax=axarr[2][0])
sns.countplot(x="region", hue = "charges_range", data = insurance_ds, ax=axarr[2][1])

#Deleting original categorical columns

insurance_ds.drop(["sex", "smoker", "region", "age_range", "bmi_range", "children_range", "charges_range"], axis=1, inplace=True)

#Plotting a Heatmap

fig, ax = plt.subplots(1, figsize=(15,15))
sns.heatmap(insurance_ds.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

#Plotting a Pairplot

sns.pairplot(insurance_ds)

In [None]:
#Plotting a Feature Importance

from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot
#Defining Xs and y
X = insurance_ds.drop(["charges"], axis=1)
y = insurance_ds["charges"]
#Defining the model
model = RandomForestRegressor().fit(X, y)
#Getting importance
importance = model.feature_importances_
#Summarizing feature importance
for i,v in enumerate(importance):
    print("Feature:{0:} - Score:{1:,.4f}".format(X.columns[i], v))
#Plotting feature importance
pd.Series(model.feature_importances_[::-1], index=X.columns[::-1]).plot.barh(figsize=(25,25))


# 8. Data Modelling

In [None]:
#Defining Xs and y

X = insurance_ds[["smoker_no", "bmi", "age", "children"]]
y = insurance_ds[["charges"]]

#Scaling all features

from sklearn.preprocessing import MinMaxScaler
sc_X = MinMaxScaler()
X_scaled = sc_X.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

#Setting train/test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=0)

# 9. Machine Learning Algorithms Implementation & Assessment

# 9.1 Polynomial Regression

In [None]:
#Creating a Polynomial Regression model and checking its Metrics

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#Creating a Linear Regressor
lin_regressor = LinearRegression()

#Trying different polynomial degrees
degrees = [1, 2, 3, 4, 5]
print("Testing degrees:")
for a in degrees:
    poly = PolynomialFeatures(degree=a)
    X_train_degree = poly.fit_transform(X_train)
    X_test_degree = poly.fit_transform(X_test)
    model_pr = lin_regressor.fit(X_train_degree, y_train)
    y_preds_train = model_pr.predict(X_train_degree)
    y_preds_test = model_pr.predict(X_test_degree)
    score_train = r2_score(y_train, y_preds_train)
    score_test = r2_score(y_test, y_preds_test)
    mse_train = mean_squared_error(y_train, y_preds_train)
    mse_test = mean_squared_error(y_test, y_preds_test)
    print("Train: Degree:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_train, mse_train, np.sqrt(mse_train)))
    print("Test : Degree:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_test, mse_test, np.sqrt(mse_test)))       
print("")

#Choosing the best polynomial degree
chosen_degree = 2
poly = PolynomialFeatures(degree=chosen_degree)

#Working on X_train & X_test in the polynomial chosen degree
X_train_degree = poly.fit_transform(X_train)
X_test_degree = poly.fit_transform(X_test)

#Fitting to the Linear Regressor
model_pr = lin_regressor.fit(X_train_degree, y_train)
print(f"Linear Regression Intercept: {model_pr.intercept_}")
print(f"Linear Regression Coefficients: {model_pr.coef_}, \n")

#Getting the predictions & Metrics
y_preds_train = model_pr.predict(X_train_degree)
y_preds_test = model_pr.predict(X_test_degree)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print("Chosen degree:")
print("Train: Degree:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(chosen_degree, score_train, mse_train, np.sqrt(mse_train)))
print("Test : Degree:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(chosen_degree, score_test, mse_test, np.sqrt(mse_test)))   

#Visualizing y_pred in the dataset
X_degree = poly.fit_transform(X_scaled)
y_pred_all = model_pr.predict(X_degree)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_pr.xlsx")

# 9.2 Ridge Regression

In [None]:
#Creating a Ridge Regression model and checking its Metrics

from sklearn.linear_model import Ridge

#Trying different alphas
alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
print("Testing alphas:")
for a in alphas:
    model_ridge = Ridge(alpha=a, normalize=True).fit(X_train, y_train) 
    y_preds_train = model_ridge.predict(X_train)
    y_preds_test = model_ridge.predict(X_test)
    score_train = r2_score(y_train, y_preds_train)
    score_test = r2_score(y_test, y_preds_test)
    mse_train = mean_squared_error(y_train, y_preds_train)
    mse_test = mean_squared_error(y_test, y_preds_test)
    print("Train: Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_train, mse_train, np.sqrt(mse_train)))
    print("Test : Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_test, mse_test, np.sqrt(mse_test)))
print("")

#Choosing the best alpha
a_final = 0.000001
model_ridge = Ridge(alpha=a_final, normalize=True).fit(X_train, y_train) 
y_preds_train = model_ridge.predict(X_train)
y_preds_test = model_ridge.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print(f"Linear Regression Intercept: {model_ridge.intercept_}")
print(f"Linear Regression Coefficients: {model_ridge.coef_}, \n")
print("Chosen alpha:")
print("Train: Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_train, mse_train, np.sqrt(mse_train)))
print("Test : Aplha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_ridge.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_ridge.xlsx")

# 9.3 RidgeCV Regression

In [None]:
#Creating a RidgeCV Regression model and checking its Metrics
from sklearn.linear_model import RidgeCV

#Choosing the best alpha
model_ridge_cv = RidgeCV(alphas=alphas, normalize=True).fit(X_train,y_train) 
y_preds_train = model_ridge_cv.predict(X_train)
y_preds_test = model_ridge_cv.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print(f"Linear Regression Intercept: {model_ridge_cv.intercept_}")
print(f"Linear Regression Coefficients: {model_ridge_cv.coef_}, \n")
print("Train: R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_train, mse_train, np.sqrt(mse_train)))
print("Test : R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_ridge_cv.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_ridge_cv.xlsx")

# 9.4 Lasso Regression

In [None]:
#Creating a Lasso Regression model and checking its Metrics

from sklearn.linear_model import Lasso

#Trying different alphas
alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
print("Testing alphas:")
for a in alphas:
    model_lasso = Lasso(alpha=a, normalize=True).fit(X_train,y_train) 
    y_preds_train = model_lasso.predict(X_train)
    y_preds_test = model_lasso.predict(X_test)
    score_train = r2_score(y_train, y_preds_train)
    score_test = r2_score(y_test, y_preds_test)
    mse_train = mean_squared_error(y_train, y_preds_train)
    mse_test = mean_squared_error(y_test, y_preds_test)
    print("Train: Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_train, mse_train, np.sqrt(mse_train)))
    print("Test : Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_test, mse_test, np.sqrt(mse_test)))
print("")

#Choosing the best alpha
a_final = 0.000001
model_lasso = Lasso(alpha=a_final, normalize=True).fit(X_train,y_train) 
y_preds_train = model_lasso.predict(X_train)
y_preds_test = model_lasso.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print(f"Linear Regression Intercept: {model_lasso.intercept_}")
print(f"Linear Regression Coefficients: {model_lasso.coef_}, \n")
print("Chosen alpha:")
print("Train: Alpha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_train, mse_train, np.sqrt(mse_train)))
print("Test : Aplha:{0:,.6f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_lasso.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_lasso.xlsx")

# 9.5 LassoCV Regression

In [None]:
#Creating a LassoCV Regression model and checking its Metrics

from sklearn.linear_model import LassoCV

#Choosing the best alpha
model_lasso_cv = LassoCV(alphas=alphas, normalize=True).fit(X_train,y_train) 
y_preds_train = model_lasso_cv.predict(X_train)
y_preds_test = model_lasso_cv.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print(f"Linear Regression Intercept: {model_lasso_cv.intercept_}")
print(f"Linear Regression Coefficients: {model_lasso_cv.coef_}, \n")
print("Train: R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_train, mse_train, np.sqrt(mse_train)))
print("Test : R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_lasso_cv.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_lasso_cv.xlsx")

# 9.6 Random Forest Regression

In [None]:
#Creating a Random Forest Regression model and checking its Metrics

from sklearn.ensemble import RandomForestRegressor

#Trying different depths
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print("Testing depths:")
for a in depths:
    model_rf = RandomForestRegressor(max_depth=a, random_state=0).fit(X_train,y_train.values.ravel()) 
    y_preds_train = model_rf.predict(X_train)
    y_preds_test = model_rf.predict(X_test)
    score_train = r2_score(y_train, y_preds_train)
    score_test = r2_score(y_test, y_preds_test)
    mse_train = mean_squared_error(y_train, y_preds_train)
    mse_test = mean_squared_error(y_test, y_preds_test)
    print("Train: Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_train, mse_train, np.sqrt(mse_train)))
    print("Test : Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_test, mse_test, np.sqrt(mse_test)))
print("")

#Choosing the best depth
a_final = 4
model_rf = RandomForestRegressor(max_depth=a_final, random_state=0).fit(X_train,y_train.values.ravel()) 
y_preds_train = model_rf.predict(X_train)
y_preds_test = model_rf.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print("Chosen depth:")
print("Train: Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_train, mse_train, np.sqrt(mse_train)))
print("Test : Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_rf.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_rf.xlsx")

# 9.7 XGBoost Regression

In [None]:
#Creating a XGBoost Regression model and checking its Metrics

from xgboost import XGBRegressor

#Trying different depths
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print("Testing depths:")
for a in depths:
    model_xgb = XGBRegressor(max_depth=a, random_state=0).fit(X_train,y_train) 
    y_preds_train = model_xgb.predict(X_train)
    y_preds_test = model_xgb.predict(X_test)
    score_train = r2_score(y_train, y_preds_train)
    score_test = r2_score(y_test, y_preds_test)
    mse_train = mean_squared_error(y_train, y_preds_train)
    mse_test = mean_squared_error(y_test, y_preds_test)
    print("Train: Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_train, mse_train, np.sqrt(mse_train)))
    print("Test : Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a, score_test, mse_test, np.sqrt(mse_test)))
print("")

#Choosing the best depth
a_final = 2
model_xgb = XGBRegressor(max_depth=a_final, random_state=0).fit(X_train,y_train) 
y_preds_train = model_xgb.predict(X_train)
y_preds_test = model_xgb.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print("Chosen depth:")
print("Train: Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_train, mse_train, np.sqrt(mse_train)))
print("Test : Depth:{0:,.0f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(a_final, score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_xgb.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_xgb.xlsx")

# 9.8 Deep Learning Regression

In [None]:
#Creating a Deep Learning Regression model and checking its Metrics

from keras import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping,ReduceLROnPlateau

#Creating a model
model_dl = Sequential()

#Input and First Hidden Layer
model_dl.add(Dense(units=256, activation="relu", input_dim=X_train.shape[1]))
#Second Hidden Layer
model_dl.add(Dense(units=256, activation="relu"))
#Third Hidden Layer
model_dl.add(Dense(units=256, activation="relu"))
#Output Layer
model_dl.add(Dense(units=1))

#Compiling the neural network
model_dl.compile(optimizer="adam",loss="mean_squared_error")

#Fitting to the model
model_dl.fit(X_train,y_train, callbacks=[EarlyStopping(patience=10),ReduceLROnPlateau(monitor="val_loss",min_lr=0.01)], epochs=250)

#Getting the predictions & Metrics
y_preds_train = model_dl.predict(X_train)
y_preds_test = model_dl.predict(X_test)
score_train = r2_score(y_train, y_preds_train)
score_test = r2_score(y_test, y_preds_test)
mse_train = mean_squared_error(y_train, y_preds_train)
mse_test = mean_squared_error(y_test, y_preds_test)
print("Train: R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_train, mse_train, np.sqrt(mse_train)))
print("Test : R2:{0:,.3f}, MSE:{1:,.2f}, RMSE:{2:,.2f}".format(score_test, mse_test, np.sqrt(mse_test)))

#Plotting
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_preds_test, lw=0.8, color="red", label="predicted")
plt.legend()

#Visualizing y_pred in the dataset
y_pred_all = model_dl.predict(X_scaled)
insurance_ds["charges_predicted"] = y_pred_all
insurance_ds.to_excel("model_dl.xlsx")

# 10. Model Deployment

# 10.1 Entering Independent Variables in Python

In [None]:
#Entering Xs

# smoker_no_input = str(input("Smoker (Yes/No)? "))
# if smoker_no_input == "No":
#     smoker_no_input = int(1)
# else:
#     smoker_no_input = int(0)
# bmi_input = float(input("Enter the BMI: "))
# age_input = float(input("Enter the Age: "))
# children_input = int(input("Children (number of): "))
    
#Defining Xs

# X_mod_dep = pd.DataFrame({"smoker_no": [smoker_no_input], "bmi": [bmi_input], "age": [age_input], "children": [children_input]})
#Example for random client
X_mod_dep = pd.DataFrame({"smoker_no": [1], "bmi": [33], "age": [45], "children": [1]})
    
    
#Appending X_mod_dep to original X dataframe, so we can scale it all together next

X_with_X_mode_dep = X.append(X_mod_dep)

#Scaling all features

from sklearn.preprocessing import MinMaxScaler
sc_X = MinMaxScaler()
X_scaled = sc_X.fit_transform(X_with_X_mode_dep)
X_scaled = pd.DataFrame(X_scaled)

#Recovering X_mod_dep row in dataframe after scaling

X_mod_dep = X_scaled.tail(1)

#Predicting results

print(f"Predicted Charge = {model_xgb.predict(X_mod_dep)}.")

# 10.2 Entering Independent Variables in a Web Page

Obs: to be deployed in localhost or paid cloud service

In [None]:
#Copying html content to new editable directory

!mkdir static
!mkdir templates
!cp -r "../input/Medical-Cost/static/input_page.html" "/kaggle/working/static"
!cp -r "../input/Medical-Cost/static/input_page.css" "/kaggle/working/static"
!cp -r "../input/Medical-Cost/templates/input_page.html" "/kaggle/working/templates"
!cp -r "../input/Medical-Cost/templates/input_page.css" "/kaggle/working/templates"

!ls

In [None]:
#Creating the picke file
import pickle

pickle.dump(RandomForestRegressor(max_depth=a_final, random_state=0), open('model.pkl','wb'))

model = pickle.load(open('model.pkl','rb'))


#Making an API which receives sales details through GUI and computes the predicted sales value based on our model
from flask import Flask, request, jsonify, render_template

app = Flask(__name__)
model = pickle.load(open('model.pkl', 'rb'))

@app.route('/')
def home():
    return render_template('input_page.html')

@app.route('/predict',methods=['POST'])
def predict():

    int_features = [int(x) for x in request.form.values()]
    final_features = [np.array(int_features)]
    prediction = model_xgb.predict(final_features)

    output = round(prediction[0], 2)

    return render_template('input_page.html', prediction_text='Predicted Charge = {}'.format(output))

@app.route('/results',methods=['POST'])
def results():

    data = request.get_json(force=True)
    prediction = model_xgb.predict([np.array(list(data.values()))])

    output = prediction[0]
    return jsonify(output)

if __name__ == "__main__":
    app.run(debug=True)


#Using requests module to call APIs
import requests

url = 'http://localhost:5000/results'
r = requests.post(url,json={'smoker_no':0, 'bmi':0, 'age':0, 'children':0})

print(r.json())

# Open http://127.0.0.1:5000/ in your web-browser, and the GUI as shown below should appear.

# 11. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

In this project we went through all the process from defining the business objective, collecting data, exploring features and distributions, treating data, understanding correlations, selecting relevant features, data modelling and presenting 8 different algorithms with metrics to select the best to predict the best price to charge the customers, what´s vital for an insurance company in order to minimize the profit losses and maximize the revenue. The chosen model was XGBoost, with around 85% accuracy.