## Predicting the cost of health insurance for a person

The important insurance company 4Geeks Insurance S.L. wants to calculate, based on the physiological data of its customers what will be the premium (cost) to be borne by each of them. To do this, it has assembled a whole team of doctors, and based on data from other companies and a particular study, it has managed to gather a set of data to train a predictive model.

### Step 0: Import Libreries

In [11]:
# Libreries
import pandas as pd

# When you work locally it is likely to have an error with the SSL certification
# Recomend use request for read csv
import requests
from io import StringIO

# Graphics
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing  import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import (SelectKBest, 
                                       f_regression)


### Step 1: Get Data Base

In [2]:
# Import DB
url = "https://raw.githubusercontent.com/4GeeksAcademy/linear-regression-project-tutorial/main/medical_insurance_cost.csv"
response = requests.get(url)
data = pd.read_csv(StringIO(response.text))
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [3]:
# Create DB file in data/raw
data.to_csv("../data/raw/example.csv", index= False)

### Step 2: Preprocessing

In [12]:
# Delete duplicates
df_raw = data
df_raw = df_raw.drop_duplicates().reset_index(drop= True)
df_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [14]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB


In [18]:
# Scalering data
df_raw['sex_n'] = pd.factorize(df_raw['sex'])[0]
df_raw['smoker_n'] = pd.factorize(df_raw['smoker'])[0]
df_raw['region_n'] = pd.factorize(df_raw['region'])[0]
num_variables = ['age', 'bmi', 'children', 'sex_n', 'smoker_n', 'region_n', 'charges']

scaler = MinMaxScaler()
scal_features = scaler.fit_transform(df_raw[num_variables])
df_raw_scal = pd.DataFrame(scal_features, index= df_raw.index, columns= num_variables)
df_raw_scal.head()

Unnamed: 0,age,bmi,children,sex_n,smoker_n,region_n,charges
0,0.021739,0.321227,0.0,0.0,0.0,0.0,0.251611
1,0.0,0.47915,0.2,1.0,1.0,0.333333,0.009636
2,0.217391,0.458434,0.6,1.0,1.0,0.333333,0.053115
3,0.326087,0.181464,0.0,1.0,1.0,0.666667,0.33301
4,0.304348,0.347592,0.0,1.0,1.0,0.666667,0.043816


In [19]:
df_raw_scal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   float64
 1   bmi       1337 non-null   float64
 2   children  1337 non-null   float64
 3   sex_n     1337 non-null   float64
 4   smoker_n  1337 non-null   float64
 5   region_n  1337 non-null   float64
 6   charges   1337 non-null   float64
dtypes: float64(7)
memory usage: 73.2 KB


### Setp 3: EDA

In [22]:
# Split DB
df_train, df_test = train_test_split(df_raw_scal, test_size= 0.2, random_state= 2024)

In [23]:
display(df_train.describe(include= 'number').T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1069.0,0.46429,0.305705,0.0,0.195652,0.456522,0.717391,1.0
bmi,1069.0,0.395705,0.165725,0.0,0.276029,0.38593,0.50686,1.0
children,1069.0,0.220393,0.240153,0.0,0.0,0.2,0.4,1.0
sex_n,1069.0,0.500468,0.500234,0.0,0.0,1.0,1.0,1.0
smoker_n,1069.0,0.789523,0.407838,0.0,1.0,1.0,1.0,1.0
region_n,1069.0,0.496102,0.371294,0.0,0.333333,0.333333,0.666667,1.0
charges,1069.0,0.198248,0.196882,0.0,0.05797,0.134398,0.254139,1.0
