# U.S. Medical Insurance Costs

### Import libraries

In [1]:
import numpy as np
import pandas as pd

### Load csv

In [2]:
medical_insurance = pd.read_csv("insurance.csv")

### Inspect CSV

In [3]:
medical_insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Checking for duplicates 

In [3]:
duplicates = medical_insurance.duplicated()
print(duplicates.value_counts())

False    1337
True        1
dtype: int64


In [4]:
medical_insurance = medical_insurance.drop_duplicates()
duplicates = medical_insurance.duplicated()
print(duplicates.value_counts())

False    1337
dtype: int64


### Organize data into lists

In [4]:
age = medical_insurance.age
sex = medical_insurance.sex
bmi = medical_insurance.bmi
children = medical_insurance.children
smoker = medical_insurance.smoker
region = medical_insurance.region
charges = medical_insurance.charges

### Perform an analysis of the following points:
* average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location of the patients
* return the average yearly medical charges of the patients
* different costs between smokers vs. non-smokers
* different costs between males vs. females
* effect of BMI in the final cost
* effect of having children in the final cost

### Average age

In [26]:
age_average = np.mean(age)
print(round(age_average, 2))

39.22


The average age of the patients is 39.22 years, which is quite representative of a whole population. Thus, our data is not going to be over represented by young or old patients, that can affect to the charges.

In [18]:
high_age_avg = medical_insurance.loc[medical_insurance['age'] > medical_insurance.age.mean(), 'charges'].mean()
high_age_avg

16430.512562364456

In [19]:
low_age_avg = medical_insurance.loc[medical_insurance['age'] < medical_insurance.age.mean(), 'charges'].mean()
low_age_avg

10157.217580636494

Customers with ages higher than the mean have higher costs. The average cost for high age customers does not reach double the value of the lower age customer costs but it is close to it.

### Male vs. female patients

In [11]:
sex_different = sex.value_counts()
sex_different

male      675
female    662
Name: sex, dtype: int64

There is a similar distribution of male and female patients, which makes our data quite homogeneous.

In [16]:
female_avg = medical_insurance.loc[medical_insurance['sex'] == 'female', 'charges'].mean()
female_avg

12569.57884383534

In [17]:
male_avg = medical_insurance.loc[medical_insurance['sex'] == 'male', 'charges'].mean()
male_avg

13974.998863762954

Apparently, males pay around 1400 more than females. This fact can be done for various reasons, but it is important to take it into account for further analysis.

### Regions 

In [13]:
region_different = region.unique()
region_different

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [18]:
sw_avg = medical_insurance.loc[medical_insurance['region'] == 'southwest', 'charges'].mean()
sw_avg

12346.93737729231

In [20]:
nw_avg = medical_insurance.loc[medical_insurance['region'] == 'northwest', 'charges'].mean()
nw_avg

12450.840843950615

In [21]:
ne_avg = medical_insurance.loc[medical_insurance['region'] == 'northeast', 'charges'].mean()
ne_avg

13406.3845163858

In [19]:
se_avg = medical_insurance.loc[medical_insurance['region'] == 'southeast', 'charges'].mean()
se_avg

14735.411437609895

In general, southeast region patients pay around 1000 more charges than the rest of the regions.

### Smokers

In [14]:
smoker_avg = medical_insurance.loc[medical_insurance['smoker'] == 'yes', 'charges'].mean()
smoker_avg

32050.23183153285

In [15]:
non_smoker_avg = medical_insurance.loc[medical_insurance['smoker'] == 'no', 'charges'].mean()
non_smoker_avg

8440.660306508935

As it could be expected, smokers pay much more than non smoker patients.

### BMI

In [5]:
medical_insurance.bmi.max()

53.13

In [6]:
medical_insurance.bmi.min()

15.96

In [9]:
medical_insurance.bmi.mean()

30.663396860986538

In [11]:
high_bmi_avg = medical_insurance.loc[medical_insurance['bmi'] > medical_insurance.bmi.mean(), 'charges'].mean()
high_bmi_avg

15801.788405743035

In [12]:
low_bmi_avg = medical_insurance.loc[medical_insurance['bmi'] < medical_insurance.bmi.mean(), 'charges'].mean()
low_bmi_avg

10907.32612810548

Patients with a BMI value higher than the mean (high BMI patient) also have higher charges.

### Children

In [13]:
no_children_avg = medical_insurance.loc[medical_insurance['children'] == 0, 'charges'].mean()
no_children_avg

12365.975601635882

In [16]:
children_avg = medical_insurance.loc[medical_insurance['children'] != 0, 'charges'].mean()
children_avg

13949.94109348167

Customers with children pay slightly more than customers without children. Nevertheless, the effect in the final cost is not affected by the variable "children" as much as in the case of the previous considered variables.

After analising the data, we found that the cost of the insurance is especially affected by age, gender, smoking, region and BMI. 

In [21]:
max_cost = medical_insurance.charges.max()
max_cost

63770.42801

In [32]:
medical_insurance.loc[medical_insurance.charges == medical_insurance.charges.max()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
543,54,female,47.41,0,yes,southeast,63770.42801


As an example, the previous line shows the customer with the highest costs. It is a customer with age higher than the average, smoker and from the southeast region.