# U.S. Medical Insurance Costs

## Project Goals
This project has 4 different goals:

1. Find out the average age of the patients in the dataset.
2. Analyze where a majority of the individuals are from.
3. Look at the different costs between smokers vs. non-smokers.
4. Figure out what the average age is for someone who has at least one child in this dataset.
5. Figure out if there is a bias in the dataset in the number of males vs females.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("insurance.csv")
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


## Figure out if there is a bias in the dataset in the number of males vs females

In [3]:
num_males = len(df[df.sex == "male"])
num_males

676

In [4]:
num_females = len(df[df.sex == "female"])
num_females

662

In [5]:
print("Our dataset consists of {} males and {} females.".format(num_males, num_females) )
print( "Therefore we can conclude that the dataset is representative of the population.")

Our dataset consists of 676 males and 662 females.
Therefore we can conclude that the dataset is representative of the population.


## Find out the average age of the patients in the dataset.

In [6]:
round(np.mean(df["age"]))

39

## Analyze where a majority of the individuals are from.

In [7]:
df["region"]

0       southwest
1       southeast
2       southeast
3       northwest
4       northwest
          ...    
1333    northwest
1334    northeast
1335    southeast
1336    southwest
1337    northwest
Name: region, Length: 1338, dtype: object

In [8]:
unique_regions = []
for region in df["region"].unique():
    unique_regions.append(region)
unique_regions

['southwest', 'southeast', 'northwest', 'northeast']

In [9]:
southwest_count = len(df[df.region == 'southwest'])
southeast_count = len(df[df.region == 'southeast'])
northwest_count = len(df[df.region == 'northwest'])
northeast_count = len(df[df.region == 'northeast'])
print("We have {} individuals from the southwest.".format(southwest_count))
print("We have {} individuals from the southeast.".format(southeast_count))
print("We have {} individuals from the northwest.".format(northwest_count))
print("We have {} individuals from the northeast.".format(northeast_count))

We have 325 individuals from the southwest.
We have 364 individuals from the southeast.
We have 325 individuals from the northwest.
We have 324 individuals from the northeast.


In [10]:
print("Therefore we can conclude that the regions are pretty evenly matched")

Therefore we can conclude that the regions are pretty evenly matched


## Look at the different costs between smokers vs. non-smokers.

In [11]:
smokers = df[df.smoker == 'yes']
smokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
11,62,female,26.290,0,yes,southeast,27808.72510
14,27,male,42.130,0,yes,southeast,39611.75770
19,30,male,35.300,0,yes,southwest,36837.46700
23,34,female,31.920,1,yes,northeast,37701.87680
...,...,...,...,...,...,...,...
1313,19,female,34.700,2,yes,southwest,36397.57600
1314,30,female,23.655,3,yes,northwest,18765.87545
1321,62,male,26.695,0,yes,northeast,28101.33305
1323,42,female,40.370,2,yes,southeast,43896.37630


In [12]:
non_smokers = df[df.smoker == 'no']
non_smokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
5,31,female,25.740,0,no,southeast,3756.62160
...,...,...,...,...,...,...,...
1332,52,female,44.700,3,no,southwest,11411.68500
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [13]:
smokers_average_cost = sum(smokers.charges)/len(smokers)
smokers_average_cost

32050.23183153285

In [14]:
non_smokers_average_cost = sum(non_smokers.charges)/len(non_smokers)
non_smokers_average_cost

8434.268297856199

In [15]:
diff = smokers_average_cost - non_smokers_average_cost

In [16]:
smokers_average_cost / non_smokers_average_cost

3.8000014582983206

In [17]:
print("The difference between a smoker and non-smoker average cost is {}.".format(round(diff)))
print("Therefore we can conclude that smokers pay almost {} times as much as non smokers.".format(round(smokers_average_cost / non_smokers_average_cost)))

The difference between a smoker and non-smoker average cost is 23616.
Therefore we can conclude that smokers pay almost 4 times as much as non smokers.


## Figure out what the average age is for someone who has at least one child in this dataset.

In [18]:
more_than_1_child = df[df.children > 0]
more_than_1_child

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
6,46,female,33.440,1,no,southeast,8240.58960
7,37,female,27.740,3,no,northwest,7281.50560
8,37,male,29.830,2,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1328,23,female,24.225,2,no,northeast,22395.74424
1329,52,male,38.600,2,no,southwest,10325.20600
1330,57,female,25.740,2,no,southeast,12629.16560
1332,52,female,44.700,3,no,southwest,11411.68500


In [19]:
avg_age = sum(more_than_1_child.age) / len(more_than_1_child)
avg_age

39.78010471204188

In [20]:
print("The average age for someone who has at least 1 child is {}.".format(round(avg_age)))


The average age for someone who has at least 1 child is 40.
