# U.S. Medical Insurance Costs

# Project Goals
* Analyze BMI within each region
* Garner insight about how different factors influence the cost of insurance by averaging data.
* Use comparative analysis to compare the three most expensive patients and the three cheapest patients

## BMI By Region
* The four regions represented in the data set are Northwest, Southwest, Northeast, and Southeast
* Find the total number of patients for each region. Then find the number of underweight, healthy, overweight, and obese patients within each region. The BMI ranges for each category can be found in the note below.
* Finally, break down each region and represent each bmi range as a percent of the total population within each region.
    * i.e. Northwest: x% underweight, y% healthy, z% overweight, w% obese; Southwest: (etc)

## Data Averaging Analysis
* Compare average cost of smokers to average cost of non-smokers
* Compare average cost of males to average cost of females
* Compare average cost of patients in age ranges of teens, 20s, 30s, etc.
* Compare average cost of patients with different numbers of children (0, 1, 2, 3, 4, 5)
* Compare average cost of patients in the different BMI ranges (see note below)
* Compare average insurance cost by region (SW, SE, NW, NE)

## Comparative Analysis
* Examine the patients with the three highest and three lowest insurance costs and analyze the factors involved.

### Note: 

According to the CDC, a BMI of less than 18.5 is in the underweight range. A BMI between 18.5 and < 25 is a healthy range, 25 to < 30 is overweight and 30 or higher is obese. These are the ranges for which I will compare average insurance costs and analyze BMI by region. 



# Data Collection

In [18]:
import csv
dataset = []
with open('insurance.csv', newline='') as insurance_csv:
    reader = csv.DictReader(insurance_csv)
    for row in reader:
        dataset.append(row)
#dataset is now a group full of dictionaries corresponding to each row in insurance.csv
#print(dataset)

# BMI By Region

In [49]:
def bmi_category(bmi):
    if float(bmi) <= 18.5:
        return 'underweight'
    elif float(bmi) < 25:
        return 'healthy'
    elif float(bmi) < 30:
        return 'overweight'
    else:
        return 'obese'
def count_cats_group(cats, catname):
    un = 0
    he = 0
    ov = 0
    ob = 0
    for i in cats:
        if i == 'underweight':
            un += 1
        elif i == 'healthy':
            he += 1
        elif i == 'overweight':
            ov += 1
        else:
            ob += 1
    tot = un+he+ov+ob
    print("{catname}: {unper:.2f}% underweight, {heper:.2f}% healthy, {ovper:.2f}# overweight, {obper:.2f}% obese".format(catname=catname, unper=100*un/tot, heper=100*he/tot, ovper=100*ov/tot, obper=100*ob/tot))
    
nw = 0
sw = 0
ne = 0
se = 0
se_cats = []
ne_cats = []
sw_cats = []
nw_cats = []
for i in dataset:
    if i['region'] == 'southwest':
        sw += 1
        sw_cats.append(bmi_category(i['bmi']))
    elif i['region'] == 'southeast':
        se += 1
        se_cats.append(bmi_category(i['bmi']))     
    elif i['region'] == 'northwest':
        nw += 1
        nw_cats.append(bmi_category(i['bmi']))
    else:
        ne += 1
        ne_cats.append(bmi_category(i['bmi']))

print("Northwest: {nw}, Northeast: {ne}, Southwest: {sw}, Southeast: {se}".format(nw=nw, ne=ne, sw=sw, se=se))
#print(nw+sw+ne+se)
#should be 1338
count_cats_group(nw_cats, "Northwest")
count_cats_group(ne_cats, "Northeast")
count_cats_group(sw_cats, "Southwest")
count_cats_group(se_cats, "Southeast")


Northwest: 325, Northeast: 324, Southwest: 325, Southeast: 364
Northwest: 2.15% underweight, 19.38% healthy, 32.92# overweight, 45.54% obese
Northeast: 3.09% underweight, 22.53% healthy, 30.25# overweight, 44.14% obese
Southwest: 1.23% underweight, 14.46% healthy, 31.08# overweight, 53.23% obese
Southeast: 0.00% underweight, 11.26% healthy, 21.98# overweight, 66.76% obese


This analysis shows that, in this set of data, all regions have a higher obesity percentage than any other category. The southeast region has the highest obesity percentage at 66.76% of the sample population. The northeast is the "healthiest" region with a higher healthy percentage than any other region and a lower obese percentage than any other region.

# Data Averaging Analysis


## Smokers vs Nonsmokers

In [111]:
smokecosts = []
nonsmokecosts = []
for i in dataset:
    if i['smoker'] == 'yes':
        smokecosts.append(float(i['charges']))
    else:
        nonsmokecosts.append(float(i['charges']))
averagesmokecost = sum(smokecosts)/len(smokecosts)
averagenonsmokecost = sum(nonsmokecosts)/len(nonsmokecosts)
print("There are {numb} smokers in the sample. The average cost for smokers is ${cost:.2f}.".format(numb=len(smokecosts), cost=averagesmokecost))
print("There are {numb} non-smokers in the sample. The average cost for non-smokers is ${cost:.2f}.".format(numb=len(nonsmokecosts), cost=averagenonsmokecost))
print("Patients that smoke pay ${diff:.2f} more on average than patients that don't smoke.".format(diff=averagesmokecost-averagenonsmokecost))

There are 274 smokers in the sample. The average cost for smokers is $32050.23.
There are 1064 non-smokers in the sample. The average cost for non-smokers is $8434.27.
Patients that smoke pay $23615.96 more on average than patients that don't smoke.


## Males vs Females

In [112]:
malecosts = []
femalecosts = []
for i in dataset:
    if i['sex'] == 'male':
        malecosts.append(float(i['charges']))
    else:
        femalecosts.append(float(i['charges']))
averagemalecost = sum(malecosts) / len(malecosts)
averagefemalecost = sum(femalecosts) / len(femalecosts)
print("There are {n} males in the sample. The average cost for males is ${cost:.2f}.".format(n=len(malecosts), cost=averagemalecost))
print("There are {n} females in the sample. The average cost for females is ${cost:.2f}.".format(n=len(femalecosts), cost=averagefemalecost))
print("Men pay ${dif:.2f} more than women, on average.".format(dif = averagemalecost - averagefemalecost))

There are 676 males in the sample. The average cost for males is $13956.75.
There are 662 females in the sample. The average cost for females is $12569.58.
Men pay $1387.17 more than women, on average.


## Different Age Ranges

In [120]:
#10-19 is the first range with entries
#60-70 is the last range with entries
list10s = []
list20s = []
list30s = []
list40s = []
list50s = []
list60s = []
for i in dataset:
    if int(i['age']) <= 19:
        list10s.append(float(i['charges']))
    elif int(i['age']) <= 29:
        list20s.append(float(i['charges']))
    elif int(i['age']) <= 39:
        list30s.append(float(i['charges']))
    elif int(i['age']) <= 49:
        list40s.append(float(i['charges']))
    elif int(i['age']) <= 59:
        list50s.append(float(i['charges']))
    else:
            list60s.append(float(i['charges']))
avg10s = sum(list10s)/len(list10s)
avg20s = sum(list20s)/len(list20s)
avg30s = sum(list30s)/len(list30s)
avg40s = sum(list40s)/len(list40s)
avg50s = sum(list50s)/len(list50s)
avg60s = sum(list60s)/len(list60s)
print("There are {num} patients aged 19 or younger. The average cost for this range is ${cost:.2f}".format(num = len(list10s), cost = avg10s))
print("There are {num} patients aged 20 through 29.  The average cost for this range is ${cost:.2f}".format(num = len(list20s), cost = avg20s))
print("There are {num} patients aged 30 through 39.  The average cost for this range is ${cost:.2f}".format(num = len(list30s), cost = avg30s))
print("There are {num} patients aged 40 through 49.  The average cost for this range is ${cost:.2f}".format(num = len(list40s), cost = avg40s))
print("There are {num} patients aged 50 through 59.  The average cost for this range is ${cost:.2f}".format(num = len(list50s), cost = avg50s))
print("There are {num} patients aged 60 or older.  The average cost for this range is ${cost:.2f}".format(num = len(list60s), cost = avg60s))
print("The difference between the oldest range's average and the youngest range's average is ${:.2f}.".format(avg60s-avg10s))
print("It is clear that average cost of insurance increases as age increases.")

#Note: Use more functions next time to better optimize this code.
      


        

There are 137 patients aged 19 or younger. The average cost for this range is $8407.35
There are 280 patients aged 20 through 29.  The average cost for this range is $9561.75
There are 257 patients aged 30 through 39.  The average cost for this range is $11738.78
There are 279 patients aged 40 through 49.  The average cost for this range is $14399.20
There are 271 patients aged 50 through 59.  The average cost for this range is $16495.23
There are 114 patients aged 60 or older.  The average cost for this range is $21248.02
The difference between the oldest range's average and the youngest range's average is $12840.67.
It is clear that average cost of insurance increases as age increases.


## Different Amounts of Children

In [115]:
children = [x for x in range(6)]
child_dict = {}
for i in children:
    chargegroup = []
    for j in dataset:
        if int(j['children']) == i:
            chargegroup.append(float(j['charges']))
    child_dict.update({i: chargegroup})
for i in child_dict:
    if i==1:
        print("There are {num} patients with {i} child. The average cost for this amount is ${avg:.2f}".format(num=len(child_dict[i]), i=i, avg=(sum(child_dict[i])/len(child_dict[i]))))
    else:
        print("There are {num} patients with {i} children. The average cost for this amount is ${avg:.2f}".format(num=len(child_dict[i]), i=i, avg=(sum(child_dict[i])/len(child_dict[i]))))
print("The data is unclear about any influence that number of children has on insurance costs.")


There are 574 patients with 0 children. The average cost for this amount is $12365.98
There are 324 patients with 1 child. The average cost for this amount is $12731.17
There are 240 patients with 2 children. The average cost for this amount is $15073.56
There are 157 patients with 3 children. The average cost for this amount is $15355.32
There are 25 patients with 4 children. The average cost for this amount is $13850.66
There are 18 patients with 5 children. The average cost for this amount is $8786.04
The data is unclear about any influence that number of children has on insurance costs.


## BMI Categories

In [119]:
#bmi_category(bmi) returns classification as a string
bmi_dict = {}
ungroup = []
hegroup = []
ovgroup = []
obgroup = []
for i in dataset:
    if bmi_category(float(i['bmi'])) == 'underweight':
        ungroup.append(float(i['charges']))
    elif bmi_category(float(i['bmi'])) == 'healthy':
        hegroup.append(float(i['charges']))
    elif bmi_category(float(i['bmi'])) == 'overweight':
        ovgroup.append(float(i['charges']))
    else:
        obgroup.append(float(i['charges']))
bmi_dict.update({'underweight': ungroup, 'healthy': hegroup, 'overweight': ovgroup, 'obese': obgroup})
#bmi_dict is now a dictionary with classifications as keys and groups of corresponding insurance costs as values
for i in bmi_dict:
    print("The {bmi_class} patients pay an average of ${avg:.2f} for insurance.".format(bmi_class = i, avg = sum(bmi_dict[i])/len(bmi_dict[i])))
diff = (sum(bmi_dict['obese']) / len(bmi_dict['obese'])) - (sum(bmi_dict['healthy']) / len(bmi_dict['healthy']))
diff2 = (sum(bmi_dict['obese']) / len(bmi_dict['obese'])) - (sum(bmi_dict['underweight']) / len(bmi_dict['underweight']))
print("The difference between the obese range average and healthy range average is ${diff:.2f}.".format(diff = diff))
print("The difference between the obese range average and underweight range average is ${diff:.2f}.".format(diff = diff2))
print("Note: Patients should not be encouraged to force themselves into the underweight category to lower their insurance cost.")
print("The data suggests that average cost of insurance increases as BMI increases.")    
    
    

The underweight patients pay an average of $8657.62 for insurance.
The healthy patients pay an average of $10434.53 for insurance.
The overweight patients pay an average of $10987.51 for insurance.
The obese patients pay an average of $15552.34 for insurance.
The difference between the obese range average and healthy range average is $5117.80.
The difference between the obese range average and underweight range average is $6894.71.
Note: Patients should not be encouraged to force themselves into the underweight category to lower their insurance cost.
The data suggests that average cost of insurance increases as BMI increases.


## Cost by Region

In [110]:
region_dict = {}
nwlist = []
nelist = []
swlist = []
selist = []
for i in dataset:
    if i['region'] == 'northwest':
        nwlist.append(float(i['charges']))
    elif i['region'] == 'northeast':
        nelist.append(float(i['charges']))
    elif i['region'] == 'southwest':
        swlist.append(float(i['charges']))
    else:
        selist.append(float(i['charges']))
region_dict.update({'northwest': nwlist, 'northeast': nelist, 'southwest': swlist, 'southeast': selist})
#region_dict is now a dictionary with region names as keys and lists of charges as values.
for i in region_dict:
    print("Patients in the {region} region pay an average of ${avg:.2f} for health insurance.".format(region=i, avg=sum(region_dict[i])/len(region_dict[i])))
cost = (sum(region_dict['southeast'])/len(region_dict['southeast'])-(sum(region_dict['northwest'])/len(region_dict['northwest'])))
print("The difference in average cost between the most expensive and least expensive region is ${cost:.2f}.".format(cost=cost))    

Patients in the northwest region pay an average of $12417.58 for health insurance.
Patients in the northeast region pay an average of $13406.38 for health insurance.
Patients in the southwest region pay an average of $12346.94 for health insurance.
Patients in the southeast region pay an average of $14735.41 for health insurance.
The difference in average cost between the most expensive and least expensive region is $2317.84.


## Averaging Analysis
### Patients that smoke pay 23615.96 more on average than patients that don't smoke.
This is by far the highest discrepancy in average insurance cost out of all of the factors in the data set. If a patient wanted to lower their insurance cost, the first thing they should do is stop smoking.

### Men pay 1387.17 more than women, on average.
This is the lowest discrepancy in average insurance cost. It's entirely possible that men and women are not charged differently for insurance and this is only due to variance in the other factors.

### It is clear that the average cost of insurance increases as age increases.
The difference between the oldest range's average and the youngest range's average is 12840.67. 
The biggest jump occurs between the 50-60 range and the 60-70 range, at about 5000. It appears the curve steepens as the patient gets older.

### The data is unclear about any influence that number of children has on insurance costs.
The first four demographics (No children, one child, two children, and three children) suggest an increasing trend, however the fifth demographic (four children) is lower than that of two or three children, and the sixth demographic (five children) is lower than all other demographics by a significant margin (about 4000 lower than the second lowest). One possible reason for this is that there is a significantly smaller sample size in the fifth and sixth demographics. This could lead to lower accuracy in the data. 

I would tentatively suggest that having more children results in higher insurance costs, but not with considerable confidence, at least from an average-analysis standpoint. Another form of analysis would likely provide better insight.

### The difference between the obese range average and underweight range average is 6894.71. The data suggests that average cost of insurance increases as BMI increases.

This is one metric that patients have at least some agency over (as opposed to something like age). Analysis has revealed that a vast majority of the sample population is in the obese range, and many of these patients are paying higher premiums than those in healthier categories. Patients in the obese category should consider working toward lowering their bmi if they want to lower their insurance costs. While those in the underweight category are paying less for insurance, it is not advisable to drop below a BMI in the 'healthy' range. 

# Comparative Analysis

In [141]:
#Iterate through the dataset and print the three highest insurance costs and three lowest insurance costs
dataset2 = dataset
cheapest = {}
secondcheapest = {}
thirdcheapest = {}
mostexpens = {}
secondmostexpens = {}
thirdmostexpens = {}
highest = 0
lowest = float('inf')
for i in dataset2:
    if float(i['charges']) >= highest:
        highest = float(i['charges'])
        mostexpens.update(i)
highest = 0
for i in dataset2:
    if float(i['charges']) >= highest and float(i['charges']) < float(mostexpens['charges']):
        highest = float(i['charges'])
        secondmostexpens.update(i)
highest = 0
for i in dataset2:
    if float(i['charges']) >= highest and float(i['charges']) < float(secondmostexpens['charges']):
        highest = float(i['charges'])
        thirdmostexpens.update(i)
print("These are the three patients with the highest charges.")
print("Most expensive: {}".format(mostexpens))
print("Second most expensive: {}".format(secondmostexpens))
print("Third most expensive: {}\n".format(thirdmostexpens))

for i in dataset2:
    if float(i['charges']) <= lowest:
        lowest = float(i['charges'])
        cheapest.update(i)
lowest = float('inf')
for i in dataset2:
    if float(i['charges']) <= lowest and float(i['charges']) > float(cheapest['charges']):
        lowest = float(i['charges'])
        secondcheapest.update(i)
lowest = float('inf')
for i in dataset2:
    if float(i['charges']) <= lowest and float(i['charges']) > float(secondcheapest['charges']):
        lowest = float(i['charges'])
        thirdcheapest.update(i)
print("These are the three patients with the lowest charges.")
print("Cheapest: {}".format(cheapest))
print("Second cheapest: {}".format(secondcheapest))
print("Third cheapest: {}".format(thirdcheapest))
    



These are the three patients with the highest charges.
Most expensive: {'age': '54', 'sex': 'female', 'bmi': '47.41', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '63770.42801'}
Second most expensive: {'age': '45', 'sex': 'male', 'bmi': '30.36', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '62592.87309'}
Third most expensive: {'age': '52', 'sex': 'male', 'bmi': '34.485', 'children': '3', 'smoker': 'yes', 'region': 'northwest', 'charges': '60021.39897'}

These are the three patients with the lowest charges.
Cheapest: {'age': '18', 'sex': 'male', 'bmi': '23.21', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '1121.8739'}
Second cheapest: {'age': '18', 'sex': 'male', 'bmi': '30.14', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '1131.5066'}
Third cheapest: {'age': '18', 'sex': 'male', 'bmi': '33.33', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '1135.9407'}


When observing the second and third most expensive patients, it's interesting to see that despite the second patient's lower age, lower BMI, and fewer children, he still pays considerably more than the third patient. This means that either having more children brings down the insurance price, different regions have different weights to each of these factors, or some combination of both.

Notably, the three cheapest patients are all 18 year old nonsmoking men with 0 children from the southeast. The only difference is in BMI, so we can isolate that variable and analyze it for the southeast region.

A BMI change of +6.93 results in a charge change of +9.63
A BMI change of +3.19 results in a charge change of 4.44
This indicates BMI influences charge as follows: charge ~ 1.39(bmi).

It is possible that different regions follow different rules when it comes to how the factors influence insurance costs.



# Footnote

Dataset provided by https://www.kaggle.com/mirichoi0218/insurance