# U.S. Medical Insurance Costs

# Scope of the project

## What do I want to achieve with this project

A proper analysis of medical insurance costs where I can use each variable given and obtain valuable insights for each, i.e. how they affect the cost of each individual. I can use some conditionals to obtain data for a certain age group or BMI group, for example. It will be vital to make multiple sorted list, where you can group data and display statistics for each group separately.

__The ultimate goal is to have done at least a single piece of analysis for each different field available.__

### Importing data into useful format

Seeing as our data is arranged in tabular form in a csv file, we can use the `csv` library to get our data into dictionaries and then possibly lists so we have our data in a format that is easy to conduct analysis on.

### EXAMPLE: Age category

The age category is an interesing variable but will not tell much about the cost without the other factors, age combined with other factors will be an interesting variable to analyse. I will try to analyse data for 3 distinct groups, people below the age of 20 (called 'Zoomers'), people below the age of 40 (called 'Millenials'), people below the age of 65 (called 'Boomers').

1. Importing data from `insurance.csv` into dictionaries

In [157]:
import csv

ages = []
sex = []
bmis = []
children = []
smoker = []
region = []
charges = []

with open('insurance.csv', newline = '') as insurance_data:
    data = csv.DictReader(insurance_data)
    for row in data:
        ages.append(row['age'])
        sex.append(row['sex'])
        bmis.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])
        
#create unique identifier for each row
unique_ids = []
for iteration in range(1,len(ages)+1):
    unique_ids.append(iteration)

2. Put the information in a master dictionary containing data for each column as `key:value` for each unique ID separately

In [158]:
master_dictionary = {}

#Create a list with each row in our `insurance.csv` file having a sublist
master_list = list(zip(unique_ids, ages, sex, bmis, children, smoker, region, charges))
#print(master_list)

for i in range(len(unique_ids)):
    master_dictionary[unique_ids[i]] = {"ID": unique_ids[i],
                                        "Age": ages[i],
                                        "Sex": sex[i],
                                        "BMI": bmis[i],
                                        "Children": children[i],
                                        "Smoker": smoker[i],
                                        "Region": region[i],
                                        "Charges": charges[i]
                                       }

#print(master_dictionary)

### `AGE` analysis: I will conduct various studies here, for example, how many Males v Females are there in each age bracket? What is the average BMI of a certain age bracket? How many smokers are there in each age bracket? What is the average medical charge for each age bracket?

1. The first thing to do is to compile the data for every unique ID into their respective age brackets so that we can start the analysis.

In [159]:
def separate_age_brackets():
# Initialise lists to store IDs for each age bracket
    Zoomers = [] #Age 19 and below
    Millenials = [] #Age >=20 and < 40
    Boomers = [] #Age >=40 and < 65
    for item in master_dictionary:
        UID = master_dictionary[item]['ID']
        age = int(master_dictionary[item]['Age'])
        if age > 0 and age < 20:
            Zoomers.append(UID)
            master_dictionary[item]['Age group'] = 'Zoomer'
        elif age >= 20 and age < 40:
            Millenials.append(UID)
            master_dictionary[item]['Age group'] = 'Millenial'
        elif age >= 40 and age < 65:
            Boomers.append(UID)
            master_dictionary[item]['Age group'] = 'Boomer'
        else:
            pass
    return Zoomers, Millenials, Boomers
    
Zoomers, Millenials, Boomers = separate_age_brackets()

def dict_by_agegroup():
    age_group_dict = dict()
    for item in master_dictionary:
        UID = master_dictionary[item]['ID'] 
        age_group = master_dictionary[item]['Age group']
        if age_group not in age_group_dict:
            age_group_dict[age_group] = [master_dictionary[item]]
        else:
            age_group_dict[age_group].append(master_dictionary[item])
    return age_group_dict
            
#Creates a dictionary grouped by age group, now it's easy to analyse data for each group
age_group_dict = dict_by_agegroup()
#print(age_group_dict)

2. What is the male v female count of each age group? What is the average cost for males vs females in each group? What is the BMI difference in male vs female of each age group?

In [160]:
#Get the total number of people in each age group
num_zoomers, num_millenials, num_boomers = len(Zoomers), len(Millenials), len(Boomers)

z_females = 0
z_males = 0
m_females = 0
m_males = 0
b_females = 0
b_males = 0

for item in master_dictionary:
    sex = master_dictionary[item]['Sex']
    age_group = master_dictionary[item]['Age group']
    if age_group == 'Zoomer':
        if sex == 'male':
            z_males += 1
        elif sex == 'female':
            z_females += 1
    elif age_group == 'Millenial':
        if sex == 'male':
            m_males += 1
        elif sex == 'female':
            m_females += 1
    elif age_group == 'Boomer':
        if sex == 'male':
            b_males += 1
        elif sex == 'female':
            b_females += 1

# We now have the number of males and females in each group

for item in ['Zoomers', 'Millenials', 'Boomers']:
    if item == 'Zoomers':
        print('There are', z_females, 'female', item + ' in our records.')
        print('There are', z_males, 'male', item + ' in our records.')
    elif item == 'Millenials':
        print('There are', m_females, 'female', item + ' in our records.')
        print('There are', m_males, 'male', item+ ' in our records.')
    else:
        print('There are', b_females, 'female', item + ' in our records.')
        print('There are', b_males, 'male', item + ' in our records.')
        
print("All in all, there are " + str(z_females+m_females+b_females) + " females and " + str(z_males+m_males+b_males) + " males in our insurance.csv dataset. The numbers are almost equal, meaning our data is not biased towards any sex if any further model is to be based on this dataset.")
        
"""Seems like there are pretty much an equal amount of females to males in each category, this is important as we don't
have too many people of the same sex in any category, making our data records biased."""

There are 66 female Zoomers in our records.
There are 71 male Zoomers in our records.
There are 262 female Millenials in our records.
There are 275 male Millenials in our records.
There are 334 female Boomers in our records.
There are 330 male Boomers in our records.
All in all, there are 662 females and 676 males in our insurance.csv dataset. The numbers are almost equal, meaning our data is not biased towards any sex if any further model is to be based on this dataset.


"Seems like there are pretty much an equal amount of females to males in each category, this is important as we don't\nhave too many people of the same sex in any category, making our data records biased."

In [161]:
def tot_cost():
    #initialise the variables that will store the total cost for each sex within each age group
    zf_cost, zm_cost, mf_cost, mm_cost, bf_cost, bm_cost = 0, 0, 0, 0, 0, 0
    for item in master_dictionary:
        sex = master_dictionary[item]['Sex']
        age_group = master_dictionary[item]['Age group']
        cost = float(master_dictionary[item]['Charges'])
        if age_group == 'Zoomer':
            if sex == 'female':
                zf_cost += cost
            elif sex == 'male':
                zm_cost += cost
        elif age_group == 'Millenial':
            if sex == 'female':
                mf_cost += cost
            elif sex == 'male':
                mm_cost += cost
        elif age_group == 'Boomer':
            if sex == 'female':
                bf_cost += cost
            elif sex == 'male':
                bm_cost += cost
    return zf_cost, zm_cost, mf_cost, mm_cost, bf_cost, bm_cost

#call the function that will return the total costs for each sex within each age group
zF_c, zM_c, mF_c, mM_c, bF_c, bM_c = tot_cost()

MF_totalcost_dict = {"Female": {"Zoomer": zF_c, "Millenial": mF_c, "Boomer": bF_c},
                    "Male": {"Zoomer": zM_c, "Millenial": mM_c, "Boomer": bM_c}}

#print(MF_totalcost_dict)

MF_avgcost_dict = {"Female": {"Zoomer": int(zF_c/z_females), "Millenial": int(mF_c/m_females), "Boomer": int(bF_c/b_females)},
                    "Male": {"Zoomer": int(zM_c/z_males), "Millenial": int(mM_c/m_males), "Boomer": int(bM_c/b_males)}}

print("The data for average costs for females and males in each category:", MF_avgcost_dict)

"""It seems that the costs for males are on average higher than those for females in the same age group, but this data should be taken with a grain of salt; the age groups have a massive range spanning around 20 years for each, and this could be the difference between having 0 children or 3 children, significantly altering the charges."""

The data for average costs for females and males in each category: {'Female': {'Zoomer': 8067, 'Millenial': 9588, 'Boomer': 15797}, 'Male': {'Zoomer': 8723, 'Millenial': 11570, 'Boomer': 17071}}


'It seems that the costs for males are on average higher than those for females in the same age group, but this data should be taken with a grain of salt; the age groups have a massive range spanning around 20 years for each, and this could be the difference between having 0 children or 3 children, significantly altering the charges.'

In [170]:
def tot_bmi():
    #initialise the variables that will store the total cost for each sex within each age group
    zf_bmi, zm_bmi, mf_bmi, mm_bmi, bf_bmi, bm_bmi = 0, 0, 0, 0, 0, 0
    total_bmi_dict = {}
    avg_bmi_dict = {}
    for item in master_dictionary:
        sex = master_dictionary[item]['Sex']
        age_group = master_dictionary[item]['Age group']
        bmi = round(float(master_dictionary[item]['BMI']))
        if age_group == 'Zoomer':
            if sex == 'female':
                zf_bmi += bmi
            elif sex == 'male':
                zm_bmi += bmi
        elif age_group == 'Millenial':
            if sex == 'female':
                mf_bmi += bmi
            elif sex == 'male':
                mm_bmi += bmi
        elif age_group == 'Boomer':
            if sex == 'female':
                bf_bmi += bmi
            elif sex == 'male':
                bm_bmi += bmi
    total_bmi_dict.update({'Zoomer':{'Female': zf_bmi, 'Male': zm_bmi}})
    total_bmi_dict.update({'Millenial': {'Female': mf_bmi, 'Male': mm_bmi}})
    total_bmi_dict.update({'Boomer': {'Female': bf_bmi, 'Male': bm_bmi}})
    avg_bmi_dict.update({'Zoomer':{'Female': zf_bmi/z_females, 'Male': zm_bmi/z_males}})
    avg_bmi_dict.update({'Millenial': {'Female': mf_bmi/m_females, 'Male': mm_bmi/m_males}})
    avg_bmi_dict.update({'Boomer': {'Female': bf_bmi/b_females, 'Male': bm_bmi/b_males}})
    return total_bmi_dict, avg_bmi_dict

total_bmi_dict, avg_bmi_dict = tot_bmi()

print(avg_bmi_dict)

{'Zoomer': {'Female': 30.984848484848484, 'Male': 29.0}, 'Millenial': {'Female': 29.31679389312977, 'Male': 30.861818181818183}, 'Boomer': {'Female': 31.107784431137723, 'Male': 31.412121212121214}}


### `BMI` category: According to the CDC (https://www.cdc.gov/obesity/basics/adult-defining.html), people with a BMI < 18.5 fall in the 'underweight' category, people with BMI >= 18.5 and < 25 fall in the 'healthy' range, people with bmi >= 25 and < 30 fall into the 'overweight' category, and people with bmi >=30 are classified as 'obese'. I will now calculate the average medical cost for people in each category.

1. Make a dictionary that is grouped by each BMI category, so that we can easily do the rest of the analysis.

In [189]:
for item in master_dictionary:
    bmi = float(master_dictionary[item]['BMI'])
    record = master_dictionary[item]
    if bmi < 18.5:
        master_dictionary[item]['BMI_cat'] = 'Underweight'
    elif bmi >= 18.5 and bmi < 25:
        master_dictionary[item]['BMI_cat'] = 'Healthy'
    elif bmi >= 25 and bmi < 30:
        master_dictionary[item]['BMI_cat'] = 'Overweight'
    elif bmi >= 30:
        master_dictionary[item]['BMI_cat'] = 'Obese'
                
def bmi_dict():
    bmi_dict = {}
    for item in master_dictionary:
        bmi_cat = master_dictionary[item]['BMI_cat']
        if bmi_cat not in bmi_dict:
            bmi_dict[bmi_cat] = [master_dictionary[item]]
        else:
            bmi_dict[bmi_cat].append(master_dictionary[item])
    return bmi_dict

BMI_dictionary = bmi_dict()

2. Now calculate the average medical insurance charges (cost) for each BMI category. I will also calculate the number of males and females in each category.

In [188]:
def avg_cost_byBMI():
    tot_cost_u = 0
    tot_cost_h = 0
    tot_cost_ov = 0
    tot_cost_ob = 0
    final_dictionary = {}
    for each_list in BMI_dictionary.values():
        for dictionary in each_list:
            cost = float(dictionary['Charges'])
            if dictionary['BMI_cat'] == 'Underweight':
                tot_cost_u += cost
            elif dictionary['BMI_cat'] == 'Healthy':
                tot_cost_h += cost
            elif dictionary['BMI_cat'] == 'Overweight':
                tot_cost_ov += cost
            else:
                tot_cost_ob += cost
    avg_cost_u = tot_cost_u/len(BMI_dictionary['Underweight'])
    avg_cost_h = tot_cost_h/len(BMI_dictionary['Healthy'])
    avg_cost_ov = tot_cost_ov/len(BMI_dictionary['Overweight'])
    avg_cost_ob = tot_cost_ob/len(BMI_dictionary['Obese'])
    final_dictionary.update({"Average cost for underweights:": ''.join(['$',str(round(avg_cost_u))]),
                       "Average cost for healthy:": ''.join(['$',str(round(avg_cost_h))]),
                       "Average cost for overweights:": ''.join(['$',str(round(avg_cost_ov))]),
                       "Average cost for obese:": ''.join(['$',str(round(avg_cost_ob))])
                      })
    return final_dictionary

average_cost_by_bmicat = avg_cost_byBMI()
print(average_cost_by_bmicat)   

{'Average cost for underweights:': '$8852', 'Average cost for healthy:': '$10409', 'Average cost for overweights:': '$10988', 'Average cost for obese:': '$15552'}


In [187]:
count_uf, count_hf, count_ovf, count_obf = 0, 0, 0, 0
count_um, count_hm, count_ovm, count_obm = 0, 0, 0, 0

for item in BMI_dictionary.values():
    for each_dict in item:
        sex = each_dict['Sex']
        BMI_cat = each_dict['BMI_cat']
        if BMI_cat == 'Underweight':
            if sex == 'female':
                count_uf +=1
            elif sex == 'male':
                count_um += 1
        elif BMI_cat == 'Healthy':
            if sex == 'female':
                count_hf += 1
            elif sex == 'male':
                count_hm += 1
        elif BMI_cat == 'Overweight':
            if sex == 'female':
                count_ovf += 1
            elif sex == 'male':
                count_ovm += 1
        elif BMI_cat == 'Obese':
            if sex == 'female':
                count_obf += 1
            elif sex == 'male':
                count_obm += 1
    
MF_by_BMI_cat = {"Underweight": {"Females": count_uf, "Males": count_um},
                     "Healthy": {"Females": count_hf, "Males": count_hm},
                     "Overweight": {"Females": count_ovf, "Males": count_ovm},
                     "Obese": {"Females": count_obf, "Males": count_obm}
                    }
    
print(MF_by_BMI_cat)

{'Underweight': {'Females': 12, 'Males': 8}, 'Healthy': {'Females': 117, 'Males': 108}, 'Overweight': {'Females': 199, 'Males': 187}, 'Obese': {'Females': 334, 'Males': 373}}
