# U.S. Medical Insurance Costs

This project seeks to analyze a dataset containing seemingly simulated U.S. medical insurance data in the form of a `.csv` file. The dataset is structured as follows:

* age: this column contains the age of the client in years
* sex: this column contains biological sex information. It is important to note that the dataset does not contain gender identity information. As such, not insights into gender identity can be gleaned.
* bmi: this column contains the body mass index (BMI) of the client.
* children: this column contains the number of children the client has.
* smoker: this column states whether the client is a smoker.
* region: this column contains the region in which the client resides
* charges: this column contains the amount paid by the client, presumably per annum.

## Project Scoping

* What is the average age?
* What proportion of the dataset resides in each region?
* What proportion are males or females?
* What proportion are smokers?
* What is the average of smokers vs. non-smokers? What about in each region?
* What is the average age for a client with a given number of children?
* I would like to answer the same quesitons as above replacing age with bmi and charges.

## Importing the Data

Because the data is in `.csv` format, I will use the `csv` module to import the data into lists.

In [2]:
import csv

In [3]:
insurance_data_dict = {}
ages = []
sexes = []
bmis = []
smokers = []
children = []
regions = []
charges = []

with open('insurance.csv') as insurance_file:
    insurance_data = csv.DictReader(insurance_file)
    for row in insurance_data:
        ages.append(int(row['age']))
        sexes.append(row['sex'])
        bmis.append(float(row['bmi']))
        smokers.append(row['smoker'])
        children.append(int(row['children']))
        regions.append(row['region'])
        charges.append(float(row['charges']))

I would like my data to be contained in a more convenient form. For that I will use a dictionary. I will write a function that will take the above lists and add them to a dicitonary with "Client x" as the keys.

In [4]:
def make_master_dict(age, sex, bmi, smoker, children, region, charges):
    """
    This function will take lists containing the data 
    from file and create a master data dictionary.
    """
    master_dict = {}
    for i in range(len(age)):
        client_dict = {'age': age[i], 'sex': sex[i], 'bmi': bmi[i], 
                      'smoker': smoker[i], 'children': children[i],
                       'region': region[i], 'charges': charges[i]}
        master_dict['Client ' + str(i + 1)] = client_dict
    return master_dict

insurance_data_dict = make_master_dict(ages, sexes, bmis, smokers, children, regions, charges)
### Quick print statement to test my above function
print(insurance_data_dict['Client 1'])

{'age': 19, 'sex': 'female', 'bmi': 27.9, 'smoker': 'yes', 'children': 0, 'region': 'southwest', 'charges': 16884.924}


## Calculations and Analysis

Now that I have a master data dictionary, I can begin my analysis. The first questions I would like to answer are "what is the average age?", "what is the average BMI?" and "what is the average per annum cost paid by clients represented in the dataset?". This preliminary line of questioning will serve well to intoduce me to the dataset.

In [5]:
def avg_calc(data_dict, feature):
    '''
    This function will take a numerical feature from the data_dict
    and return the average value.
    '''
    total_feature = 0
    for client, data in data_dict.items():
        total_feature += data[feature]
    avg_feature = float(total_feature) / len(data_dict)
    return avg_feature

avg_age = avg_calc(insurance_data_dict, 'age')
print('The average age in the dataset is {:.1f} years.'.format(avg_age))
avg_bmi = avg_calc(insurance_data_dict, 'bmi')
print('The average BMI in the dataset is {:.1f}.'.format(avg_bmi))
avg_charges = avg_calc(insurance_data_dict, 'charges')
print('The average cost per annum in the dataset is {:.2f} dollars.'.format(avg_charges))

The average age in the dataset is 39.2 years.
The average BMI in the dataset is 30.7.
The average cost per annum in the dataset is 13270.42 dollars.


Now I would like to learn some information about the categorical features. I will begin by deteriming the proportion of the dataset population for each of the feature values. To that end, I will here develope two functions. The first will take my master data dictionary and produce a second dictionary using a categorical features unique values as keys. The values for the above keys will be the data for each client who's categorical feature matches that unique value.

The second function will use that new categorical dictionary and calculate the proportion of the dataset belong to each of the unique values of that feature.

In [7]:
def feature_dict_creator(data_dict, feature):
    '''
    This function will create a dictonary containing the client
    data for each of the relevent categorical values within the
    given feature.
    '''
    feature_dict = {}
    for client, data in data_dict.items():
        current_feature_val = data[feature]
        if current_feature_val in feature_dict:
            feature_dict[current_feature_val].append(data)
        else:
            feature_dict[current_feature_val] = [data]
    return feature_dict

def prop_calculator(data_dict, feature):
    '''
    This function will calculate the proportion of the dataset
    belonging to the relevent categorical values for the given
    categorical feature.
    '''
    dict_for_prop_calc = feature_dict_creator(data_dict, feature)
    prop_dict = {}
    for feature, client_data in dict_for_prop_calc.items():
        current_feature_count = len(dict_for_prop_calc[feature])
        prop_feature = float(current_feature_count) / len(data_dict)
        prop_dict[feature] = prop_feature * 100
    return prop_dict

sex_prop_dict = prop_calculator(insurance_data_dict, 'sex')
for sex in sex_prop_dict.keys():
    print('{pct:.1f}% of the dataset are {sex}s.'.format(pct = sex_prop_dict[sex], sex = sex))

49.5% of the dataset are females.
50.5% of the dataset are males.


In [8]:
smokers_prop_dict = prop_calculator(insurance_data_dict, 'smoker')
for status in smokers_prop_dict.keys():
    print("{pct:.1f}% of the dataset responded \"{status}\" when asked if they smoked.".format(pct = smokers_prop_dict[status], status = status))

20.5% of the dataset responded "yes" when asked if they smoked.
79.5% of the dataset responded "no" when asked if they smoked.


In [9]:
region_prop_dict = prop_calculator(insurance_data_dict, 'region')
for region in region_prop_dict.keys():
    print("{pct:.1f}% of the dataset resides in the {region} region of the U.S.".format(pct = region_prop_dict[region], region = region))

24.3% of the dataset resides in the southwest region of the U.S.
27.2% of the dataset resides in the southeast region of the U.S.
24.3% of the dataset resides in the northwest region of the U.S.
24.2% of the dataset resides in the northeast region of the U.S.


In [10]:
children_prop_dict = prop_calculator(insurance_data_dict, 'children')
for children in children_prop_dict.keys():
    print("{pct:.1f}% of the dataset have {children} child(ren).".format(pct = children_prop_dict[children], children = children))

42.9% of the dataset have 0 child(ren).
24.2% of the dataset have 1 child(ren).
11.7% of the dataset have 3 child(ren).
17.9% of the dataset have 2 child(ren).
1.3% of the dataset have 5 child(ren).
1.9% of the dataset have 4 child(ren).


I would now like to look at how the average values of my numerical features change with different categorical features. For example, I would like to answer questions like "Is there a difference in the average age/BMI/per annum cost of insurance between males and females?". I will write another function to do just that.

In [11]:
def avg_feat_cat(data_dict, num_feat, cat_feat):
    '''
    This function will output the average of a numerical
    feature for a given categorical feature in the data-
    set.
    '''
    cat_org_dict = feature_dict_creator(data_dict, cat_feat)
    cat_avg_dict = {}
    for cat_feature, data_list in cat_org_dict.items():
        total_num_feat = 0
        for data in data_list:
            total_num_feat += data[num_feat]
        avg_num_feature = float(total_num_feat) / len(data_list)
        cat_avg_dict[cat_feature] = avg_num_feature
    return cat_avg_dict

avg_age_sex = avg_feat_cat(insurance_data_dict, 'age', 'sex')
for cat in avg_age_sex:
    print('The average age of {cat}s is {avg:.1f} years.'.format(cat = cat, avg = avg_age_sex[cat]))

The average age of females is 39.5 years.
The average age of males is 38.9 years.


In [12]:
avg_bmi_sex = avg_feat_cat(insurance_data_dict, 'bmi', 'sex')
for cat in avg_bmi_sex:
    print('The average BMI of {cat}s is {avg:.1f}.'.format(cat = cat, avg = avg_bmi_sex[cat]))

The average BMI of females is 30.4.
The average BMI of males is 30.9.


In [13]:
avg_cost_sex = avg_feat_cat(insurance_data_dict, 'charges', 'sex')
for cat in avg_cost_sex:
    print('The average annual insurance cost for {cat}s is {avg:.2f} dollars.'.format(cat = cat, avg = avg_cost_sex[cat]))

The average annual insurance cost for females is 12569.58 dollars.
The average annual insurance cost for males is 13956.75 dollars.


In [14]:
avg_age_smoker = avg_feat_cat(insurance_data_dict, 'age', 'smoker')
for cat in avg_age_smoker:
    print('The average age of clients who responded \"{cat}\" to whether they smoke is {avg:.1f} years.'.format(cat = cat, avg = avg_age_smoker[cat]))

The average age of clients who responded "yes" to whether they smoke is 38.5 years.
The average age of clients who responded "no" to whether they smoke is 39.4 years.


In [15]:
avg_bmi_smoker = avg_feat_cat(insurance_data_dict, 'bmi', 'smoker')
for cat in avg_bmi_smoker:
    print('The average BMI of clients who responded \"{cat}\" to whether they smoke is {avg:.1f}.'.format(cat = cat, avg = avg_bmi_smoker[cat]))

The average BMI of clients who responded "yes" to whether they smoke is 30.7.
The average BMI of clients who responded "no" to whether they smoke is 30.7.


In [16]:
avg_cost_smoker = avg_feat_cat(insurance_data_dict, 'charges', 'smoker')
for cat in avg_cost_smoker:
    print('The average yearly cost of insurance for clients who responded \"{cat}\" to whether they smoke is {avg:.2f} dollars.'.format(cat = cat, avg = avg_cost_smoker[cat]))

The average yearly cost of insurance for clients who responded "yes" to whether they smoke is 32050.23 dollars.
The average yearly cost of insurance for clients who responded "no" to whether they smoke is 8434.27 dollars.


In [17]:
avg_age_region = avg_feat_cat(insurance_data_dict, 'age', 'region')
for cat in avg_age_region:
    print('The average age of clients residing in the {cat} region is {avg:.1f} years.'.format(cat = cat, avg = avg_age_region[cat]))

The average age of clients residing in the southwest region is 39.5 years.
The average age of clients residing in the southeast region is 38.9 years.
The average age of clients residing in the northwest region is 39.2 years.
The average age of clients residing in the northeast region is 39.3 years.


In [18]:
avg_bmi_region = avg_feat_cat(insurance_data_dict, 'bmi', 'region')
for cat in avg_bmi_region:
    print('The average BMI of clients residing in the {cat} region is {avg:.1f}.'.format(cat = cat, avg = avg_bmi_region[cat]))

The average BMI of clients residing in the southwest region is 30.6.
The average BMI of clients residing in the southeast region is 33.4.
The average BMI of clients residing in the northwest region is 29.2.
The average BMI of clients residing in the northeast region is 29.2.


In [19]:
avg_cost_region = avg_feat_cat(insurance_data_dict, 'charges', 'region')
for cat in avg_cost_region:
    print('The average annual cost of insurance for clients residing in the {cat} region is {avg:.2f} dollars.'.format(cat = cat, avg = avg_cost_region[cat]))

The average annual cost of insurance for clients residing in the southwest region is 12346.94 dollars.
The average annual cost of insurance for clients residing in the southeast region is 14735.41 dollars.
The average annual cost of insurance for clients residing in the northwest region is 12417.58 dollars.
The average annual cost of insurance for clients residing in the northeast region is 13406.38 dollars.


In [24]:
avg_age_children = avg_feat_cat(insurance_data_dict, 'age', 'children')
for cat in sorted(list(avg_age_children.keys())):
    print('The average age of clients with {cat} child(ren) is {avg:.1f} years.'.format(cat = cat, avg = avg_age_children[cat]))

The average age of clients with 0 child(ren) is 38.4 years.
The average age of clients with 1 child(ren) is 39.5 years.
The average age of clients with 2 child(ren) is 39.4 years.
The average age of clients with 3 child(ren) is 41.6 years.
The average age of clients with 4 child(ren) is 39.0 years.
The average age of clients with 5 child(ren) is 35.6 years.


In [25]:
avg_bmi_children = avg_feat_cat(insurance_data_dict, 'bmi', 'children')
for cat in sorted(list(avg_bmi_children.keys())):
    print('The average BMI of clients with {cat} child(ren) is {avg:.1f}.'.format(cat = cat, avg = avg_bmi_children[cat]))

The average BMI of clients with 0 child(ren) is 30.6.
The average BMI of clients with 1 child(ren) is 30.6.
The average BMI of clients with 2 child(ren) is 31.0.
The average BMI of clients with 3 child(ren) is 30.7.
The average BMI of clients with 4 child(ren) is 31.4.
The average BMI of clients with 5 child(ren) is 29.6.


In [26]:
avg_cost_children = avg_feat_cat(insurance_data_dict, 'charges', 'children')
for cat in sorted(list(avg_cost_children.keys())):
    print('The average annual cost of insurance for clients with {cat} child(ren) is {avg:.2f} dollars.'.format(cat = cat, avg = avg_cost_children[cat]))

The average annual cost of insurance for clients with 0 child(ren) is 12365.98 dollars.
The average annual cost of insurance for clients with 1 child(ren) is 12731.17 dollars.
The average annual cost of insurance for clients with 2 child(ren) is 15073.56 dollars.
The average annual cost of insurance for clients with 3 child(ren) is 15355.32 dollars.
The average annual cost of insurance for clients with 4 child(ren) is 13850.66 dollars.
The average annual cost of insurance for clients with 5 child(ren) is 8786.04 dollars.


## Conclusions

In this project, I was successfully able to import the simulated U.S. medical insurance data provided by Codecademy. I then added the data to convineniently structured lists and dictionaries for descriptive analysis. I was able to disern the following important takeawys:

* Smoking appears to effect the yearly cost of insurance the most. Within the data, 20.5% of clients responded as smokers. These clients paid, on average $\$$32050.23 while non-smokers paid $\$$8434.27. To keep insurance costs low, I would suggest clients quit smoking.

* While the average age of clients in each region is relatively consistent (ranging from 38.9 to 39.5 years of age), the BMI has a much broader range and so too does the yearly cost. When looking at the charges, the southeast region pays the highs yearly insurance costs. The southeast region also has the highest average BMI. This suggests that BMI is a key contributer to insurance primiums as well as whether a client smokes.

* Males paid higher yearly insurance costs than females. The other numerical features are consistent between males and females in this data so the cause has yet to be revealed.

* Clients with 5 children on average paid much less than other clients. They also tended to be much younger. It is not immediately apparent if the number of children is a driving contributer or if the difference is contained within the change in average age.