# U.S. Medical Insurance Costs

## Given a .csv of U.S. Medical Insurance Costs can we:

1. Examine the average cost of insurance?
    * Which region has the highest cost?
    * Which region has the lowest cost?
    * What's the highest overall cost?
    * What's the lowest overall cost? 
    
    <br>
2. Examine the average age of insurance holders.
    * What's the highest age overall?
    * What's the youngest age overall?
    * Is there an oldest region?
    * Is there a youngest region?
    
    <br>
3. Examine the BMI of the insurance holders.
    * What is the average BMI?
    * Is there a region with a particularly high overall BMI?
    * Is there a region with a particularly low overall BMI?
    
    <br>
4. Examine the impact of smoking status on cost.
    * Does smoking mean that insurance cost is higher?
    
    <br>
5. Examine the impact of having children on cost.
    * Does cost increase with the number of children?
    
    <br>
6. What's the average age of insurance holders' with at least one child?

In [16]:
import csv
from collections import Counter

In [17]:
# Create variables (as lists) for the data contained in the .csv file
age = []
sex = []
bmi = []
num_of_children = []
smoking_status = []
region = []
charges = []

In [18]:
# To get the data contained in the .csv file into the list variables, a function will be created that
# imports the .csv file and saves it using the csv module and the DictReader method.

def list_to_data(lst, csv_file, column_name):
    # import and open the .csv file
    with open(csv_file, 'r') as csv_info:
        csv_dict = csv.DictReader(csv_info)
        # Loop through the rows in the csv_dict and assign them to the list varibles
        for row in csv_dict:
            lst.append(row[column_name])
        #return the final list
        return lst

In order to get the information stored in the Insurance.csv file into the previously defined variables, we will
call the `list_to_data` method on the list, the .csv file and title the column to match the list.

In [19]:
ages = list_to_data(age, 'insurance.csv', 'age')
sexes = list_to_data(sex, 'insurance.csv', 'sex')
bmis = list_to_data(bmi, 'insurance.csv', 'bmi')
num_of_children = list_to_data(num_of_children, 'insurance.csv', 'children')
smoking_status = list_to_data(smoking_status, 'insurance.csv', 'smoker')
regions = list_to_data(region, 'insurance.csv', 'region')
cost = list_to_data(charges, 'insurance.csv', 'charges')


We now have organized, labeled lists of data to analyze. <br>

In order to analyze the data collected, we will need to build out functions that will perform the analysis on the lists and return the desired value - either an average or a total.
* Analyze cost
    * return the average cost of insurance
    * return the average cost per region
    * return the highest and lowest cost
* Analyze age
    * return the average age of an insurance holder
    * return the average age per region
    * return the highest and lowest ages
* Analyze BMI
    * return the average overall BMI 
    * return the average BMI per region
* Analyze smoking_status
    * return how many smokers vs non-smokers
    * return the cost difference between smokers vs non-smokers
* Analyze children
    * return the average cost of insurance as number of children increases 
    * return the average age of an insurance holder who has at least one child
     <br>

Since a good deal of what we want to look at are averages (average cost, average age, average BMI, etc.) a helper function would be useful. We could then use that function to determine the averages and save them to their own variables for later use if necessary.

Another helper function that may be useful is one that will return the highest and lowest data points in a list. 
These returned values can also be stored for later use if necessary.

A class may also be useful in analyzing some of the other data. 


In [20]:
# This helper function will determine the highest and lowest values in a given list
def find_high_low(lst):
    # set two variables to track our highest and lowest values
    highest = float('-inf')
    lowest = float('inf')
    # iterate through the list
    for item in lst:
        # if the item is higher than our highest variable, we assign that item to it
        if float(item) > float(highest):
            highest = item
        if float(item) < float(lowest):
            lowest = item
    # finally we return our variables
    return highest, lowest


In [21]:
# The `find_high_low` function can be used to save the highest and lowest values into variables with names corresponding
# to the list passed for later use if necessary

highest_lowest_cost = find_high_low(cost)
highest_lowest_age = find_high_low(ages)
highest_lowest_bmi = find_high_low(bmis)

print("The highest and lowest cost of insurance are: {highest_lowest_cost}".format(highest_lowest_cost=highest_lowest_cost))
print("The oldest and youngest insurance holders are: {highest_lowest_age}".format(highest_lowest_age=highest_lowest_age))
print("The highest and lowest BMIs are: {highest_lowest_bmi}".format(highest_lowest_bmi=highest_lowest_bmi))

The highest and lowest cost of insurance are: ('63770.42801', '1121.8739')
The oldest and youngest insurance holders are: ('64', '18')
The highest and lowest BMIs are: ('53.13', '15.96')


It may be helpful to have a list containing all of the data organized by patient as well. A patient for example would be the first datapoint in each of our lists:<br>
    **Patient1 = {'Age': ages[0], 'Sex': sexes[0], 'BMI': bmis[0], 'Children': num_of_children[0], 'Smoker': smoking_status[0], 'Region': regions[0], 'Charges': cost[0]}**
This would be repeated for each row of data and contained in a list, allowing for comparison of multiple points

In [22]:
# A Class will be built that takes in information of the Patient (our seven data lists)
class Patient_Info:
    def __init__(self, age_list, sex_list, bmi_list, children_list, smoking_list, region_list, cost_list):
        self.age = age_list
        self.sex = sex_list
        self.bmi = bmi_list
        self.children = children_list
        self.smoking_status = smoking_list
        self.regions = region_list
        self.cost = cost_list
        self.all_patient_info = []
        self.southwest = []
        self.southeast = []
        self.northwest = []
        self.northeast = []

    # The first function will merge together all of our data into a list, making it easier to access and compare 
    # later on
    def build_patient_list(self):
        for i in range(len(self.age)):
            patient = {"Age": self.age[i], 'Sex': self.sex[i], 'BMI': self.bmi[i], 'Children': self.children[i], 'Smoker': self.smoking_status[i],
            'Region': self.regions[i], 'Charges': self.cost[i]}
            self.all_patient_info.append(patient)

    # We should also have lists of these patients sorted by region, since a lot of our data will make comparisons
    # between regions
    def build_region_lists(self):
        #Look at every patient in the patient list
        for patient in self.all_patient_info:
            # if a patient matches a designated region, put that patient in a corrresponding region appropriate list
            if patient['Region'] == 'southwest':
                self.southwest.append(patient)
            elif patient['Region'] == 'southeast':
                self.southeast.append(patient)
            elif patient['Region'] == 'northwest':
                self.northwest.append(patient)
            else:
                self.northeast.append(patient)
    

    # This function will help find the average of a given item in our dictionary
    def find_average(self, lst, string):
        # Create a running total
        total = 0
        # iterate through every item in the list
        for item in lst:
            # add that item's subvalue to total as a float 
            total += float(item[string])
        # divide the total by the length of the list and return it to find the average, rounded to the second decimal
        return round(total / len(lst), 2)

    # The next  three functions will all use the previously created helper function to determine the average
    # of an item in a certain region. We will pass in the region we wish to examine in the function call.
    def avg_cost_per_region(self, region_list):
        return self.find_average(region_list, "Charges")

    def avg_bmi_per_region(self, region_list):
        return self.find_average(region_list, "BMI")
    
    def avg_age_per_region(self, region_list):
        return self.find_average(region_list, 'Age')

    # This function will count the number of smokers vs non-smokers
    def num_smokers(self):
        # Create a Counter object to hold our smoker v non-smoker info
        self.smokers = Counter()
        # Iterate through our patient data
        for patient in self.all_patient_info:
            # if a patient has a 'yes' under their 'Smoker' key, assign it to the 'yes' key of our Counter
            if patient['Smoker'] == 'yes':
                self.smokers['yes'] += 1
            # Otherwise, add the patient to our 'no' key of our Counter
            else:
                self.smokers['no'] += 1
        # Return our Counter object with the total number of smokers vs non-smokers
        return self.smokers

    # This function will give us the difference in charges for a smoker and a non-smoker
    def cost_difference_smokers(self):
        # Create two varibles to track total charges for a smoker and a non-smoker
        yes_total = 0
        no_total = 0
        # Iterate through our patient data
        for patient in self.all_patient_info:
            # if the patient responded 'yes' to being a smoker we'll increase the 'yes_total' variable by their charge
            if patient['Smoker'] == 'yes':
                yes_total += float(patient['Charges'])
            # Otherwise we'll add the charge to the 'no_total'
            else:
                no_total += float(patient['Charges'])
        # We can return the difference of charges between the two rounded to the second decimal
        print('The difference in charge between smokers and non-smokers is $' + str(round(abs(yes_total - no_total), 2))
        + ' with smokers totalling $' + str(round(yes_total, 2)) + ' and non-smokers totalling $' + str(round(no_total, 2)) + '.')
    
    # We want to see what the avg charge is per number of children
    def cost_per_child(self):
        # Create two Counters to handle our number of children and our charges per number
        num_children = Counter()
        charges_per_child = Counter()
        # Go through our list of patients
        for patient in self.all_patient_info:
            # Add the number of children to the appropriate counter, the key will be the number of children with the 
            #value being the response number for that amount
            num_children[patient['Children']] += 1
            # Here we add the charges associated with a particular response for children
            charges_per_child[patient['Children']] += float(patient['Charges'])
        
        # Counters can't be divided by other counters so we'll make two lists containing our values, one for
        # children and one for the charges
        child = [value for value in sorted(num_children.values())]
        charge = [value for value in sorted(charges_per_child.values())]
        # We'll divide the charge by the number of children after combining our lists
        avg =  [round(charge/child, 2) for (child, charge) in list(zip(child, charge))]
        # Finally, we can return the average charge associated with having a set number of children
        print(avg)
        
    # This function will help us determine the avg age of the parents in our data
    def age_of_parents(self):
        # Start by setting a Counter to track the ages of our parents
        parent_ages = Counter()
        # Go through our patient list
        for patient in self.all_patient_info:
            # If the patient has at least one child we'll continue
            if patient['Children'] != 0:
                # We will add the parent's age as a key and add the total number of children who have a parent that age
                # as a value
                parent_ages[patient['Age']] += int(patient['Children'])
        # Finally, we return the most common age of our parent Counter
        return parent_ages.most_common()[0][0]



Now that the Class is built out we can create an instance of it and call the functions within to see what insights our data holds.

In [23]:
#Create an instance of our class
Patient = Patient_Info(ages, sexes, bmis, num_of_children, smoking_status, regions, cost)

We now have an instance of our Patient_Info Class! Within that class we had two functions that built lists in a particular way, the .build_patient_list function and the .build_region_lists function; the next step is to create these two lists to use for the analysis we want to perform.

In [24]:
# Create our patient and region lists for future use. Both of these order our patient data in a particular way. 
# The first one sorts it by patient and the second one creates lists based on what region our patients live in.
Patient.build_patient_list()
Patient.build_region_lists()

We'll start running through the desired analysis now, starting from the top of our previous list:

    * Analyze cost
        * return the average cost of insurance
        * return the average cost per region
        * return the highest and lowest cost

In [25]:
# We can use the .find_average function to help with the first question. 
# We pass in Patient.all_patient_info as that is the variable created by our class constructor to hold all of 
# our patient data, including charges. The string 'Charges' gets passed in to tell our function that we want to 
# examine all of the charges contained within our patient list.

Patient.find_average(Patient.all_patient_info, 'Charges')

13270.42

In [26]:
# We can use the .cost_per_region to find the second one. We need to pass in Patient.region_name for each region. 
# This is because each region list is a variable created by the constructor of the Patient_Info class.

northwest = Patient.avg_cost_per_region(Patient.northwest)
northeast = Patient.avg_cost_per_region(Patient.northeast)
southwest = Patient.avg_cost_per_region(Patient.southwest)
southeast = Patient.avg_cost_per_region(Patient.southeast)

print(northwest, northeast, southwest, southeast, sep='\n')

12417.58
13406.38
12346.94
14735.41


In [27]:
# Finding the highest and lowest cost was handled above with the highest_lowest function. 
# But it can be formatted here for readability. 

print("The highest cost for insurance is $" + str(highest_lowest_cost[0] + " while the lowest cost is $" +
str(highest_lowest_cost[1]) +"."))

The highest cost for insurance is $63770.42801 while the lowest cost is $1121.8739.


Our next group of questions involved mostly looking at the ages of the insurance holders.

    * Analyze age
        * return the average age of an insurance holder
        * return the average age per region
        * return the highest and lowest ages


In [28]:
# Again we can use the helper function within the class to examine the average age of the policy holders.
# This time however instead of passing in 'Charges' as our string we will pass in 'Age' as we wish to 
# examine the average age of all of our policy holders.

Patient.find_average(Patient.all_patient_info, 'Age')

39.21

In [29]:
# There is also a function that handles the average age per region. Again we will pass in the Patient.region_name
# for each region in order to find the average age per region.

northwest = Patient.avg_age_per_region(Patient.northwest)
northeast = Patient.avg_age_per_region(Patient.northeast)
southwest = Patient.avg_age_per_region(Patient.southwest)
southeast = Patient.avg_age_per_region(Patient.southeast)
print(northwest, northeast, southwest, southeast)

39.2 39.27 39.46 38.94


In [31]:
# The highest_lowest helper function has already been used to find our highest and lowest agest, 
# we will format the output here.

print("The oldest insurance holder is " + str(highest_lowest_age[0]) + ", and the youngest is " + str(highest_lowest_age[1])
 + ". ")

The oldest insurance holder is 64, and the youngest is 18. 


The next section tackles the BMI of our insurance holders.

    * Analyze BMI
        * return the average overall BMI 
        * return the average BMI per region


In [32]:
# This time we'll pass in 'BMI' as our string to find the average BMI of our policy holders.

Patient.find_average(Patient.all_patient_info, 'BMI')

30.66

In [33]:
# Again we will create regional variables and pass in the Patient.region_name to find the average BMI per region
# using the .avg_bmi_per_region function of the Patient_Info class.

northwest = Patient.avg_bmi_per_region(Patient.northwest)
northeast = Patient.avg_bmi_per_region(Patient.northeast)
southwest = Patient.avg_bmi_per_region(Patient.southwest)
southeast = Patient.avg_bmi_per_region(Patient.southeast)

print(northwest, northeast, southwest, southeast)

29.2 29.17 30.6 33.36


The `.avg_cost_per_region`, `.avg_age_per_region`, and `.avg_bmi_per_region` all function using the `.find_average` function created within the class. They all however just handle the string parameter differently; we essentially sort the region list by a value we wish to examine and eliminate the need to repeate the same process for everything within the class. 

Our next questions revolved around smokers. 

    * Analyze smoking_status
        * return how many smokers vs non-smokers
        * return the cost difference between smokers vs non-smokers


In [36]:
# The Patient_Info Class has a function, .num_smokers which uses a Counter to tell us how many yes/no responses we 
# received in regard to this value in our patient list.

Patient.num_smokers()

Counter({'yes': 274, 'no': 1064})

The above code shows us that we clearly have fewer smokers than non-smokers! 

In [37]:
# There is also a .cost_difference_smoker function within the class that shows the difference in charges 
# between smokers and non-smokers.

Patient.cost_difference_smokers()


The difference in charge between smokers and non-smokers is $192297.95 with smokers totalling $8781763.52 and non-smokers totalling $8974061.47.


So while there are 274 smokers, they're paying almost as much as 1064 non-smokers!!

Finally, our last questions involved children.

    * Analyze children
        * return the average cost of insurance as number of children increases 
        * return the average age of an insurance holder who has at least one child

In [40]:
# The first question can be answered using the .cost_per_child function of the Patient_Info Class.
# It returns the average cost associated with having a number of children answered by our policy holders 
# sorted from having no children to having 5 children

Patient.cost_per_child()

[8786.04, 13850.66, 15355.32, 15073.56, 12731.17, 12365.98]


In [41]:
# The last question is found using the .age_of_parents function which counts how many policy holders have
# at least one child, and then returns the age with the most number of children.

Patient.age_of_parents()

'39'