# U.S. Medical Insurance Costs

This is a Python portfolio project from **Codecademy's Data Scientist path**. The objective of this project is to apply what I know so far about Python to analyze the US Medical Insurance Costs data as provided in a CSV file. 

### Importing Modules
For this project, I am importing the **csv** and **string** modules, and some functions from the **operator** module. The csv module will be used to read the data from the **insurance.csv** file. String will be used to help format the csv data properly with the correct data types (i.e., numeric values should be converted to *int* or *float* as appropriate). The operator functions will be used in a class method that will allow me to search based on specific parameters.

While **pandas** and **matplotlib** will definitely make the analysis easier, at this point in time, I don't know how to use those just yet. Let's see how far I can go with just these modules!

In [1]:
import csv
import string
from operator import lt, le, eq, ne, gt, ge
letters = string.ascii_letters

### Creating a class with methods to load and analyze the data
I want to create a class that will allow me to load data from a csv file and perform basic calculations on it using class methods. These are the methods I will define for my class:


- **get_data**: load data from csv and returns a dictionary with the patient keys and records
- **search**: filter the data based on field/s and value/s (e.g., sex='female) and returns a dictionary with patient keys and records matching the criteria
- **get_patient_data**: returns the patient record based on a patient key input
- **count_frequency**: returns the number of records that matches the given criteria (like search but returns a number instead of a dictionary)
- **percent_of_total**: returns the percentage of total based on count_frequency and a given base data set
- **calculate_total**: returns total value of a specified field in a specified dictionary
- **calculate_average**: returns average of a specified field in a specified dictionary
- **calculate_max**: returns max of a specified field in a specified dictionary
- **calculate_min**: returns min of a specified field in a specified dictionary


In [2]:
class Insurance_Data:
    
    def __init__(self, csv_file):
        self.csv_file = csv_file
        self.data = {} # empty dictionary where CSV data will be loaded
        
    def __len__(self):
        return len(self.data) # len will be based on the size of the dictionary
    
    def __repr__(self):
        return repr(self.data) # printing an instance of this class will print the dictionary
        
    def get_data(self):
        
        with open(self.csv_file) as csv_data:
            csv_contents = csv.DictReader(csv_data)
            fields = csv_contents.fieldnames
            
            key = 1 # initializes the first patient key

            for line in csv_contents:
                dict_value = {} # temporary dictionary that will be fed into self.data
                for field in fields:
                    line_value = line[field]
                    line_value_type = None # a variable that will help identify how to format the value of the field

                    # a loop to check if the value is a string, float or int
                    for letter in letters: 
                        
                        # if a letter is found in value, value is string
                        if letter in line_value:
                            line_value_type = 'string'
                            break
                            
                        # if no letters are found in value and '.' is not in value, value is an integer
                        elif letter not in line_value and '.' not in line_value:
                            line_value_type = 'int'
                            
                        # else, value is a float    
                        else:
                            line_value_type = 'float'

                    # format value based on line_value_type
                    
                    if line_value_type == 'int':
                        dict_value[field] = int(line_value)
                    elif line_value_type == 'float':
                        dict_value[field] = float(line_value)
                    else:
                        dict_value[field] = line_value
                
                # patient and record added to self.data
                self.data.update({key: dict_value})
                
                # next patient key
                key = key + 1
            
    def search(self, data=None, charges_op=eq, bmi_op=eq, children_op=eq, age_op=eq, **search_criteria):
        
        results = {}
        
        if data == None:
            data = self.data # If no data_dictionary is provided, use data attribute of object
        
        for patient, record in data.items():
            criteria_match = 1
            for key, value in search_criteria.items():
                
                if type(value) is int or type(value) is float: # handles int and float comparison
                    if key == 'charges' and charges_op(record[key], value):
                        criteria_match *= 1
                    elif key == 'children' and children_op(record[key], value):
                        criteria_match *= 1
                    elif key == 'age' and age_op(record[key], value):
                        criteria_match *= 1
                    elif key == 'bmi' and bmi_op(record[key], value):
                        criteria_match *= 1
                    else:
                        criteria_match *= 0
                else: # for other data types
                    if record[key] == value:
                        criteria_match *= 1
                    else:
                        criteria_match *=0
            
            if criteria_match == 1:
                results.update({patient: record})
        
        return results
    
    def get_patient_data(self, patient_name):
        return self.data.get(patient_name)
    
    def count_frequency(self, data=None, **search_criteria):
        return len(self.search(data, **search_criteria))
    
    def percent_of_total(self, data=None, **search_criteria):
        if data == None:
            data = self.data
        percent = self.count_frequency(data, **search_criteria) / len(data) * 100
        return f'{percent}%'
    
    def calculate_total(self, record_key, data=None):
        if data == None:
            data = self.data
        total = 0
        for item in data.values():
            total += item.get(record_key)
        return total
    
    def calculate_average(self, record_key, data=None):
        if data == None:
            data = self.data
        return self.calculate_total(record_key, data) / len(data)
    
    def calculate_max(self, record_key, data=None):
        if data == None:
            data = self.data
        lst = [record[record_key] for record in data.values()]
        return max(lst)
    
    def calculate_min(self, record_key, data=None):
        if data == None:
            data = self.data
        lst = [record[record_key] for record in data.values()]
        return min(lst)

### Creating an Object
Now that I have created the class and defined its methods, let's create an object for that class.

In [3]:
my_data = Insurance_Data('insurance.csv')
my_data.get_data()

### Exploring the Data
Let's use the methods. It will be good to get an overview of our data: its size, count or proportion based on dimensions, average values, minimum values, and maximum values.

In [30]:
cf = my_data.count_frequency
percent = my_data.percent_of_total
avg = my_data.calculate_average
search = my_data.search
get_max = my_data.calculate_max
get_min = my_data.calculate_min
number_of_records = len(my_data)
male_count = cf(sex='male')
female_count = cf(sex='female')
smokers = cf(smoker='yes')
non_smokers = cf(smoker='no')
with_children = cf(children_op=ne, children=0)
no_children = cf(children=0)
average_age = avg('age')
average_bmi = avg('bmi')
average_cost = avg('charges')
max_cost = get_max('charges')
min_cost = get_min('charges')
max_cost_patients = search(charges=max_cost)
min_cost_patients = search(charges=min_cost)


print(
    f'Here\'s an overview of our data:\n\n\
    Number of records: {number_of_records}\n\
    Males: {male_count}, Females: {female_count}\n\
    Smokers: {smokers}, Non-smokers: {non_smokers}\n\
    With children: {with_children}, No children: {no_children}\n\
    Average age: {average_age}\n\
    Average BMI: {average_bmi}\n\
    Average insurance cost: {average_cost}\n\
      ')

print(f'The highest insurance paid by a patient amounted to {max_cost} dollars.\n')

if len(max_cost_patients) > 1:
    print(f'The patients who paid the highest costs are: {max_cost_patients}\n')
else:
    print(f'The patient who paid the highest cost is: {max_cost_patients}.\n')

print(f'While the lowest insurance paid by a patient was {min_cost} dollars.\n')
if len(min_cost_patients) > 1:
    print(f'The patients who paid the lowest costs are: {min_cost_patients}\n')
else:
    print(f'The patient who paid the lowest cost is: {min_cost_patients}.\n')

Here's an overview of our data:

    Number of records: 1338
    Males: 676, Females: 662
    Smokers: 274, Non-smokers: 1064
    With children: 764, No children: 574
    Average age: 39.20702541106129
    Average BMI: 30.663396860986538
    Average insurance cost: 13270.422265141257
      
The highest insurance paid by a patient amounted to 63770.42801 dollars.

The patient who paid the highest cost is: {544: {'age': 54, 'sex': 'female', 'bmi': 47.41, 'children': 0, 'smoker': 'yes', 'region': 'southeast', 'charges': 63770.42801}}.

While the lowest insurance paid by a patient was 1121.8739 dollars.

The patient who paid the lowest cost is: {941: {'age': 18, 'sex': 'male', 'bmi': 23.21, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1121.8739}}.



In reality, insurance cost is impacted by various factors. While there's a more scientific approach to test correlation between cost and other variables in our dataset through statistical methods, we will use simple calculations and logic for now.

### Who is paying more between men and women?
Does sex impact insurance costs? And if so, who is paying more? For the first test, let's see what the average costs are for males and females.

In [5]:
males = search(sex='male')
females = search(sex='female')
average_cost_male = avg('charges', males)
average_cost_female = avg('charges', females)
print(f'The average insurance cost for males in our data is: {average_cost_male} dollars.')
print(f'The average insurance cost for females in our data is: {average_cost_female} dollars.')

The average insurance cost for males in our data is: 13956.751177721886 dollars.
The average insurance cost for females in our data is: 12569.57884383534 dollars.


Our initial test using simple average shows that **men pay higher than women**. However, we should take note that the calculation captures other variables that potentially affect insurance costs (e.g., smoking, BMI, etc.) It is then important to check if the above holds true if all other factors are the same.

So for our next step, let's take a specific subset of our data. We will recalculate the average cost for males and females who are 30 years old and below (i.e., at the prime of their lives), whose BMI do not exceed the upper limit of what is considered normal, who don't have children, don't smoke and are living in the northeast region of the US.

For BMI, ideally, we should only be capturing the healthy range. However, because of method limitations, we will use everything below the upper limit of a healthy BMI for now. (You can improve on this code by rewriting the search method.)

In [6]:
normal_bmi = 24.9
males_retest = search(sex='male', smoker='no', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
females_retest = search(sex='female', smoker='no', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
average_cost_male = avg('charges', males_retest)
average_cost_female = avg('charges', females_retest)
print(f'The average insurance cost for males in our data is: {average_cost_male} dollars.')
print(f'The average insurance cost for females in our data is: {average_cost_female} dollars.')

The average insurance cost for males in our data is: 1900.4874142857145 dollars.
The average insurance cost for females in our data is: 3064.219825 dollars.


Based on the above, we can see that **men actually pay lower than women.** Let's see if this still holds true if we change another parameter in our criteria. Does this still hold true for smokers?

In [7]:
normal_bmi = 24.9
males_retest = search(sex='male', smoker='yes', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
females_retest = search(sex='female', smoker='yes', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
average_cost_male = avg('charges', males_retest)
average_cost_female = avg('charges', females_retest)
print(f'The average insurance cost for males in our data is: {average_cost_male} dollars.')
print(f'The average insurance cost for females in our data is: {average_cost_female} dollars.')

The average insurance cost for males in our data is: 14943.3172 dollars.
The average insurance cost for females in our data is: 14990.218233333333 dollars.


While women are still paying higher than men, the gap has significantly decreased when looking at smokers. Let's go back to non-smokers and check the impact of having kids.

In [8]:
males_kids = search(sex='male', smoker='no', children_op=ne, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
males_no_kids = search(sex='male', smoker='no', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
females_kids = search(sex='female', smoker='no', children_op=ne, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
females_no_kids = search(sex='female', smoker='no', children_op=eq, children=0, bmi_op=le, bmi=normal_bmi, age_op=le, age=30, region='northeast')
average_cost_male_kids = avg('charges', males_kids)
average_cost_male_no_kids = avg('charges', males_no_kids)
average_cost_female_kids = avg('charges', females_kids)
average_cost_female_no_kids = avg('charges', females_no_kids)
print(f'The average insurance cost for males with kids in our data is: {average_cost_male_kids} dollars.')
print(f'The average insurance cost for males without kids in our data is: {average_cost_male_no_kids} dollars.')
print(f'The average insurance cost for females with kids in our data is: {average_cost_female_kids} dollars.')
print(f'The average insurance cost for females without kids in our data is: {average_cost_female_no_kids} dollars.')

The average insurance cost for males with kids in our data is: 4466.600759999999 dollars.
The average insurance cost for males without kids in our data is: 1900.4874142857145 dollars.
The average insurance cost for females with kids in our data is: 11657.971218 dollars.
The average insurance cost for females without kids in our data is: 3064.219825 dollars.


In our kids vs. no kids scenario, we can see that **having kids has a higher insurance cost** vs. not having kids in both men and women subsets of our data.

### Next Steps

The above are just basic investigations of our data. As I progress in my knowledge of Python, I plan to write functions that will test the correlation of variables and predict insurance costs using multiple regression.