# U.S. Medical Insurance Costs Portfolio Project - Jack Wills


## Objective:

In this project I will attempt to take a csv file of example US medical insurance cost dataset from Codecademy, then use it to import, organise, sort and analyse the dataset.

## Workflow:

I decided to separate my entire project into sections as an attempt to create a workflow that I could follow and workthrough.  I separated the workflow into the following steps:
<ul>
    <li>Review the dataset</li>
    <li>Decide how I would like to organise and analyse the data</li>
    <li>Learn Markdown so I can present the information as such in Jupyter Notebooks</li>
    <li>Breakdown the project by section, in Markdown, allowing for the python code to be added later (thus allowing goals to be set)</li>
    <li>Start using Python3 fundementals to organise and analyse the data set</li>
</ul>

## Scope:

Python3 will be used to organise and analyse the dataset in the follwing steps:

<ol>
    <li>Import the csv file into Python3</li>
    <li>Separate the columns of the csv file into individual lists</li>
    <li>Compile the lists into a dictionaries that can be organised by each category</li>
    <li>Start to analyse the data by breaking down the data in to groups and compare it to other factors as to determine what affects the cost of medical insurance</li>
    <li>Summarise the results</li>
</ol>

### Part 1 - Importing csv data to Python3

Using the knowledge used in Python files, import the csv into Python so it is useable and readable:

In [111]:
# First import csv libraries
import csv

# Then create indivdual lists as to store each column data into
age = []
sex = []
bmi = []
num_of_children = []
smoker_status = []
region = []
cost = []

# Next we can open the csv file and save the associated data to each predefined list:
with open('insurance.csv', newline='') as insurance_csv:
    insurance_data = csv.DictReader(insurance_csv)
    for row in insurance_data:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        num_of_children.append(row['children'])
        smoker_status.append(row['smoker'])
        region.append(row['region'])
        cost.append(row['charges'])
        


We can check to see if the values have been saved correctly by printing any of the new lists:

In [125]:
#print(cost)

### Part 2 - How many records are there, and are they evenly split by sex and by smoker status (binomial data)?

In this section we will find out how many rows or records of data there are, and how the dataset is split between sex and smoker status.

Sex and smoker status are the only binomial data in this dataset:

In [73]:
# Calculate the total number of records by counting the rows (nb any list can be used)
total_rows = 0

for i in age:
    total_rows += 1;

print("There are " + str(total_rows) + " records in this dataset.")

There are 1338 records in this dataset.


Next we look at the split in the dataset bewtween sex and smoker status:

In [104]:
# First count the male and female sexes in the record and find the ratio of sex in the dataset

male_count = 0
female_count = 0

for i in sex:
    if i == "male":
        male_count += 1
    elif i == "female":
        female_count += 1
        
male_pct = round(male_count / total_rows*100, 2)
female_pct = round(female_count / total_rows*100, 2)
        
print("In the dataset the split between gender is {}% Male to {}% Female.  This is a fairly even split.".format(male_pct, female_pct))


#  Now we can apply the same to smoker status
smoker_count = 0
non_smoker_count = 0

for i in smoker_status:
    if i == "yes":
        smoker_count += 1
    elif i == "no":
        non_smoker_count += 1
        
smoker_pct = round(smoker_count / total_rows*100, 2)
non_smoker_pct = round(non_smoker_count / total_rows*100, 2)
        
print("In the dataset the split between smokers is {}% Smokers to {}% Non Smokers.".format(smoker_pct, non_smoker_pct))

In the dataset the split between gender is 50.52% Male to 49.48% Female.  This is a fairly even split.
In the dataset the split between smokers is 20.48% Smokers to 79.52% Non Smokers.


We can also check to see the split by region if we wish to:

In [75]:
# Create new dictionary to store regions and loop through regions list 

region_count = {}

for i in region:
    if i not in region_count:
        region_count[i] = 1
    else:
        region_count[i] += 1
        
print(region_count)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


As we can see the regions are pretty evenly split into four quadrants.

### Part 3 - Find Min, Max, Means and Medians of Insurance cost then compare groups

Here we will take minimum and maximum insurance costs overall, the mean and then the median.  We can then compare the costs between female/male and smoker/non-smoker. (Note these can be accomplished also by inbuilt methods by importing statistics). 


In [103]:
# First we need to use list comprehension to transform the cost list to floats so we can use the data efficiently.

new_cost = [float(i) for i in cost]

min_cost = round(min(new_cost), 2)
max_cost = round(max(new_cost), 2)

print("The minimum insurance cost in the dataset is ${}.  The maximum insurance cost is ${}.".format(min_cost, max_cost))

The minimum insurance cost in the dataset is $1121.87.  The maximum insurance cost is $63770.43.


In [124]:
# To find the mean we take the total off all costs over the length of the list (using total_rows)
mean_cost = round(sum(new_cost) / total_rows, 2)
print("The mean insurance cost is ${}.".format(mean_cost))

# To find the median we use the premise n+1 / 2 for odd totals and n / 2 for even totals using the operator // for floor values
new_cost.sort()

if total_rows % 2 == 0:
    median1 = new_cost[total_rows // 2]
    median2 = new_cost[total_rows // 2 - 1]
    median_cost = round((median1 + median2) / 2, 2)
else:
    median_cost = round(new_cost[total_rows // 2], 2)
    
print("The median insurance cost is ${}.".format(median))

The mean insurance cost is $13270.42.
The median insurance cost is $9382.03.


In [175]:
# Create a function that takes in the parameters of gender or smoker status, options then finds the mean of the two and compares them

def mean_diff(param, param1, param2):
    with open("insurance.csv") as data:
        data_set = csv.DictReader(data)
        param1_total = 0
        param2_total = 0

        param1_count = 0
        param2_count = 0

        for row in data_set:
            if row[param] == param1:
                param1_total += float(row['charges'])
                param1_count += 1
            elif row[param] == param2:
                param2_total += float(row['charges'])
                param2_count += 1

        param1_mean = round((param1_total / param1_count), 2)
        param2_mean = round((param2_total / param2_count), 2)
        mean_difference = round((param1_mean - param2_mean), 2)
                        
        print("{} comparison:\nThe mean insurance cost for {} is ${}.  The mean cost for {} is ${}.  The mean difference is ${}.".format(param.capitalize(), param1.upper(), param1_mean, param2.upper(), param2_mean, mean_difference))
                        


Now we can run this function and input gender and smoker status to compare the two means:

In [176]:
mean_diff_gender = mean_diff('sex', 'male', 'female')
mean_diff_smoker = mean_diff('smoker', 'yes', 'no')

Sex comparison:
The mean insurance cost for MALE is $13956.75.  The mean cost for FEMALE is $12569.58.  The mean difference is $1387.17.
Smoker comparison:
The mean insurance cost for YES is $32050.23.  The mean cost for NO is $8434.27.  The mean difference is $23615.96.


From this we can deduce that Males pay slightly more than females and smokers pay considerable mroe than non-smokers.

We can go further and modify this function to see which region on average pays higher insurance:

In [171]:
# Modify the above function and add more parameters

def mean_by_region(param, param1, param2, param3, param4):
    with open("insurance.csv") as data:
        data_set = csv.DictReader(data)
        param1_total = 0
        param2_total = 0
        param3_total = 0
        param4_total = 0

        param1_count = 0
        param2_count = 0
        param3_count = 0
        param4_count = 0

        for row in data_set:
            if row[param] == param1:
                param1_total += float(row['charges'])
                param1_count += 1
            elif row[param] == param2:
                param2_total += float(row['charges'])
                param2_count += 1
            elif row[param] == param3:
                param3_total += float(row['charges'])
                param3_count += 1
            elif row[param] == param4:
                param4_total += float(row['charges'])
                param4_count += 1

        param1_mean = round((param1_total / param1_count), 2)
        param2_mean = round((param2_total / param2_count), 2)
        param3_mean = round((param3_total / param3_count), 2)
        param4_mean = round((param4_total / param4_count), 2)
                        
        print("Average insurance cost per region:")
        print("{}: ${}".format(param1.capitalize(), param1_mean))
        print("{}: ${}".format(param2.capitalize(), param2_mean))
        print("{}: ${}".format(param3.capitalize(), param3_mean))
        print("{}: ${}".format(param4.capitalize(), param4_mean))
     

By running this function we should get the following means per region:

In [172]:
mean_region = mean_by_region('region', 'southwest', 'southeast', 'northwest', 'northeast')

Average insurance cost per region:
Southwest: $12346.94
Southeast: $14735.41
Northwest: $12417.58
Northeast: $13406.38


From this we can deduce that the Eastern side of the country is paying slightly more for health inusrance.

### Part 4 - Break down the Age and BMI groups and compare insurance cost

For this we will categorize the ages and bmi into groups. (Please note BMI is not a definitive staistic of health as there are numerous factors involved, these are set by the WHO and vary around the globe but are used to calculate health inusrance in the U.S.)

First we can breakdown our dataset by age and compare the costs:

In [217]:
# First check to what the min and max ages are
min_age = min([int(i) for i in age])
max_age = max([int(i) for i in age])
print(min_age)
print(max_age)

18
64


In [216]:
# Now we can break the set into age groups and calculate mean insurance by group
def mean_age_diff(age):
    with open("insurance.csv") as data:
        data_set = csv.DictReader(data)
        total_18_25 = 0
        total_26_35 = 0
        total_36_45 = 0
        total_46_55 = 0
        total_56_65 = 0

        count_18_25 = 0
        count_26_35 = 0
        count_36_45 = 0
        count_46_55 = 0
        count_56_65 = 0

        for row in data_set:
            if int(row[age]) >= 18 and int(row[age]) <= 25:
                total_18_25 += float(row['charges'])
                count_18_25 += 1
            elif int(row[age]) >= 26 and int(row[age]) <= 35:
                total_26_35 += float(row['charges'])
                count_26_35 += 1
            elif int(row[age]) >= 36 and int(row[age]) <= 45:
                total_36_45 += float(row['charges'])
                count_36_45 += 1
            elif int(row[age]) >= 46 and int(row[age]) <= 55:
                total_46_55 += float(row['charges'])
                count_46_55 += 1
            elif int(row[age]) >= 56 and int(row[age]) <= 65:
                total_56_65 += float(row['charges'])
                count_56_65 += 1

        mean_18_25 = round((total_18_25 / count_18_25), 2)
        mean_26_35 = round((total_26_35 / count_26_35), 2)
        mean_36_45 = round((total_36_45 / count_36_45), 2)
        mean_46_55 = round((total_46_55 / count_46_55), 2)
        mean_56_65 = round((total_56_65 / count_56_65), 2)
                                
        print("The mean insurance cost for the 18 to 25 group is ${}".format(mean_18_25))
        print("The mean insurance cost for the 26 to 35 group is ${}".format(mean_26_35))
        print("The mean insurance cost for the 36 to 45 group is ${}".format(mean_36_45))
        print("The mean insurance cost for the 46 to 55 group is ${}".format(mean_46_55))
        print("The mean insurance cost for the 56 to 65 group is ${}".format(mean_56_65))

18
64


In [215]:
mean_age = (mean_age_diff('age'))

The mean insurance cost for the 18 to 25 group is $9087.02
The mean insurance cost for the 26 to 35 group is $10495.16
The mean insurance cost for the 36 to 45 group is $13493.49
The mean insurance cost for the 46 to 55 group is $15986.9
The mean insurance cost for the 56 to 65 group is $18795.99


There is a gradual increase of insurance cost with age.

Now we can do the same and divide by BMI group:

To calculate BMI, see the Adult BMI Calculator or determine BMI by finding your height and weight in this BMI Index Chart.

   - If your BMI is less than 18.5, it falls within the underweight range.
   - If your BMI is 18.5 to <25, it falls within the healthy weight range.
   - If your BMI is 25.0 to <30, it falls within the overweight range.
   - If your BMI is 30.0 or higher, it falls within the obesity range.

In [225]:
# Now we can break the set into bmi groups and calculate mean insurance by group
def mean_bmi_diff(bmi):
    with open("insurance.csv") as data:
        data_set = csv.DictReader(data)
        total_underweight = 0
        total_normal = 0
        total_overweight = 0
        total_obese = 0

        count_underweight = 0
        count_normal = 0
        count_overweight = 0
        count_obese = 0

        for row in data_set:
            if float(row[bmi]) < 18.5:
                total_underweight += float(row['charges'])
                count_underweight += 1
            elif float(row[bmi]) >= 18.5 and float(row[bmi]) < 25:
                total_normal += float(row['charges'])
                count_normal += 1
            elif float(row[bmi]) >= 25 and float(row[bmi]) < 30:
                total_overweight += float(row['charges'])
                count_overweight += 1
            elif float(row[bmi]) >= 30:
                total_obese += float(row['charges'])
                count_obese += 1
                
        mean_underweight = round((total_underweight / count_underweight), 2)
        mean_normal = round((total_normal / count_normal), 2)
        mean_overweight = round((total_overweight / count_overweight), 2)
        mean_obese = round((total_obese / count_obese), 2)
                                
        print("The mean insurance cost for the UNDERWEIGHT group is ${}".format(mean_underweight))
        print("The mean insurance cost for the NORMAL group is ${}".format(mean_normal))
        print("The mean insurance cost for the OVERWEIGHT group is ${}".format(mean_overweight))
        print("The mean insurance cost for the OBESE group is ${}".format(mean_obese))

In [226]:
mean_bmi = (mean_bmi_diff('bmi'))

The mean insurance cost for the UNDERWEIGHT group is $8852.2
The mean insurance cost for the NORMAL group is $10409.34
The mean insurance cost for the OVERWEIGHT group is $10987.51
The mean insurance cost for the OBESE group is $15552.34


Here we can see that an increse of BMI also increases insurance cost.

Now for the final statistical breakdown we will assess by number of children:

In [229]:
# First determine the max amount of children in the dataset
max_children = max([int(i) for i in num_of_children])
print(max_children)

5


In [246]:
# Break the set into num of children groups and calculate mean insurance by group
def mean_child_diff(child):
    with open("insurance.csv") as data:
        data_set = csv.DictReader(data)
        child_0 = 0
        child_1 = 0
        child_2 = 0
        child_3 = 0
        child_4 = 0
        child_5_plus = 0 

        count_0 = 0
        count_1 = 0
        count_2 = 0
        count_3 = 0
        count_4 = 0
        count_5_plus = 0

        for row in data_set:
            if int(row[child]) == 0:
                child_0 += float(row['charges'])
                count_0 += 1
            elif int(row[child]) == 1:
                child_1 += float(row['charges'])
                count_1 += 1
            elif int(row[child]) == 2:
                child_2 += float(row['charges'])
                count_2 += 1
            elif int(row[child]) == 3:
                child_3 += float(row['charges'])
                count_3 += 1
            elif int(row[child]) == 4:
                child_4 += float(row['charges'])
                count_4 += 1
            elif int(row[child]) >= 5:
                child_5_plus += float(row['charges'])
                count_5_plus += 1
                
        mean_0 = round((child_0 / count_0), 2)
        mean_1 = round((child_1 / count_1), 2)
        mean_2 = round((child_2 / count_2), 2)
        mean_3 = round((child_3 / count_3), 2)
        mean_4 = round((child_4 / count_4), 2)
        mean_5_plus = round((child_5_plus / count_5_plus), 2)
                                
        print("The mean insurance cost for zero children is ${}".format(mean_0))
        print("The mean insurance cost for one child is ${}".format(mean_1))
        print("The mean insurance cost for two children is ${}".format(mean_2))
        print("The mean insurance cost for three children is ${}".format(mean_3))
        print("The mean insurance cost for four children is ${}".format(mean_4))
        print("The mean insurance cost for five or more children is ${}".format(mean_5_plus))

In [247]:
mean_children = (mean_child_diff('children'))

The mean insurance cost for zero children is $12365.98
The mean insurance cost for one child is $12731.17
The mean insurance cost for two children is $15073.56
The mean insurance cost for three children is $15355.32
The mean insurance cost for four children is $13850.66
The mean insurance cost for five or more children is $8786.04


From the above we can deduce that with more children insurance cost rises with more children until the number of four which then it begins to reduce.  

#### Explanation: 
The price should increase with more children if the formula for the insurance cost is:

insurance_cost=250∗age−128∗sex+370∗bmi+425∗num_of_children+24000∗smoker−12500​

However we can see it does not.  Posiibilities could include that people with four or more children have lower bmi and/or are non smokers.  Future analysis and comparison in this area would possibly yield results.

## Conclusion

In this project we have found that the dataset was quite evenly split across all binomial data aside form there were more non-smokers than smokers.

We have established that the western region is lower in cost to the eastern region. Futher analysis would be worht looking into to see if factors in these regions attribute to the difference in cost.

Cost also rises per age group and BMI group which is predictable.  However, cost did fall with four or more children.  This anomalous result could be explored to see if there is any correlation bewteen cost and number of children.


## Learning Outcomes

I have thouroughly enjoyed this project and it has helped me understand and practice Python fundementals. I presented my project through skills learnt through the Python fundementals course with Codecademy and used Python documentation and web searches to help futher my knowledge and skills.  In the process I have learnt how to use Markdown and in the Jupyter Notebook environment.

JDWills