# U.S. Medical Insurance Costs

## Look over your dataset

Open **insurance.cvs** and take a look at the file. Take note of how information is organized.<br> 
How will this affect how you analyze the data in Python?<br> 
Is there anything of particular interest to you in the dataset that you want to investigate?<br> 
Think about these things before you jump into Python.<br>

In [5]:
import csv

# print the whole CSV file
# with open("insurance.csv") as insurance:
#    print(insurance.read())

# print the first 10 rows and the header of the CSV file
with open("insurance.csv") as insurance:
    file_reader = csv.reader(insurance, delimiter = ",")
    count = 0
    for row in file_reader:
        print (row)
        if count > 9:
            break
        count += 1
    

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061']
['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552']
['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216']
['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896']
['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056']
['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107']
['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692']


The file has 7 columns: <br>
**age**: number (integer) <br>
**sex**: female, male <br>
**bmi**: number (float) <br>
**children**: number (integer) <br>
**smoker**: yes, no <br>
**region**: southwest, southeast, northwest, northeast <br>
**charges**: number (float) <br>

Each row represents one record.
Fortunately, there is no missing data.

## Scope your Project

__Basic Queries__<br>
* Find out the average age of the patients in the dataset.
* Analyze where a majority of the individuals are from.
* Look at the different costs between smokers vs. non-smokers.
* Figure out what the average age is for someone who has at least one child in this dataset.

__Data Manipulation__<br>

I will use the facilities that the csv library provides. <br>
I will use the **DictReader** function and manipulate the Python Dictionary that this function returns.<br> 
The structure of the dictionary is similar to the one we used in the "Medical Insurance Project" <br>

medical_record[i] = {"age": <number>, "sex": "male/female", "bmi": <number>, "children": <number>, "smoker": "yes/no", "region":"southwest/southeast/northwest/northeast", "charges": <number>}<br>

I will define a function for each one of the queries requested. <br>
The function will received the csv file name (and an extra parameter if needed) and will return the result to print it outside the function.

## Functions to implement

### Average age of the patients in the dataset.

In [20]:
def average_age(csv_file):
    total_age = 0
    num_records = 0
    with open(csv_file) as insurance_records:
        medical_records = csv.DictReader(insurance_records)
        for record in medical_records:
            total_age += int(record["age"])
            num_records += 1
    return round(total_age / num_records)

# Testing function
print ("The average age of the patients in the datasheet is {average}".format(average = average_age("insurance.csv")))

The average age of the patients in the datasheet is 39


### Where the majority of the individuals are from <br>
Initially, I created the function just to focus on the region, but after reading the project extension task,<br>
I decided to modify it and make it generic. This way, the function can provide the number of elements per category. <br>
This way is easier to identify if the data is bais or not. 

In [53]:
def count_by_category(csv_file, category):
    categories = {} # this dictionary will provide the results
    with open(csv_file) as insurance_records:
        medical_records = csv.DictReader(insurance_records)
        for record in medical_records:
            key = record[category]
            if key in categories: # we find if the category is already in the dictionary
                categories[key] = categories[key] + 1 # we add a new entry into the dictionary
            else:
                categories[key] = 1 # We create a new region in our dictionary
    return categories

def max_category(category): # this function will provide the category with the most customers register
    name = ""
    total = 0
    for key,value in category.items():
        if total < value:
            total = value
            name = key 
    return name, total

# Testing function
regions = count_by_category("insurance.csv", "region")
print (regions)
region, total = max_category(regions)
print ("The {0} region is the biggest with {1} members".format(region, total))
sexes = count_by_category("insurance.csv", "sex")
print (sexes)
sex, total = max_category(sexes)
print ("There are more {0}s with {1} members register".format(sex, total))
smokers = count_by_category("insurance.csv", "smoker")
print (smokers)
smoke, total = max_category(smokers)
if smoke == "yes":
    print ("There are more smokers with {1} members register".format(smoke, total))
else:
    print ("There are more Non smokers with {1} members register".format(smoke, total))

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
The southeast region is the biggest with 364 members
{'female': 662, 'male': 676}
There are more males with 676 members register
{'yes': 274, 'no': 1064}
There are more Non smokers with 1064 members register


### Different Costs between Smokers and No-Smokers
This function provides the difference in costs by any categorical value in the file.

In [46]:
def cost_difference_by_category(csv_file, category):
    categories = {} # This dictionary will provide the average charges for the categories selected
    totals = {} # This dictionary will count the number of instances per categories
    with open(csv_file) as insurance_records:
        medical_records = csv.DictReader(insurance_records)
        for record in medical_records:
            key = record[category]
            if key in categories: # we find if the categpry is already in the dictionary
                categories[key] = categories[key] + float(record["charges"]) # we accumulate with the new entry into the dictionary
                totals[key] += 1
            else:
                categories[key] = float(record["charges"]) # We create a new category in our dictionary
                totals[key] = 1
    # print (categories)
    # print (totals)
    for key, value in totals.items(): # We will get the average value of each category
        categories[key] = categories[key] / value
    return categories

# Test function
print(cost_difference_by_category("insurance.csv", "smoker"))
print(cost_difference_by_category("insurance.csv", "sex"))
print(cost_difference_by_category("insurance.csv", "region"))
print(cost_difference_by_category("insurance.csv", "children"))
print(sorted(cost_difference_by_category("insurance.csv", "age").items(), key=lambda x: x[0]))

{'yes': 32050.23183153285, 'no': 8434.268297856199}
{'female': 12569.57884383534, 'male': 13956.751177721886}
{'southwest': 12346.93737729231, 'southeast': 14735.411437609895, 'northwest': 12417.575373969228, 'northeast': 13406.3845163858}
{'0': 12365.975601635882, '1': 12731.171831635793, '3': 15355.31836681528, '2': 15073.563733958328, '5': 8786.035247222222, '4': 13850.656311199999}
[('18', 7086.2175563623205), ('19', 9747.909334558823), ('20', 10159.697736206897), ('21', 4730.464329642857), ('22', 10012.932801785715), ('23', 12419.820039642855), ('24', 10648.015962142857), ('25', 9838.365310714285), ('26', 6133.825308571429), ('27', 12184.701721428573), ('28', 9069.187564285712), ('29', 10430.158727037038), ('30', 12719.110358148146), ('31', 10196.980573333332), ('32', 9220.300290769232), ('33', 12351.53298730769), ('34', 11613.52812076923), ('35', 11307.182031200002), ('36', 12204.476138), ('37', 18019.9118772), ('38', 8102.733674), ('39', 11778.2429452), ('40', 11772.25131), ('41

### Average age for someone who has at least one child

In [39]:
def avg_age_with_children(csv_file):
    total_analyzed = 0
    age = 0
    with open(csv_file) as insurance_records:
        medical_records = csv.DictReader(insurance_records)
        for record in medical_records:
           if int(record["children"]) >= 1:
            age += int(record["age"])
            total_analyzed += 1
    return round(age / total_analyzed)

#Test function
print ("The average age of the patients with at least one child is {average}".format(average = avg_age_with_children("insurance.csv")))

The average age of the patients with at least one child is 40


### Additional Comments

Once I completed the functions and run some tests, I found the following:
* There are almost 4 times more no-smokers than smokers in the dataset. That is a similar difference with regards to the average cost difference between a smoker and a non-smoker.<br>

* If you look at the population distribution and the differences between males and females, the counts are similar, and so the costs. There needs to be more analysis to identify features that trigger costs increases, for example, combining sex & region. <br>
* You can also see a bias in the data when analyzing how the number of children flied impacts costs. The more children you have, the less the costs, which seems odd. Looking deeper into the amount of information of the dataset, you notice that there are fewer records of people with 4 and 5 children than with 0-3.
<br>