# Guided Project: Explore U.S. Births

The solution provided by Dataquest.io can be found [here](https://github.com/dataquestio/solutions/blob/master/Mission9Solutions.ipynb)

In this guided project the goal is to practice the basics of Python by analyzing births in the U.S.
* How to convert data into a list of lists
* How to calculate summary statistics using dictionaries
* How to create a more general function for calculating summary statistics

----------------------------------------------------------------------------------------------------------------------
## Introduction to the Dataset
The data set used in this notebook come from the Center for Disease Control and Prevention's National Center for Health Statistics.

One way to do this is through a series of commands like this:
<blockquote>
    <p>f = open('US_births_1994-2003_CDC_NCHS.csv')</p>
    <p>births = f.read()</p>
    <p>birth_data = birth_data.split('\n')</p>
    <p>print(birth_data[0:10])</p>
</blockquote>
**BUT** A more 'programmatic' way is to do this is...

In [None]:
filename = "US_births_1994-2003_CDC_NCHS.csv"
birth_data = open(filename).read().split("\n")
print(birth_data[0:10])

Looking at the output we see that the data set has the following structure:
* year - Year (1994 to 2003).
* month - Month (1 to 12).
* date_of_month - Day number of the month (1 to 31).
* day_of_week - Day of week (1 to 7).
* births - Number of births that day.

*The ranges can be found by looking at the set for each field, OR by looking at the information provided in Dataquest.*

## Converting Data Into A List Of Lists

Personally, I like to set my functions apart from where I use them. It helps me stay organized.

In [None]:
def read_csv(filename):
    string_data = open(filename).read()
    split_string_data = string_data.split("\n")
    string_list = split_string_data[1:len(split_string_data)]
    final_list = []
    
    for row in string_list:
        string_fields = []
        string_fields = row.split(",")
        int_fields = []
        
        for value in string_fields:
            int_fields.append(int(value))
        final_list.append(int_fields)
        
    return final_list

The project does ask the student to read in the file twice. The second time using a function that can no be copy and pasted to be reused.

In [None]:
file_to_read = "US_births_1994-2003_CDC_NCHS.csv"
cdc_list = read_csv(file_to_read)

In [None]:
cdc_list[0:10]

## Calculating Number Of Births Each Month

In [None]:
def month_births(list_of_lists):
    births_per_month = {}
    
    for each_list in list_of_lists:
        month_value = each_list[1]
        birth_value = int(each_list[4])
        
        if month_value in births_per_month:
            births_per_month[month_value] = births_per_month[month_value] + birth_value
        else:
            births_per_month[month_value] = birth_value
            
    return births_per_month

In [None]:
cdc_month_births = month_births(cdc_list)
cdc_month_births

## Calculating Number Of Births Each Day Of Week

In [None]:
def dow_births(list_of_lists):
    day_of_week = {}
    
    for each_list in list_of_lists:
#         split_each = each.split(',')
        day_value = each_list[3]
        birth_value = int(each_list[4])
        if day_value in day_of_week:
            day_of_week[day_value] = day_of_week[day_value] + birth_value
        else:
            day_of_week[day_value] = birth_value
    return day_of_week

In [None]:
def dow_births(list_of_lists):
    day_of_week = {}
    
    for each_list in list_of_lists:
        day_value = each_list[3]
        birth_value = int(each_list[4])
        
        if day_value in day_of_week:
            day_of_week[day_value] = day_of_week[day_value] + birth_value
        else:
            day_of_week[day_value] = birth_value
    
    return day_of_week

In [None]:
cdc_day_births = dow_births(cdc_list)
cdc_day_births

For viewing, I don't like that the dictionary begins with 6 so I sorted it here:

In [None]:
keylist = sorted(cdc_day_births)
for key in keylist:
    print(key,':', cdc_day_births[key])

## Creating A More General Function

In [None]:
def calc_counts(data, column):
    dict_of_data = {}
    
    for each_list in data:
        column_value = each_list[column]
        count_value = int(each_list[4])
        
        if column_value in dict_of_data:
            dict_of_data[column_value] = dict_of_data[column_value] + count_value
        else:
            dict_of_data[column_value] = count_value
            
    return dict_of_data

In [None]:
cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)

In [None]:
cdc_year_births

In [None]:
cdc_month_births

In [None]:
cdc_dom_births

In [None]:
cdc_dow_births

----------------------------------------------------------------------------------------------------------------------
## Bonus Rounds

Write a function that can calculate the min and max values for any dictionary that's passed in.

In [None]:
def min_max_values(dictionary):
    key_max_value = max(dictionary.keys(), key=(lambda k: dictionary[k]))
    key_min_value = min(dictionary.keys(), key=(lambda k: dictionary[k]))

    max_value = dictionary[key_max_value]
    min_value = dictionary[key_min_value]
    print("The dictionary minimum is:", min_value, "\nThe dictionary maximum is:", max_value)
    return

In [None]:
input_dict = cdc_year_births # <-- Enter your dictionary here.
min_max_values(input_dict)

Write a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing. For example, how did the number of births on Saturday change each year between 1994 and 2003?

In [None]:
def year_over_year(input_list):
    previous_year_birth = 0
    previous_birth_diff = 0
    for year, total_births in input_list.items():
        current_year_birth = int(total_births)
        if previous_year_birth == 0:
            print(year,"is the first year.")
            previous_year_birth = current_year_birth
        else:
            if current_year_birth > previous_year_birth:
                print("Births increase in the year", year)
                previous_year_birth = current_year_birth
            elif current_year_birth < previous_year_birth:
                print("Births decreased in the year", year)
                previous_year_birth = current_year_birth
            elif current_year_birth == previous_year_birth:
                print("Births remained the same in the year", year)
                previous_year_birth = current_year_birth

In [None]:
year_over_year(cdc_year_births)

Find a way to combine the CDC data with the SSA data, which you can find [here](https://github.com/fivethirtyeight/data/tree/master/births). Specifically, brainstorm ways to deal with the overlapping time periods in the datasets.

In [None]:
file_to_read_two = "US_births_2000-2014_SSA.csv"
cdc_list_two = read_csv(file_to_read_two)

In [None]:
cdc_list_two[0:10]

In [None]:
cdc_two_year_births = calc_counts(cdc_list_two, 0)

In [None]:
cdc_two_year_births

In [None]:
cdc_year_births

The first thing I notice is that the overlapping years, 2000-2003, have different total values for yearly birth totals. Given this the first step is to investigate the sources of data to see if there may be methodology difference to explain the differences.

I'm going to assume the list with more recent values shoudl supersede the other list.

In [None]:
joined_cdc_list = []

for row in cdc_list:
    if row[0] < 2000:
        joined_cdc_list.append(row)
for row in cdc_list_two:
    joined_cdc_list.append(row)

In [None]:
join_check = calc_counts(joined_cdc_list, 0)
join_check