## Guided Project: Explore U.S. Births

The dataset contains the following columns:

+ year: Year (1994 to 2003).
+ month: Month (1 to 12).
+ date_of_month: Day number of the month (1 to 31).
+ day_of_week: Day of week (1 to 7).
+ births: Number of births that day.

Create `read_csv` to read in a csv file and;
+ remove header row
+ convert string fields to ints
+ return list of lists where each list is a row

In [7]:
def read_csv(file):
    data = open(file).read()
    string_list = data.split("\n")
    final_list = []
    for row in string_list[1:]:
        string_fields = row.split(",")
        int_fields=[]
        for char in string_fields:
            int_fields.append(int(char))
        final_list.append(int_fields)
    return final_list

cdc_list  = read_csv("US_births_1994-2003_CDC_NCHS.csv")
cdc_list[:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

Create `month_births` to return a dictionary of the total monthly births in the dataset. The 1-12 notiation used in the data set is not as clear as using the abbreviated month names, so we will format the dictionary to use those instead. 

In [21]:
def month_births(birthData):
    month_dict = {}
    births_per_month = {}
    #months are 1-12, so we need a filler string
    #at the 0 index of our list
    month_list = ["", "Jan", "Feb", "Mar", "Apr", "May", 
                  "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
    for num in range(1,13):
        month_dict[num] = month_list[num]
    for row in birthData:
        month = month_dict[int(row[1])]
        births = int(row[4])
        if month not in births_per_month:
            births_per_month[month] = births
        else:
            births_per_month[month] += births
    return births_per_month

cdc_month_births = month_births(cdc_list)
cdc_month_births

{'Apr': 3185314,
 'Aug': 3525858,
 'Dec': 3301860,
 'Feb': 3018140,
 'Jan': 3232517,
 'Jul': 3498783,
 'Jun': 3296530,
 'Mar': 3322069,
 'May': 3350907,
 'Nov': 3171647,
 'Oct': 3378814,
 'Sep': 3439698}

Create `dow_births` to return a dictionary of the total number of births on each day of the week. The 1-7 notiation used in the data set is not as clear as using the abbreviated day names, so we will format the dictionary to use those instead. 

In [22]:
def dow_births(birthData):
    day_dict = {}
    births_per_dow = {}
    #days are 1-7, so we need a filler string
    #at the 0 index of our list
    day_list = ["","Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
    for num in range(1,8):
        day_dict[num] = day_list[num]
    for row in birthData:
        day = day_dict[int(row[3])]
        births = int(row[4])
        if day not in day_dict:
            births_per_dow[day] = births
        else:
            births_per_dow[day] += births
    return births_per_dow

cdc_day_births = dow_births(cdc_list)
cdc_day_births

{'Fri': 10218,
 'Mon': 12823,
 'Sat': 8646,
 'Sun': 7645,
 'Thu': 6628,
 'Tue': 14438,
 'Wed': 12374}

In [11]:
def month_dict():
    month_dict = {}
    month_list = ["", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
    for num in range(1,13):
        month_dict[num] = month_list[num]
    return month_dict
def day_dict():
    day_dict = {}
    day_list = ["","Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
    for num in range(1,8):
        day_dict[num] = day_list[num]
    return day_dict

Instead of creating a very similar function over and over again, we want to create a function that can return a dictionary of counts for any column in our data set. I built the month_dict and day_dict fucntion above to allow the formating used previously.

In [26]:
def calc_counts(birthData, col):
    counts_dict = {}
    if col == 1:
        monthDict = month_dict()
        for row in birthData:
            month = monthDict[row[col]]
            births = row[4]
            if month not in counts_dict:
                counts_dict[month] = births
            else:
                counts_dict[month] += births
    elif col == 3:
        dowDict = day_dict()
        for row in birthData:
            day = dowDict[row[col]]
            births = row[4]
            if day not in counts_dict:
                counts_dict[day] = births
            else:
                counts_dict[day] += births
    else:
        for row in birthData:
            key = row[col]
            births = int(row[4])
            if key not in counts_dict:
                counts_dict[key] = births
            else:
                counts_dict[key] += births
                
    return counts_dict

cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)                

In [27]:
cdc_year_births 

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [28]:
cdc_month_births

{'Apr': 3185314,
 'Aug': 3525858,
 'Dec': 3301860,
 'Feb': 3018140,
 'Jan': 3232517,
 'Jul': 3498783,
 'Jun': 3296530,
 'Mar': 3322069,
 'May': 3350907,
 'Nov': 3171647,
 'Oct': 3378814,
 'Sep': 3439698}

In [29]:
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [30]:
cdc_dow_births

{'Fri': 6233657,
 'Mon': 5789166,
 'Sat': 4562111,
 'Sun': 4079723,
 'Thu': 6288429,
 'Tue': 6446196,
 'Wed': 6322855}

Write a function that can calculate the min and max values for any dictionary that's passed in.

In [31]:
def dictMax_Min(dictionary):
    maxVal = 0
    minVal = 10**100
    maxKey = ""
    minKey = ""
    for key in dictionary:
        if dictionary[key] > maxVal:
            maxVal = dictionary[key]
            maxKey = key
        if dictionary[key] < minVal:
            minVal = dictionary[key]
            minKey = key
    return [(minKey, minVal), (maxKey, maxVal)]

dictMax_Min(cdc_year_births)        

[(1997, 3880894), (2003, 4089950)]

Write a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing.

+ For example, how did the number of births on Saturday change each year between 1994 and 2003?

In [36]:
from collections import defaultdict

In [96]:
def create_yearly_dict(birthData, col):
    '''
    Creates a dictionary of yearly values for each
    time type from the indicated column
    i.e col=1 results in the number of births in each month
    grouped by year
    '''
    year_dict = defaultdict(dict)
    for row in birthData:
            year = row[0]
            year_dict[year]
            key = row[col]
            births = row[4]
            if key in year_dict[year]:
                year_dict[year][key] += births
            else:
                year_dict[year][key] = births
    return year_dict

def yearly_change(birthData, col, interest):
    '''
    Finds the change from year to year of the indicated interest time
    i.e if col=1 and interest=3, this returns the difference in total
    births for March of each year
    '''
    year_dict = create_yearly_dict(birthData, col)
    years = sorted(year_dict.keys())
    yearly_change = defaultdict(int)
    for i in range(len(years)):
        if years[i] == 1994:
            pass
        else:
            change =  year_dict[years[i]][1]-year_dict[years[i-1]][1]
            yearly_change[years[i]] = change
    return yearly_change

In [108]:
march_births_change = yearly_change(cdc_list, 1, 3)

In [109]:
march_births_change

defaultdict(int,
            {1995: -4692,
             1996: -1730,
             1997: 2928,
             1998: 2129,
             1999: -158,
             2000: 10926,
             2001: 5090,
             2002: -4524,
             2003: -871})