# Analysis of U.S. births data

Completed project as part of Dataquest.io curriculum. 

We are working with the dataset compiled by FiveThirtyEight, which can be found [here](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv).

Each row represents one calendar day. This is the order of the columns:

- `year`: Year (1994 to 2003).
- `month`: Month (1 to 12).
- `date_of_month`: Day number of the month (1 to 31).
- `day_of_week`: Day of week (1 to 7: 1 represents Monday, 7 represents Sunday).
- `births`: Number of births for that specific day.

In [203]:
def read_csv(unsplit_string):
    string_list_has_header = unsplit_string.split('\n')
    
    string_list = string_list_has_header[1:len(string_list_has_header)] #Removes the header from the data
    
    
    final_list = []
    for row in string_list:
        int_fields = []
        string_fields = row.split(',')
        for item in string_fields:
            int_fields.append(int(item))
        final_list.append(int_fields)
    return final_list
    
    

In [204]:
f = open("US_births_1994-2003_CDC_NCHS.csv", 'r')
raw_string = f.read()
cdc_list = read_csv(raw_string)

cdc_list[0:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

What we've just done is open the csv file and convert it into a list of lists of integer values. What we're going to do now is calculate the total number of births for each month category (total for all January's, total for all February's, etc.). 

In [205]:
def month_births(lst_of_lsts):
    births_per_month = {}
    for row in lst_of_lsts:
        month = row[1]
        births = row[4]
        if month in births_per_month:
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    
    return births_per_month

In [206]:
cdc_month_births = month_births(cdc_list)
print(cdc_month_births)

{1: 3232517, 2: 3018140, 3: 3322069, 4: 3185314, 5: 3350907, 6: 3296530, 7: 3498783, 8: 3525858, 9: 3439698, 10: 3378814, 11: 3171647, 12: 3301860}


What we're going to do now is calculate the total number of births for each day of week category (total for all Mondays's, total for all Tuesdays's, etc.).

In [207]:
def dow_births(lst_of_lsts):
    births_per_dow = {}
    for row in lst_of_lsts:
        day_of_week = row[3]
        births = row[4]
        if day_of_week in births_per_dow:
            births_per_dow[day_of_week] += births
        else:
            births_per_dow[day_of_week] = births
            
    return births_per_dow

In [208]:
cdc_day_births = dow_births(cdc_list)
print(cdc_day_births)

{1: 5789166, 2: 6446196, 3: 6322855, 4: 6288429, 5: 6233657, 6: 4562111, 7: 4079723}


Now below is a function that will work for any column to calculate its birth totals. We will use it for birth totals that pertain to everything, including year and date_of_month categories. 

In [209]:
def calc_counts(data, column):
    totals = {}
    for row in data:
        keyword = row[column]
        births = row[4]
        if keyword in totals:
            totals[keyword] += births
        else:
            totals[keyword] = births
            
    return totals

In [210]:
cdc_year_births = calc_counts(data=cdc_list, column=0)
cdc_month_births = calc_counts(data=cdc_list, column=1)
cdc_dom_births = calc_counts(data=cdc_list, column=2)
cdc_dow_births = calc_counts(data=cdc_list, column=3)

In [211]:
print ("cdc_year_births: ")
print (cdc_year_births)
print("\n")

print ("cdc_month_births: ")
print (cdc_month_births)
print("\n")

print ("cdc_dom_births: ")
print (cdc_dom_births)
print("\n")

print ("cdc_dow_births: ")
print (cdc_dow_births)

cdc_year_births: 
{2000: 4058814, 2001: 4025933, 2002: 4021726, 2003: 4089950, 1994: 3952767, 1995: 3899589, 1996: 3891494, 1997: 3880894, 1998: 3941553, 1999: 3959417}


cdc_month_births: 
{1: 3232517, 2: 3018140, 3: 3322069, 4: 3185314, 5: 3350907, 6: 3296530, 7: 3498783, 8: 3525858, 9: 3439698, 10: 3378814, 11: 3171647, 12: 3301860}


cdc_dom_births: 
{1: 1276557, 2: 1288739, 3: 1304499, 4: 1288154, 5: 1299953, 6: 1304474, 7: 1310459, 8: 1312297, 9: 1303292, 10: 1320764, 11: 1314361, 12: 1318437, 13: 1277684, 14: 1320153, 15: 1319171, 16: 1315192, 17: 1324953, 18: 1326855, 19: 1318727, 20: 1324821, 21: 1322897, 22: 1317381, 23: 1293290, 24: 1288083, 25: 1272116, 26: 1284796, 27: 1294395, 28: 1307685, 29: 1223161, 30: 1202095, 31: 746696}


cdc_dow_births: 
{1: 5789166, 2: 6446196, 3: 6322855, 4: 6288429, 5: 6233657, 6: 4562111, 7: 4079723}


# Next steps

1. "Writing a function that can calculate the min and max values for any dictionary that's passed in."
2. "Writing a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing."
  + "For example, how did the number of births on Saturday change each year between 1994 and 2003?"
3. "Finding a way to combine the CDC data with the SSA data, which you can find [here](https://github.com/fivethirtyeight/data/tree/master/births). Specifically, brainstorming ways to deal with the overlapping time periods in the datasets."

## 1. Determining max and min

In [212]:
def max_finder(input_dict):
    max = 0
    just_started = True
    for key in input_dict:
        if just_started:
            max = input_dict[key]
            just_started = False
        elif input_dict[key] > max:
            max = input_dict[key]            
    return max


def min_finder(input_dict):
    min = 0
    just_started = True
    for key in input_dict:
        if just_started:
            min = input_dict[key]
            just_started = False
        elif input_dict[key] < min:
            min = input_dict[key]
    return min
            

In [213]:
print (cdc_year_births)

{2000: 4058814, 2001: 4025933, 2002: 4021726, 2003: 4089950, 1994: 3952767, 1995: 3899589, 1996: 3891494, 1997: 3880894, 1998: 3941553, 1999: 3959417}


In [214]:
max_finder(cdc_year_births)

4089950

In [215]:
min_finder(cdc_year_births)

3880894

## 2. General functions to determine year-to-year change in births (increasing or decreasing) for certain conditions

"For example, how did the number of births on Saturday change each year between 1994 and 2003?"

In [216]:


def birth_conditions_totals(data, start_yr, end_yr, 
                fixed_column, fixed_value):
    """returns a dictionary with totals for each year for a
    very specific condition in a certain column
    
    Arguments:
    data -- a list of lists of ints.
    start_yr -- start year, must be less than or equal to
        end year.
    end_yr -- end year, must be greater than or equal to
        start year.
    fixed_column -- category number, such as day_of_week,
        which is 3. 
    fixed_value -- value number for which we are checking 
        over the years, such as Saturday, which is 6. 
    
    """
    
    trends = {}
    year = start_yr
    while year <= end_yr:
        birth_counter = 0
        
        #for each row in the data, if the year matches our
        #year and the value we're looking for matches the
        #value in that column, then we add up that day's
        #births to our yearly total
        for row in data:
            if row[0] == year and row[fixed_column] == fixed_value:
                birth_counter += row[4]     
                
        trends[year] = birth_counter
        year += 1
    return trends

In [217]:
saturday_births = birth_conditions_totals(data=cdc_list, start_yr=1994, end_yr=2003, fixed_column=3, fixed_value=6)
saturday_births

{1994: 474732,
 1995: 459580,
 1996: 456261,
 1997: 450840,
 1998: 453776,
 1999: 449985,
 2000: 469794,
 2001: 453928,
 2002: 445770,
 2003: 447445}

In [218]:
def trend_thinker(input_dict, start_yr, end_yr):
    """Returns a dictionary of boolean values with each 
    year from start year to end year as a keyword, 
    excluding the start year.
    
    The value True indicates births increased from the
    previous year.
    The value False indicates births decreased from the
    previous year. 
    In the rare case that births are exactly equal from 
    consecutive years, the value will be marked as
    False. 
    """
    
    trends = {}
    year = start_yr + 1
    while year <= end_yr:
        trends[year] = (input_dict[year] > input_dict[year-1])
        year += 1
    return trends

In [219]:
saturday_annual_trends = trend_thinker(input_dict=saturday_births, start_yr=1994, end_yr=2003)
saturday_annual_trends

{1995: False,
 1996: False,
 1997: False,
 1998: True,
 1999: False,
 2000: True,
 2001: False,
 2002: False,
 2003: True}

In [220]:
def boolean_dict_thinker(input_dict):
    output_dict = {}
    for key in input_dict:
        if input_dict[key] == True:
            output_dict[key] = "Increased"
        else:
            output_dict[key] = "Decreased"
    return output_dict

In [221]:
real_saturday_annual_trends = boolean_dict_thinker(saturday_annual_trends)
real_saturday_annual_trends

{1995: 'Decreased',
 1996: 'Decreased',
 1997: 'Decreased',
 1998: 'Increased',
 1999: 'Decreased',
 2000: 'Increased',
 2001: 'Decreased',
 2002: 'Decreased',
 2003: 'Increased'}

### Consolidating 3 functions into one

In [222]:
def growth_over_time(data, start_yr, end_yr, 
                fixed_column, fixed_value):
    """returns a dictionary with value "Increased" or
    "Decreased" for each year's births for a
    very specific condition in a certain column
    
    Arguments:
    data -- a list of lists of ints.
    start_yr -- start year, must be less than or equal to
        end year.
    end_yr -- end year, must be greater than or equal to
        start year.
    fixed_column -- category number, such as day_of_week,
        which is 3. 
    fixed_value -- value number for which we are checking 
        over the years, such as Saturday, which is 6. 
    
    """
    
    birth_dict = birth_conditions_totals(data=data, start_yr=start_yr, end_yr=end_yr, 
                fixed_column=fixed_column, fixed_value=fixed_value)
    annual_trends = trend_thinker(birth_dict, start_yr=start_yr, end_yr=end_yr)
    real_annual_trends = boolean_dict_thinker(annual_trends)
    
    return real_annual_trends

Results: "How did the number of births on Saturday change each year between 1994 and 2003?"

In [223]:
growth_over_time(data=cdc_list, start_yr=1994, end_yr=2003, fixed_column=3, fixed_value=6)

{1995: 'Decreased',
 1996: 'Decreased',
 1997: 'Decreased',
 1998: 'Increased',
 1999: 'Decreased',
 2000: 'Increased',
 2001: 'Decreased',
 2002: 'Decreased',
 2003: 'Increased'}

## 3. Combining CDC data with SSA data
Source: (https://github.com/fivethirtyeight/data/tree/master/births)

In [224]:
(cdc_list[0:10])

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

In [225]:
g = open("US_births_2000-2014_SSA.csv", 'r')
raw_string_g = g.read()
ssa_list = read_csv(raw_string_g)

(ssa_list[0:10])

[[2000, 1, 1, 6, 9083],
 [2000, 1, 2, 7, 8006],
 [2000, 1, 3, 1, 11363],
 [2000, 1, 4, 2, 13032],
 [2000, 1, 5, 3, 12558],
 [2000, 1, 6, 4, 12466],
 [2000, 1, 7, 5, 12516],
 [2000, 1, 8, 6, 8934],
 [2000, 1, 9, 7, 7949],
 [2000, 1, 10, 1, 11668]]

The order of the columns is consistent for both the CDC and SSA data sets:

- `year`: Year.
- `month`: Month (1 to 12).
- `date_of_month`: Day number of the month (1 to 31).
- `day_of_week`: Day of week (1 to 7: 1 represents Monday, 7 represents Sunday).
- `births`: Number of births for that specific day.

### Plan for addressing overlapping time periods

In the new, combined data set, the value of births for an overlapped day will be averaged from the CDC and SSA datasets. 

In [226]:
def averager(num_1, num_2):
    return (num_1 + num_2)/2

Start by putting the entire CDC data into the new data set. Then making a record of all dates already logged.

In [227]:
import copy
merged_data = copy.deepcopy(cdc_list)
search_data = []

for row in cdc_list:
    merged_data.append(row)

for row in merged_data:
    search_data.append(row[0:len(row)-2])
search_data[0:10]

[[1994, 1, 1],
 [1994, 1, 2],
 [1994, 1, 3],
 [1994, 1, 4],
 [1994, 1, 5],
 [1994, 1, 6],
 [1994, 1, 7],
 [1994, 1, 8],
 [1994, 1, 9],
 [1994, 1, 10]]

Run through the SSA data. If a date (row) isn't in the new data set, append the row.

In [228]:
changed_rows = {}
for row in ssa_list:
    if row[0:len(row)-2] not in search_data:
        merged_data.append(row)
        search_data.append(row[0:len(row)-2])
    else:
        spot = search_data.index(row[0:len(row)-2])
        average = averager(merged_data[spot][4], row[4])
        merged_data[spot][4] = average
        changed_rows[spot] = average
        

### Results

In [229]:
merged_data

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498],
 [1994, 1, 11, 2, 11706],
 [1994, 1, 12, 3, 11567],
 [1994, 1, 13, 4, 11212],
 [1994, 1, 14, 5, 11570],
 [1994, 1, 15, 6, 8660],
 [1994, 1, 16, 7, 8123],
 [1994, 1, 17, 1, 10567],
 [1994, 1, 18, 2, 11541],
 [1994, 1, 19, 3, 11257],
 [1994, 1, 20, 4, 11682],
 [1994, 1, 21, 5, 11811],
 [1994, 1, 22, 6, 8833],
 [1994, 1, 23, 7, 8310],
 [1994, 1, 24, 1, 11125],
 [1994, 1, 25, 2, 11981],
 [1994, 1, 26, 3, 11514],
 [1994, 1, 27, 4, 11702],
 [1994, 1, 28, 5, 11666],
 [1994, 1, 29, 6, 8988],
 [1994, 1, 30, 7, 8096],
 [1994, 1, 31, 1, 10765],
 [1994, 2, 1, 2, 11755],
 [1994, 2, 2, 3, 11483],
 [1994, 2, 3, 4, 11523],
 [1994, 2, 4, 5, 11677],
 [1994, 2, 5, 6, 8991],
 [1994, 2, 6, 7, 8309],
 [1994, 2, 7, 1, 10984],
 [1994, 2, 8, 2, 12152],
 [1994, 2, 9, 3

As expected, the first 20 rows of the CDC data set and the first 20 rows of the new data set are the same because these dates were not contained by the SSA data set.

In [238]:
merged_data[0:20]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498],
 [1994, 1, 11, 2, 11706],
 [1994, 1, 12, 3, 11567],
 [1994, 1, 13, 4, 11212],
 [1994, 1, 14, 5, 11570],
 [1994, 1, 15, 6, 8660],
 [1994, 1, 16, 7, 8123],
 [1994, 1, 17, 1, 10567],
 [1994, 1, 18, 2, 11541],
 [1994, 1, 19, 3, 11257],
 [1994, 1, 20, 4, 11682]]

In [231]:
cdc_list[0:20]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498],
 [1994, 1, 11, 2, 11706],
 [1994, 1, 12, 3, 11567],
 [1994, 1, 13, 4, 11212],
 [1994, 1, 14, 5, 11570],
 [1994, 1, 15, 6, 8660],
 [1994, 1, 16, 7, 8123],
 [1994, 1, 17, 1, 10567],
 [1994, 1, 18, 2, 11541],
 [1994, 1, 19, 3, 11257],
 [1994, 1, 20, 4, 11682]]

As expected, the last 20 rows of the SSA data set and the last 20 rows of the new data set are the same because these dates were not contained by the CDC data set.

In [232]:
merged_data[len(merged_data)-20:len(merged_data)]

[[2014, 12, 12, 5, 12001],
 [2014, 12, 13, 6, 8596],
 [2014, 12, 14, 7, 7291],
 [2014, 12, 15, 1, 12013],
 [2014, 12, 16, 2, 12748],
 [2014, 12, 17, 3, 12684],
 [2014, 12, 18, 4, 12816],
 [2014, 12, 19, 5, 12714],
 [2014, 12, 20, 6, 8465],
 [2014, 12, 21, 7, 7382],
 [2014, 12, 22, 1, 12799],
 [2014, 12, 23, 2, 12604],
 [2014, 12, 24, 3, 9308],
 [2014, 12, 25, 4, 6749],
 [2014, 12, 26, 5, 10386],
 [2014, 12, 27, 6, 8656],
 [2014, 12, 28, 7, 7724],
 [2014, 12, 29, 1, 12811],
 [2014, 12, 30, 2, 13634],
 [2014, 12, 31, 3, 11990]]

In [233]:
ssa_list[len(ssa_list)-20:len(ssa_list)]

[[2014, 12, 12, 5, 12001],
 [2014, 12, 13, 6, 8596],
 [2014, 12, 14, 7, 7291],
 [2014, 12, 15, 1, 12013],
 [2014, 12, 16, 2, 12748],
 [2014, 12, 17, 3, 12684],
 [2014, 12, 18, 4, 12816],
 [2014, 12, 19, 5, 12714],
 [2014, 12, 20, 6, 8465],
 [2014, 12, 21, 7, 7382],
 [2014, 12, 22, 1, 12799],
 [2014, 12, 23, 2, 12604],
 [2014, 12, 24, 3, 9308],
 [2014, 12, 25, 4, 6749],
 [2014, 12, 26, 5, 10386],
 [2014, 12, 27, 6, 8656],
 [2014, 12, 28, 7, 7724],
 [2014, 12, 29, 1, 12811],
 [2014, 12, 30, 2, 13634],
 [2014, 12, 31, 3, 11990]]

During the execution of the code, a dictionary was made on the side to record every occasion in which the merged data set was modified to accomodate averages between the two parent data sets. The key refers to the index in the merged data set at which this occurred. 

This index matches the CDC list because the overlap between the CDC and SSA data sets starts as the CDC list ends. 

If you subtract 2191 (see below), this index matches the SSA list because the overlap between the CDC and SSA data sets starts as the SSA list  begins. 

In [234]:
changed_rows

{2191: 8963.0,
 2192: 7911.0,
 2193: 11243.0,
 2194: 12867.5,
 2195: 12399.0,
 2196: 12363.0,
 2197: 12398.0,
 2198: 8842.0,
 2199: 7842.5,
 2200: 11543.0,
 2201: 12467.0,
 2202: 12227.5,
 2203: 11685.5,
 2204: 12052.0,
 2205: 8445.0,
 2206: 7578.0,
 2207: 10712.5,
 2208: 12218.0,
 2209: 12244.0,
 2210: 12347.0,
 2211: 11810.0,
 2212: 8744.0,
 2213: 7784.0,
 2214: 11316.5,
 2215: 12423.0,
 2216: 12011.0,
 2217: 12272.0,
 2218: 11803.5,
 2219: 8720.0,
 2220: 7683.5,
 2221: 11007.5,
 2222: 12486.5,
 2223: 12273.5,
 2224: 11934.0,
 2225: 11931.5,
 2226: 8549.0,
 2227: 7796.0,
 2228: 11502.5,
 2229: 12735.0,
 2230: 12169.5,
 2231: 12475.5,
 2232: 12296.5,
 2233: 8755.0,
 2234: 7867.0,
 2235: 12036.0,
 2236: 12864.0,
 2237: 12380.0,
 2238: 12397.0,
 2239: 12284.0,
 2240: 8777.0,
 2241: 7853.0,
 2242: 10508.0,
 2243: 12532.5,
 2244: 12112.0,
 2245: 12291.5,
 2246: 12087.5,
 2247: 8936.5,
 2248: 7776.0,
 2249: 11335.0,
 2250: 11770.0,
 2251: 12539.5,
 2252: 12392.5,
 2253: 12255.0,
 2254: 895

In [235]:
for key in changed_rows:
    print (cdc_list[key])

[2000, 1, 1, 6, 8843]
[2000, 1, 2, 7, 7816]
[2000, 1, 3, 1, 11123]
[2000, 1, 4, 2, 12703]
[2000, 1, 5, 3, 12240]
[2000, 1, 6, 4, 12260]
[2000, 1, 7, 5, 12280]
[2000, 1, 8, 6, 8750]
[2000, 1, 9, 7, 7736]
[2000, 1, 10, 1, 11418]
[2000, 1, 11, 2, 12323]
[2000, 1, 12, 3, 12057]
[2000, 1, 13, 4, 11556]
[2000, 1, 14, 5, 11924]
[2000, 1, 15, 6, 8365]
[2000, 1, 16, 7, 7499]
[2000, 1, 17, 1, 10601]
[2000, 1, 18, 2, 12086]
[2000, 1, 19, 3, 12083]
[2000, 1, 20, 4, 12188]
[2000, 1, 21, 5, 11667]
[2000, 1, 22, 6, 8633]
[2000, 1, 23, 7, 7712]
[2000, 1, 24, 1, 11184]
[2000, 1, 25, 2, 12253]
[2000, 1, 26, 3, 11879]
[2000, 1, 27, 4, 12136]
[2000, 1, 28, 5, 11673]
[2000, 1, 29, 6, 8635]
[2000, 1, 30, 7, 7603]
[2000, 1, 31, 1, 10882]
[2000, 2, 1, 2, 12359]
[2000, 2, 2, 3, 12082]
[2000, 2, 3, 4, 11806]
[2000, 2, 4, 5, 11828]
[2000, 2, 5, 6, 8474]
[2000, 2, 6, 7, 7730]
[2000, 2, 7, 1, 11375]
[2000, 2, 8, 2, 12591]
[2000, 2, 9, 3, 12024]
[2000, 2, 10, 4, 12339]
[2000, 2, 11, 5, 12182]
[2000, 2, 12, 6, 8674]

In [236]:
for key in changed_rows:
    print (ssa_list[key-2191])

[2000, 1, 1, 6, 9083]
[2000, 1, 2, 7, 8006]
[2000, 1, 3, 1, 11363]
[2000, 1, 4, 2, 13032]
[2000, 1, 5, 3, 12558]
[2000, 1, 6, 4, 12466]
[2000, 1, 7, 5, 12516]
[2000, 1, 8, 6, 8934]
[2000, 1, 9, 7, 7949]
[2000, 1, 10, 1, 11668]
[2000, 1, 11, 2, 12611]
[2000, 1, 12, 3, 12398]
[2000, 1, 13, 4, 11815]
[2000, 1, 14, 5, 12180]
[2000, 1, 15, 6, 8525]
[2000, 1, 16, 7, 7657]
[2000, 1, 17, 1, 10824]
[2000, 1, 18, 2, 12350]
[2000, 1, 19, 3, 12405]
[2000, 1, 20, 4, 12506]
[2000, 1, 21, 5, 11953]
[2000, 1, 22, 6, 8855]
[2000, 1, 23, 7, 7856]
[2000, 1, 24, 1, 11449]
[2000, 1, 25, 2, 12593]
[2000, 1, 26, 3, 12143]
[2000, 1, 27, 4, 12408]
[2000, 1, 28, 5, 11934]
[2000, 1, 29, 6, 8805]
[2000, 1, 30, 7, 7764]
[2000, 1, 31, 1, 11133]
[2000, 2, 1, 2, 12614]
[2000, 2, 2, 3, 12465]
[2000, 2, 3, 4, 12062]
[2000, 2, 4, 5, 12035]
[2000, 2, 5, 6, 8624]
[2000, 2, 6, 7, 7862]
[2000, 2, 7, 1, 11630]
[2000, 2, 8, 2, 12879]
[2000, 2, 9, 3, 12315]
[2000, 2, 10, 4, 12612]
[2000, 2, 11, 5, 12411]
[2000, 2, 12, 6, 8836]

It is clear that the changed values in the merged data set are distinct from either parent data set. Most importantly, the averages are correct. For instance, key 2191 refers to *January 1st, 2000* [2000, 1, 1]:

In [237]:
averager(8843, 9083)

8963.0