# Explore U.S Births

In this project, we're going to explore U.S. Births using this [dataset](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv) compiled by FiveThirtyEight.

This dataset contains the following column:
* **year**: Year (1994 to 2003).
* **month**: Month (1 to 12).
* **date_of_month**: Day number of the month (1 to 31).
* **day_of_week**: Day of week (1 to 7).
* **births**: Number of births that day.

## First things first, let's read the dataset in the CSV file and explore it:

In [6]:
f = open(r"US_births_1994-2003.csv", 'r')
data = f.read().split('\n')
string_list = data[1:len(data)]
string_list[:5]

['1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053']

## Converting data into a list of lists:

While a list of strings helps us get a general picture of the dataset, we need to convert it to a more structured format to be able to analyze it. Specifically, we need to convert the dataset into a list of lists where each nested list contains integer values (not strings). We also need to remove the header row.

Here's what we want the data to look like:

```python
[ 
  [1994, 1, 1, 6, 8096],
  [1994, 1, 2, 7, 7772],
  [1994, 1, 3, 1, 10142],
  [1994, 1, 4, 2, 11248],
  [1994, 1, 5, 3, 11053],
...
]
```

In [7]:
string_fields = []
for string in string_list:
    split_string = string.split(',')
    string_fields.append(split_string)
int_fields = []
for row in string_fields:
    first = int(row[0])
    second = int(row[1])
    third = int(row[2])
    forth = int(row[3])
    fifth = int(row[4])
    new_list = [first, second, third, forth, fifth]
    int_fields.append(new_list)

## Create a function that reads a csv file and turn it into a list of lists:

In [8]:
def read_csv(text_file):
    f = open(text_file, "r")
    data = f.read().split('\n')
    string_list = data[1:len(data)]
    string_fields = []
    for string in string_list:
        split_string = string.split(',')
        string_fields.append(split_string)
    int_fields = []
    for row in string_fields:
        first = int(row[0])
        second = int(row[1])
        third = int(row[2])
        forth = int(row[3])
        fifth = int(row[4])
        new_list = [first, second, third, forth, fifth]
        int_fields.append(new_list)
    final_data = int_fields
    return(final_data)

### Calculating number of births each time:

Now that the data is in a more usable format, we can start to analyze it. Let's calculate the total number of births that occured in each month, across all of the years in the dataset. We'll create a dictionary where each key is a unique month and each value is the number of births that happened in that month, across all years:

```python
{  
   1: 3232517,
   2: 3018140,
   3: 3322069,
   4: 3185314,
   5: 3350907,
   6: 3296530,
   7: 3498783,
   8: 3525858,
   9: 3439698,
   10: 3378814,
   11: 3171647,
   12: 3301860
}
```

In [24]:
def year_birth(text_file):
    cdc_list = read_csv(text_file)
    births_per_year = {}
    for row in cdc_list:
        if row[0] not in births_per_year:
            births_per_year[row[0]] = row[4]
        else:
            births_per_year[row[0]] = births_per_year[row[0]] + row[4]
    return(births_per_year)

year_birth(r"US_births_1994-2003.csv")

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [25]:
def month_birth(text_file):
    cdc_list = read_csv(text_file)
    births_per_month = {}
    for row in cdc_list:
        if row[1] not in births_per_month:
            births_per_month[row[1]] = row[4]
        else:
            births_per_month[row[1]] = births_per_month[row[1]] + row[4]
    return(births_per_month)

month_birth(r"US_births_1994-2003.csv")

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

Let's now create a function that calculates the total number of births for each unique day of the week. Here's what we want the dictionary to look like:

```python
{
  1: 5789166,
  2: 6446196,
  3: 6322855,
  4: 6288429,
  5: 6233657,
  6: 4562111,
  7: 4079723
}
```

In [26]:
def day_birth(text_file):
    cdc_list = read_csv(text_file)
    births_per_day = {}
    for row in cdc_list:
        if row[2] not in births_per_day:
            births_per_day[row[2]] = row[4]
        else:
            births_per_day[row[2]] = births_per_day[row[2]] + row[4]
    return(births_per_day)

day_birth(r"US_births_1994-2003.csv")

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [27]:
def week_birth(text_file):
    cdc_list = read_csv(text_file)
    births_per_week = {}
    for row in cdc_list:
        if row[3] not in births_per_week:
            births_per_week[row[3]] = row[4]
        else:
            births_per_week[row[3]] = births_per_week[row[3]] + row[4]
    return(births_per_week)

week_birth(r"US_births_1994-2003.csv")

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

### Calculating number of births with a more general function:

You may have noticed that there was a lot of similarity between the two functions you just wrote. While we can also create separate functions to calculate the totals for the year and date_of_month columns, it's better to create a single function that works for any column and specify the column we want as a parameter each time we call the function.

In [13]:
def calc_counts(text_file, column):
    cdc_list = read_csv(text_file)
    births_per_column = {}
    for row in cdc_list:
        if row[column - 1] not in births_per_column:
            births_per_column[row[column - 1]] = row[4]
        else:
            births_per_column[row[column - 1]] = births_per_column[row[column - 1]] + row[4]
    return(births_per_column)

In [28]:
calc_counts(r"US_births_1994-2003.csv", 1)

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

### Calculating minimum and maximum number of births each time:

In [19]:
def max_counts(text_file, column):
    counts = calc_counts(text_file, column)
    max = 0
    for i in counts:
        if counts[i] > max:
            max = counts[i]
            i += 1
    return(max)

In [20]:
def min_counts(text_file, column):
    counts = calc_counts(text_file, column)
    min = 10000000
    for i in counts:
        if counts[i] < min:
            min = counts[i]
            i += 1
    return(min)

In [29]:
max_counts(r"US_births_1994-2003.csv", 1)

4089950

In [31]:
min_counts(r"US_births_1994-2003.csv", 1)

3880894

## Next Steps

- Write a function that can calculate the min and max values for any dictionary that's passed in.
- Write a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing.
 - For example, how did the number of births on Saturday change each year between 1994 and 2003?
- Find a way to combine the CDC data with the SSA data, which you can find here. Specifically, brainstorm ways to deal with the overlapping time periods in the datasets.