# US SSA Birth Data

We're going to explore some data on births in the US from the US Social Security Administration. The data we'll be using was curated by [FiveThirtyEight](https://github.com/fivethirtyeight/data/tree/master/births).

The data is already in this repository in the `data` directory, but you could also download it from [here](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv).

```bash
mkdir -p ../data
wget -qO ../data/US_births_2000-2014_SSA.csv https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
```

To get started we'll have to figure out how to read the data contained in the comma separated value (CSV) file.

## Environment

I highly encourage you to use [miniconda](https://docs.conda.io/en/latest/miniconda.html) to manage your python environment (which plays nicely with [VSCode](https://code.visualstudio.com/download)).  Follow the instructions to install miniconda, which will set you up with the `conda` command line tool for managing your environments.

You can use `requirements.txt` in this repository to set up an environment with the packages we need.  To do so, open a terminal window, navigate to the base directory of the repository, and run

```bash
conda create -n births python=3.10 pip
```

This creates an environment called `births` running python version 3.10, with `pip` installed.  Now activate that environment and install the required packages within it:

```bash
conda activate births
pip install -r requirements.txt
```

## I/O with `pandas`

We'll be using `pandas` to read the CSV file and manage data.  I'm going to assume you're using a `conda` environment with `pandas`, `numpy`, `matplotlib`, and `seaborn` installed.  If you're using a different environment, you may have to install some things (e.g., using `pip`).

In [None]:
import pandas as pd

The `read_csv()` function automatically parses the column names and assigns proper data types to the columns, returning a `DataFrame` object.  Please read [the docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to learn all the gory details about `DataFrames`.

## Plotting

`matplotlib` is going to be our go-to plotting library.  We'll also make use of some domain-specific plotting libraries like `seaborn`, `arviz`, etc. throughout the term, many of which are built on top of `matplotlib`.

In [None]:
import matplotlib.pyplot as plt

First we'll make a simple histogram.

We will also be using `numpy` *extensively* for efficient number-crunching.

In [None]:
import numpy as np

Now let's make a scatter plot.

There's a lot of scatter here, so let's use a boolean array to select only the data pertaining to June.

Scatter plots aren't the best way to look at discrete data.  A better tool is a violin plot, which plots a kernel density estimate (basically a smoothed histogram) of the distribution of points within each discrete value.  `seaborn` has a nice implementation of such a plotting routine.  To install this a terminal, activate your conda environment, and run
```bash
python -m pip install seaborn
```

In [None]:
import seaborn as sns

Now let's use some more advanced features of `pandas`.  First we'll group the data entries by year and compute the total number of births each year.

Now by day of the month.