# Task 1: Instructions

Load in the dataset with the yearly number of deaths.

- Import the `pandas`, aliasing it as `pd`.
- Read in `yearly_deaths_by_clinic.csv` and assign it to the variable `yearly`.
- Print out `yearly`.

## Good to know

To complete this project you need to know some Python and be familiar with `pandas` DataFrames and bootstrap analysis. Here are relevant DataCamp exercises if you need to brush up your skills:

- From [Data Manipulation with pandas](https://www.datacamp.com/courses/data-manipulation-with-pandas).
    - [Reading in a CSV](https://campus.datacamp.com/courses/data-manipulation-with-pandas/creating-and-visualizing-dataframes?ex=14\).
    - [Subsetting rows](https://campus.datacamp.com/courses/data-manipulation-with-pandas/transforming-data?ex=2).
    - [Inspecting a DataFrame](https://campus.datacamp.com/courses/data-manipulation-with-pandas/transforming-data?ex=7).
- From [Statistical Thinking in Python (Part 2)](https://www.datacamp.com/courses/statistical-thinking-in-python-part-2).
    - [Bootstrap analysis](https://campus.datacamp.com/courses/statistical-thinking-in-python-part-2/bootstrap-confidence-intervals?ex=1).

Even if you've taken these courses you will still find this project challenging unless you use some external _documentation_. Here is a [pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) summarizing the basics of pandas DataFrames. (You could also look at the [official pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) but be aware that it is _very technical_).

Finally, know that _Google is your friend_ and a good search pattern is **example of ??? in pandas** where **???** is whatever you need to do. For example, if you need to read in a csv file you could search for [example of reading a csv file in pandas](http://www.google.com/search?q=example+of+reading+a+csv+file+in+pandas).

# Task 2: Instructions

Calculate the yearly proportion of deaths.

- Calculate the proportion of `deaths` per number of `births` and store the result in a new column named `proportion_deaths`.
- Extract the rows from Clinic 1 into `clinic_1` and the rows from Clinic 2 into `clinic_2`.
- Print out `clinic_1`.

Here you need to be able to "pick out" or _subset_ rows and columns in the `yearly` DataFrame. How to do that can be glanced from the [pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) under the headings **Subset observations** and **Subset Variables**.

# Task 3: Instructions

Plot the yearly proportion of deaths for both clinics.

- Plot `proportion_deaths` by `year` for the two clinics in a single plot. Use the DataFrame `.plot()` method.
    - Label the plotted lines using the `label` argument to `.plot()`.
    - Change the y-axis label to `"Proportion deaths"` using the `ylabel` parameter in your second call of `.plot()`.
- Save the Axes object returned by the `plot` method into the variable `ax`.

For plotting it is easiest to use the `plot` method that is built into DataFrames. To get two lines into the same plot we need to use a trick you might not have seen before. If `df1` and `df2` are two DataFrames you can plot their data together like this:

```
ax = df1.plot(x="col_a", y="col_b",
              label="df1")
df2.plot(x="col_a", y="col_b",
         label="df2", ax=ax, ylabel="Y Axis Label")
```

By capturing the `ax` object and giving it as an argument in the plot statement we get both lines in the same plot.

# Task 4: Instructions

Load in the dataset with the monthly number of deaths for Clinic 1.

- Read in `monthly_deaths.csv` and assign it to the variable `monthly`. Make sure to tell `read_csv` to parse the `date` column as a date.
- Calculate the proportion of `deaths` per number of `births` and store the result in the new column `monthly["proportion_deaths"]`.
- Print out the first rows in `monthly` using the `.head()` method.

The `read_csv()` function doesn't automatically detect which columns contain dates. You can tell `read_csv()` this by giving a list of the date columns as the optional argument `parse_dates`. For example, if `my_data.csv` is a csv-file with a date column `date` then you can read it in like this:

```
my_df = pd.read_csv("my_data.csv", parse_dates=["date"])
```

# Task 5: Instructions

Plot the monthly proportion of deaths for Clinic 1.

- Plot `proportion_deaths` by `date` for the `monthly` date using the DataFrame `.plot()` method.
    - Change the y-axis label to `"Proportion deaths"`.
- Save the Axes object returned by the `.plot()` method into the variable `ax`.

# Task 6: Instructions

Make a plot that highlights the effect of handwashing. _The code to define `handwashing_start` is already provided to you using `pandas`' `to_datetime()` function_.

- Split `monthly` into `before_washing` (the rows in `monthly` before `handwashing_start`) and `after_washing` (the rows in `monthly` at and after `handwashing_start`).
- Using the same approach you used in Task 3, plot `proportion_deaths` in `before_washing` and `after_washing` into the same plot. Again, use the DataFrame `.plot()` method twice, saving the Axes object returned by the first call of `.plot()` into the variable `ax`.
    - Label the plotted lines using the `label` argument to `.plot()`.
    - Change the y-axis label to `"Proportion deaths"` in your second call of `.plot()`.

Since the column `monthly["date"]` was read in as a date column we can now compare it to other dates using the comparison operators (`<`, `>=`, `==`, etc.). For example, to pick out the row exactly at `handwashing_start` we could write:

```
at_washing = monthly[monthly["date"] == handwashing_start]
```

# Task 7: Instructions

Calculate the average reduction in proportion of deaths due to handwashing.

- Select the column `proportion_deaths` in `before_washing` and assign it to `before_proportion`.
- Do the same for `proportion_deaths` in `after_washing` and assign it to `after_proportion`.
- Calculate the difference in mean monthly proportion of deaths as mean `after_proportion` minus mean `before_proportion`.

For info on how to calculate the mean of `before_proportion` and `after_proportion` take a look under the heading **Summarize data** in the [pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

# Task 8: Instructions

Make a bootstrap analysis of the difference in mean monthly proportion of deaths.

- Within your `for` loop:
    - `boot_before` and `boot_after` should be sampled with replacement from `before_proportion` and `after_proportion`.
    - The difference in means should be appended to `boot_mean_diff`.
- Calculate a 95% `confidence_interval` as the 2.5% and 97.5% quantiles of `boot_mean_diff`.

A bootstrap analysis is a quick way of getting at the uncertainty of an estimate, in your case the estimate is the `mean_diff` you calculated in Task 7. A bootstrap analysis works by _simulating_ redoing the data collection by drawing randomly from the data and allowing a value to be drawn many times. Using a `pandas` column `my_col` (also called a Series), this can be done like this:

```
boot_col = my_col.sample(frac=1, replace=True)
```

The estimate is then calculated using `boot_col` instead of `my_col`. This process is repeated a large number of times and the distribution of the bootstrapped estimates represents the uncertainty around the original estimate. If `boot_mean` is a list of bootstrap estimates you can calculate a 95% confidence interval using `pandas`:

```
pd.Series(boot_mean).quantile([0.025, 0.975])
```

If you want to learn more about how the bootstrap works you should check out the course [Statistical Thinking in Python (Part 2)](https://www.datacamp.com/courses/statistical-thinking-in-python-part-2)!

# Task 9: Instructions

- Given the data Semmelweis collected, is it `True` or `False` that doctors should wash their hands?

Congratulations, you've made it this far! If you haven't tried it already, you should **check** your project now.

Good luck! :)