# Assignment 3. Analysis of flood data

![Jesus Green lock](res/jesus_green_lock.jpg)

This assignment asks you to analyse data provided by the UK Environment Agency concerning flooding. The agency offers an [API for near real-time data](http://environment.data.gov.uk/flood-monitoring/doc/reference) covering:
* flood warnings and flood alerts
* flood areas which to which warnings or alerts apply
* measurements of water levels and flows
* information on the monitoring stations providing those measurements

In this assignment we will be working with historical data of water level measurements, at several monitoring stations in Cambridge and on the Cam. The dataset is available as a CSV file at [https://teachingfiles.blob.core.windows.net/scicomp/flood.csv](https://teachingfiles.blob.core.windows.net/scicomp/flood.csv). If you go home over Christmas and are worried about flooding, and want to extend these analyses to your home area, see [A3. Data import and cleanup](A3.%20Data%20import%20and%20cleanup.ipynb) for details of how to fetch data from a web API.

_Image by [N. Chadwick](http://www.geograph.org.uk/photo/4800494)._

<div class="alert alert-warning">**Goal of the assignment.** 
    This assignment tests your skill at manipulating dataframes and indexed arrays, and your flair at plotting data.
    You should use Pandas and numpy operations for data manipulation, rather than <code style="background-color:inherit">for</code> loops,
    wherever possible. You can organize your code however you like. Please create a new notebook for your answers to this assignment.
</div>

# Part A
This section is worth 1 mark. Check your answers as described in [&sect;0.3](0.%20About%20this%20course.ipynb#grader) using `section='assignment3a'`.

In [1]:
# Import modules, and give them short aliases so we can write e.g. np.foo rather than numpy.foo
import numpy as np
import pandas
import matplotlib
import matplotlib.pyplot as plt
# The next line is a piece of magic, to let plots appear in our Jupyter notebooks
%matplotlib inline 

In [2]:
!pip install ucamcl --upgrade
import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp', section='assignment3a')
# paste in whatever section is appropriate for the section of notes / assignment you're working on

Waiting for you to log in .. done.


**Question 1.** Import the CSV file and print out a few lines, choosing the lines at random using `np.random.choice`. The file mistakenly includes the village Cam, near Bristol. Remove these rows, and store what's as the variable `flood`. How many rows are left?
```
# Submit your answer:
GRADER.submit_answer(GRADER.fetch_question('q1'), num_rows)
```

**Question 2.** Complete this table of the number of entries in this dataset for each town and river.

|  | Cambridge | Great Shelford | Milton | Weston Bampfylde
| -------
| **Bin Brook** | 2665 | 0
| **River Cam**

```
# Submit your answer, as an unstacked indexed array:
GRADER.submit_answer(GRADER.fetch_question('q2'), your_answer.as_matrix())
```

**Question 3.** Each measurement station has a unique `measure_id` and `label`. Complete this table of the number of measurement stations for each town and river. Use only the Pandas operations for split-apply-combine, don't use any numpy operations or Python `for` loops or list comprehensions. 

| | Cambridge | Great Shelford | Milton | Weston Bampylde
|-----
| **Bin Brook** | 1 | 0
| **River Cam**

```
# Submit your answer. Let your_answer be an unstacked indexed array.
GRADER.submit_answer(GRADER.fetch_question('q3'), your_answer.as_matrix())
```

**Question 4.** 
Each measurement station has low and high reference levels, in columns `low` and `high`. In this dataset, the reference levels are stored for every measurement, but we can verify that every `measure_id` has a unique pair `(low,high)` with
```
assert all(flood.groupby(['measure_id','low','high']).apply(len).groupby('measure_id').apply(len) == 1), "Reference levels non-constant"
```
Add a column `norm_value`, by rescaling `value` linearly so that `value=low` correponds to `norm_value=0` and `value=high` corresponds to `norm_value=1`.
Use `np.nanpercentile` to find the [_tercile points_](https://en.wiktionary.org/wiki/tercile), the two values that split the entire `norm_value` column into three roughly equal parts.

```
# Submit your answer:
GRADER.submit_answer(GRADER.fetch_question('q4'), [tercile1, tercile2])
```

**Question 5.** Complete the following dataframe, listing the number of observations in each tercile and the total number of observations per station.

| label | norm_value_tercile | n | ntot
|----
| Bin Brook | low | 19 | 2665
| Bin Brook | med | 1653 | 2665
| Bin Brook | high | 993 | 2665
| Cambridge Jesus Lock | high | 1906 | 2651

```
# Submit your answer:
assert np.array_equal(your_dataframe.columns, ['label','norm_value_tercile','n','ntot']), 'columns are wrong'
GRADER.submit_answer(GRADER.fetch_question('q5'), your_dataframe)
```
<div class="alert alert-info">
Update on 2017-12-09: When submitting a dataframe, please make sure that the columns have the names and order shown. The rows can be in any order.
</div>

**Question 6.** Complete this dataframe listing the fraction of observations in each tercile per station:

| label | low | med | high
|----
| Bin Brook | 0.007 | 0.620 | 0.373
| Cambridge | 0.807 | 0.194 | 0.000

```
# Submit your answer:
assert np.array_equal(your_dataframe.columns, ['label','low','med','high']), 'columns are wrong'
GRADER.submit_answer(GRADER.fetch_question('q6'), your_dataframe)
```

**Question 7.** Fill in the rest of this indexed array, giving the `low` and `high` values for each measurement station:

| label | ref | |
|----
| Bin Brook | high | 0.368
| | low | 0.057
| Cambridge | high | 1.250
| | low | 0.141

```
# Submit your answer. Let your_answer be an indexed array.
GRADER.submit_answer(GRADER.fetch_question('q7'), your_answer.reset_index(name='val'))
```

# Part B
This section is worth 1 mark. There is no automated testing of your answers here, but your code may be assessed in the ticking session.

You should pay attention to axis ranges, axis labelling, colour schemes, etc. in reproducing the plot, though you shouldn't aim to be pixel-perfect.
    You will need to spend time <a href="https://stackoverflow.com/questions/tagged/matplotlib">searching</a> how to control matplotlib plots.

**Question 8.** Reproduce this plot:

![fraction of observations in each tercile](res/ass3_q7.png)

**Question 9.** Reproduce this plot:

![Water levels over time with reference](res/ass3_q9_v2.png)


The light shaded area shows the range from `low` to `high` for each station. The dark shaded area shows the inter-tercile range, `low+tercile1*(high-low)` to `low+tercile2*(high-low)` where `tercile1` and `tercile2` are your answers to Question 4. They can be plotted with [`ax.axhspan`](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.axhspan.html#matplotlib.axes.Axes.axhspan).
Here are some code snippets for working with datetimes that may be helpful.
```
# Create a column with datetime objects
import datetime, pytz
def as_datetime(s): return datetime.datetime.strptime(s, '%Y-%m-%dT%H:%M:%SZ').replace(tzinfo=pytz.UTC)
flood['datetime'] = np.vectorize(as_datetime)(flood['t'])

# Date-axis control, taken from http://matplotlib.org/examples/api/date_demo.html
# Given a matplotlib axis, print out date labels nicely
ax.xaxis.set_major_locator(matplotlib.dates.WeekdayLocator(byweekday=matplotlib.dates.MO, tz=pytz.UTC))
ax.xaxis.set_minor_locator(matplotlib.dates.DayLocator(tz=pytz.UTC))
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%a %d %b'))
# then, at the end,
fig.autofmt_xdate(bottom=0.2, rotation=-30, ha='left')
```

<div class="alert alert-info">
Update on 2017-12-05: Fixed the dark-shaded areas, which had previously been plotted in the wrong place.
</div>