# Repetition from days 1+2

We went through
 * Basic usage of python. Operators, data structures, flow control.
 * Manually and partly-automatic parsing of text files.
 * Some basic and not-so-basic plotting: double y-axes, logarithmic y axis, line graphs, bars and filled areas.

Questions
 * Anything that was (particularly) unclear?
 * Follow-up questions to what we went through?

# Day 3, before lunch: Time series analysis


Next step: Read stuff using pandas, making use of it's time series analysis and statistics functions.

But before we can start with that, we need to understand a little the basic data structure in pandas, a ``DataFrame``. Very briefly a ``DataFrame`` is a bit like a Table in an Excel/Calc spreadsheat:

 * It has columns and rows (hence two-dimensional structure).
 * Both columns and rows can have a header (a name).
 * ``DataFrames`` also have an index, a unique identifier for each entry (i.e. row) in the table.
 * Columns can have different data types

Let's see how we can build a ``DataFrame`` from the structures we already know.

In [None]:
import pandas as pd

# interactive, code-along

Now that we have got a rough idea of what a DataFrame is, and how it might be useful, let's see what we can make use of that for analysing time series. We start by reading the rain time series from Bulken into pandas. pandas comes with it's own function to read text files, and one that is more versatile than the functions we used previously.

In [None]:
df = pd.read_table(
    'd1s2/rr24_Bulken.txt', encoding='utf8', 
    header=13,
    skipfooter=12, engine='python',  # The file contains 12 footer lines, and skipfooter requires the python parsing engine
    sep='\s+',                       # Column separtion by one or more whitespace
    parse_dates=[1,], dayfirst=True, # Second column contains dates, in European format
    index_col=1,                     # Use the date column as index
)


#df.RR.plot()
df

We can apparently directly plot ``df.RR``, but what else can we do with it?

In [None]:
dir(df.RR)

Many things available! We'll dive into a few of them later. 

First, a quick and easy first data analysis: Cumulative precipitation during that period.

In [None]:
type(df.RR.cumsum())

Note: ``cumsum()`` returns a (time) series, so we can work with that in exactly the same way as ``df.RR``.

In [None]:
df.RR.cumsum().plot()

Second example, find dates where the 24-hour precipitation exceeded 50 mm.

In [None]:
df.index[df.RR > 50]

Worth taking some time to figure out in detail what happens here.

In [None]:
type(df.RR > 50), (df.RR > 50).dtype, len(df.RR)

We're using a boolean time series to select dates.

We could also use any other (random) boolean time series of length 100.

In [None]:
import numpy as np

# interactive, code-along

In [None]:
# interactive, code-along

Using the same mechanism, you can retrieve the entire data row instead of the dates

In [None]:
# interactive, code-along

Nevertheless, you can still select rows by their integer index (starting the count at zero!).

In [None]:
# Select the 46th and 59th entry in the DataFrame
# interactive, code-along

### Exercises 1: Monthly cumulative precipitation

The rain time series from Bulken contains three full months, August, September and October 2018.

Exercises: 
 * Print the cumulative precipitation for these three months to find out which month was wettest.
 * Plot the cumulative precipitation for these three months into the same figure. 

Hints: 
 * You will need the ``datetime`` objects from the ``datetime`` package to compare against the time series index.
 * You might need to convert the cumulative rain time series to a ``numpy`` array for the plotting.

In [None]:
from datetime import datetime

# try to solve

In [None]:
import matplotlib.pyplot as plt

# try to solve

## Comparing time series

So far we have only worked with one time series. Let's add a second to have some more analysis options to explore.

Unfortunately, the date format of ``rro_Bulken.txt`` is not recognised automatically by pandas, so we need to supply our custom conversion function.

In [None]:
date_parser = lambda datestr: datetime.strptime(datestr, '%d%m%Y')

# interactive, code-along

This ``lambda`` is essentially just a shorthand for defining a function. We can use ``date_parser`` just as any other function.

In [None]:
date_parser('23012019')

In [None]:
df2 = pd.read_table('d1s2/rro_Bulken.txt', encoding='latin1', 
                   header=None, names=['Dato', 'Level', 'Discharge', 'p75', 'p50', 'p25'],
                                                    # Custom header information
                   comment='#',                     # Ignore lines starting with #
                   na_values=['----', ],            # Custom marker for missing values
                   sep='\s+',                       # Column separtion by one or more whitespace
                   parse_dates=[0,], date_parser=date_parser, 
                                                    # First column contains dates, custom format
                   index_col=0,                     # Use the date column as index
)

A quick sanity check to see whether we got what we expected.

In [None]:
# interactive, code-along

### Correlation analyses

Is river runoff correlated to precipitation?

In [None]:
# interactive, code-along

Seems like!

But what about the median discharge and precipitation?

In [None]:
# interactive, code-along

More advance statistics will require the ``scipy.stats`` module. But to be able to use that module, we'll need to homogenise the two time series:
 * Identical index, i.e. dates
 * Remove NaNs
 
Once complete, we'll then make use of ``scipy.stats`` to estimate the significance of the above correlations and calculate lagged correlations between the two time series.

In [None]:
# First combine into common data frame, using the same index

# interactive, code-along

In [None]:
# Then extract the columns we are interested in, keeping only those rows where we have data in both

# interactive, code-along

Now we're finally ready to calculate the Pearson correlaton including it's significance. Documentation for the function is given here:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr

In [None]:
import scipy.stats

# interactive, code-along

Cool, seems highly significant!

In [None]:
# interactive, code-along

While the correlation between median discharge and precipitation seems spurious.

With that clarified, onto lagged correlations.

### Exercise 2: Lagged correlations

We would expect that the river discharge is a delayed response of the precipitation. So we might expect that the maximum correlation is not instantaneous, but at a certain lead/lag.

Exercise: At which lead/lag in days does the correlation between precipitation reach it's maximum?
 * First, calculate the time lagged correlations and their significance for lags between -5 and 5 days. The ``pandas.Series.shift`` function might come in handy to do the actual shifting.
 * Second, visualise the results with a correlation lead/lag time series. Mark portions of the time series that are significant at the 99%-level.

First do the analysis.

In [None]:
# try to solve

Then the plot.

In [None]:
# do the plot

### Exercise 3: Smoothing time series

The river discharge might contain an integral of the precipitation over the preceeding days. Let's correlate smoothed precipitation with discharge, to see whether we can further increase the correlations. Use running means of 1-5 days centred on the given date in combination with lags between -5 and 5 days to find the maximum correlation.

The ``pandas.Series.rolling`` function might come in handy for calculating the running mean.

Again, first the analysis ...

In [None]:
# try to solve

... then the plotting.

In [None]:
# do the plot

### Exercise 4: Fitting linear model

For the combination of running mean and time lag that yields the maximum correlation create a linear model to estimate discharge from observed precipitation. 

We'll first do some preparatory work, before diving into the actual exercise. 

First step is to extract the time lag / smoothing that yielded the maximum correlation. 
 * The ``np.argmax`` yields the position of the maximum of the flattened the array. A flattened array is one-dimensional and contains the entries of the array in the order in which they appear in memory.
 * The ``np.unravel_index``-function then converts this index of the flattened array back to a multi-dimensional index appropriate for the original array.

In [None]:
# interactive, code-along

Second step is to recreate the shifted running mean that gave the maximum correlation.

In [None]:
# interactive, code-along

As a preparation for the linear regression, we then collect all relevant ``Series`` in a new data frame, and rename the shifted / smoothed ``RR`` time series in the process.

In [None]:
# Rename column to avoid having two columns named "RR", then concatenate to one dataframe

# interactive, code-along

Finally we are ready for the actual exercise
 * Fit a linear model by linear regression. You can use ``scipy.stats.linregress`` do to the actual regression.
 * Create a new time series containing the river discharge from the precipitation observations
 * Evaluate the linear model fit using a ``plt.scatter``-plot comparing modelled versus observed discharge.
 * Evaluete the linear model by comparing the modelled and observed discharge time series.

Steps 1+2: Fitting the linear model and create the modelled discharge time series

In [None]:
# try to solve

Step 3: Evaluation using scatter plot

In [None]:
# try to solve

Step 4: Evaluation comparing the time series directly

In [None]:
# try to solve