# Analysing water levels of the river Isar

In light of the recent flooding catastrophe with more then 170 dead, my interest sparked in monitoring and maybe predicting the waterlevels of the river in my hometown Munich, the river Isar.

In this project you will get an idea how i refine my workflow and the identification of interesting data. I will also improve my skills in webscraping with beautiful soup and time series analysis.

The project is structured in three main parts:

1. Problem formulation and subject understanding
2. Datasource identification and webscraping
3. Data wrangling and time series analysis

## 0. Loading packages

In [140]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from datetime import datetime, timedelta
from datetime import date

now = datetime.today()

import os
from os import walk
dirname = os.getcwd()


In [141]:
#Script handling parameters
scraping = False
concatenating = True
deleting_scraped_files = False

## 0. Project goal and motivation

I started this project to get familiar with two skillsets in my data science learning journey:

Tools and worflow technologies
- Visual Studio Code
- Github
- Python scripting

Data science subject
- Time Series analysis
- Time Series forecasting

Also i see it as a portfolio project, highlighting my current skills in the above mentioned subjects, story telling and general Python programming. I will probably  
use a lot ouf resources from practitioners, learners, professionals and amateures. For this reason i want to share my project and the isights i've gained with you and  
i'm more then curios about your comments, suggestions and enhancements.

Acknowledgements

## 1. Problem formulation and subject understanding

The isar is a 292,3 kilometer long river that originates in the very north of Austria. The river crosses the capital of bavaria Munich until it merges with the Donau river south of Deggendorf. [Source: wikipedia.com/isar]



# 1. Data collecting phase

## 1.1. Data Source identification

The bavarian ministry for envoironment runs the so called "Hochwassernachrichtendienst" HND where the water levels of several rivers and lakes are provided. Additionally there are information on precipication and others.

We focus on scraping the water levels first. The site is structured as follows

    Basic Link:           https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen
    Additional Parameter: ?days=0&hours=1

We can address the levels per day for the last 30 days while 0 is today and 29 the oldest. The data is provided on an hourly interval from 0 to 23. For now, we don't need the most recent data, but this might change when upgrading to a more sophisticated scraping approach. At the moment we focus on getting a basic scrapping to work.

In [142]:
#Web scraping properties
link = "https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen?days=0&hours=1"
basic_link = "https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen"

## 1.2. Web scraping

Currently we scrap the full month of data and discarge the duplicates when we merge the DataFrames. This leads to a unneccessary long scraping time.  
In order to reduce the scapring time we want to determine which days need to be scraped by comparing the difference between the recent data and the   
current date.

In [144]:
#list history file and read as df
for file in os.listdir():
    if 'isar_pegel_' in file:
        hist_pegel = pd.read_csv(file)
        last_scraped_datetime = pd.to_datetime(hist_pegel['Datum Zeit'][0], format='%Y-%m-%d %H:%M:%S')

        #determine the range of days to scrape
        days_scrape  = [i for i in range(0,(now.day - last_scraped_datetime.day + 1))]
    else:
        #full scrape if no history available
        hist_pegel = None
        days_scrape  = [i for i in range(0,31)]
    break

#full hours to scrape
hours_scrape = [i for i in range(0,24)]


In [145]:
#File handling properties
von_string = str(date.today() - timedelta(days_scrape[-1]))
bis_string = str(date.today())
path = dirname + "/ScrapingData/"

save_string = 'isar_pegel' + '_' + von_string +  '_bis_' + bis_string

As you can see, pandas 'read_html' class is a very powerful tool to get html tabels quickly into a dataframe. We now automate this approach to scrape the full history of the water levels.

In [146]:
#Web scraping
if scraping:

    df_water_levels_scrape = pd.DataFrame()
    
    #add handling for most recent data
    for day in days_scrape:
        #when today, only scrape to current hour
        if day == 0:

            for hour in hours_scrape:
                if hour <= (now.hour - 1):
                    scrape_link = basic_link + '?days=' + str(day) + '&hours=' + str(hour)
                    water_level = pd.read_html(scrape_link)
                    df_water_levels_scrape = df_water_levels_scrape.append(water_level[0])
                    
        #when past days, scrape full hours
        else:
            for hour in hours_scrape:
                scrape_link = basic_link + '?days=' + str(day) + '&hours=' + str(hour)
                water_level = pd.read_html(scrape_link)
                df_water_levels_scrape = df_water_levels_scrape.append(water_level[0])

    df_water_levels_scrape['Datum Zeit'] = pd.to_datetime(df_water_levels_scrape['Datum Zeit'], format='%d.%m.%Y, %H:%M')
    df_water_levels_scrape.to_csv((path + save_string + '.csv'), index=False)

In [147]:
#concat files in the scraping folder into a single df

if concatenating == True:
       

    #list files in the scraping folder and concat to single df
    scraped_files       = os.listdir(path)
    df_from_each_file   = (pd.read_csv(path + f) for f in scraped_files)
    concatenated_df     = pd.concat(df_from_each_file, axis=0, ignore_index=True)

    #sort and drop duplicates
    concatenated_df     = concatenated_df.sort_values(by=['Datum Zeit'], ascending = False)
    concatenated_df     = concatenated_df.drop_duplicates()

    #concat hist and recent scraping
    concatenated_df_2     = pd.concat([hist_pegel, concatenated_df], axis=0, ignore_index=True)
    concatenated_df_2     = concatenated_df_2.sort_values(by=['Datum Zeit'], ascending = False)

    #saving concatenated df to file
    end_date            = concatenated_df_2['Datum Zeit'].iloc[0][:10]
    start_date          = concatenated_df_2['Datum Zeit'].iloc[-1][:10]
    save_string2        = 'isar_pegel' + '_' + start_date +  '_bis_' + end_date

    os.remove(file)

    concatenated_df_2   = concatenated_df_2.drop_duplicates()

    concatenated_df_2.to_csv((save_string2 + '.csv'), index=False)


In [148]:
#deleting scraped data
if deleting_scraped_files == True:
    for f in scraped_files:
        os.remove(path + f)


Since we are now able to automatically scrape and save the waterlevels, lets take a look at the date we get.

# 2. Data understanding phase

Before we can work with the data we need to understand the collected data first. I have separated this task into two main parts;

- The quantitaive data understanding  
Here we want to now how data data looks in the first place. Which variabels are in the data, how many observations, NaN values, data types, statistical information in numerical and unique values
in categorical variables.  
  
  
- The qualitative data understanding  
Goal is to really understand the data from an domain point of view. How are different rivers, measuring points etc. are connected, what do cetrain variables mean.
These questions are best answered by consulting the website (FAQ, reading material, documentation etc.) or interviewing experts in the field. These insights will
help us further improve further analysis and forecasting.

We start with loading the latest data into a dataframe.

In [149]:
#Load the complete csv to dataframe
for file in os.listdir():
        if 'isar_pegel_' in file:
            waterlevel_hist = pd.read_csv(file)
    
        else:
            waterlevel_hist = None
    
        break


In [150]:
waterlevel_hist.head()

Unnamed: 0,Messstelle,Gewässer,Datum Zeit,Wasser­stand [cm],Änderung seit 2 Std. [cm],Abfluss [m³/s],Melde­stufe,Jähr­lichkeit,Vorher­sage
0,Stegen,Amper,2021-09-15 21:00:00,129,0,258,0,---,
1,Bad Tölz Brücke,Isar,2021-09-15 21:00:00,65,0,---,0,---,
2,Oberfinning Seepegel,Windachspeicher,2021-09-15 21:00:00,62509,1,---,0,---,
3,Raisting,Rott,2021-09-15 21:00:00,35,0,022,0,---,
4,Weilheim,Ammer,2021-09-15 21:00:00,47,1,977,0,---,


In [151]:
waterlevel_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Messstelle                 26280 non-null  object 
 1   Gewässer                   26280 non-null  object 
 2   Datum Zeit                 26280 non-null  object 
 3   Wasser­stand [cm]          26280 non-null  int64  
 4   Änderung seit 2 Std. [cm]  26280 non-null  int64  
 5   Abfluss [m³/s]             26280 non-null  object 
 6   Melde­stufe                26280 non-null  int64  
 7   Jähr­lichkeit              26280 non-null  object 
 8   Vorher­sage                0 non-null      float64
dtypes: float64(1), int64(3), object(5)
memory usage: 1.8+ MB


In [152]:
waterlevel_hist['Datum Zeit'] = pd.to_datetime(waterlevel_hist['Datum Zeit'], format='%Y-%m-%d %H:%M:%S')

In [153]:
waterlevel_messtellen = waterlevel_hist['Messstelle'].unique()
waterlevel_gewässer = waterlevel_hist['Gewässer'].unique()
waterlevel_meldestufe = waterlevel_hist['Melde­stufe'].unique()


In [154]:
print(waterlevel_messtellen)
print(waterlevel_gewässer)
print(waterlevel_meldestufe)

['Stegen' 'Bad Tölz Brücke' 'Oberfinning Seepegel' 'Raisting' 'Weilheim'
 'Peißenberg' 'Oberammergau' 'Inkofen' 'Fürstenfeldbruck' 'Sylvenstein'
 'Leutstetten' 'Ampermoching' 'Oberfinning Speicherabgabe' 'Eching'
 'Starnberg' 'Eschenlohe Brücke' 'Hohenkammer' 'Berg' 'Schlehdorf'
 'Puppling' 'München' 'Freising' 'Landshut Birket' 'Landau' 'Plattling'
 'Garmisch o. d. Partnachmündung' 'Garmisch u. d. Partnachmündung'
 'Kochel' 'Beuerberg' 'Partenkirchen (alt)' 'Lenggries']
['Amper' 'Isar' 'Windachspeicher' 'Rott' 'Ammer' 'Würm' 'Windach'
 'Starnberger See' 'Loisach' 'Glonn' 'Sempt' 'Partnach' 'Ammersee']
[0 4 1 3 2]


In [156]:
waterlevel_hist[waterlevel_hist['Messstelle'] == 'Sylvenstein']

Unnamed: 0,Messstelle,Gewässer,Datum Zeit,Wasser­stand [cm],Änderung seit 2 Std. [cm],Abfluss [m³/s],Melde­stufe,Jähr­lichkeit,Vorher­sage
9,Sylvenstein,Isar,2021-09-15 21:00:00,242,0,116,0,---,
35,Sylvenstein,Isar,2021-09-15 20:00:00,242,0,116,0,---,
83,Sylvenstein,Isar,2021-09-15 19:00:00,242,0,116,0,---,
104,Sylvenstein,Isar,2021-09-15 18:00:00,242,0,116,0,---,
130,Sylvenstein,Isar,2021-09-15 17:00:00,242,0,116,0,---,
...,...,...,...,...,...,...,...,...,...
26142,Sylvenstein,Isar,2021-08-13 04:00:00,257,0,201,0,---,
26170,Sylvenstein,Isar,2021-08-13 03:00:00,257,0,201,0,---,
26192,Sylvenstein,Isar,2021-08-13 02:00:00,257,-1,201,0,---,
26228,Sylvenstein,Isar,2021-08-13 01:00:00,257,0,201,0,---,
