# Analysing water levels of the river Isar

In light of the recent flooding catastrophe with more then 170 dead, my interest sparked in monitoring and maybe predicting the waterlevels of the river in my hometown Munich, the river Isar.

In this project you will follow how i refine my workflow and the identification of interesting data data. I will also improve my skills in webscraping with beautiful soup and time series analysis.

The project is structured in three main parts:

1. Problem formulation and subject understanding
2. Datasource identification and webscraping
3. Data wrangling and time series analysis

## 0. Loading packages

In [217]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from datetime import datetime, timedelta
from datetime import date

import os
from os import walk
dirname = os.getcwd()


In [218]:
#Script handling parameters
scraping = False
concatenating = True
deleting_scraped_files = True

## 1. Problem formulation and subject understanding

The isar is a 292,3 kilometer long river that originates in the very north of Austria. The river crosses the capital of bavaria Munich until it merges with the Donau river south of Deggendorf. [Source: wikipedia.com/isar]



## 2. Data Source identification and webscraping

The bavarian ministry for envoironment runs the so called "Hochwassernachrichtendienst" HND where the water levels of several rivers and lakes are provided. Additionally there are information on precipication and others.

We focus on scraping the water levels first. The site is structured as follows

    Basic Link:           https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen
    Additional Parameter: ?days=0&hours=1

We can address the levels per day for the last 30 days while 0 is today and 29 the oldest. The data is provided on an hourly interval from 0 to 23. For now, we don't need the most recent data, but this might change when upgrading to a more sophisticated scraping approach. At the moment we focus on getting a basic scrapping to work.

In [219]:
#we create two lists for the days and hours we want to scrape.
days_scrape  = [i for i in range(1,30)]
hours_scrape = [i for i in range(0,24)]

In [220]:
#Web scraping properties
link = "https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen?days=0&hours=1"
basic_link = "https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen"

In [221]:
#File handling properties
von_string = str(date.today() - timedelta(days_scrape[-1]))
bis_string = str(date.today())
path = dirname + "/ScrapingData/"

save_string = 'isar_pegel' + '_' + von_string +  '_bis_' + bis_string

As you can see, pandas 'read_html' class is a very powerful tool to get html tabels quickly into a dataframe. We now automate this approach to scrape the full history of the water levels.

In [222]:
#Web scraping
if scraping:

    df_water_levels_scrape = pd.DataFrame()

    for day in days_scrape: 
        for hour in hours_scrape:
            scrape_link = basic_link + '?days=' + str(day) + '&hours=' + str(hour)
            water_level = pd.read_html(scrape_link)
            df_water_levels_scrape = df_water_levels_scrape.append(water_level[0])

    df_water_levels_scrape['Datum Zeit'] = pd.to_datetime(df_water_levels_scrape['Datum Zeit'], format='%d.%m.%Y, %H:%M')
    df_water_levels_scrape.to_csv(path + save_string + '.csv')

In [223]:
#concat files in the scraping folder into a single df

if concatenating == True:
    
    #list history file and read as df
    for file in os.listdir():
        if 'isar_pegel_' in file:
            hist_pegel = pd.read_csv(file)
        else:
            hist_pegel = None

    #list files in the scraping folder and concat to history + concat to df
    scraped_files       = os.listdir(path)
    df_from_each_file   = (pd.read_csv(path + f) for f in scraped_files)
    concatenated_df     = pd.concat(df_from_each_file, ignore_index=True)

    concatenated_df.sort_values(by=['Datum Zeit'], ascending = False)

    #concat hist and recent scraping
    if hist_pegel != None:
        concatenated_df     = pd.concat([hist_pegel, concatenated_df], ignore_index=True)
    else:
        concatenated_df     = concatenated_df

    #saving concatenated df to file
    end_date            = concatenated_df['Datum Zeit'].iloc[0][:10]
    start_date          = concatenated_df['Datum Zeit'].iloc[-1][:10]
    save_string2        = 'isar_pegel' + '_' + start_date +  '_bis_' + end_date

    concatenated_df.drop_duplicates()

    concatenated_df.to_csv(save_string2 + '.csv')

    #deleting scraped data
    if deleting_scraped_files == True:
        for f in scraped_files:
            os.remove(path + f)


