# Analysing water levels of the river Isar

In light of the recent flooding catastrophe with more then 170 dead, my interest sparked in monitoring and maybe predicting the waterlevels of the river in my hometown Munich, the river Isar.

In this project you will get an idea how i refine my workflow and the identification of interesting data. I will also improve my skills in webscraping with beautiful soup and time series analysis.

The project is structured in three main parts:

1. Problem formulation and subject understanding
2. Datasource identification and webscraping
3. Data wrangling and time series analysis

## Project goal and motivation

I started this project to get improve and expand two skillsets in my data science toolbox:

Tools and workflow technologies
- Visual Studio Code
- Github
- Python scripting

Data science subject
- Time Series analysis
- Time Series forecasting

Also i see it as a portfolio project, highlighting my current skills in the above mentioned subjects, story telling and general Python programming. I will probably  
use a lot ouf resources from practitioners, learners, professionals and amateures. For this reason i want to share my project and the isights i've gained with you and  
i'm more then curios about your comments, suggestions and enhancements.

Acknowledgements

## 0. Loading packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from datetime import datetime, timedelta
from datetime import date

now = datetime.today()

import os
from os import walk
dirname = os.getcwd()


In [2]:
#Script handling parameters
scraping = True
concatenating = True
deleting_scraped_files = False
save_to = 'CSV'
using_sql = False

In [3]:
#csv date format
format_1 = '%Y-%m-%d %H:%M:%S'
#website date format
format_2 = '%d.%m.%Y, %H:%M'

## 1. Problem formulation and subject understanding

The isar is a 292,3 kilometer long river that originates in the very north of Austria. The river crosses the capital of bavaria Munich until it merges with the Donau river south of Deggendorf. [Source: wikipedia.com/isar]



# 1. Data collecting phase

## 1.1. Data Source identification

The bavarian ministry for envoironment runs the so called "Hochwassernachrichtendienst" HND where the water levels of several rivers and lakes are provided. Additionally there are information on precipication and others.

We focus on scraping the water levels first. The site is structured as follows

    Basic Link:           https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen
    Additional Parameter: ?days=0&hours=1

We can address the levels per day for the last 30 days while 0 is today and 29 the oldest. The data is provided on an hourly interval from 0 to 23. For now, we don't need the most recent data, but this might change when upgrading to a more sophisticated scraping approach. At the moment we focus on getting a basic scrapping to work.

In [4]:
#Web scraping properties
#link = "https://www.hnd.bayern.de/pegel/meldestufen/isar/tabellen?days=0&hours=1"
basic_link = "https://www.hnd.bayern.de/pegel/tabellen"

## 1.2. Web scraping

Currently we scrap the full month of data and discarge the duplicates when we merge the DataFrames. This leads to a unneccessary long scraping time.  
In order to reduce the scapring time we want to determine which days need to be scraped by comparing the difference between the recent data and the   
current date.

In [5]:
#list history file and read as df
for file_name in os.listdir():
    if 'bay_river_pegel' and '.csv' in file_name:
        hist_pegel = pd.read_csv(file_name, parse_dates=True)
        hist_pegel['datumzeit'] = pd.to_datetime(hist_pegel['datumzeit'], format='%Y-%m-%d %H:%M:%S')
        last_scraped_datetime = hist_pegel['datumzeit'][0]
        
        #determine the range of days to scrape
        days_scrape  = [i for i in range(0,((now - last_scraped_datetime).days + 1))]
        break
    else:
        #full scrape if no history available
        hist_pegel = None
        days_scrape  = [i for i in range(0,31)]


#full hours to scrape
hours_scrape = [i for i in range(0,24)]


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [6]:
#File handling properties
von_string = str(date.today() - timedelta(days_scrape[-1]))
bis_string = str(date.today() - timedelta(days_scrape[0]))
path = dirname + "/ScrapingData/"

save_string = 'bay_river_pegel' + '_' + von_string +  '_bis_' + bis_string

As you can see, pandas 'read_html' class is a very powerful tool to get html tabels quickly into a dataframe. We now automate this approach to scrape the full history of the water levels.

In [7]:
#Web scraping
if scraping:

    df_water_levels_scrape = pd.DataFrame()
    
    #add handling for most recent data
    for day in days_scrape:
        #when today, only scrape to current hour
        if day == 0:

            for hour in hours_scrape:
                if hour <= (now.hour - 1):
                    scrape_link = basic_link + '?days=' + str(day) + '&hours=' + str(hour)
                    water_level = pd.read_html(scrape_link)
                    df_water_levels_scrape = df_water_levels_scrape.append(water_level[0])
                    
        #when past days, scrape full hours
        else:
            for hour in hours_scrape:
                scrape_link = basic_link + '?days=' + str(day) + '&hours=' + str(hour)
                water_level = pd.read_html(scrape_link)
                df_water_levels_scrape = df_water_levels_scrape.append(water_level[0])

    #df_water_levels_scrape['Datum Zeit'] = pd.to_datetime(df_water_levels_scrape['Datum Zeit'], format='%d.%m.%Y, %H:%M')
    df_water_levels_scrape.to_csv((path + save_string + '.csv'), index=False)

In [8]:
#concat files in the scraping folder into a single df

if concatenating == True:
    #list files in the scraping folder and concat to single df
    scraped_files       = os.listdir(path)
    df_from_each_file   = (pd.read_csv(path + f) for f in scraped_files)
    concatenated_df     = pd.concat(df_from_each_file, axis=0, ignore_index=True)

    #formatting of column names
    concatenated_df.columns = concatenated_df.columns.str.replace(' ','')
    concatenated_df.columns = concatenated_df.columns.str.replace('[^a-zA-ZäÄöÖüÜ0-9,]', '_', regex=True)
    concatenated_df.columns = concatenated_df.columns.str.lower()

    #search for non available data and mark as nan. We do this step only on newly scraped data
    #datetime formating from the websites date format
    missing_string = 'Derzeit leider keine aktuellen Daten vorhanden.'
    concatenated_df = concatenated_df[concatenated_df != missing_string]
    concatenated_df['datumzeit'] = pd.to_datetime(concatenated_df['datumzeit'], format=format_2)
    
    #Subset because there are NaN in column "Vorhersage", which would drop everything
    concatenated_df = concatenated_df.dropna(subset=['wasser_stand_cm_'])

  

    #sort and drop duplicates
    concatenated_df     = concatenated_df.sort_values(by=['datumzeit'], ascending = True)
    concatenated_df     = concatenated_df.drop_duplicates(['messstelle','datumzeit'])

    #concat hist and recent scraping
    concatenated_df_2     = pd.concat([hist_pegel, concatenated_df], axis=0, ignore_index=True)
    concatenated_df_2     = concatenated_df_2.sort_values(by=['datumzeit'], ascending = False)

    #Clean up by dropping duplicates in observation point and datetime
    concatenated_df_2   = concatenated_df_2.drop_duplicates(['messstelle','datumzeit'])
  
   

In [9]:
concatenated_df.columns

Index(['messstelle', 'gewässer', 'datumzeit', 'wasser_stand_cm_',
       'änderungseit2std__cm_', 'abfluss_m__s_', 'melde_stufe',
       'jähr_lichkeit', 'vorher_sage'],
      dtype='object')

Since our scraping works as intended, we take the time to investigate the outcome. In theory, there should be 24 obersvations per day and oberservation point. We check this by saving an oberservation point to a dedicated DataFrame and group by date.

In [10]:
waterlevel_sylvenstein = concatenated_df_2[concatenated_df_2['messstelle'] == 'Sylvenstein']
waterlevel_sylvenstein.groupby(by=waterlevel_sylvenstein['datumzeit'].dt.date).count()

Unnamed: 0_level_0,messstelle,gewässer,datumzeit,wasser_stand_cm_,änderungseit2std__cm_,abfluss_m__s_,melde_stufe,jähr_lichkeit,vorher_sage
datumzeit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-08-23,24,24,24,24,24,24,24,24,0
2021-08-24,24,24,24,24,24,24,24,24,0
2021-08-25,24,24,24,24,24,24,24,24,0
2021-08-26,24,24,24,24,24,24,24,24,0
2021-08-27,24,24,24,24,24,24,24,24,0
2021-08-28,24,24,24,24,24,24,24,24,0
2021-08-29,24,24,24,24,24,24,24,24,0
2021-08-30,24,24,24,24,24,24,24,24,0
2021-08-31,24,24,24,24,24,24,24,24,0
2021-09-01,24,24,24,24,24,24,24,24,0


The scraped data is indeed what we have expected, exept for the last day, which is the current day and scraping is still a work in progress.

Lastly we take a look at a complete day and see if we spot any mistakes whatsoever.

In [11]:
waterlevel_sylvenstein

Unnamed: 0,messstelle,gewässer,datumzeit,wasser_stand_cm_,änderungseit2std__cm_,abfluss_m__s_,melde_stufe,jähr_lichkeit,vorher_sage
518769,Sylvenstein,Isar,2021-10-04 21:00:00,256,0,226,0,---,
518530,Sylvenstein,Isar,2021-10-04 20:00:00,256,0,226,0,---,
518285,Sylvenstein,Isar,2021-10-04 19:00:00,256,-1,226,0,---,
518038,Sylvenstein,Isar,2021-10-04 18:00:00,256,-3,226,0,---,
517790,Sylvenstein,Isar,2021-10-04 17:00:00,257,-4,235,0,---,
...,...,...,...,...,...,...,...,...,...
257453,Sylvenstein,Isar,2021-08-23 04:00:00,256,0,192,0,---,
255592,Sylvenstein,Isar,2021-08-23 03:00:00,256,0,192,0,---,
256953,Sylvenstein,Isar,2021-08-23 02:00:00,256,+1,192,0,---,
256691,Sylvenstein,Isar,2021-08-23 01:00:00,256,0,192,0,---,


We are happy with the results and save the dataframe to the disk.

### Persistently save data

Obviously, we want to save our current efforts persistently to save the historical data we have put so much thought and effort in. We use .csv for now and also delete the last complete file to keep things tidy.

In [12]:
#delete current history file
if 'bay_river_pegel' in file_name:
    os.remove(file_name)

#determine filename from content taking into account that values could be missing
end_date            = concatenated_df_2['datumzeit'].iloc[0].strftime('%Y-%m-%d')
start_date          = concatenated_df_2['datumzeit'].iloc[-1].strftime('%Y-%m-%d')
save_string2        = 'bay_river_pegel' + '_' + start_date +  '_bis_' + end_date

#saving concatenated df as .csv to disk
concatenated_df_2.to_csv((save_string2 + '.csv'), index=False)
#concatenated_df_2.to_pickle((save_string2 + '.pkl'))

However with saving to .csv we loose the data type information from our DataFrame. For example, the datetime column would be loaded as object and not as datetime object.
With the use of pythons pickle format (.pkl) these information is also saved to the file, allowing us to efficiently load the date in later steps. We take this opportunity
to further refine our datatypes by looking at the current formats and see if we can optimize our data further.

In [13]:
concatenated_df_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 262347 entries, 518781 to 256601
Data columns (total 9 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   messstelle             262347 non-null  object        
 1   gewässer               262347 non-null  object        
 2   datumzeit              262347 non-null  datetime64[ns]
 3   wasser_stand_cm_       262347 non-null  object        
 4   änderungseit2std__cm_  262347 non-null  object        
 5   abfluss_m__s_          262347 non-null  object        
 6   melde_stufe            262347 non-null  object        
 7   jähr_lichkeit          262347 non-null  object        
 8   vorher_sage            0 non-null       object        
dtypes: datetime64[ns](1), object(8)
memory usage: 20.0+ MB


In [14]:
os.stat(save_string2 + '.csv')

os.stat_result(st_mode=33188, st_ino=37779792, st_dev=16777222, st_nlink=1, st_uid=501, st_gid=20, st_size=15012236, st_atime=1633380850, st_mtime=1633380851, st_ctime=1633380851)

In [15]:
#deleting scraped data
if deleting_scraped_files == True:
    for f in scraped_files:
        os.remove(path + f)


Since we are now able to automatically scrape and save the waterlevels, lets take a look at the date we get.

In [16]:
for row in concatenated_df_2.iloc[:2].iterrows():
    print(row)

(518781, messstelle                        Schenkenau
gewässer                                 Itz
datumzeit                2021-10-04 21:00:00
wasser_stand_cm_                         157
änderungseit2std__cm_                     -7
abfluss_m__s_                            254
melde_stufe                                0
jähr_lichkeit                            ---
vorher_sage                              NaN
Name: 518781, dtype: object)
(518627, messstelle                               Büg
gewässer                           Schwabach
datumzeit                2021-10-04 21:00:00
wasser_stand_cm_                         166
änderungseit2std__cm_                      1
abfluss_m__s_                            026
melde_stufe                                0
jähr_lichkeit                            ---
vorher_sage                              NaN
Name: 518627, dtype: object)
