# Vizugy portal scraper

We start by scraping the latest available water level data from the website https://www.vizugy.hu/?mapData=VizmerceLista#mapData using the BeautifulSoup (bs4) library. This website provides information on water level measuring stations across Hungary. We extract the necessary data using the appropriate HTML selectors to ensure accuracy and reliability.

Once we have the latest available data, we proceed to iterate through each link on the page to access the historical water level data for each measuring station (hourly data).

In [None]:
import pandas as pd 
import os
import tqdm
from sqlalchemy import create_engine

In [None]:
VIZUGY_WEBPAGE = 'https://www.vizugy.hu/'

Connect to target database

In [None]:
engine = create_engine( os.getenv("PG_URL") )

Get the first table from the `VizmerceLista` site

In [None]:
df = pd.read_html(f'{VIZUGY_WEBPAGE}?mapData=VizmerceLista#mapData', extract_links="body")
df = df[0]

All columns here are tuple typed. First is the value of the cell, second is the link (if it is a link).

In [None]:
df

We simply split the tupe to `_val` and `_url` columns.

In [None]:
for col in df.columns:
    if col == 'Vízmérce':
        df[[col, f'{col}_url']] = pd.DataFrame(df[col].to_list(), index=df.index)
    else:
        df[col] = df[col].apply(lambda x: x[0])


In [None]:
df.to_sql('raw_list', con=engine, if_exists='replace', index_label='id')

This is how the link look like for a subpage (Station page)

In [None]:
df.iloc[0]["Vízmérce_url"]


Let's got through all subpage (station page) and collect the hourly table. All of this data will be available as `hourly_data`. Scraping these hundreds page took a while (5 mins or so).

In [None]:
df_list = []

for index, row in tqdm.tqdm(df.iterrows(), total=len(df)): 
    df2 = pd.read_html(f'{VIZUGY_WEBPAGE}{df.iloc[index]["Vízmérce_url"]}',parse_dates=True)
    df2[1]["Vízmérce"] = df.iloc[index]["Vízmérce"]
    df_list.append(df2[1])

hourly_data = pd.concat(df_list)


In [None]:
hourly_data

Save to postgres 

In [None]:
hourly_data.to_sql('raw_hourly_data', con=engine, if_exists='replace',
           index_label='id')