# Extração

Usaremos este caderno para identificar quais estações podemos usar. Se você quiser executar este caderno, pode fazer o download do arquivo XML com as estações visitando [este endereço](https://dadosabertos.ana.gov.br/documents/ae318ebacb4b41cda37fbdd82125078b/explore), colocando 2 no campo **tpEst**; 4 no campo **codBacia**; e 0 no campo **telemetrica**. Isso listará todas as estações cadastradas do tipo **pluviométrica** (cód. 2), na bacia do São Francisco (cód. 4), e que não são telemétricas.

Esse arquivo não foi incluído no repositório por ter tamanho elevado.

In [1]:
# Get stations from HidroInventario.xml

from xml.dom.minidom import parse

with open("data/HidroInventario.xml", "r") as f:
    doc = parse(f)

stations = doc.getElementsByTagName("Estacoes")[0].getElementsByTagName("Table")

In [2]:
# Filter out stations that have started working less than 20 years ago or stopped along the way

from datetime import datetime, timedelta

limit_date = datetime.now()
threshold_date = limit_date - timedelta(days=365 * 20)
valid_stations = []

for station in stations:
    try:
        # check start date
        started_working = (
            station.getElementsByTagName("PeriodoPluviometroInicio")[0]
            .childNodes[0]
            .nodeValue
        )
        started_working = datetime.fromisoformat(started_working)

        # check if started 20 or more years ago
        if started_working >= threshold_date:
            continue

        # check if stopped working
        stopped_working = station.getElementsByTagName("PeriodoPluviometroFim")[
            0
        ].childNodes

        if len(stopped_working):
            continue
        else:
            print(
                f'Station {station.getElementsByTagName("Codigo")[0].childNodes[0].nodeValue} started working on {started_working} and never stopped'
            )

        # extract info
        try:
            sub_basin = (
                station.getElementsByTagName("SubBaciaCodigo")[0]
                .childNodes[0]
                .nodeValue
            )
            station_id = (
                station.getElementsByTagName("Codigo")[0].childNodes[0].nodeValue
            )
            latitude = (
                station.getElementsByTagName("Latitude")[0].childNodes[0].nodeValue
            )
            longitude = (
                station.getElementsByTagName("Longitude")[0].childNodes[0].nodeValue
            )
            nome = station.getElementsByTagName("Nome")[0].childNodes[0].nodeValue
            state = station.getElementsByTagName("nmEstado")[0].childNodes[0].nodeValue
            municipality = (
                station.getElementsByTagName("nmMunicipio")[0].childNodes[0].nodeValue
            )

            valid_stations.append(
                {
                    "sub_basin": sub_basin,
                    "station": station_id,
                    "latitude": latitude,
                    "longitude": longitude,
                    "name": nome.title(),
                    "state": state.title(),
                    "municipality": municipality.title(),
                }
            )
        except:
            print(f"Failed to extract info from station")

    except IndexError:
        continue

print(f"Found {len(valid_stations)} valid stations")

Station 737027 started working on 1911-08-01 00:00:00 and never stopped
Station 737036 started working on 1914-09-01 00:00:00 and never stopped
Station 737040 started working on 1979-01-01 00:00:00 and never stopped
Station 737042 started working on 1979-02-01 00:00:00 and never stopped
Station 737043 started working on 1979-08-01 00:00:00 and never stopped
Station 737044 started working on 1979-08-01 00:00:00 and never stopped
Station 737045 started working on 1979-07-01 00:00:00 and never stopped
Station 737046 started working on 1979-01-01 00:00:00 and never stopped
Station 737048 started working on 1979-01-01 00:00:00 and never stopped
Station 737049 started working on 1979-01-01 00:00:00 and never stopped
Station 737050 started working on 1979-01-01 00:00:00 and never stopped
Station 737051 started working on 1979-01-01 00:00:00 and never stopped
Station 738032 started working on 1911-09-01 00:00:00 and never stopped
Station 738035 started working on 1953-06-01 00:00:00 and never 

Aqui salvamos as estações numa outra planilha, que podemos usar para fazer o download das séries históricas visitando o [site da HidroWeb](https://www.snirh.gov.br/hidroweb/serieshistoricas) com as séries históricas.

In [3]:
# Save valid stations to csv

import pandas as pd

with open("stations.csv", "w") as f:
    df = pd.DataFrame(valid_stations)
    df.to_csv(f, index=False)