# KNMI data retrieval examples

This Notebook is the starting point of an exciting journey into meteorological data science with one of the highest-quality Dutch weather data sets provided openly by KNMI, the Dutch Royal Meteorological Institure. 

In this notebook I’ll guide you through the process of retrieving and preparing weather data with Python, using custom-made helper scripts to keep the workflow streamlined. 

Whether you're coding or just exploring, this Notebook sets the stage for the data analysis and visualization steps in the following parts of the Notebook series.

### Disclaimer
I am not officially affiliated with KNMI, and the scripts and insights provided in this Jupyter Notebook are offered on an unofficial basis for educational and exploratory purposes.

### Word of caution
<u>Important</u>: if you request too much data at once (and also sometimes if the service is offline; this occurs from time to time), the KNMI web service returns a 'faulty' HTML instead of the data. This will lead to an <code>AssertionError</code> in the custom-made retrieval helper script. 

Create multiple smaller requests to circumvent this issue. In the unfortunate case the service is offline completely, please try again later or contact KNMI about the expected downtime period.

## Preprocessing steps
We will start by importing the required external and internal modules, scripts and libraries, and by setting up a few (meta)data files. These cells should always be run before running the cells of one or more of the examples further below.

### Import external libraries
Note: these are only the external libraries used directly within this Notebook. The Python helper scripts, imported in the subsequent step, may require the installation of additional libraries.

In [25]:
import datetime
import pandas as pd

### Internal imports
In the code below the custom Python helper scripts are imported.

In [3]:
import knmi_update_metadata

import knmi_meteo_ingest
import knmi_meteo_transform

#### Optional: get additional info for internal import scripts
If you want a brief description of each of the helper scripts and their content, feel free to uncomment the applicable line(s) of code in the cell block below.

In [3]:
# Optional: uncomment line below for more info on script content
# help(knmi_update_metadata)
# help(knmi_meteo_ingest)
# help(knmi_meteo_transform)

### Optional: update all metadata files
Running this cell is <u>optional</u>, since the metadata is not prone to change much over time. Executing the cell will update all metadata files in the corresponding folder.

To run the cell below, a stable internet connection and the availability of the KNMI data retrieval service using a script are required. This is also the case for all subsequent data retrieval procedures in this notebook.

In [4]:
# Optional: update parameter and meteo station metadata files
update_metadata = False

if update_metadata:
    knmi_update_metadata.knmi_update_all_metadata()

## Data retrieval examples

In this section we will delve deeper into the the KNMI web service data retrieval procedure for scripts by working though some examples. 

The aim of showcasing different examples is to give you a hands-on idea on the possibilities, so that you can follow along and experiment further if desired.

The showcased examples in this Jupyter Notebook are:
<ol>
<li>Retrieve all daily data from all KNMI meteo stations for a full year</li>
<li>Retrieve all hourly data from all KNMI meteo stations within a given monthr</li>
<li>Load selected daily data for selected weather stations within a given distance</li>
</ol>

### Example 1: Retrieve all daily data from all KNMI meteo stations for a full year
In the cells below we will retrieve data for all daily parameters for all automatic KNMI meteo stations for a full year. 

We do so by setting <code>start_date</code> and <code>end_date</code>, while keeping <code>meteo_stns_list</code>, <code>meteo_params_list</code> and <code>mode</code> (default value: <code>'day'</code>) undefined for our request to the KNMI service.

In the example below we use 2023 as our year of interest - feel free to change this as desired.

#### Obtain the raw dataset from KNMI service

In [5]:
# Set start and end dates (inclusive) for data retrieval
start_date = datetime.date(2023, 1, 1)
end_date = datetime.date(2023, 12, 31)

# Optional: uncomment below to make end_date exclusive
# end_date -= datetime.timedelta(days=1)

In [6]:
# Get dataset from KNMI web script service
df_day = knmi_meteo_ingest.knmi_meteo_to_df(meteo_stns_list=None,
                                            meteo_params_list=None,
                                            start_date=start_date,
                                            end_date=end_date)

In [7]:
# Show the result
df_day

Unnamed: 0,STN,YYYYMMDD,DDVEC,FHVEC,FG,FHX,FHXH,FHN,FHNH,FXX,...,VVNH,VVX,VVXH,NG,UG,UX,UXH,UN,UNH,EV24
0,209,20230101,215.0,92.0,96.0,140.0,1.0,30.0,22.0,200.0,...,,,,,,,,,,
1,209,20230102,266.0,61.0,69.0,80.0,2.0,50.0,1.0,100.0,...,,,,,,,,,,
2,209,20230103,196.0,83.0,88.0,150.0,22.0,30.0,3.0,200.0,...,,,,,,,,,,
3,209,20230104,233.0,137.0,141.0,170.0,2.0,120.0,18.0,220.0,...,,,,,,,,,,
4,209,20230105,245.0,82.0,90.0,120.0,2.0,60.0,13.0,160.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17150,391,20231227,176.0,27.0,31.0,50.0,18.0,10.0,1.0,110.0,...,,,,,82.0,95.0,1.0,53.0,21.0,2.0
17151,391,20231228,225.0,48.0,48.0,60.0,11.0,40.0,1.0,130.0,...,,,,,72.0,79.0,4.0,67.0,17.0,3.0
17152,391,20231229,236.0,49.0,50.0,60.0,1.0,40.0,13.0,150.0,...,,,,,81.0,92.0,13.0,72.0,1.0,1.0
17153,391,20231230,213.0,30.0,35.0,50.0,1.0,30.0,6.0,90.0,...,,,,,80.0,87.0,8.0,73.0,14.0,3.0


#### Clean (transform) dataset
We can improve the readibility of the dataset by applying some preset transformations on the "raw" dataset we just ingested. 

As part of this transformation, all parameters are converted to whole units of measurements (e.g. m/s, J/cm2, °C) for better interpretability. Furthermore the column names are converted to a more readable format.

In case you would like to change any of the transformed column names, feel free to modify the file <code>transform_params_day.json</code> in folder 'transform' as desired.

In [8]:
# Apply transformations to the raw dataset
df_day_cleaned = knmi_meteo_transform.transform_param_values(df_day)

In [9]:
# Show the result
df_day_cleaned

Unnamed: 0,station_code,date,vect_avg_wind_dir,vect_avg_wind_speed,day_avg_wind_speed,max_hour_avg_wind_speed,hour_slot_max_avg_wind_speed,min_hour_avg_wind_speed,hour_slot_min_avg_wind_speed,max_gust_speed,...,hour_slot_min_visibility,max_visibility_cat,hour_slot_min_visibility.1,cloudiness_in_eights_cat,day_avg_humidity,max_humidity,hour_slot_max_humidity,min_humidity,hour_slot_min_humidity,evap_ref
0,209,2023-01-01,215.0,9.2,9.6,14.0,1.0,3.0,22.0,20.0,...,,,,,,,,,,
1,209,2023-01-02,266.0,6.1,6.9,8.0,2.0,5.0,1.0,10.0,...,,,,,,,,,,
2,209,2023-01-03,196.0,8.3,8.8,15.0,22.0,3.0,3.0,20.0,...,,,,,,,,,,
3,209,2023-01-04,233.0,13.7,14.1,17.0,2.0,12.0,18.0,22.0,...,,,,,,,,,,
4,209,2023-01-05,245.0,8.2,9.0,12.0,2.0,6.0,13.0,16.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17150,391,2023-12-27,176.0,2.7,3.1,5.0,18.0,1.0,1.0,11.0,...,,,,,0.82,0.95,1.0,0.53,21.0,0.2
17151,391,2023-12-28,225.0,4.8,4.8,6.0,11.0,4.0,1.0,13.0,...,,,,,0.72,0.79,4.0,0.67,17.0,0.3
17152,391,2023-12-29,236.0,4.9,5.0,6.0,1.0,4.0,13.0,15.0,...,,,,,0.81,0.92,13.0,0.72,1.0,0.1
17153,391,2023-12-30,213.0,3.0,3.5,5.0,1.0,3.0,6.0,9.0,...,,,,,0.80,0.87,8.0,0.73,14.0,0.3


In [10]:
# Get distribution overview of numeric cols
df_day_cleaned.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
station_code,17155.0,297.361702,46.271796,209.0,260.0,286.0,331.0,391.0
vect_avg_wind_dir,16741.0,187.695657,91.645513,1.0,110.0,212.0,249.0,360.0
vect_avg_wind_speed,16741.0,4.629096,2.760371,0.1,2.6,4.1,6.1,17.6
day_avg_wind_speed,16744.0,5.240952,2.699477,0.4,3.3,4.7,6.7,18.2
max_hour_avg_wind_speed,16741.0,7.685264,3.343342,1.0,5.0,7.0,10.0,30.0
hour_slot_max_avg_wind_speed,16741.0,10.907353,6.365938,1.0,6.0,11.0,15.0,24.0
min_hour_avg_wind_speed,16741.0,2.782331,2.251414,0.0,1.0,2.0,4.0,15.0
hour_slot_min_avg_wind_speed,16741.0,9.724509,8.549831,1.0,1.0,6.0,19.0,24.0
max_gust_speed,16742.0,12.056146,4.514247,2.0,9.0,11.0,15.0,41.0
hour_slot_max_gust_speed,16742.0,11.755167,6.309347,1.0,7.0,12.0,16.0,24.0


#### Final Notes
At first sight (skimming over the distributions of the parameters), the data seems to have comprehensible values for the year 2023. From the overview it becomes clear that each of the parameters is only measured in some of the KNMI meteo stations (see the <code>"count"</code>-column).

After ingesting the data and running it through the data cleaning helper scripts, it becomes clear that the dataset is already of relatively high quality, as no immediate obvious outliers show up from the distribution overview above. Those of us working in data will likely appreciate the luxury situation of working with such a relatively 'clean' and well-documented dataset!

### Example 2: Retrieve all hourly data from all KNMI meteo stations within a given month
In the cells below we will retrieve data for all hourly parameters for all automatic KNMI meteo stations for one full month in a year. 

We do so by setting <code>start_date</code>, <code>end_date</code>, and mode (value: <code>'hour'</code>), while keeping <code>meteo_stns_list</code> and <code>meteo_params_list</code> undefined for our request to the KNMI service.

In the example below we use July 2024 as our month of interest - feel free to change this as desired.


In [11]:
# Set start and end dates (inclusive) for data retrieval
start_date = datetime.date(2024, 7, 1)
end_date = datetime.date(2024, 7, 31)

# Optional: uncomment below to make end_date exclusive
# end_date -= datetime.timedelta(days=1)

In [12]:
# Get dataset from KNMI web script service
df_hour = knmi_meteo_ingest.knmi_meteo_to_df(meteo_stns_list=None,
                                             meteo_params_list=None,
                                             start_date=start_date,
                                             end_date=end_date,
                                             mode="hour")

In [13]:
# Show the result
df_hour

Unnamed: 0,STN,YYYYMMDD,HH,DD,FH,FF,FX,T,T10N,TD,...,VV,N,U,WW,IX,M,R,S,O,Y
0,209,20240701,1,320.0,70.0,60.0,80.0,,,,...,,,,,6,,,,,
1,209,20240701,2,320.0,80.0,90.0,100.0,,,,...,,,,,6,,,,,
2,209,20240701,3,320.0,90.0,90.0,110.0,,,,...,,,,,6,,,,,
3,209,20240701,4,330.0,80.0,70.0,100.0,,,,...,,,,,6,,,,,
4,209,20240701,5,320.0,70.0,70.0,90.0,,,,...,,,,,6,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34219,391,20240731,20,0.0,0.0,0.0,10.0,203.0,,188.0,...,,,91.0,,6,,,,,
34220,391,20240731,21,0.0,0.0,0.0,10.0,193.0,,189.0,...,,,97.0,,6,,,,,
34221,391,20240731,22,0.0,0.0,0.0,10.0,184.0,,177.0,...,,,95.0,,6,,,,,
34222,391,20240731,23,0.0,0.0,0.0,10.0,178.0,,174.0,...,,,97.0,,6,,,,,


#### Clean (transform) dataset
Again, we can improve the readibility of the dataset by applying some preset transformations on the "raw" dataset we just ingested. 

As part of this transformation, all parameters are converted to whole units of measurements (e.g. m/s, J/cm2, °C) for better interpretability. Furthermore the column names are converted to a more readable format.

In case you would like to change any of the transformed column names, feel free to modify the file <code>transform_params_hour.json</code> in folder 'transform' as desired.

In [14]:
# Apply transformations to the raw dataset
df_hour_cleaned = knmi_meteo_transform.transform_param_values(df_hour)

In [15]:
# Show the result
df_hour_cleaned

Unnamed: 0,station_code,date,hour,avg_wind_dir_last_10min,hour_avg_wind_speed,avg_wind_speed_last_10min,hour_max_gust_speed,temp,min_temp_10cm_last_six_hours,temp_dew_point,...,cloudiness_in_eights_cat,relative_humidity,weather_code_cat,weather_code_obs_method_cat,fog_observed,rain_observed,snow_observed,lightning_observed,ice_observed,datetime
0,209,20240701,1,320.0,7.0,6.0,8.0,,,,...,,,,6,,,,,,2024-07-01 00:00:00
1,209,20240701,2,320.0,8.0,9.0,10.0,,,,...,,,,6,,,,,,2024-07-01 01:00:00
2,209,20240701,3,320.0,9.0,9.0,11.0,,,,...,,,,6,,,,,,2024-07-01 02:00:00
3,209,20240701,4,330.0,8.0,7.0,10.0,,,,...,,,,6,,,,,,2024-07-01 03:00:00
4,209,20240701,5,320.0,7.0,7.0,9.0,,,,...,,,,6,,,,,,2024-07-01 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34219,391,20240731,20,0.0,0.0,0.0,1.0,20.3,,18.8,...,,0.91,,6,,,,,,2024-07-31 19:00:00
34220,391,20240731,21,0.0,0.0,0.0,1.0,19.3,,18.9,...,,0.97,,6,,,,,,2024-07-31 20:00:00
34221,391,20240731,22,0.0,0.0,0.0,1.0,18.4,,17.7,...,,0.95,,6,,,,,,2024-07-31 21:00:00
34222,391,20240731,23,0.0,0.0,0.0,1.0,17.8,,17.4,...,,0.97,,6,,,,,,2024-07-31 22:00:00


In [16]:
# Get distribution overview of numeric cols
df_hour_cleaned.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
station_code,34224.0,296.630435,209.0,260.0,285.5,330.0,391.0,46.501899
date,34224.0,20240716.0,20240701.0,20240708.0,20240716.0,20240724.0,20240731.0,8.944403
hour,34224.0,12.5,1.0,6.75,12.5,18.25,24.0,6.922288
avg_wind_dir_last_10min,33295.0,210.698003,0.0,140.0,220.0,270.0,990.0,122.420986
hour_avg_wind_speed,33292.0,4.343236,0.0,2.0,4.0,6.0,21.0,2.578385
avg_wind_speed_last_10min,33295.0,4.351074,0.0,2.0,4.0,6.0,21.0,2.623047
hour_max_gust_speed,33292.0,6.897693,0.0,4.0,6.0,9.0,28.0,3.48453
temp,25296.0,18.062227,7.9,15.3,17.6,20.3,32.1,3.826874
min_temp_10cm_last_six_hours,4216.0,15.572486,3.6,13.4,15.5,17.7,29.6,3.668605
temp_dew_point,25296.0,13.810911,4.6,12.2,13.8,15.5,21.8,2.473836


#### Final Notes
At first sight (skimming over the distributions of the parameters), the data seems to have comprehensible values for July 2024. From the overview it becomes clear that each of the parameters is only measured in some of the KNMI meteo stations (see the <code>"count"</code>-column).

For the example of July 2024, no ice or snow were observed, which makes sense in Dutch summers. Lightning and fog were observed within 1.3% of the hour slots in the dataset, while the average sunshine was 30.0% per hour. 

This sunshine value seems on the low side for summer. We should probably take into consideration that the day and night hours have equal weight here, and all night hours with value 0 bring this value down a lot. Normally, however, we would only consider daytime hours to calculate the average sunshine percentage. With this in mind, 30% looks more plausible.

### Example 3: Load selected daily data for selected weather stations within a given distance

In the example below we will use the location of <strong>Utrecht</strong> as midpoint for our search area, and set the search radius at <strong>50 km</strong>. For all stations within this radius, we will load a selection of the day-based meteorological data.

#### Pre-requirement for geoprocessing: GeoPandas
<u>Important:</u> performing the geo-operations in the cells below requires that the 'geopandas' package be installed in the virtual environment linked to your Jupyter Notebook.

For those new to GeoPandas: it combines the data analysis tools of Pandas with geo-capabilities. Its main object is the GeoDataFrame, which works very similar to a Pandas DataFrame. The main add-on is that now one of the columns represents an item's geometry: its geographical location and/or shape. This enables geo-operations similar to those used in GIS software. See the GeoPandas documentation for further details: https://geopandas.org/en/stable/index.html.

In case you do not want to install geopandas (since its installation might be cumbersome on some systems), continue with <code>enable_geo_ops</code> as <code>False</code>. In that case, you can use the preloaded list of results instead.

In case you do have GeoPandas installed - feel free to change the location of interest to another one within The Netherlands, and to play with the radius buffer distance.

#### Load stations to GeoDataFrame
Let's start by loading the KNMI meteo stations and their locations from the available metadata. These need to be transformed to a GeoDataFrame in order to enable geoprocessing on them later on.

In [17]:
enable_geo_ops = False

if enable_geo_ops:
    import geopandas as gpd

In [18]:
# Load the overview of stations from metadata
df_stns = knmi_meteo_ingest.knmi_load_meteo_stations()

In [19]:
# Transform the station data
df_stns_cleaned = knmi_meteo_transform.transform_stations(df_stns)

In [20]:
df_stns_cleaned

Unnamed: 0,station_code,longitude,latitude,altitude,location_name
0,209,4.518,52.465,0.0,IJmond
1,210,4.43,52.171,-0.2,Valkenburg Zh
2,215,4.437,52.141,-1.1,Voorschoten
3,225,4.555,52.463,4.4,IJmuiden
4,235,4.781,52.928,1.2,De Kooy
5,240,4.79,52.318,-3.3,Schiphol
6,242,4.921,53.241,10.8,Vlieland
7,248,5.174,52.634,0.8,Wijdenes
8,249,4.979,52.644,-2.4,Berkhout
9,251,5.346,53.392,0.7,Hoorn Terschelling


In [21]:
# Function to reduce code duplication
def df_to_gdf(df):
    """Convert df with lon, lat data to gdf."""
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(
        df["longitude"], df["latitude"]),
        crs="EPSG:4326")
    return gdf

In [22]:
if enable_geo_ops:
    gdf_stns = df_to_gdf(df_stns_cleaned)

In [23]:
if enable_geo_ops:
    print(gdf_stns)

#### Load central location point to GeoDataFrame

In [26]:
df_center_point = pd.DataFrame(
    {"city": ["Utrecht"],
     "country": ["The Netherlands"],
     "latitude": [52.08],
     "longitude": [5.13],
    }
)

In [27]:
if enable_geo_ops:
    gdf_cent_pt = df_to_gdf(df_center_point)
    print(gdf_cent_pt)

#### Transform geometries to distance-based CRS
We have now successfully loaded the station data and added its location as a point geometry. In order to work with distances in kilometers, we need to convert the <code>lon, lat</code>, degree-based geometry to a local projection that approximates the 2D distances as accurately as possible. In The Netherlands, we have the so-called <code>Amersfoort/RD New</code> Coordinate Reference System (CRS), decoded as <code>EPSG:28992</code>. 

We will translate both our geodata from the global geodetic CRS (<code>EPSG:4326</code>) to the Amersfoort/RD New CRS in the cells below.

In [28]:
if enable_geo_ops:
    gdf_stns = gdf_stns.to_crs("EPSG:28992")
    gdf_cent_pt = gdf_cent_pt.to_crs("EPSG:28992")

In [29]:
if enable_geo_ops:
    print(gdf_cent_pt)

#### Create search polygon - buffer around central point
Great! Now that our geometries of interest are expressed in meters, we can create a buffer around the desired center point.

In [30]:
r_in_km = 50

if enable_geo_ops:
    gdf_cent_pt['geometry'] = (gdf_cent_pt['geometry']
                               .buffer(1000 * r_in_km))

    print(gdf_cent_pt)

Note that our previous point geometry for the center point has changed to a polygon geometry. This polygon approximates the circle with a radius of <code>r_in_km</code> kilometers around the center point, and will be used as the intersection shape to determine which of the KNMI meteo stations lie within this shape, and thus, within the desired search distance.

#### Use spatial join to find KNMI meteo stations within search polygon
Now we need to do a geoprocessing step called a <em>spatial join</em>. Contrary to regular (or: <em>attribute</em>) joins, spatial joins use the location of the dataset to perform a join operation, instead of on the attributes of the data itself. 

In the example below, we will find all KNMI meteo stations of which their location points intersect with the search polygon that we have just set up. 

In [31]:
if enable_geo_ops:
    gdf_matching_stns = gdf_stns.sjoin(gdf_cent_pt,
                                       how="inner",
                                       predicate="intersects")
    print(gdf_matching_stns["location_name"])

Great, we have found the KNMI meteo stations within a search range of <code>r_in_km</code> around our chosen central point! 

#### Extract station codes

To request the data from the KNMI service, we will still need to extract the station code values. We can do this by selecting the values from the <code>station_code</code> column of the GeoDataFrame containing the matching stations.

In [32]:
if enable_geo_ops:
    matched_stns = list(gdf_matching_stns["station_code"])
    print(matched_stns)

#### After geoprocessing
For those who did not install 'geopandas': we are now done with all geoprocessing operations, so feel free to join along again or experiment with the code further below. The results of the geo-operations of our example (Utrecht + 50 km radius) are also provided in the cell below.

In [33]:
# Use stations found by geoprocessing as hard-coded result
# in case 'geopandas' was not used for running this Notebook
if not enable_geo_ops:
    matched_stns = [210, 215, 240, 260, 265, 269, 344, 348, 356]

#### Request the data from the KNMI service
In this example, we will request data for the first half of 2022 from the KNMI meteo stations within a 50 km radius from Utrecht.

To also show how parameter selection works, we only select the (T) temperature-, (S) solar-, (R) rain- and (Q) irradiance-related parameters to be returned in the cells below.

In [34]:
# Set start and end dates (inclusive) for data retrieval
start_date = datetime.date(2022, 1, 1)
end_date = datetime.date(2022, 7, 1)

# Convert end_date to exclusive
end_date -= datetime.timedelta(days=1)

In [35]:
# Load available daily parameters
params_day = knmi_meteo_ingest.knmi_load_daily_parameters()

# We are interested in T, S, R and Q-related params; filter those
start_chars = ("T", "S", "R", "Q")

sel_params = (params_day["parameter_code"]
              [params_day["parameter_code"]
               .str.startswith(start_chars)]
               .tolist())

# Show the selected parameter codes
print(sel_params)

['TG', 'TN', 'TNH', 'TX', 'TXH', 'T10N', 'T10NH', 'SQ', 'SP', 'Q', 'RH', 'RHX', 'RHXH']


In [36]:
# Get dataset from KNMI web script service
df_sel = knmi_meteo_ingest.knmi_meteo_to_df(meteo_stns_list=matched_stns,
                                            meteo_params_list=sel_params,
                                            start_date=start_date,
                                            end_date=end_date)

In [37]:
# Apply transformations to the raw dataset
df_sel_cleaned = knmi_meteo_transform.transform_param_values(df_sel)

# Show the result
df_sel_cleaned

Unnamed: 0,station_code,date,hour_slot_max_temp,min_temp_10cm,six_hour_slot_min_temp_10cm,hour_slot_max_rain_hour_sum,sunshine_hours,max_temp,hour_slot_min_temp,min_temp,rain_sum,global_irradiance,sunshine_day_fraction,day_avg_temp,max_rain_hour_sum
0,215,2022-01-01,14,9.1,6.0,24,0.9,13.3,1,9.7,0.1,209,0.12,12.1,0.1
1,215,2022-01-02,4,8.9,24.0,19,3.0,13.6,24,9.7,3.3,227,0.38,11.6,2.7
2,215,2022-01-03,12,7.7,24.0,1,2.5,10.9,22,8.3,0.0,218,0.32,9.6,0.0
3,215,2022-01-04,1,3.2,24.0,5,0.2,8.9,19,3.9,0.5,113,0.03,6.7,0.4
4,215,2022-01-05,15,2.7,6.0,13,0.9,7.1,1,3.9,5.6,145,0.11,5.8,1.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1262,356,2022-06-26,14,14.1,6.0,1,5.1,21.2,5,14.5,0.0,1769,0.31,17.6,0.0
1263,356,2022-06-27,11,9.4,24.0,13,4.7,19.2,24,11.8,7.1,936,0.28,16.2,3.0
1264,356,2022-06-28,16,8.4,6.0,1,14.3,23.6,2,11.1,0.0,2848,0.86,18.0,0.0
1265,356,2022-06-29,15,11.9,6.0,1,11.0,27.2,4,12.9,0.0,2297,0.66,21.1,0.0


In [38]:
# Summarize the result
df_sel_cleaned.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
station_code,1267.0,290.285714,53.608777,215.0,240.0,269.0,348.0,356.0
hour_slot_max_temp,1267.0,13.456985,3.681473,1.0,12.0,14.0,15.0,24.0
min_temp_10cm,1086.0,3.200829,4.975224,-8.4,-0.3,2.9,6.7,17.5
six_hour_slot_min_temp_10cm,1086.0,13.20442,8.458335,6.0,6.0,6.0,24.0,24.0
hour_slot_max_rain_hour_sum,1267.0,5.423836,7.023398,1.0,1.0,1.0,9.0,24.0
sunshine_hours,1267.0,6.601815,4.686958,0.0,2.15,6.5,10.7,15.5
max_temp,1267.0,14.207893,5.88694,3.5,9.1,13.7,18.6,31.2
hour_slot_min_temp,1267.0,10.538279,9.041632,1.0,4.0,6.0,22.0,24.0
min_temp,1267.0,5.237885,4.64681,-5.8,1.7,4.6,8.5,17.9
rain_sum,1267.0,2.095856,5.177685,0.0,0.0,0.025,1.4,56.4


This last example contained quite a lot of steps, but we have made it through and got what we wanted! Time for a well-deserved coffee / tea break...

## Summary Overview
Hopefully the hands-on examples expanded your confidence to play around with similar data retrieval workflows.

Before wrapping up, I have listed some additional considerations and potential improvements. Feel free to skip these if you think these are not relevant to you.

### Methodological similarities to Medallion Architecture (Data Engineering)
A Data Engineering-minded reader might notice a few resemblances between the methodology used in the Python helper scripts of this notebook and the bronze and silver parts of the Medallion Architecture for data pipelines, where the raw (ingested) datasets would translate to 'bronze' and the cleaned (transformed) datasets to 'silver'. In case you noticed - it's correct!

For those who would like to know more about this principle, see for example: https://www.databricks.com/glossary/medallion-architecture.

### Improvement suggestions (for production environments)
Since no version control is used for the (meta)data in the files, any update simply <strong>overwrites</strong> the existing metadata files in the <code>metadata</code> folder, discarding the old content.

In a production environment, it would be recommended to instead use e.g. tables or Parquet-files, and to make use of one of the SCD (Slowly Changing Dimension) time-slicing types. 
For more infomation on SCD types to use in that case, see for example: https://en.wikipedia.org/wiki/Slowly_changing_dimension.

Also noteworthy is that KNMI offers more stable (production-ready) data products through the use of API keys, available on request by contacting KNMI (link to site in Dutch): https://www.knmi.nl/over-het-knmi/contact/contactformulier. Or check out the following page: https://developer.dataplatform.knmi.nl

### Similar work
For those mainly interested in a clean Python wrapper around the data retrieval API that is under more active maintenance, please also see the <code>knmi-py</code> package: https://github.com/EnergieID/KNMI-py. The organization also developed quite a few more Python wrappers for weather and energy APIs, which might be worthwhile to look at.

Note that they are also not affiliated with KNMI, nor am I affiliated with EnergieID.

### Data Analysis Opportunities
Now that we have showed ways to flexibly retrieve KNMI datasets, a wide array of interesting data analyses have become possible. 

We could start answering questions such as:
- Which time(s) of the day tended to be the rainiest/sunniest?
- Which of the measured locations were most/least windy?
- What were the total sums for rainfall and sunshine for the year of interest, and how did this vary throughout The Netherlands?
- Where and when did we see extreme rainfall or drought?
- At what time of the day did we typically see the lowest and highest temperatures, and is there a difference per region?

Since the purpose of this Notebook was only to show the data ingestion process, a deeper dive into analyses is not provided here. 

However, feel free to check the other Notebooks for data analysis examples, or experiment on your own!

## Closing note

Great job on reaching the end of this Notebook! We've learned how to access and prepare KNMI’s meteorological data, laying the groundwork for deeper analysis.

In the follow-up Notebook(s), we’ll focus on analyzing this data to uncover valuable weather trends. Happy to see you there!