# **Weather data exploration**

## 0 - Set up the libraries

In [None]:
#packages for EDA
import numpy as np
import pandas as pd

#packages for visualisation
import matplotlib.pyplot as plt
import seaborn as sns

#package for import data in json format
import json

#package for interpolation
#from scipy.interpolate import interp1d

## 1 - Load the data

In [None]:
#load the data in the csv format 
hist_weather_df = pd.read_csv('../data/historical_weather.csv')
forecast_weather_df = pd.read_csv('../data/forecast_weather.csv')
weather_station_df = pd.read_csv('../data/weather_station_to_county_mapping.csv')

df_dict = {'historical_weather' : hist_weather_df, 
           'weather_station' : weather_station_df, 
           'forecast_weather' : forecast_weather_df}

In [None]:
for df in df_dict.keys():
    print('name:', df)
    print('shape:', df_dict[df].shape)
    print('columns:', [col for col in df_dict[df].columns], '\n')
#    print(df_dict[df].info())
    print(15*'----')

In [None]:
json_file = open('../data/county_id_to_name_map.json', mode  = 'r', encoding= 'utf-8')
county_id_name_dict = json.load(json_file)

In [None]:
county_id_name_dict

Estonia has 15 administrative counties in total, see e.g. the __[Wikipedia page](https://en.wikipedia.org/wiki/Counties_of_Estonia)__. Here we haver 16 with one of them, #12 corresponding to an `Unknown` county. We make this dictionary into a dataframe for later convenience.

In [None]:
county_id_name_df = pd.DataFrame(data = {'county' : county_id_name_dict.keys(), 
                                         'county_name' : county_id_name_dict.values()})
county_id_name_df['county'] = county_id_name_df['county'].astype('int64')
county_id_name_df.head()

## 2 - Data exploration

### 2.1 - **Weather station**

Geographical coordinates for 112 weather stations, with county name and county county id.
- `county_name` - The name of the counties the weather stations are placed.
- `[longitude/latitude]` - The coordinates of the weather stations.
- `county` - The county id.

In [None]:
weather_station_df.head()

In [None]:
weather_station_df.isnull().sum()

Thus we have a number of `NaNs` in this table. Let us see where are these missing values for county on the map.

In [None]:
import plotly.express as px

weather_station_df['size'] = 5

fig = px.scatter_mapbox(
    weather_station_df, 
    lat="latitude", 
    lon="longitude", 
    color="county",
    size='size',
    zoom=6,
    title='Weather Stations Locations'
)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

Now it is clear why there are some many missing values in this table: they correspond to weather station on the sea or outside the borders of Estonia. They must correspond to the #12 `Unknown` county in the `county_id_name_dict`. In order to match the json file, we will replace the `NaNs` in this dataframe accordingly. 

In [None]:
#first rename county to county_id to match the json file and
#make the column county_name into the string format 
#weather_station_df.rename(columns = {'county' : 'county_id'}, inplace = True)
weather_station_df['county_name'] = weather_station_df['county_name'].str.upper()

We then check that the `NaNs` correspond to those values in the json file.

In [None]:
county_id_name_df['county_name'].isin(weather_station_df['county_name'])

In [None]:
county_id_name_df['county'].isin(weather_station_df['county'])

We now input the `NaNs`.

In [None]:
weather_station_df.fillna(value = {'county_name' : 'UNKNOWN', 'county' : 12}, inplace = True)
weather_station_df.head()

For later convenience, we drop the column `size` and make the column `county` into integer. 

In [None]:
weather_station_df.drop(columns = ['size'], inplace = True)
weather_station_df['county'] = weather_station_df['county'].astype('int64')
weather_station_df.head()

### 2.2 - **Weather forecast**

Weather forecasts that would have been available at prediction time. Sourced from the __[European Centre for Medium-Range Weather Forecasts](https://codes.ecmwf.int/grib/param-db/?filter=grib2)__.

- `[latitude/longitude]` - The coordinates of the weather forecast.

- `origin_datetime` - The timestamp of when the forecast was generated. Given in EET+2/EEST+3 timezone.

- `hours_ahead` - The number of hours between the forecast generation and the forecast weather. Each forecast covers 48 hours in total.

- `temperature` - The air temperature at 2 meters above ground in degrees Celsius. Estimated for the end of the 1-hour period.

- `dewpoint` - The dew point temperature at 2 meters above ground in degrees Celsius. Estimated for the end of the 1-hour period.

- `cloudcover_[low/mid/high/total]` - The percentage of the sky covered by clouds in the following altitude bands: 0-2 km, 2-6, 6+, and total. Estimated for the end of the 1-hour period.

- `10_metre_[u/v]_wind_component` - The [eastward/northward] component of wind speed measured 10 meters above surface in meters per second. Estimated for the end of the 1-hour period.
- `data_block_id` -  All rows sharing the same data_block_id will be available at the same forecast time. This is a function of what information is available when forecasts are actually made, at 11 AM each morning. For example, if the forecast weather data_block_id for predictins made on October 31st is 100 then the historic weather data_block_id for October 31st will be 101 as the historic weather data is only actually available the next day.

- `forecast_datetime` - The timestamp of the predicted weather. Generated from origin_datetime plus hours_ahead. This represents the start of the 1-hour period for which weather data are forecasted. Given in UTC+00:00 timezone.

- `direct_solar_radiation` - The direct solar radiation reaching the surface on a plane perpendicular to the direction of the Sun accumulated during the hour, in watt-hours per square meter.

- `surface_solar_radiation_downwards` - The solar radiation, both direct and diffuse, that reaches a horizontal plane at the surface of the Earth, accumulated during the hour, in watt-hours per square meter.

- `snowfall` - Snowfall over hour in units of meters of water equivalent.

- `total_precipitation` - The accumulated liquid, comprising rain and snow that falls on Earth's surface over the described hour, in units of meters.

In [None]:
forecast_weather_df.head()

In [None]:
forecast_weather_df.tail()

In [None]:
forecast_weather_df.info(show_counts = True)

#### 2.2.1 - Check for unique values

Unique values for (`latitude`, `longitude`) pairs.

In [None]:
geo_coord_weather_st = list(set(list(zip(forecast_weather_df['latitude'],forecast_weather_df['longitude']))))
print(geo_coord_weather_st)

The number of unique pairs is

In [None]:
Nstations = len(geo_coord_weather_st)
print(Nstations)

which coincides with the number of weather stations. We now create a dataframe with the weather station id's and the corresponding geocoordinates.

In [None]:
#create a dataframe with the id's and corresponding geocoordinates
weather_station_geocoord_df = pd.DataFrame(data = {'station' : [id_ for id_ in range(Nstations)], 'geocoordinates' : geo_coord_weather_st})
#create columns with the latitude and longitude
weather_station_geocoord_df['latitude'] = weather_station_geocoord_df['geocoordinates'].apply(lambda x: x[0])
weather_station_geocoord_df['longitude'] = weather_station_geocoord_df['geocoordinates'].apply(lambda x: x[1])
weather_station_geocoord_df.head()

We will also add the id information into the `weather_station_df`.

In [None]:
weather_station_df['geocoordinates'] = list(zip(weather_station_df['latitude'].round(1),weather_station_df['longitude'].round(1)))
weather_station_df['station'] = weather_station_df['geocoordinates'].apply(lambda x: geo_coord_weather_st.index(x))
weather_station_df.drop(columns = 'geocoordinates', inplace = True)
weather_station_df.head()

In [None]:
weather_station_dict = weather_station_df.to_dict(orient = 'list')
with open('../data/weather_station_to_county_dictionary.json', 'w') as outfile:
    json.dump(weather_station_dict, outfile)

Unique values for `hours_ahead`.

In [None]:
print('all possible hours ahead:', [ha for ha in pd.unique(forecast_weather_df.hours_ahead)])
print('# of possible hours ahead:', len(pd.unique(forecast_weather_df.hours_ahead)))

So the weather if forecasted every hour for up to 48 hours in advance.

#### 2.2.2 - Transform the date/time columns into the datetime type

All timestamps are given in EET/EEST.

In [None]:
#transform to datetime
forecast_weather_df['origin_datetime'] = pd.to_datetime(forecast_weather_df['origin_datetime'])
forecast_weather_df['forecast_datetime'] = pd.to_datetime(forecast_weather_df['forecast_datetime'])

In [None]:
forecast_weather_df.head()

#### 2.2.3 - Check for duplicates

For each weather station we should have only two occurences for each time stamp in the `forecast_datetime` column or, inverting the statement, since we have 112 weather stations, each time stamp in`forecast_datetime` should occur 224 times in the dataset. This frequency is due to the fact that for each `origin_datetime` the weather is forecasted for up 48 hours i.e. two day. Hence taken as an example today as our origin datetime we are going to have forecasts for tomorrow and the day after, but then the forecast using tomorrow's date as the origin datetime, we will have forecasts for the day fater and the one after that, hence we have two distinct forecasts for the day after. In short, to look for duplicates we have to consider both `origin_datetime` and `forecast_datetime`. 

In order to identify the weather stations more easily, we will create the `weather_station_id` column from the `geo_coord_weather_st` list with, taking the list index as the corresponding id.

In [None]:
#create weather_station_id column in the dataframe as the index of geo_coord_weather_st for the 
#corresponding tuple in the latitude/longitude columns
forecast_weather_df['weather_station_id'] = list(zip(forecast_weather_df['latitude'],forecast_weather_df['longitude']))
forecast_weather_df['weather_station_id'] = forecast_weather_df['weather_station_id'].apply(lambda x: geo_coord_weather_st.index(x))
forecast_weather_df['weather_station_id']

In [None]:
forecast_weather_df.head()

Our next task is to go through the dataset corresponding to each weather station and then check if there are duplicates. For ecah station and for each `origin_datetime` we forecast the wetaher for the next 48 hours, hence no combination of `orgin_datetime`, `hours_ahead` and `weather_station_id` should be repeated, otherwise this would indicate duplicates in our dataset.

In [None]:
np.unique(forecast_weather_df[['origin_datetime', 'hours_ahead', 'weather_station_id']].duplicated())

Hence, we have no duplicated data.

#### 2.2.4 - Check for missing values

In [None]:
forecast_weather_df.isnull().sum()

We have 2 `NaN` in the column `surface_solar_radiation_downwards`. Led us first find out in which rows these `NaN` values are.

In [None]:
nan_cond = forecast_weather_df['surface_solar_radiation_downwards'] != forecast_weather_df['surface_solar_radiation_downwards']
forecast_weather_df[nan_cond].head()

Let us now plot the time series containing the `NaN` values. For that we will take the 48-hours forcast with `origin_datetime` = 2022-08-11 02:00:00.

In [None]:
station_id = 51
cond = ((forecast_weather_df['weather_station_id'] == station_id) #condition to screen for a time series containing the NaNs
        & (forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-08-11')
        & forecast_weather_df['hours_ahead'].between(1,48,inclusive='both')
       )
aux_series = forecast_weather_df[cond].set_index('hours_ahead')['surface_solar_radiation_downwards']
aux_series.plot(figsize = (16,6)); #plot of the time series containing the NaNs

We now interpolate the time series using a polynomial method so to get a steep, yet smooth, rise at the first hours of the day.

In [None]:
aux_series.interpolate(method='cubic').plot(figsize = (16,6));

This looks good enough, so we inpute the interpolated values into the time series.

In [None]:
aux_series.interpolate(method='polynomial', order = 3, inplace = True)

Since we only have two `NaN`, we can impute them by hand. 

In [None]:
for hour in [3,4]:
    cond = ((forecast_weather_df['weather_station_id'] == station_id) 
            & (forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d %H:%M:%S') == '2022-08-11 02:00:00')
            & (forecast_weather_df['hours_ahead'] == hour)
            )
    forecast_weather_df.loc[cond, 'surface_solar_radiation_downwards'] = aux_series.iloc[hour]

In [None]:
forecast_weather_df.isnull().sum()

So now we are free from `NaNs`!

#### 2.2.5 - Check for Daylight Saving Time

The start and end timestamps for DST for Estonia in the years 2021, 2022 and 2023 are

- `2021-03-28 03:00:00` to `2021-10-31 04:00:00`
- `2022-03-27 03:00:00` to `2022-10-30 04:00:00`
- `2023-03-26 03:00:00` to `2023-10-29 04:00:00`

In [None]:
end_dst_2021 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2021-10-30'
forecast_weather_df[end_dst_2021&(forecast_weather_df['weather_station_id']==1)][22:27]

In [None]:
end_dst_2021 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2021-10-31'
forecast_weather_df[end_dst_2021&(forecast_weather_df['weather_station_id']==1)][0:4]

In [None]:
start_dst_2022 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-03-26'
forecast_weather_df[start_dst_2022&(forecast_weather_df['weather_station_id']==1)][22:27]

In [None]:
start_dst_2022 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-03-27'
forecast_weather_df[start_dst_2022&(forecast_weather_df['weather_station_id']==1)][0:4]

In [None]:
end_dst_2022 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-10-29'
forecast_weather_df[end_dst_2022&(forecast_weather_df['weather_station_id']==1)][22:27]

In [None]:
end_dst_2022 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-10-30'
forecast_weather_df[end_dst_2022&(forecast_weather_df['weather_station_id']==1)][0:4]

In [None]:
start_dst_2023 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2023-03-25'
forecast_weather_df[start_dst_2023&(forecast_weather_df['weather_station_id']==1)][22:27]

In [None]:
start_dst_2023 = forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2023-03-26'
forecast_weather_df[start_dst_2023&(forecast_weather_df['weather_station_id']==1)][0:4]

We can see that when the DST ends, the forecast timestamp `03:00:00` is duplicated, as at `03:59:00` the clock goes back to `03:00:00`. For the same reason, when DST starts the timestamp `03:00:00` is skipped, since at `02:59:00` the clock jumps to `04:00:00`. We can also see the original timestamp changes with the DST change. During the DST, the original timestamp is at `02:00:00`, while during normal time the original timestamp is at `01:00:00`.  

#### 2.2.5 - Check for summer time in the time stamps

In the latest version of the `forecast_weather` there are no problem related to summer time. 

In [None]:
forecast_weather_df[(forecast_weather_df['origin_datetime'].dt.strftime('%Y-%m-%d') == '2022-03-27')
                   & (forecast_weather_df['weather_station_id'] == 5)].head()

### 2.3 - **Historic weather**

Historic weather data, as described in the competition webasite.


- `datetime` - This represents the start of the 1-hour period for which weather data are measured. Given in EET+2/EEST+3 timezone.
- `temperature` - Measured at the end of the 1-hour period.
- `dewpoint` - Measured at the end of the 1-hour period.
- `rain` - Different from the forecast conventions. The rain from large scale weather systems of the hour in millimeters.
- `snowfall` - Different from the forecast conventions. Snowfall over the hour in centimeters.
- `surface_pressure` - The air pressure at surface in hectopascals.
- `cloudcover_[low/mid/high/total]` - Different from the forecast conventions. Cloud cover at 0-3 km, 3-8, 8+, and total.
- `windspeed_10m` - Different from the forecast conventions. The wind speed at 10 meters above ground in meters per second.
- `winddirection_10m` - Different from the forecast conventions. The wind direction at 10 meters above ground in degrees.
- `shortwave_radiation` - Different from the forecast conventions. The global horizontal irradiation in watt-hours per square meter.
- `direct_solar_radiation`
- `diffuse_radiation` - Different from the forecast conventions. The diffuse solar irradiation in watt-hours per square meter.
- `[latitude/longitude]` - The coordinates of the weather station.
- `data_block_id`

In [None]:
hist_weather_df.head()

In [None]:
hist_weather_df.tail()

In [None]:
hist_weather_df.info(show_counts=True)

#### 2.3.1 - Create a weather station id

In [None]:
#create weather_station_id column in the dataframe as the index of geo_coord_weather_st for the 
#corresponding tuple in the latitude/longitude columns
hist_weather_df['weather_station_id'] = list(zip(hist_weather_df['latitude'],hist_weather_df['longitude']))
hist_weather_df['weather_station_id'] = hist_weather_df['weather_station_id'].apply(lambda x: geo_coord_weather_st.index(x))
hist_weather_df['weather_station_id']

#### 2.3.2 - Check for missing values

In [None]:
hist_weather_df.isnull().sum()

There are no missing values!

#### 2.3.3 - Transform the date/time columns into the datetime type

In [None]:
hist_weather_df['datetime'] = pd.to_datetime(hist_weather_df['datetime'])

#### 2.3.4 - Check for duplicates in the dataset

In [None]:
cols_to_drop = ['latitude', 'longitude', 'weather_station_id'] #columns that can be dropped
cols_to_check = 'datetime' #columns to check for duplicates
weather_stations_w_duplicates = [] #list where to keep the ids of the weather stations with duplicated data

for station_id in range(Nstations): #loop through the weather stations ids
    screen_cond = hist_weather_df['weather_station_id'] == station_id #screen for a particular weather station
    aux_df = hist_weather_df[screen_cond].drop(columns = cols_to_drop).copy() #drop columns
    if True in pd.unique(aux_df.duplicated(subset=cols_to_check)): #check for duplicated rows for a subset of the columns
        weather_stations_w_duplicates.append(station_id) #append to the list

In [None]:
weather_stations_w_duplicates

We have duplicated time stamps for the weather station 31 and 97. Since this duplication does not happen to all weather station, it will not be connected to summer time changes. Let us find out at what `datetime` value they occur.

In [None]:
aux_df = hist_weather_df[(hist_weather_df['weather_station_id'].isin(weather_stations_w_duplicates))][['datetime', 'weather_station_id']]
for id in weather_stations_w_duplicates:
    aux_count = aux_df[aux_df['weather_station_id']==id].groupby('datetime')['datetime'].count()
    print(aux_count[aux_count > 1])

Finally, let us display the duplicates.

In [None]:
hist_weather_df[hist_weather_df.duplicated(subset=['datetime', 'weather_station_id'], keep=False)]

These duplicates might correspond to sudden changes in the weather conditions. A way around this is to aggregate the dataframe by `datetime` and then take the mean.

In [None]:
mask_all = hist_weather_df.duplicated(subset=['datetime', 'weather_station_id'], keep=False)
aux_df = hist_weather_df[mask_all].groupby(['datetime', 'weather_station_id'], as_index = False).mean()
move_col = aux_df.pop('weather_station_id')
n_cols = len(aux_df.columns.values.tolist())
aux_df.insert(n_cols, 'weather_station_id', move_col)
aux_df

We then input these values in the rows corresponding to the first instance of the duplicate and then drop the duplicates.

In [None]:
mask_last = hist_weather_df.duplicated(subset=['datetime', 'weather_station_id'])
hist_weather_df.iloc[mask_last] = aux_df
hist_weather_df.drop_duplicates(subset=['datetime', 'weather_station_id'], inplace=True)
hist_weather_df[hist_weather_df.duplicated(subset=['datetime', 'weather_station_id'], keep = False)]

And now e are free of duplicates!

#### 2.3.5 - Check for timezone

In the latest version of the data all timestamps are given in the EET/EEST timezone.

#### 2.3.6 - Check for Daylight Saving Time

The start and end timestamps for DST for Estonia in the years 2021, 2022 and 2023 are

- `2021-03-28 03:00:00` to `2021-10-31 04:00:00`
- `2022-03-27 03:00:00` to `2022-10-30 04:00:00`
- `2023-03-26 03:00:00` to `2023-10-29 04:00:00`

In [None]:
end_dst_2021 = hist_weather_df['datetime'].dt.strftime('%Y-%m-%d') == '2021-10-31'
hist_weather_df[end_dst_2021&(hist_weather_df['weather_station_id']==1)].head()

In [None]:
start_dst_2022 = hist_weather_df['datetime'].dt.strftime('%Y-%m-%d') == '2022-03-27'
hist_weather_df[start_dst_2022&(hist_weather_df['weather_station_id']==1)].head()

In [None]:
end_dst_2022 = hist_weather_df['datetime'].dt.strftime('%Y-%m-%d') == '2022-10-30'
hist_weather_df[end_dst_2022&(hist_weather_df['weather_station_id']==1)].head()

In [None]:
start_dst_2023 = hist_weather_df['datetime'].dt.strftime('%Y-%m-%d') == '2023-03-26'
hist_weather_df[start_dst_2023&(hist_weather_df['weather_station_id']==1)].head()

We can see that there is no DST change. Hence forecasted and historic weather data are giving in different times.

## 3 - Save cleaned data in the files

For now we will drop the column `weather_station_id` before saving the cleaned datasets files.

In [None]:
weather_station_df.to_csv('../data/weather_station_to_county_mapping_clean.csv', index = False)

In [None]:
county_id_name_df.to_csv('../data/county_id_to_name_map.csv', index = False)

In [None]:
hist_weather_df.to_csv('../data/historical_weather_clean.csv', index = False)
forecast_weather_df.to_csv('../data/forecast_weather_clean.csv', index = False)

In [None]:
weather_station_geocoord_df.to_csv('../data/weather_stations_geocoord.csv', index = False)