# Weather Data Wrangling
### Data Source:
- Weather data comes from the Meteorological Institute, Universität Hamburg (https://www.mi.uni-hamburg.de/). 
- It includes hourly values for air temperature, relative humidity, wind speed and direction, precipitation amount, duration of sunshine, and cloud coverage in Hamburg.
- Each variable is stored as single column .txt with the extension "M60_201201010000-202212312300.txt" after the abbreveation of the variable
- all weather .txt files contain entries for each hour. Missing values (NaN) are denoted as 9999

|Variable   |Abbreveation   |
|---|---|
|wind direction in degree   |DD010   |
|wind speed in m/s   |FF010   |
|minutes of sunshine per hour   |GSM   |
|cloud coverage in eigths   |NC   |
|relative humindity in percent   |RH002   |
|precipitation in mm   |RR   |
|Surface-Temperature in °C  |TS000   |
|Air-Temperature 2m in °C  |TT002   |

### Aims:
1. Get the 8 variables into a single time-series Data-Frame
2. replace missing values 9999 with NaN
3. create clean column names (features) 
5. Create daily aggegrates for each variable
    1. To meaningfully aggregate sunshine-time per day, sunrise- and sunset-time must be mined using the daylight library (https://pypi.org/project/daylight/)
    2. Calculate daylength
    3. Aggregation function

In [None]:
import sys
# adding to the path variables the one folder higher (locally, not changing system variables)
sys.path.append("..")
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import csv
import re
from os import listdir

warnings.filterwarnings('ignore')

##### 1. Read in .txt weather-data

In [None]:
### list of txt tables in m60
m60 = listdir("../data/weather_m60") 
### Very simple, each has only one column of measurements

m60_s = {} ### dictionary of lists to be read in

### Create a list of data-type float for each weather variable
for txt in m60:
    key = txt.replace("_201201010000-202212312300.txt", "") # Define key name from filename
    path = "../data/weather_m60/" + txt                     # relative path of .txt
    with open(path) as file:                                # Open .txt
        ls_file = file.read().splitlines()                  # read values into a list
    m60_s[key] = [float(i) for i in ls_file]                # cast values to float and store in the dictionary

### Cast dictionary to data-frame
df_m60 = pd.DataFrame(m60_s)

In [None]:
### Attach a Date-time as index
df_m60["datetime"]= pd.date_range("2012-01-01 00:30:00", "2022-12-31 23:30:00", freq="1H")
df_m60.set_index("datetime", inplace=True)

##### 2. Replace Missing Values

In [None]:
na_val = 99999
def count_nine(series, na_val = 99999):
    return sum(series == na_val)

In [None]:
df_m60.apply(lambda x: count_nine(x))

In [None]:
### Replace 9999 with na
df_m60 = df_m60.apply(lambda x: x.replace(na_val, np.nan))

##### 3. Clean column names
- export hourly dataframe

In [None]:
### Column names
df_m60.columns = ["wind_dir", "sun_minutes", "rel_humid", "cloud_cover", "t_air", 
                  "t_surface", "precip", "wind_speed"]

### Export as Pickle and csv
df_m60.to_pickle("../data/df_weather_2012_2022.pkl")
df_m60.to_csv("../data/df_weather_2012_2022.csv")


##### 4. Daily data
4.1 Setting up the daylight package and defining objects and functions for its proper usage

In [None]:
### Import libraries
import daylight as dl
import pytz
import datetime as dt

In [None]:
### Define Epoch function to return a timestamp
### returned timezone is Coordinated Universal Time (UTC)
### This gives the expected format for the daylight.methods
def epoch(year, month, day, hour=0, minute=0, second=0, 
          tz=pytz.UTC): 
    return int(tz.localize(dt.datetime(year, month, day, hour, minute, second)).timestamp())

In [None]:
### Define Timezone of Hamburg
tz_hh = pytz.timezone("Europe/Berlin")

### Difference between Hamburg and Coordinated Universal Time (UTC)
tz_hh_offset = tz_hh.utcoffset(dt.datetime.utcnow()).total_seconds()/3600

### Sunclock Object to further locate at Coordinates of Hamburg (Central Station)
sun_hh = dl.Sunclock(53.552736, 10.005490, tz_hh_offset)

In [None]:
### Test Daylength in Minutes
(sun_hh.sunset(epoch(2023,3,15)) - \
    sun_hh.sunrise(epoch(2023,3,15)) ) \
    / 60 

In [None]:
### Function to get unique dates as strings from datetime-series
def unique_dates(dt_series):
    return dt_series.dt.strftime("%Y-%m-%d").unique()

### Define Function to get epoch input as integers from date object (index of time-series data-frame of weather)
### Returns as array
    # year as int
    # Month as int (1-12)
    # day as int (1-31)
### Returned array is in the correct format for the daylight functions sunrise() and sunset()

def get_epoch_input(dt_unique):
    # Define empty lists for years, months, days
    years = []
    months = []
    days = []

    # extract parts from input and append to corresponding lists as integers
    for dt_str in dt_unique:
        dt_parts = [int(i) for i in dt_str.split("-")]

        years.append(dt_parts[0])
        months.append(dt_parts[1])
        days.append(dt_parts[2])
    
    ### Return as Array with one column for year, month, date
    ### Makes it easier for vectorized computing of daylengths
    return np.array([years, 
                months, 
                days]).T

### Define Function to get daylength in minutes
def get_daylength(sun_obj, epoch_input):
    ep_dates = [epoch(row[0], row[1], row[2]) for row in epoch_input] ### Epochs of the dates as input for sunrise/sunset

    # get sunrise, sunset as timestamp integer in seconds
    sr_series = sun_obj.sunrise(ep_dates)
    ss_series = sun_obj.sunset(ep_dates)

    # cast to datetime object
    sr_dt = pd.to_datetime(sr_series, unit = "s")
    ss_dt = pd.to_datetime(ss_series, unit = "s")

    daylen = (ss_series - sr_series)/60

    return pd.DataFrame({"day_minutes": daylen, "sunrise": sr_dt, "sunset": ss_dt})

In [None]:
### Create datetime object - Unique dates of weather measurements
dt_series = pd.Series(df_m60.index)

4.2 Apply daylight-functions to get sunrise, sunset and calculate the daylength for each day of the time series

In [None]:
# Series of dates 2012-2022
ud = unique_dates(dt_series)

# Get an Array of year, month, day integers
epoch_input = get_epoch_input(ud)
# Calculate daylength
df_day = get_daylength(sun_hh, epoch_input)
df_day["date"] = pd.to_datetime(ud)
df_day.set_index("date", inplace=True)

##### 4.3 Daylength and weather summary
- Include sunrise and sunset as dates in the returned data.frame
- Define Function to summarize the data daily within a certain timeframe
- append daylength to summarized weather data
- export data-frame


In [None]:
### Aggregate hourly weather data by day
df_weather_day = df_m60.groupby("date").agg({"wind_dir": np.mean,
                                  "sun_minutes": sum,
                                  "rel_humid": np.mean,
                                  "cloud_cover": np.mean,
                                  "t_air": np.mean,
                                  "t_surface": np.mean,
                                  "precip":sum,
                                  "wind_speed":np.mean})

df_weather_day.index = pd.to_datetime(df_weather_day.index)

### Append column of daylength to daily-aggregated weather
df_weather_day = pd.concat([df_weather_day, df_day], axis = 1)

### Calculate the fraction of sunshine per day in relation to the actual daylength (sunrise till sunset)
df_weather_day["sun_time_fraction"] = df_weather_day.sun_minutes /  df_weather_day.day_minutes 

# Export data
df_weather_day.to_pickle("../data/df_weather_daily.pkl")
df_weather_day.to_csv("../data/df_weather_daily.csv")