# Week 04 Assignment weather data

Welcome to week four of this course programming 1. You will to organise your data into the required format and apply smoothing. In this assignment we will work with weatherdata from the KNMI. A subset of weatherdata is for you available in the file: `KNMI_20181231.csv`. The data consist of several stations with daily weather data of several years. Your task is to make a plot similar to the plot below.


<img src="images/weather.png" alt="drawing" width="400"/>


Furthermore the plot needs the following enhancements

1. proper titles and ticks
2. a slider widget selecting a particular year or all years
3. lines need to be smoothed
3. legends needs to be added

Use your creativity. Consider colors, alpha settings, sizes etc. 

Learning outcomes

- load, inspect and clean a dataset 
- reformat dataframes
- apply smoothing technologies
- visualize timeseries data

The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: clean the data](#1)
- [part 3: reformat data](#2)
- [part 4: smooth the data](#3)
- [part 5: visualize the data](#4)
- [part 6: Challenge](#5)

Part 1 and 5 are mandatory, part 6 is optional (bonus)
To pass the assingnment you need to a score of 60%. 


---

In [1]:
# IMPORTS
import pandas as pd
import numpy as np
from pathlib import Path
import yaml
import matplotlib.pyplot as plt

In [2]:
def get_config():
    with open("config.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
    return config

config = get_config()
data_dir = config['datadir']

<a name='0'></a>
## Part 1: Load the data

Either load the dataset `KNMI_20181231.csv` or `KNMI_20181231.txt.tsv`. The dataheaders contain spaces and are not very self explainable. Change this into more readable ones. Select data from station 270. Select only the mean, minimum and maximum temperature. The data should look something like this:


In [3]:
# Read in the data
knmi_data = pd.read_csv(Path(data_dir + "KNMI_20181231.txt.tsv"), comment="#", header = None, low_memory = False)

# Remove leading white spaces from the columns names
# knmi_data.columns = knmi_data.columns.str.lstrip()

# Rename the columns
knmi_data.rename(columns={0: "station",
                         1:"date",
                         2: "Tmean",
                         3: "Tmin",
                         4: "Tmax"}, inplace = True)

# Only keep interesting columns
cols_to_keep = ["station", "date", "Tmean", "Tmin", "Tmax"]
knmi_data = knmi_data[cols_to_keep]

# Get data from station 270
data_270 = knmi_data[knmi_data["station"] == 270]
data_270.head()
knmi_data

Unnamed: 0,station,date,Tmean,Tmin,Tmax
0,209,20010130,,,
1,209,20010131,,,
2,209,20010201,,,
3,209,20010202,,,
4,209,20010203,,,
...,...,...,...,...,...
331311,391,20181227,12,-18,47
331312,391,20181228,7,-29,30
331313,391,20181229,59,25,92
331314,391,20181230,78,52,87


---

<a name='1'></a>
## Part 2: Clean the data

The data ia not clean. There are empty cells in the dataframe which needs to be replaced with NaN's and the temperature is in centidegrees which needs to be transformed into degrees. The date field needs a datetime format. For visualization convience we would like to remove the leap year. Conduct the cleaning.

In [4]:

# knmi_data[['Tmean', 'Tmin', 'Tmax']] = knmi.[['Tmean', 'Tmin', 'Tmax']].map(lambda x:x/10)
# knmi_data[['Tmean', 'Tmin', 'Tmax']].applymap(lambda x:x/10)
knmi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331316 entries, 0 to 331315
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   station  331316 non-null  int64 
 1   date     331316 non-null  int64 
 2   Tmean    331316 non-null  object
 3   Tmin     331316 non-null  object
 4   Tmax     331316 non-null  object
dtypes: int64(2), object(3)
memory usage: 12.6+ MB


In [5]:
try:
    # Repalce the empty cells with NaN values
    knmi_data = knmi_data.replace(r'^\s*$', np.NaN, regex=True)

    # Change data format
    
    # Station
    knmi_data.station = knmi_data.station.astype("int64")
    
    # Date
#     knmi_data.date = knmi_data.date.apply(lambda a: "{}-{}-{}".format(str(int(a))[0:4],str(int(a))[4:6], str(int(a))[6:])) 
#     knmi_data["date"] = knmi_data["date"].astype("datetime64")
    knmi_data["date"]  = pd.to_datetime(knmi_data['date'].astype(str), format='%Y%m%d')
    
    # Tmean, Tmin, Tmax
    knmi_data[["Tmean", "Tmin", "Tmax"]] = knmi_data[["Tmean", "Tmin", "Tmax"]].astype("float64")
    
    def is_leapCheck(s):
        """
        Check if a year in a series of years is a leap year or not.
        
        :parameters
        -----------
        s - array like
            Years
            
        :returns
        --------
        leap - array like
            List of booleans, True if a year is a leap year and False if a year is not a leap year.
        """
        # Remove the first part because you only need to remove the leap year.
        # leap = (s.dt.year % 4 == 0) & ((s.dt.year % 100 != 0) | (s.dt.year % 400 == 0)) & (s.dt.month == 2) & (s.dt.day == 29)
        leap = (s.dt.month == 2) & (s.dt.day == 29)
        return leap

    # Exclude leap years
    knmi_data = knmi_data[~is_leapCheck(knmi_data["date"])]
    
    # Change the unit of Tmean Tmin and Tmax from centidegrees to degrees
    knmi_data[['Tmean', 'Tmin', 'Tmax']] = knmi_data[['Tmean', 'Tmin', 'Tmax']].applymap(lambda x:x/10)
    
    # Set date as the index
    # knmi_data.set_index("date", inplace = True, drop = False)
    
except Exception as e:
    print(f"Dataframe has already been cleaned. error: {e}")


knmi_data

Unnamed: 0,station,date,Tmean,Tmin,Tmax
0,209,2001-01-30,,,
1,209,2001-01-31,,,
2,209,2001-02-01,,,
3,209,2001-02-02,,,
4,209,2001-02-03,,,
...,...,...,...,...,...
331311,391,2018-12-27,1.2,-1.8,4.7
331312,391,2018-12-28,0.7,-2.9,3.0
331313,391,2018-12-29,5.9,2.5,9.2
331314,391,2018-12-30,7.8,5.2,8.7


In [6]:
knmi_data[knmi_data.date.dt.year == 2018].nunique()

station     47
date       365
Tmean      367
Tmin       322
Tmax       417
dtype: int64

In [7]:
knmi_270 = knmi_data[knmi_data["station"] == 270]
knmi_270

Unnamed: 0,station,date,Tmean,Tmin,Tmax
97641,270,2000-01-01,4.2,-0.4,7.9
97642,270,2000-01-02,5.5,3.3,7.4
97643,270,2000-01-03,7.4,4.9,8.9
97644,270,2000-01-04,4.6,2.2,7.5
97645,270,2000-01-05,4.1,1.4,5.6
...,...,...,...,...,...
104576,270,2018-12-27,5.7,5.3,6.2
104577,270,2018-12-28,7.1,5.8,8.1
104578,270,2018-12-29,8.5,6.9,10.2
104579,270,2018-12-30,8.0,6.8,9.0


In [8]:
knmi_data[knmi_data.date.dt.year == 2018].nunique()

station     47
date       365
Tmean      367
Tmin       322
Tmax       417
dtype: int64

In [9]:
#replace cells with spaces to NaN
#change data formats
#change temperatures to celcius degrees
#remove leap year

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')</li>
    <li>regex for empty cells = `^\s*$` </li>
    <li>remove month == 2 & day == 29</li> 
</ul>
</details>

In [10]:
#Test your outcome
knmi_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 331078 entries, 0 to 331315
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   station  331078 non-null  int64         
 1   date     331078 non-null  datetime64[ns]
 2   Tmean    237394 non-null  float64       
 3   Tmin     237395 non-null  float64       
 4   Tmax     237395 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1)
memory usage: 15.2 MB


### Expected outcome

---

<a name='2'></a>
## Part 3: Reform your data

First we will split the data in data from 2018 and data before 2018. Best is to split this in two dataframes. 
Next we need for the non 2018 data the minimum values for each day and the maximum values for each day. So we look for the minimum value out of all january-01 minimum values (regardless the year). Create a dataframe with 365 days containing the ultimate minimum and the ultimate maximum per day. 


In [27]:
# Get the dataframe for all rows that were before 2018
before_2018 = knmi_270[knmi_270['date'].dt.year < 2018]

# Get the dataframe for all rows that have the year 2018 or higher
after_2018 = knmi_270[knmi_270['date'].dt.year >= 2018]

b_2018 = before_2018.copy() # create copy from both dfs
a_2018 = after_2018.copy()

# Create a seperate month and day column
b_2018["Month"] = b_2018['date'].dt.month
b_2018["Day"] = b_2018['date'].dt.day

# Grab the minumum value vor each day across all years
min_vals = b_2018.groupby(['Month', 'Day']).Tmin.min()

# Another possible solution:
# test = b_2018.groupby(['Month', 'Day']).agg({'Tmin': ['min'], "Tmax":['max']}).reset_index()
# test.columns = ["Month", "Day", "Tmin", "Tmax"] # reset the column names to turn it back into a normal DataFrame

min_vals

Month  Day
1      1      -5.8
       2      -7.5
       3     -12.6
       4      -6.7
       5      -6.2
              ... 
12     27     -6.0
       28     -7.4
       29     -7.3
       30    -10.2
       31    -10.6
Name: Tmin, Length: 365, dtype: float64

In [28]:
def month_day(df_multipleyears):
    #your code to reform data here
    
    df_oneyear = df_multipleyears.copy()
    
    # Create a seperate month and day column
    df_oneyear["Month"] = df_oneyear['date'].dt.month
    df_oneyear["Day"] = df_oneyear['date'].dt.day
    
    # Grab the minumum value vor each day across all years
    df_groupedbymonthday = b_2018.groupby(['Month', 'Day']).Tmin.min()

    print(df_groupedbymonthday)
    return df_groupedbymonthday
    
df_monthday = month_day(before_2018)

Month  Day
1      1      -5.8
       2      -7.5
       3     -12.6
       4      -6.7
       5      -6.2
              ... 
12     27     -6.0
       28     -7.4
       29     -7.3
       30    -10.2
       31    -10.6
Name: Tmin, Length: 365, dtype: float64


In [29]:
#Test your code
def test_reformed(df): 
    #
    df = df[(df.date.dt.year > 2007) & (df.date.dt.year < 2018)]
    df_monthday = month_day(df)
    
    return df_monthday

df_monthday = test_reformed(knmi_270)

Month  Day
1      1      -5.8
       2      -7.5
       3     -12.6
       4      -6.7
       5      -6.2
              ... 
12     27     -6.0
       28     -7.4
       29     -7.3
       30    -10.2
       31    -10.6
Name: Tmin, Length: 365, dtype: float64


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use the dt.month and dt.day to groupby</li>
</ul>
</details>

### Expected outcome
Note, the layout or names my differ, but the length should be 365 and the minimum values should be the same

---

<a name='3'></a>
## Part 4: Smooth the data

Make a function that takes an array or a dataframe column and returns an array of smoothed data. Explain in words why you choose a certain smoothing algoritm


In [None]:
#your code here
#your motivation here
# x = df_monthday.index.get_level_values(1)
x = np.linspace(1, 365, 365)
y = df_monthday.values
plt.scatter(x, y)
plt.show()

---

<a name='4'></a>
## Part 5: Visualize the data

Plot the mean temperature of the year 2018. Create a shaded band with the ultimate minimum values and the ultimate maximum values from the multi-year dataset. Add labels, titles and legends. Use proper ranges. Be creative to make the plot attractive. 



<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use from bokeh.models import Band</li>
    <li>use ColumnDataSource to parse data arrays</li>
    <li>look for xaxis tick formatters</li>
</ul>
</details>

---

<a name='5'></a>
## Part 6: Challenge

Make a widget in which you can select the year range for the multiyear set. Add this to your layout to make the plot interactive. Add another widget to select or deselect the smoother. Inspiration: https://demo.bokeh.org/weather