# Chicago Weather Forecasting: Model Experimentation

In this notebook I shall explain how one can build models for weather forecasting and offer evalutation of these models.

## Problem Scope Definition

Before we proceed with modelling, we need to define what the scope of this project is.

### _(1) What we are forecasting_

The goal is to predict a number of weather characteristics in the future for the city of Chicago. We shall select the basic componenets of a weather situation that a layperson will be considering for planning purposes. Think about deciding whether to go on a Sunday picnic on the beach or go hiking.  

With that in mind, the following quantities are chosen:

- Temperature (in Farenheit degrees)
- Wind Speed (in miles per hour) 
- Precipitation (whether or not there will be rain / snow or hail)
- Cloudiness (whether or not the sky will be covered in clouds)

Of those, forecasting Temperature and Wind are **Regression** problems, while forecasting Precipitation or Cloudiness are **Binary Classification** problems.

We shall attempt to predict weather for the following durations in advance:

- 6 hours
- 12 hours
- 18 hours
- 24 hours

(Preliminary experimentation has proven longer term forecasts to be not feasible with the data available)

### _(2)  Model Dataset_

As explained previously, we shall be relying on US Government (NOAA) datasets containing **hourly** weather reports for the weather station in Chicago as well as nearby stations in the US Midwest. 

In this notebook we shall work with data for 10 years, 2011-2020 (inclusively) from the following locations:

- Chicago, IL (target location)
- Cedar Rapids, IA
- Des Moines, IA
- Rochester, MN
- Quincy, IL
- Madison, WI
- St Louis, MO
- Green Bay, WI
- Lansing, MI

Most of these locations are **West** of Chicago as we previously determined through correlation analysis that locations in that direction have much more effect on weather in Chicago than locations in other directions.

Finally, we shall be using preprocessed reports rather than the non-intuitive raw NOAA reports. See the following  (Timestamp is followed by the 4 quantities we aim to forecast as well as a few more for demo purposes):

In [9]:
import pandas as pd

df = pd.read_csv('../processed-data/noaa_2011-2020_chicago_PREPROC.csv')
subset_df = df [['DATE', 'Temp', 'WindSpeed', '_is_precip', '_is_cloudy', 'CloudCondition', 'WeatherType', 
                 'Pressure', 'Humidity', '_wind_dir_sin', '_wind_dir_cos']]
subset_df.head(20)

Unnamed: 0,DATE,Temp,WindSpeed,_is_precip,_is_cloudy,CloudCondition,WeatherType,Pressure,Humidity,_wind_dir_sin,_wind_dir_cos
0,2011-01-01 00:00:00,40.333333,13.0,0,1,Cloudy,NoPrecipitation,29.72,71.333333,-0.939693,-0.3420201
1,2011-01-01 01:00:00,37.0,17.0,0,1,Cloudy,NoPrecipitation,29.735,70.0,-0.984808,-0.1736482
2,2011-01-01 02:00:00,36.0,17.0,0,1,Cloudy,NoPrecipitation,29.75,70.0,-0.866025,-0.5
3,2011-01-01 03:00:00,32.0,15.0,0,1,MostlyCloudy,NoPrecipitation,29.75,61.0,-0.866025,-0.5
4,2011-01-01 04:00:00,31.0,16.0,0,0,PartlyCloudy,NoPrecipitation,29.76,61.0,-0.866025,-0.5
5,2011-01-01 05:00:00,28.0,18.0,0,0,MostlyClear,NoPrecipitation,29.77,63.0,-0.866025,-0.5
6,2011-01-01 06:00:00,27.5,17.0,0,1,MostlyCloudy,NoPrecipitation,29.785,67.5,-0.866025,-0.5
7,2011-01-01 07:00:00,25.0,20.0,0,1,MostlyCloudy,NoPrecipitation,29.81,75.0,-0.866025,-0.5
8,2011-01-01 08:00:00,23.0,21.0,0,1,Cloudy,NoPrecipitation,29.87,65.0,-0.866025,-0.5
9,2011-01-01 09:00:00,21.0,23.0,0,1,Cloudy,NoPrecipitation,29.89,62.0,-0.939693,-0.3420201


### _(3)  Aggregated Forecasting_

As explained previously, the datasets are chronological lists of hourly data points. What does it mean to forecast each of the target quantities, say, 12h, in advance?

Predicting weather for a particular hour may not serve us particularly well. Consider the following situations: 

- let's say it is 11PM and we are considering a picnic at 11AM the next day. If it does not rain at 11AM but it does rain at 10AM or 1PM, the picnic is a bad idea.

- similarly, if we are considering kayaking, if the wind is going to be 5mph at 11AM but 30mph at 2PM, we should reconsider

*To address such concerns, we shall be attempting to forecast not weather for the exact target hour but rather some kind of **aggregation over an interval** centered around that hour.*

In the code to follow we shall rely on something called **Aggregation Half Interval (AHI)**. For example, if AHI = 3 and the target hour is 11AM, we shall be considering the interval spanning 08AM to 02PM. 

Let us now define what that means for each of the 4 forecasted quantities:

| Quantity | AHI | Aggregation Rule |
| --- | --- | --- |
| Temperature | 1h | Average |
| WindSpeed   | 2h | Average |
| Precipitation | 3h | True if any element is True |
| Cloudiness | 3h  | True if any element is True |

The first two rows for analog quantities are self explanatory: we are smoothing the prediction over an interval by averaging. Temperature has a smaller interval as it is much more directly dependent on time of day than wind.

The last two rows for binary quantities say that if *any* hour during the interval is Rainy or Cloudy, the resulting forecast too is Rainy or Cloudy. As per the situation described above, if it rains anywhere close to the hour for which we are forecasting, we'll get wet. Similarly, if it is cloudy anywhere close to that hour, our sun tanning won't go well. 

# Preparing the Learning Data Set

## (1) _Merge Data from All Locations_

We need to do a JOIN on all the weather reports whose data we'll be feeding into our models. The following functions perform the merge and drop any irrelevant columns:

In [13]:
def buildFeatureSet(targetLocationFile, adjacentLocationFiles, predictedVariable, featuresToUse):
    target_df = pd.read_csv(targetLocationFile, parse_dates=['DATE'])
    target_df = dropUnusedColumns(target_df, predictedVariable, featuresToUse)
    merged_df = target_df
    suffix_no = 1

    # Merge adjacent location files one by one relying on DATE
    for adjacentLocationFile in adjacentLocationFiles:
        adjacent_df = pd.read_csv(adjacentLocationFile, parse_dates=['DATE'])
        adjacent_df = dropUnusedColumns(adjacent_df, predictedVariable, featuresToUse)

        #Take control of column name suffix in the dataset being merged in
        adjacent_df = adjacent_df.add_suffix(str(suffix_no))
        adjacent_df = adjacent_df.rename(columns = {"DATE{}".format(suffix_no) :'DATE'})
        merged_df = pd.merge(merged_df, adjacent_df, on='DATE')
        suffix_no = suffix_no + 1

    # DATE column is of no use in the modelling stage (we only needed it for merging)
    merged_df = merged_df.drop(columns=['DATE'])
    return merged_df

#======================================================================
# Keep only the DATE column, the variable we are predicting and the variables that we use for prediction
def dropUnusedColumns(df, predictedVariable, featuresToUse):
    all_columns = featuresToUse.copy()
    all_columns.append('DATE')
    all_columns.append(predictedVariable)
    df = df[all_columns]

    return df


Quick illustration:

In [14]:
featureset = buildFeatureSet(
    '../processed-data/noaa_2011-2020_chicago_PREPROC.csv',
    ['../processed-data/noaa_2011-2020_cedar-rapids_PREPROC.csv', 
         '../processed-data/noaa_2011-2020_des-moines_PREPROC.csv'],
    predictedVariable='WindSpeed',
    featuresToUse = ['_wind_dir_sin', '_wind_dir_cos']
    )
featureset.head()

Unnamed: 0,_wind_dir_sin,_wind_dir_cos,WindSpeed,_wind_dir_sin1,_wind_dir_cos1,WindSpeed1,_wind_dir_sin2,_wind_dir_cos2,WindSpeed2
0,-0.939693,-0.34202,13.0,-0.802123,-0.597159,23.666667,-0.866025,-0.5,23.5
1,-0.984808,-0.173648,17.0,-0.642788,-0.766044,25.0,-0.939693,-0.34202,24.0
2,-0.866025,-0.5,17.0,-0.766044,-0.642788,23.0,-0.984808,-0.173648,22.0
3,-0.866025,-0.5,15.0,-0.866025,-0.5,23.0,-0.939693,-0.34202,22.0
4,-0.866025,-0.5,16.0,-0.866025,-0.5,23.0,-0.939693,-0.34202,16.0


We have a set of 3 variables of interest: `WindSpeed` (predicted) as well as `_wind_dir_sin` and `_wind_dir_cos` (to be used for predicting). As you can see, the dataset above has these variables repeated 3 times, once for each location. This merged kind of dataset will be used going forward.

## (2) _Split and Normalize the Data_

Before we can train models we must split the data into the 3 subsets:

- *Training*: the actual data that we'll be training on. This is the largest subset.
- *Validation*: the dataset to be used for model tuning during training to check the model periodically
- *Testing*: the dataset that will be hidden from the model training process and be used for final model evaluation

Of course, we'll also need to normalize the features on which we are training to avoid algorithms issues like gradient explosion. The following code achieves both:

In [16]:
import warnings
warnings.filterwarnings('ignore')

def normalizeData(trainDf, valDf,  testDf, predictedVariable, featuresToUse, adjacentLocationCount):

    columns_to_normalize = featuresToUse.copy()

    prefixes_to_normalize = featuresToUse.copy()
    prefixes_to_normalize.append(predictedVariable)
    for loc in range(1, 1 + adjacentLocationCount):
        for prefix in prefixes_to_normalize:
            columns_to_normalize.append("{}{}".format(prefix, loc))

    # Normalize input data but not the target variable
    train_mean = trainDf[columns_to_normalize].mean()
    train_std = trainDf[columns_to_normalize].std()

    trainDf[columns_to_normalize] = (trainDf[columns_to_normalize] - train_mean) / train_std
    valDf[columns_to_normalize] = (valDf[columns_to_normalize] - train_mean) / train_std
    testDf[columns_to_normalize] = (testDf[columns_to_normalize] - train_mean) / train_std

    return trainDf, valDf, testDf


# Split the data: 6 years for training, 2 for validation & 2 for testing
n = len(featureset)
train_df = featureset[0 : int(n*0.60)]
val_df = featureset[int(n*0.60) : int(n*0.80)]
test_df = featureset[int(n*0.80) : ]

# Normalize input data
train_df, val_df, test_df = normalizeData(train_df, val_df, test_df, 
                                          'WindSpeed', ['_wind_dir_sin', '_wind_dir_cos'], 2)

train_df.head()


Unnamed: 0,_wind_dir_sin,_wind_dir_cos,WindSpeed,_wind_dir_sin1,_wind_dir_cos1,WindSpeed1,_wind_dir_sin2,_wind_dir_cos2,WindSpeed2
0,-1.188696,-0.466556,13.0,-1.144549,-0.762454,2.369524,-1.33216,-0.567817,2.530257
1,-1.256421,-0.236236,17.0,-0.902651,-0.987495,2.595942,-1.448306,-0.362728,2.621978
2,-1.078108,-0.682661,17.0,-1.089776,-0.823255,2.256315,-1.519435,-0.144148,2.255092
3,-1.078108,-0.682661,15.0,-1.241564,-0.63299,2.256315,-1.448306,-0.362728,2.255092
4,-1.078108,-0.682661,16.0,-1.241564,-0.63299,2.256315,-1.448306,-0.362728,1.154431


## (3) _Prepare the Data for TensorFlow_

We still have further to go before we can use TensorFlow to build models. 

First, we need to create a Sliding Window type data structure containing a number of observations in the Past. For example if we are forecasting _Temperature_ in 12h in advance and we want to look back 3 hours, we need `Temperature[-12h], Temperature[-13h], Temperature[-14h]` all in one row.

Second, TensorFlow is quite finicky in what form the input data should take. 