## This notebook contains information about the data normalization/standardization

This notebook implements standardization using `StandardScaler` from scikit-learn. In time series analysis, the approach to standardization can be slightly different.

For example, if we are trying to see how a certain variable changes over seasons in a year or over different months, we would typically standardize the data on a yearly basis. We would normalize the data for each year, using its mean and variance, and then move on to the next year. Similarly, if we want to observe how a variable changes across different days of the week, we would apply weekly standardization.

Usually, when performing time series prediction with a certain lookback window, it’s a good idea to normalize the data across that lookback window only, on a rolling basis. In this case, you would normalize the input, save the mean and standard deviation, make predictions, then denormalize the predictions with the saved mean and standard deviation, and finally calculate the loss.

However, in our case, we will normalize the data globally to make things simpler and easier. Most people take this approach because it’s straightforward.


`StandardScaler` normalizes data column wise. It calculates mean and variance for each column and normalize each column separately.

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import warnings

#ignore warnings for the notebook

warnings.filterwarnings('ignore')

# We normalize data using mean 0.0 and variance 1.0

def standardize_river_forecast_data(df, variables, method='global'):
    """
    Standardize multiple variables.
    It normalizes data i.e. each column in the dataframe to have mean 0 and variance 1.
    
    Parameters:
    df (pandas.DataFrame): The input dataframe with a datetime index
    variables (list): List of column names to standardize
    method (str): 'global', 'yearly', or 'monthly'
    
    Returns:
    pandas.DataFrame: The dataframe with standardized columns
    """
    standardized_df = df.copy()
    df.reset_index(inplace=True)
    df['DATE'] = pd.to_datetime(df['DATE'])
    df.set_index('DATE', inplace=True)
    
    if method == 'global':
        scaler = StandardScaler()
        standardized_df[variables] = scaler.fit_transform(df[variables])
    
    elif method == 'yearly':
        for year in df.index.year.unique():
            year_data = df[df.index.year == year]
            scaler = StandardScaler()
            standardized_df.loc[year_data.index, variables] = scaler.fit_transform(year_data[variables])
    
    elif method == 'monthly':
        for month in range(1, 13):
            month_data = df[df.index.month == month]
            scaler = StandardScaler()
            standardized_df.loc[month_data.index, variables] = scaler.fit_transform(month_data[variables])
    
    else:
        raise ValueError("Method must be 'global', 'yearly', or 'monthly'")
    
    return standardized_df

In [6]:
df = pd.read_csv('../dataset/final_data.csv')

In [7]:
variables_to_standardize = list(df.columns)

if 'DATE' in variables_to_standardize:
    variables_to_standardize.remove('DATE')

global_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='global')
yearly_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='yearly')
monthly_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='monthly')

print("Non standardized data (first 5 rows):")
print(df.head())

print("Global standardization (first 5 rows):")
print(global_std_df.head())

print("\nYearly standardization (first 5 rows):")
print(yearly_std_df.head())

print("\nMonthly standardization (first 5 rows):")
print(monthly_std_df.head())

# The standardized dataframes can now be used for further analysis and modeling

Non standardized data (first 5 rows):
                     index  Unnamed: 0  Precip  WetBulbTemp  DryBulbTemp  \
DATE                                                                       
2008-01-01 01:00:00      0           0     0.0         37.0         38.0   
2008-01-01 02:00:00      1           1     0.0         38.0         39.0   
2008-01-01 03:00:00      2           2     0.0         38.0         40.0   
2008-01-01 04:00:00      3           3     0.0         38.0         40.0   
2008-01-01 05:00:00      4           4     0.0         41.0         45.0   

                     RelHumidity  WindSpeed  StationPressure  gauge_height  
DATE                                                                        
2008-01-01 01:00:00         89.0        0.0            29.19        5.6450  
2008-01-01 02:00:00         89.0        5.0            29.18        5.5425  
2008-01-01 03:00:00         86.0        5.0            29.18        5.4575  
2008-01-01 04:00:00         86.0        6.0 

In [8]:
global_std_df.head()

Unnamed: 0.1,Unnamed: 0,DATE,Precip,WetBulbTemp,DryBulbTemp,RelHumidity,WindSpeed,StationPressure,gauge_height
0,-1.732039,2008-01-01 01:00:00,-0.103183,-1.33112,-1.502346,0.919314,-1.078557,0.108053,-0.311826
1,-1.732015,2008-01-01 02:00:00,-0.103183,-1.26186,-1.440811,0.919314,0.065279,0.043514,-0.346246
2,-1.731991,2008-01-01 03:00:00,-0.103183,-1.26186,-1.379277,0.77616,0.065279,0.043514,-0.37479
3,-1.731967,2008-01-01 04:00:00,-0.103183,-1.26186,-1.379277,0.77616,0.294046,0.108053,-0.406692
4,-1.731942,2008-01-01 05:00:00,-0.103183,-1.05408,-1.071605,-0.082763,1.437883,0.237133,-0.455384


In [17]:
# Let's check mean and variance after normalization for each column

for col in global_std_df.columns:
    if col != 'DATE':
        print(f'{col:15}   : Mean : {global_std_df[col].mean():.2f}  Variance: {global_std_df[col].var():.2f}')

Unnamed: 0        : Mean : -0.00  Variance: 1.00
Precip            : Mean : -0.00  Variance: 1.00
WetBulbTemp       : Mean : -0.00  Variance: 1.00
DryBulbTemp       : Mean : 0.00  Variance: 1.00
RelHumidity       : Mean : -0.00  Variance: 1.00
WindSpeed         : Mean : -0.00  Variance: 1.00
StationPressure   : Mean : 0.00  Variance: 1.00
gauge_height      : Mean : -0.00  Variance: 1.00
