## Normalization of Data

Normalization is the process of adjusting the values in a dataset so that they fall within a common range, 
often with a mean of 0 and a standard deviation of 1. It ensures that all variables contribute equally to the analysis, 
especially when they are on different scales. 

Normalization is important in machine learning algorithms as it helps in speeding up the training process and 
improving model performance.

### Explanation of the Code

The function `standardize_river_forecast_data` normalizes specified variables in a time-series dataset by 
scaling them to have a mean of 0 and a variance of 1. The function can apply different standardization methods:
- **Global**: Standardizes all data at once.
- **Yearly**: Standardizes data year-by-year.
- **Monthly**: Standardizes data month-by-month.

It uses `StandardScaler` from `sklearn` to perform the normalization and allows flexibility in how the data is 
scaled depending on the selected `method`.

*Note: Periodic normalization is useful when the data exhibits seasonal or cyclical patterns, such as yearly or monthly trends. For example, in time series data like weather, sales, or river flow forecasts, different periods (e.g., months or years) may have varying characteristics. Normalizing data periodically (e.g., monthly or yearly) helps preserve these patterns while ensuring each period is independently scaled, improving model performance in seasonally driven data.*


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def standardize_river_forecast_data(df, variables, method='global'):
    """
    Standardize multiple variables.
    It normalizes data i.e. each column in the dataframe to have mean 0 and variance 1.
    
    Parameters:
    df (pandas.DataFrame): The input dataframe with a datetime index
    variables (list): List of column names to standardize
    method (str): 'global', 'yearly', or 'monthly'
    
    Returns:
    pandas.DataFrame: The dataframe with standardized columns
    """
    standardized_df = df.copy()
    df.reset_index(inplace=True)
    df['DATE'] = pd.to_datetime(df['DATE'])
    df.set_index('DATE', inplace=True)
    
    if method == 'global':
        scaler = StandardScaler()
        standardized_df[variables] = scaler.fit_transform(df[variables])
    
    elif method == 'yearly':
        for year in df.index.year.unique():
            year_data = df[df.index.year == year]
            scaler = StandardScaler()
            standardized_df.loc[year_data.index, variables] = scaler.fit_transform(year_data[variables])
    
    elif method == 'monthly':
        for month in range(1, 13):
            month_data = df[df.index.month == month]
            scaler = StandardScaler()
            standardized_df.loc[month_data.index, variables] = scaler.fit_transform(month_data[variables])
    
    else:
        raise ValueError("Method must be 'global', 'yearly', or 'monthly'")
    
    return standardized_df

In [2]:
df = pd.read_csv('../dataset/final_data.csv')

In [3]:
variables_to_standardize = list(df.columns)

if 'DATE' in variables_to_standardize:
    variables_to_standardize.remove('DATE')

global_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='global')
yearly_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='yearly')
monthly_std_df = standardize_river_forecast_data(df, variables_to_standardize, method='monthly')

print("Non standardized data (first 5 rows):")
print(df.head())

print("Global standardization (first 5 rows):")
print(global_std_df.head())

print("\nYearly standardization (first 5 rows):")
print(yearly_std_df.head())

print("\nMonthly standardization (first 5 rows):")
print(monthly_std_df.head())

# The standardized dataframes can now be used for further analysis and modeling

Non standardized data (first 5 rows):
                     index  gauge_heights  DryBulbTemp  Precip  RelHumidity  \
DATE                                                                          
2008-01-01 00:00:00      0         5.7375         38.0     0.0         86.0   
2008-01-01 01:00:00      1         5.6450         38.0     0.0         89.0   
2008-01-01 02:00:00      2         5.5425         39.0     0.0         89.0   
2008-01-01 03:00:00      3         5.4575         40.0     0.0         86.0   
2008-01-01 04:00:00      4         5.3625         40.0     0.0         86.0   

                     Stationpressure  WetBulbTemp  WindSpeed  
DATE                                                          
2008-01-01 00:00:00            29.20         36.0        0.0  
2008-01-01 01:00:00            29.19         37.0        0.0  
2008-01-01 02:00:00            29.18         38.0        5.0  
2008-01-01 03:00:00            29.18         38.0        5.0  
2008-01-01 04:00:00           