# Predicting waterbody depth and flow rate

The goal of this Notebook is to demonstrate a universal approach to predicting depth and flow rate for waterbodies of four different types:

| Waterbody Type | Predicted Features    |
| :-------------:|:---------------------:|
| Water Spring   | Flow Rate             |
| Lake           | Lake Level, Flow Rate |
| River          | Hydrometry            |
| Aquifer        | Depth to Groundwater  |

Each waterbody is being analyzed separately. Modelling is based on the available data for this particular waterbody and does not necessarily include the same features for all objects of the same type since their behaviour differs considerably.

Proposed algorithm automatically selects input features for each target value based on correlation between the parameters, creates several machine learning models at runtime, cross-validates their accuracy in terms of determination coefficient (R2 score), selects the best model and makes a prediction for the specified future period - day, week or month - depending on the model frequency.

Prediction could be made for the next period, second period, third period and so on - one at a time. Specific future period for prediction and data frequency (daily, weekly or monthly) are defined by arguments passed to the modelling function.

In theory, this approach does not have any limitations, however testing shows that accuracy is higher when predicting short-term future and shorter periods (daily or weekly values).

### Structure of the Notebook
- Block 1: Imports and settings
- Block 2: File paths
- Block 3: Targets and models
- Block 4: EDA functions
- Block 5: Preprocessing and modelling functions
- Block 6: EDA and modelling
- Block 7: Models explained

### Block 1: Imports and settings

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams.update({'font.size': 10})

### Block 2: File paths

In [None]:
# Aquifers
Auser_path = '/kaggle/input/acea-water-prediction/Aquifer_Auser.csv'
Doganella_path = '/kaggle/input/acea-water-prediction/Aquifer_Doganella.csv'
Luco_path = '/kaggle/input/acea-water-prediction/Aquifer_Luco.csv'
Petrignano_path = '/kaggle/input/acea-water-prediction/Aquifer_Petrignano.csv'

In [None]:
# Lake
Bilancino_path = '/kaggle/input/acea-water-prediction/Lake_Bilancino.csv'

In [None]:
# River
Arno_path = '/kaggle/input/acea-water-prediction/River_Arno.csv'

In [None]:
# Water springs
Amiata_path = '/kaggle/input/acea-water-prediction/Water_Spring_Amiata.csv'
Lupa_path = '/kaggle/input/acea-water-prediction/Water_Spring_Lupa.csv'
Madonna_path = '/kaggle/input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv'

### Block 3: Targets and models

In [None]:
# Column names for target variables
targets = {
    'Auser': [
        'Depth_to_Groundwater_SAL',
        'Depth_to_Groundwater_CoS',
        'Depth_to_Groundwater_LT2'
        ],
    'Doganella': [
        'Depth_to_Groundwater_Pozzo_1',
        'Depth_to_Groundwater_Pozzo_2',
        'Depth_to_Groundwater_Pozzo_3',
        'Depth_to_Groundwater_Pozzo_4',
        'Depth_to_Groundwater_Pozzo_5',
        'Depth_to_Groundwater_Pozzo_6',
        'Depth_to_Groundwater_Pozzo_7',
        'Depth_to_Groundwater_Pozzo_8',
        'Depth_to_Groundwater_Pozzo_9'
        ],
    'Luco': [
        'Depth_to_Groundwater_Podere_Casetta'
        ],
    'Petrignano': [
        'Depth_to_Groundwater_P24',
        'Depth_to_Groundwater_P25'
        ],
    'Bilancino': [
        'Lake_Level', 
        'Flow_Rate'
        ],
    'Arno': [
        'Hydrometry_Nave_di_Rosano'
        ],
    'Amiata': [
        'Flow_Rate_Bugnano',
        'Flow_Rate_Arbure',
        'Flow_Rate_Ermicciolo',
        'Flow_Rate_Galleria_Alta'
        ],
    'Lupa': [
        'Flow_Rate_Lupa'
        ],
    'Madonna': [
        'Flow_Rate_Madonna_di_Canneto'
        ]
    }

In [None]:
# Models to be compared
models = [('RandomForest', RandomForestRegressor()),
          ('ExtraTrees', ExtraTreesRegressor()),
          ('GradientBoosting', make_pipeline(StandardScaler(), GradientBoostingRegressor())),
          ('KNeighbors', make_pipeline(StandardScaler(), KNeighborsRegressor()))]

In [None]:
# Splits and shuffle for cross-validation
kf = KFold(3, shuffle=True, random_state=1)

In [None]:
# For applying various data frequencies
resampling = {'monthly': 'M', 'weekly': 'W', 'daily': 'D'}

### Block 4: EDA functions
Functions defined in this section are used for analysis of data as it is, before feature engineering and modelling.

In [None]:
def plot_nans(df: pd.DataFrame, obj_id: str):
    """Function calculates percentage of missing values by column
    and creates a bar plot."""
    rows, _ = df.shape
    missing_values = df.isna().sum() / rows * 100
    missing_values = missing_values[missing_values != 0]
    missing_values.sort_values(inplace=True)
    title = obj_id + ' missing values'
    plt.barh(missing_values.index, missing_values.values)
    plt.xlabel('Percentage (%)')
    plt.title(title)
    plt.show()

In [None]:
def plot_distribution(df: pd.DataFrame):
    """Function plots a histogram for parameter distribution."""
    df.hist(bins=20, figsize=(14, 10))
    plt.show()

In [None]:
def plot_correlation(df: pd.DataFrame, obj_id: str, targets: list):
    """Function calculates correlation between parameters
    and creates a heatmap."""
    title = obj_id + ' Heatmap'
    targets_correlation = df.corr()[targets]
    ax = sns.heatmap(targets_correlation, center=0, annot=True, cmap='RdBu_r')
    l, r = ax.get_ylim()
    ax.set_ylim(l + 0.5, r - 0.5)
    plt.yticks(rotation=0)
    plt.title(title)
    plt.show()

In [None]:
def plot_timeseries(df: pd.DataFrame, obj_id: str, targets: list):
    """Function plots target variable against the timescale."""
    for target in targets:
        plt.plot(df.index, df[target], label=target)
    title = obj_id + ' actual data'
    plt.legend()
    plt.title(title)
    plt.show()

In [None]:
def plot_seasonality(df: pd.DataFrame, targets: list):
    """Function creates a seasonal decomposition plot for target variables.
    Temporary interpolation of missing values is performed on a resampled
    monthly data, which does not affect the original dataset."""
    for target in targets:
        monthly_interpolated = df[target].resample('M').mean().interpolate(method='akima').dropna()
        
        decomposition = seasonal_decompose(monthly_interpolated)
        observed = decomposition.observed
        trend = decomposition.trend
        seasonal = decomposition.seasonal
        residual = decomposition.resid
        dates = monthly_interpolated.index

        plt.plot(dates, observed, label='Original data')
        plt.plot(dates, trend, label='Trend')
        plt.plot(dates, seasonal, label='Seasonal')
        plt.plot(dates, residual, label='Residual')
        plt.legend()
        plt.title(f'{target} seasonal decomposition')
        plt.tight_layout()
        plt.show()

### Block 5: Preprocessing and modeling functions

In [None]:
def data_cleaning(df: pd.DataFrame):
    """Function replaces 0 values with np.nan in all columns except rainfall."""
    for column in df.columns:
        if column.find('Rainfall') == -1:
            df[column] = df[column].apply(lambda x: np.nan if x == 0 else x)
    return df

In [None]:
def get_data(path: str):
    """Function extracts data from a csv file and converts date column
    to datetime index."""
    df = pd.read_csv(path,
                       parse_dates=['Date'],
                       date_parser=lambda x: pd.to_datetime(x, format='%d/%m/%Y'))
    df.dropna(subset=['Date'], inplace=True)  # Madonna_di_Canneto dataset has empty rows
    df.set_index('Date', inplace=True)
    # Remove erroneous 0 values from all columns except rainfall
    df = data_cleaning(df)
    return df

In [None]:
def resample_data(df: pd.DataFrame, freq: str):
    """Function converts daily data into weekly or monthly averages."""
    return df.resample(freq).mean()

In [None]:
def add_seasonality(df: pd.DataFrame):
    """Function adds columns specifying year, month and day of a year
    and binary column for rainy season (October through April)."""
    df['Year'] = df.index.year
    df['Month'] = df.index.month
    df['Week'] = df.index.weekofyear
    df['Day'] = df.index.dayofyear
    df['Rainy_Season'] = df['Month'].apply(lambda x: 0 if 5 <= x <= 9 else 1)
    return df

In [None]:
def add_weekly_averages(df: pd.DataFrame):
    """Function adds weekly rolling average values for rainfall and temperature,
    which are used as additional features in daily datasets."""
    for column in df.columns:
        if column.find('Rainfall') > -1 or column.find('Temperature') > -1:
            df[f'{column}_weekly'] = df[column].rolling(7).mean()
    return df

In [None]:
def select_features(df: pd.DataFrame):
    """Function creates a list of most correlated features for the target."""
    target_correlation = df.corr()['Target']
    mosts_correlated = target_correlation[(target_correlation >= 0.2) | (target_correlation <= -0.2)].index.tolist()
    return mosts_correlated

In [None]:
def get_X_y(df: pd.DataFrame, target: str, steps_ahead: int):
    """Function splits data into input features and target variable,
    returns a tuple containing list of features, X and y."""
    X = df.copy()
    # Add a column that contains target value for the predicted future period
    # (shift target column backwards for required number of periods)
    X['Target'] = X[target].shift(-steps_ahead)
    # Reduce input to the most correlated features
    features = select_features(X)
    X = X[features]
    # Here the last row of actual inputs is lost
    # because there is no future period target value for it
    X.dropna(inplace=True)
    y = X.pop('Target')
    return features[:-1], X, y

In [None]:
def evaluate_models(input_data: pd.DataFrame, target: pd.Series, target_name: str):
    """Function estimates cross-val score for several models.
    If the highest cv score is above 0.6, returns a tuple
    with fitted model and its name, otherwise returns a tuple
    with None and an empty string."""

    best_cv = -1
    best_model = None
    best_model_name = ''

    for name, model in models:
        cv_r2 = cross_val_score(model, input_data, target, cv=kf, scoring='r2').mean()
        cv_mae = - cross_val_score(model, input_data, target, cv=kf, scoring='neg_mean_absolute_error').mean()
        cv_rmse = - cross_val_score(model, input_data, target, cv=kf, scoring='neg_mean_squared_error').mean()
        print(f'{name} cross-val score for {target_name}:\n\tR2 = {cv_r2}\n\tMAE = {cv_mae}\n\tRMSE = {cv_rmse}')

        if cv_r2 > best_cv:
            best_cv = cv_r2
            best_model = model
            best_model_name = name

    if best_cv >= 0.6:
        best_model.fit(input_data, target)
        return best_model_name, best_model
    else:
        return '', None

In [None]:
def simple_prediction(ts: pd.Series, n_periods_1: int, n_perionds_2: int, steps_ahead: int):
    """Function returns a prediction based on the actual value of the target variable
    for the latest period and two linear trend predictions equally weighted."""
    # Actual last value in the time series
    last_period_value = ts.iloc[len(ts) - 1]
    # Linear trend based on n_periods_1
    X_1 = np.array([i for i in range(1, n_periods_1 + 1)]).reshape(-1, 1)
    linear_prediction_1 = LinearRegression().fit(X_1, ts.tail(n_periods_1)).predict([[n_periods_1 + steps_ahead]])[0]
    # Linear trend based on n_periods_2
    X_2 = np.array([i for i in range(1, n_perionds_2 + 1)]).reshape(-1, 1)
    linear_prediction_2 = LinearRegression().fit(X_2, ts.tail(n_perionds_2)).predict([[n_perionds_2 + steps_ahead]])[0]
    # Average of the three values
    prediction = (last_period_value + linear_prediction_1 + linear_prediction_2) / 3
    return prediction

In [None]:
def modelling(df: pd.DataFrame, targets: list, obj_id: str, freq: str, steps_ahead: int):
    """Function preprocesses data, creates models and estimates their accuracy,
    gets prediction for the future period from the best model if R2 >= 0.6
    or uses simple prediction based on the last actual value and linear trends."""
    print(f'\nCreating {freq} model for {obj_id}\n')
    df = resample_data(df, resampling[freq])  # Change data frequency
    if freq == 'daily':
        df = add_weekly_averages(df)  # Add weekly rolling averages as a feature

    # Select input for each target that contains the most correlated features
    for target in targets:
        features, X, y = get_X_y(df, target, steps_ahead)
        model_name, model = evaluate_models(X, y, target)

        if model_name:  # Best R2 >= 0.6
            # Get the actual last row of input data from the dataset
            # (if there are NaNs, get the last row with all required features)
            input_data = df[features].dropna()
            input_date = input_data.index.max()
            input_data = input_data.iloc[len(input_data) - 1, :].values.reshape(1, -1)
            prediction = model.predict(input_data)[0]
        else:  # Low R2
            model_name = 'Average and linear trend'
            features = [target]
            input_data = df[target].dropna()
            input_date = input_data.index.max()
            # Take into account last value and trends of the last 5 and 10 periods
            prediction = simple_prediction(input_data, 5, 10, steps_ahead)

        print(f'\n{model_name} {freq} prediction for {target}: {prediction}')
        print(f'\nInput features: {", ".join(features)}\nInput date: {input_date}')
        print(f'Prediction for {steps_ahead} step(s) ahead.\n')

### Block 6: EDA and modelling
This section is structured by waterbody type and has an additional part illustrating how the algorithm could be used to predict target values based on daily, weekly and monthly data for various time horizon (next period, second, third, etc.):
1. Aquifers
2. Lakes
3. Rivers
4. Water Springs
5. Flexible data frequencies and forecast horizons

For each object data is extracted and processed following the same steps:
- extraction of data from a csv file;
- filtering out erroneous 0 values;
- data analysis and plotting of graphs;
- feature engineering to enhance target dependence on input data;
- selection of the most correlated features;
- modelling based on several ML algorithms;
- comparing cross-val scores and selection of the best model;
- forecasting.

Further explanation of the algorithm and feature engineering techniques is presented in Block 7.

# 1. Aquifers

#### 1.1. Auser Aquifer

In [None]:
data = get_data(Auser_path)

In [None]:
target_cols = targets['Auser']

Statistical breakdown of numeric parameters of the dataset is presented below. Some data cleaning was performed after extracting data from a csv file. In particular, 0 values in many columns of this and other datasets indicate missing values. They are too distant from the neighboring values to be within the normal range of probable measurements. In all columns except rainfall 0 values were replaced with np.NaN.

In rainfall columns it is not obvious, which 0 values (if any) are a placeholder for NaN. Zero level of rainfall could actually be a realistic measurement. That's why simple replacement of all zero values by NaN couldn't be used. It doesn't discriminate between actual 0 values and errors. Filtering based on comparison between neighboring values is also not an option in this case, since dry weather could follow rainy periods and vise versa.

As a result, some level of noize in rainfall measurements could be present, which will affect accuracy of the models.

In [None]:
data.describe()

In this dataset the share of missing values in depth and volume columns varies greately - from 30% to 70%. Target variables have between 40% and 50% of missing values.

Rainfall data almost uniformly has about 35% of NaNs. Temperature measurements have relatively few missing values. Hydrometry data is available for only two locations but has low share of NaNs.

In [None]:
plot_nans(data, 'Auser aquifer')

Distribution of target variables, volume and temperature tends to be mostly normal. However, at some locations abnormalities are seen that could signal significant changes in waterbody behavior or result from inconsistent data.

Data on rainfall follows exponential distribution pattern. Amount of rainfall tends to be low for the most part. Extensive raining is rare.

Hydrometry data is available for only two locations and show inconsistent distribution patterns.

In [None]:
plot_distribution(data)

Heatmap shows inconsistent correlation between target variables and other parameters. Depth to groundwater at SAL and CoS visibly correlates with other measurements of depth and with some measurements of hydrometry and volume. Depth to groundwater at LT2, on the opposite, shows low correlation with most of other depth colums and medium to strong correlation with most of volume columns. All three target variables demonstrate low correlation with rainfall.

In [None]:
plot_correlation(data, 'Auser Aquifer', target_cols)

Plot of the target values against time scale shows that some targets demonstrate slight increasing trend while others follow seasonal patterns in a more or less stable range of values.

In [None]:
plot_timeseries(data, 'Auser Aquifer', target_cols)

To plot seasonal decomposition of target values original data was resampled to monthly frequency and missing values were interpolated with "akima" method, which fills in missing values in seasonally varying data more naturally and smoothly compared to linear interpolation.

As the plot shows, trends in target values change direction unpredictedly over the period of observation. Seasonal component is comparable in its scale to residuals. It means that seasonal factors account for smaller part of the observed behaviour of the target variables compared to other factors.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Auser Aquifer', 'monthly', 1)

#### Notes
Target variable that showed low R2 score for all four tested models was predicted using simple approach based on the latest actual value and linear trend. See explanation in Block 7.

Other two target variables could be reliably predicted with one of the four tested models, which shows the best result (highest R2 score).

In this case, as well as in modelling other waterbodies, input features include parameters that show positive correlation with the next period target value highier than or equal to 0.2 or negative correlation below -0.2. Targets required different sets of input features: linear trend and average model for the first target is based solely on previous values of the target variable, the second target was predicted using a wide range of features including rainfall, depth, volume, temperature, hydrometry and seasonal columns, the last target was predicted based on input that included several depth and volume columns, hydrometry and year.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Auser Aquifer', 'daily', 1)

#### Notes
Modelling target values on a daily basis showed high R2 scores and low mean absolute errors on the tested regression models. Better performance could be explained by more datapoints available for training compared to monthly resampled dataset and comparatively easier task of predicting the next day value.

All four of the tested models potentially could be applied to predict target values for the next day.

For predicting daily values of depth to groundwater SAL a wide range of input columns is used instead of just one column, that was used for monthly prediction. Inputs include depth, volume, temperature, rainfall, hydrometry and seasonal columns.

Inputs for models that predict other target variables are similar to those used by monthly models, though data frequency in this case is daily. 

#### 1.2. Doganella Aquifer

In [None]:
data = get_data(Doganella_path)

In [None]:
target_cols = targets['Doganella']

In [None]:
data.describe()

In this dataset the most scarce data is volume (has about 80% of missing values). Target value columns have 55% to 60% of NaNs. The lowest share of missing values is in temperature and rainfall columns, however measurements for only two locations are available.

In [None]:
plot_nans(data, 'Doganella aquifer')

Data on rainfall is distributed exponentially with prevalence of low rainfall days. Temperature distribution is close to normal.

Distribution of data on depth demonstrates abnormal patterns and could not be universally described or defined. Distribution of data on volume also varies significantly from location to location.

In [None]:
plot_distribution(data)

Most of target value columns show moderate to high correlation with other depth columns. Correlation with volume and temperature is usually lower and less ubiquitous. Practically no correlation of depth with rainfall could be seen in the original data before feature engineering is applied.

In [None]:
plot_correlation(data, 'Doganella Aquifer', target_cols)

Targets vary greatly both in absolute values and in patterns and trends. For some of the targets large gaps in data or abrupt shifts are seen on the graph. However filtering out these abnormalities requires custom approach and could not be achieved applying any general rule to all target values.

In [None]:
plot_timeseries(data, 'Doganella Aquifer', target_cols)

Seasonal decomposition confirms that target variables differ in their observed behavious over time and could not be described by generalized rules.

First target shows wide fluctuations in values, trend is changing direction unpredictedly, seasonal component is less visible than residuals.

Other 8 target variables do not demonstrate seasonal component at all and trend patterns for them are not obvious. Some targets have periods of upwards or downwards trends, others do not change value range significantly over time.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Doganella Aquifer', 'monthly', 1)

#### Notes
Taking into account large share of missing values and abnormal patterns of target variables seen on the graphs, unsurprisingly, modelling waterbody behaviour on a monthly basis leads to low R2 score and in some cases also to high mean absolute error (MAE) on cross-validation. That is why for monthly predictions the simplest approach was mostly used, which relies on the most recent actual value of predicted variable and it's trends for the last 5 and 10 datapoints.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Doganella Aquifer', 'daily', 1)

#### Notes
Daily models had more datapoints to train on and demonstrated better performance. The reason for higher accuracy of daily models is that daily fluctuations of target values are less dramatic and could be more easily predicted based on the previous values of the same parameter and other correlated features.

As could be seen in the code cell above, input features mostly included depth and volume, seasonal and long-term columns (Year, Month, Day). For some targets temperature was also included.

#### 1.3. Luco Aquifer

In [None]:
data = get_data(Luco_path)

In [None]:
target_cols = targets['Luco']

In [None]:
data.describe()

Target variable has about 55% of missing values. Other depth columns have over 80% of NaNs. Data on rainfall and volume is also rather scarce (mostly has about 60% to 75% of missing values).

In [None]:
plot_nans(data, 'Luco aquifer')

Data on rainfall demonstrates exponential distribution. Data on depth shows varying distribution patterns. Data on temperature and volume is close to normal distribution.

In [None]:
plot_distribution(data)

The target variable is highly correlated with all three volume columns and show medium correlation with two other depth columns.

In [None]:
plot_correlation(data, 'Luco Aquifer', target_cols)

Visual analysis of the graph doesn't show obvious patterns. There are multi-year periods of increase trend followed by shorter or equal periods of decrease. Seasonal patterns are not abvious. Large gaps in data are present on the graph.

In [None]:
plot_timeseries(data, 'Luco Aquifer', target_cols)

For this target variable trend is the most significant factor, which defines the behaviour. Seasonal component is small in its scale and comparable to residuals. Trend does not show clear patterns and changes direction over time.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Luco Aquifer', 'monthly', 1)

#### Notes
All four tested models showed low R2 scores on cross-validation. However MAE was relatively low taking into account the scale of predicted variable.

Final monthly forecast was made with simple linear models and using the latest available value of predicted variable. This prediction was made based on data for January 2019, because there are no further data on target variable in the Luco dataset.

Taking into account that target showed high correlation with several other parameters it is possible to make a prediction for later periods based on these parameters without previous values of the target. However reliability of this prediction would be questionable due to low R2 score. To increase the accuracy it would be necessary to obtain actual data on the target variable for the last few days or weeks if possible.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Luco Aquifer', 'daily', 1)

#### Notes
Daily models managed to demonstrate high R2 score close to 98-99%. It can be explaned by larger number of datapoints and relatively easier task of daily prediction. The best model used several depth and volume columns as well as year column to predict the target.

The latest datapoint when all highly correlated features were present in a dataset at the same time was in December 2018. To make predictions for later periods and the future it would be desirable to obtain actual values of the target variable for several recent days.

If recent data on target variable is unavailable, prediction could be made based on other available features. However this prediction wouldn't follow the general algorithm described in this Notebook and requires costom data transformation. Example of this appraoch is presented below.

#### Models that do not use previous values of target variable

In [None]:
# Example of daily forecast based on correlated features without previous values of the target
data = get_data(Luco_path)
data['Year'] = data.index.year

target = ['Depth_to_Groundwater_Podere_Casetta']
features = ['Depth_to_Groundwater_Pozzo_1', 'Depth_to_Groundwater_Pozzo_3', 
            'Volume_Pozzo_1', 'Volume_Pozzo_3', 'Volume_Pozzo_4', 'Year']

# Select input and output columns
data = data[features + target]
# Last row with all features without target
input_date = data.index.max()
input_data = data.loc[input_date, features].values.reshape(1, -1)
# Extract y values
data['Depth_to_Groundwater_Podere_Casetta'] = data['Depth_to_Groundwater_Podere_Casetta'].shift(-1)
data.dropna(inplace=True)
y = data.pop('Depth_to_Groundwater_Podere_Casetta')
# Cross-validate the model
model_name, model = models[0]
cv_r2 = cross_val_score(model, data, y, cv=kf, scoring='r2').mean()
cv_mae = - cross_val_score(model, data, y, cv=kf, scoring='neg_mean_absolute_error').mean()
cv_rmse = - cross_val_score(model, data, y, cv=kf, scoring='neg_mean_squared_error').mean()
print(f'{model_name} cross-val score for {target}:\n\tR2 = {cv_r2}\n\tMAE = {cv_mae}\n\tRMSE = {cv_rmse}')
# Train the model
model.fit(data, y)
prediction = model.predict(input_data)[0]
print(f'\n{model_name} daily prediction for {target}: {prediction}')
print(f'\nInput features: {", ".join(features)}\nInput date: {input_date}\n')

In [None]:
# Example of monthly forecast based on correlated features without previous values of the target
data = get_data(Luco_path)
data['Year'] = data.index.year
data = data.resample('M').mean()

target = ['Depth_to_Groundwater_Podere_Casetta']
features = ['Depth_to_Groundwater_Pozzo_1', 'Depth_to_Groundwater_Pozzo_3', 
            'Volume_Pozzo_1', 'Volume_Pozzo_3', 'Volume_Pozzo_4', 'Year']

# Select input and output columns
data = data[features + target]
# Last row with all features without target
input_date = data.index.max()
input_data = data.loc[input_date, features].values.reshape(1, -1)
# Extract y values
data['Depth_to_Groundwater_Podere_Casetta'] = data['Depth_to_Groundwater_Podere_Casetta'].shift(-1)
data.dropna(inplace=True)
y = data.pop('Depth_to_Groundwater_Podere_Casetta')
# Cross-validate the model
model_name, model = models[0]
cv_r2 = cross_val_score(model, data, y, cv=kf, scoring='r2').mean()
cv_mae = - cross_val_score(model, data, y, cv=kf, scoring='neg_mean_absolute_error').mean()
cv_rmse = - cross_val_score(model, data, y, cv=kf, scoring='neg_mean_squared_error').mean()
print(f'{model_name} cross-val score for {target}:\n\tR2 = {cv_r2}\n\tMAE = {cv_mae}\n\tRMSE = {cv_rmse}')
# Train the model
model.fit(data, y)
prediction = model.predict(input_data)[0]
print(f'\n{model_name} monthly prediction for {target}: {prediction}')
print(f'\nInput features: {", ".join(features)}\nInput date: {input_date}\n')

#### 1.4. Petrignano Aquifer

In [None]:
data = get_data(Petrignano_path)

In [None]:
target_cols = targets['Petrignano']

In [None]:
data.describe()

This dataset has few missing values, and the lowest share is seen in target values (below 1%). NaNs do not exceed 20% in the most scarce columns. However total number of parameters available in this dataset is smaller than in other aquifer datasets.

In [None]:
plot_nans(data, 'Petrignano aquifer')

Data distribution is mostly normal with the exception of rainfall column, which shows traditional exponential distribution, and hydrometry, which has abnormal view.

In [None]:
plot_distribution(data)

Heatmap shows perfect correlation between the two target variables and moderate correlation of both targets with the volume column.

In [None]:
plot_correlation(data, 'Petrignano Aquifer', target_cols)

Target values demonstrate multi-year increase / decrease trends with no obvious cycles. Both variables follow the same trends over the whole period of observation.

In [None]:
plot_timeseries(data, 'Petrignano Aquifer', target_cols)

These two target variables have very small seasonal component and are influenced mostly by changes in trend direction.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Petrignano Aquifer', 'monthly', 1)

#### Notes
Both targets could be reliably modelled with any of the tested models. R2 score on cross-validation was well above 0.9. Input features included previous valued of both targets, year, volume and rainfall.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Petrignano Aquifer', 'daily', 1)

#### Notes
Daily models also showed high performance based on the previous values of target variables, volume and year.

### 2. Lakes

#### 2.1. Bilancino

In [None]:
data = get_data(Bilancino_path)

In [None]:
target_cols = targets['Bilancino']

In [None]:
data.describe()

Only one dataset of this waterbody type is available for analysis. It has few missing values. Data on rainfall and temperature has about 9% of NaNs. Target column with flow rate has less than 1% of missing values. Lake level column has no missing values.

In [None]:
plot_nans(data, 'Bilancino lake')

Data on temperature demonstrate normal distribution. Other data, including both target variables, show exponential distribution.

In [None]:
plot_distribution(data)

Target variables demonstrate moderate correlation with one another (0.3). Flow rate shows rather low correlation with temperature and rainfall measurements, while lake level practically does not correlate with other features.

In [None]:
plot_correlation(data, 'Bilancino lake', target_cols)

Lake level stays rather stable at about 250 for the whole period of observations. As was shown on a histogram above, lake level does not decrease below 244, and low values are rare in the dataset.

Flow rate has visible spikes that do not follow regular patterns or cycles. These spikes do not seem like errors in the data but rather as a result of some powerful external influence.

In [None]:
plot_timeseries(data, 'Bilancino lake', target_cols)

Lake level has very small seasonal component to it and no visible trend. This stability makes target value highly predictable as it varies in a narrow range.

Seasonal decomposition plot for flow rate demonstrates seasonal patterns, however their scale is much smaller than the residuals. Trend line changes direction without any obvious pattern. As a result, the chart looks chaotic and unpredictable.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Bilancino lake', 'monthly', 1)

#### Notes
Out of two target values only the lake level could be reliably predicted on a monthly basis based on the available data. Model takes previous values of both target variables, rainfall and tempeterature measurements as input. R2 score on cross-valudation is close to 0.9 while mean absolute error is about 0.5 for the best models.

Flow rate has low R2 score and high mean absolute error on cross-validation. Future monthly values of this variable, at best, could be predicted based on the most recent values of this same variable but occasional large errors will persist. Accuracy of prediction, probably, could be improved by adding other features to the dataset (for example, adding data on surface winds).

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Bilancino lake', 'daily', 1)

#### Notes
Daily models showed R2 close to 1.0 for lake level and over 0.7 for flow rate. MAE for flow rate stayed rather high compared to scale of this variable. It means that only lake level could be accurately forecasted with the proposed models.

The same considerations as for monthly models are valid here. Accuracy of flow rate prediction could be increased only by adding other relevant data to model inputs. At present the model uses previous values of both targets, rainfall and temperature. Presumably, surface winds could affect the flow rate and cause occasional spikes that are seen on the graph and lead to large forecast errors. However, to make prediction for the future flow rate future values of wind will be required, and predicting this parameter will be nontrivial task in itself.

### 3. Rivers

#### 3.1. Arno

In [None]:
data = get_data(Arno_path)

In [None]:
target_cols = targets['Arno']

In [None]:
data.describe()

Rivers are represented by one dataset, which has very few missing values in target variable column but large share of NaNs in rainfall columns (from 25% to 85%). Temperature measurements are available for only one location with about 25% of missing values.

In [None]:
plot_nans(data, 'Arno river')

Most data columns in this dataset follow exponential distribution pattern with the exception of temperature column which shows normal distribution.

In [None]:
plot_distribution(data)

Target variable has some degree of correlation with all features available in the dataset. Correlation with rainfall columns varies from 0.24 to 0.52. Negative correlation with the only temperature column stays at -0.46.

In [None]:
plot_correlation(data, 'Arno River', target_cols)

Target variable shows rather stable minimum levels with spikes of varying magnitude over the period of observation. Frequency and magnitude of increases in the target values do not follow any obvious pattern.

In [None]:
plot_timeseries(data, 'Arno river', target_cols)

Seasonal decomposition plot does not show any long-term trends in the target values. Seasonal component is comparable in its scale to residuals, which means that seasonal features along would not provide accurate predictions for this target.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Arno river', 'monthly', 1)

#### Notes
Monthly models showed low R2 scores on cross-validation, though mean absolute error is not very high compared to the scale of the predicted parameter. As a result, simplified approach was used to predict the next month average value, which took into account the previous month as well as 6 months and 12 months linear trends.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Arno river', 'daily', 1)

#### Notes
Daily models showed high R2 scores and low MAE on cross-validation. Better performance of daily models is explained by the fact that significant spikes and shifts in the predicted parameter do not develop on a daily basis which leads to smaller errors. If this model was used to predict daily values for several days or weeks ahead, the accuracy would decrease.

At present, the model makes a prediction based on a wide range of input features including prior values of target, rainfall, temperature and features representing seasonal effects. Probably, significant changes in hydrometry could be expained by some other parameters that are not present in the given dataset.

### 4. Water springs

#### 4.1. Amiata

In [None]:
data = get_data(Amiata_path)

In [None]:
target_cols = targets['Amiata']

In [None]:
data.describe()

Target columns in this dataset have over 70% of missing values. Other columns mostly have over 50% of NaNs.

In [None]:
plot_nans(data, 'Amiata water spring')

Rainfall data shows exponential distribution pattern. Temperature distribution is close to normal.

Flow rate distribution is close to exponential for the most part, but has some irregularities, gaps and spikes.

Depth to groundwater measurements show irregularities in data distribution that could not be generalized.

In [None]:
plot_distribution(data)

All four target variables in this dataset represent measurements of flow rate. Targets are positively correlated to one another and negatively correlated to depth measurements.

Practically no correlation with rainfall or temperature is visible on a Heatmap.

In [None]:
plot_correlation(data, 'Amiata water spring', target_cols)

Actual data looks inconsistent on the graph. Target values for three out of four analysed variables shift upwards and downwards on a regular basis during the whole period of observation. These fluctuations look unnatural and most likely result from some kind of measurement errors.

Flow rate at Bugnano, on the contrary, looks almost like a straight line, which differs dramatically from other target variables. Consulting with domain experts is desirable to establish which patterns should be treated as natural and which should be excluded as errors and noise.

One of the possible solutions to remove these abrupt shifts is to apply some kind of sliding window mean smoothing to the original data. However such step is questionable since the nature of these shifts is not clear. At present, 0 values in Arbure and Ermicciolo columns in this dataset were filtered out as obvious errors and replaced with NaNs. Other inconsistencies were left uncorrected.

In [None]:
plot_timeseries(data, 'Amiata water spring', target_cols)

Seasonal decomposition shows that seasonal component could not be relied upon in modelling the target variables. It is either small in scale or demonstrates no clear patterns.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Amiata water spring', 'monthly', 1)

#### Notes
Only one out of four target values (flow rate at Galeria Alta) could be modelled on a monthly basis using the proposed sklearn models and given input values. Required input features include rainfall and depth measurements, previous values of flow rate columns and a year column for long-term trend.

For other targets simple approach based on linear trends and most recent average value of the forecasted variable is recommended. For flow rate Bugnano this approach is obviously correct since the target variable is very stable and looks like a straight line on the plot. For other two targets this simple approach also produces reasonable forecasts compared to actual values seen on the plot.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Amiata water spring', 'daily', 1)

#### Notes
Daily models showed higher R2 score due to easier task of forecasting the next day value. The four tested models do not differ dramatically in performance on cross-validation. Input features include depth and flow rate measurements for the previous day and a year column.

#### 4.2. Lupa

In [None]:
data = get_data(Lupa_path)

In [None]:
target_cols = targets['Lupa']

In [None]:
data.describe()

This dataset contains only two columns. The share of missing values in the target column is about 9%.

In [None]:
plot_nans(data, 'Lupa water spring')

Rainfall data shows exponential distribution pattern.

Flow rate distribution is not typical, which could be more clearly seen lower at a timeseries plot.

In [None]:
plot_distribution(data)

Target variable has practically no correlation with the only feature column available in the dataset.

In [None]:
plot_correlation(data, 'Lupa water spring', target_cols)

This plot is the most unnaturally looking among all target variables that are being analysed. Target value has a long period of ideally linear downward trend with uncharacteristic rapid increase periods at the beginning and at the end.

In [None]:
plot_timeseries(data, 'Lupa water spring', target_cols)

Unsurprisingly, seasonal decomposition plot shows no seasonal component for this target variable and clear downwards trend for most of the observation period with uncharacteristic behavious at the beginning and at the end of the time series.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Lupa water spring', 'monthly', 1)

#### Notes
Best models showed high R2 score (over 0.9) on cross-validation. Mean absolute error is also acceptable compared to the scale of the predicted variable. Input features included previous values of target itself and a year column.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Lupa water spring', 'daily', 1)

#### Notes
Daily models showed even higher R2 score and lower MAE based on the same input features with daily frequency.

#### 4.3. Madonna di Canneto

In [None]:
data = get_data(Madonna_path)

In [None]:
target_cols = targets['Madonna']

In [None]:
data.describe()

Target variable in this dataset has about 55% of missing values. Other columns have less than 20% of NaNs.

In [None]:
plot_nans(data, 'Madonna water spring')

Rainfall data shows exponential distribution pattern. Temperature is normally distributed. Distribution of target values has two almost separate areas as if two different parameters were plotted on the same histogram.

In [None]:
plot_distribution(data)

Target value has practically no correlation with two other columns available in a dataset.

In [None]:
plot_correlation(data, 'Madonna water spring', target_cols)

Time series plot has large gaps and dramatic drops in level, which could not be unquestionably attributed to data errors or some kind of uncharacteristic behaviour of the waterbody. At present, the data will not be filtered or smoothed but to achieve better model performance consultation with domain experts is highly recommended.

In [None]:
plot_timeseries(data, 'Madonna water spring', target_cols)

Seasonal decomposition plot shows that data does not follow any clear seasonal pattern.

Changes in trend visible on this plot are questionable since it is not clear if significant drops and rebounds of target values result from actual behaviour of this waterbody or errors in the original dataset.

In [None]:
plot_seasonality(data, target_cols)

In [None]:
# Feature engineering
data = add_weekly_averages(data)
data = add_seasonality(data)

In [None]:
# Create monthly models and make a forecast
modelling(data, target_cols, 'Madonna di Canneto water spring', 'monthly', 1)

#### Notes
Monthly models showed low R2 score and high mean absolute error. Final prediction was made using simple approach based on linear trends and most recent actual values of the predicted variable. Taking into account mostly stable level of flow rate on the graph during 2020 this prediction could be valid for the nearest future. However, if the waterbody returns to it's previous patterns of dramatic shifts and drops in flow rate level, forecasts made with this model could not be relied upon.

In [None]:
# Create daily models and make a forecast
modelling(data, target_cols, 'Madonna di Canneto water spring', 'daily', 1)

#### Notes
As was indicated earlier, even for waterbodies with unstable patterns of behaviour and dramatic shifts in target values daily predictions show acceptable cross-val scores. Significant shifts and drops in values do not fully manifest on a daily basis. As a result, daily prediction based on the value of the target variable for the previous day and a year column could be a reliable indicator of most likely value for the next day.

### 5. Flexible data frequencies and forecast horizons
The algorithm allows for monthly, weekly and daily data frequencies. Depending on the arguments passed to the modelling() function the forecast could be made for a single day, the average monthly value or the average weekly value. Data frequency is defined with 'freq' argument.

The last argument of the function controls how many steps ahead the forecast is looking for. For example, calling modelling() with arguments freq='monthly' and steps_ahead=1 will produce the average value for the next month. Calling this function with arguments freq='monthly' and steps_ahead=3 will produce the average value for the third month counting from the last available data.

In theory, this approach should allow for medium-term and long-term forecasts. However, testing shows that the accuracy notably decreases, when the algorithm is applied to make a prediction for a distant future periods.

Accuracy also depends on the data frequency chosen for prediction. The most accurate predictions could be made on a daily basis (even when making a forecast several steps ahead). The least reliable models are those based on monthly data. Weekly data provides better accuracy scores compared to monthly data but worse results compared to daily data.

In [None]:
# Examples of forecasting target values several steps ahead:
for frequency in ['monthly', 'weekly', 'daily']:
    modelling(data, target_cols, 'Madonna di Canneto water spring', frequency, 3)

### Block 7: Models explained






### Preprocessing steps
- Filtering out 0 values in all columns except rainfall columns (0 values are replaced with np.NaN). Original data contained erroneous values, which distorted trends and distribution of values. In most cases 0 values were highly uncharacteristic for the measured parameter. In cases where 0 values are a minor part of the measurements, they differ dramatically from the neighboring datapoints (look like obvious point outliers). In cases where 0 values are in abundance, they form a separate bin distant from the bulk of typical measurements for the same parameter. Some of 0 values in rainfall columns could also be a placeholder for missing values, but there is no realistic way to filter them out and differentiate from legitimate 0 values.
- Conversion of daily data to monthly or weekly averages for monthly and weekly models. This step reduces the number of datapoints and samples but smoothes fluctuations in data.
- Interpolation or imputation of missing values is not being used. Original data as it is demonstrates irregular patterns and low correlation between parameters. Adding interpolated data to it will only increase noize and will not improve modelling accuracy.

### Feature engineering
- Adding seasonal features:
  - Month, week and day of year (int) to capture seasonal effects
  - Binary column for rainy season to reflect higher intensity of rainfall from October to April
  - Year (int) to capture long-term increase / decrease trend if it is present
- Adding weekly rolling averages for rainfall and temperature, when the model is based on daily data. The reasoning behind this is that prolonged periods of intense rainfall or draught would affect water level and flow rate more significantly than daily fluctuations and changes.
- Reduction of input data to features most correlated with the target value for the predicted period (day, week or month). Only features with positive correlation above 0.2 or negative correlation below -0.2 are being used for training and prediction.

### Selection of principal approach to modelling

As was shown by exploratory data analysis, predicted variables do not follow universal seasonal trends. Some of the target variables demonstrate visible seasonal patterns while others behave more chaotically and, obviously, are affected by external factors.

In this situation predictive models such as ARIMA would not provide accurate predictions, though such models are often used to get out of sample time series forecast for several periods ahead. For the same reason Prophet or XGBoost regression models
would not be useful in this case. Predictions made solely on seasonal features would show large errors
because they ignore the influence of external factors.

Though these models could take into account external regressors, if future values of those external regressors are unknown, they would be also useless.

### Types of models
1. Depending on data frequency models are divided into 3 groups (frequency is defined by 'freq' argument of modelling() function):
   - Daily models:
Take the values of target variables and other parameters for the previous day as an input to predict target value for the next day or other day in the future selected by user.
   - Weekly models:
Take the average values of target variables and other parameters for the previous week as an input to predict the average target value for the next week or other future week selected by user.
   - Monthly models:
Take the average values of target variables and other parameters for the previous month to predict the average target value for the next month of other month in the future.
2. Depending on the period (step) in the future which is predicted models would produce a forecast for the next period (day, week or month), second period, third period, etc. This behaviour is defined by 'steps_ahead' argument of the modelling() function. There is no restriction on input values for this argument. However, prediction accuracy for distant future periods is lower and it is not advisable to rely on this algorithm for long-term predictions.
3. Depending on the algorithm final prediction is made with one of the following models:
   - Basic sklearn models: one of RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor or KNeighborsRegressor (model with the highest cross-val score is selected given that the best R2 is higher than or equal to 0.6).
   - Mean value of two linear model predictions and the actual value of the target variable for the previous period. This method is used to make a prediction if basic sklearn models listed above showed R2 lower than 0.6 on cross validation. Low R2 usually reflects the fact that target value varies chaotically or could not be wholly explained and derived from the given input features. In this case the simplest option is used for prediction - to look at the actual value of the target variable for the latest available period (day, week or month - depending on the model frequency) and make a correction for observable trend. LinearRegression predictions are made based on the last 5 and 10 available values of the target variable, and these two predictions are added to the actual value for the latest period and divided by 3.
4. Depending on input features: input features for each model are selected at runtime based on correlation with the target value. Threashold for the selection is positive correlation above 0.2 or negative correlation below -0.2. The list of input features that are being used in the model is printed below each prediction. Prognostication algorithm is constructed in such a way that in the future the composition of inputs could change if new relevant data is added to dataset or correlation between the parameters changes.
5. Depending on waterbody type: the competition rules required to create 4 models specific to waterbody type so that any of the models could be applied to predict target values of any object of this particular type. EDA did not show consistent patterns for target variables in waterbodies of the same type. Moreover, identical target variables in the same waterbody could demonstrate quite different patterns and prediction scores for the same models vary greatly. To achieve the highest possible level of accuracy and to take into consideration peculiarities of various waterbodies a decision was made to create one flexible algorithm which tests several models at runtime, selects appropriate input features and makes a prediction on monthly, weekly or daily basis depending on the user's choice.

### Conclusions and recommendations

Waterbodies differ in terms of correlation between features given in the datasets and range of features available for each object. It is the main obstacle to creating a generalized model with given input features for each waterbody type.

To overcome this problem flexible approach is used, which selects input features at runtime based on available data and correlation between parameters. In some cases, this algorithm could lead to unexpected results. If data in the original file is not evenly distributed (especially if some parameters are correlated with the target value but are not present in the latest periods of observation), prediction will be made based on the last period when all correlated features were present. Date of the input is printed along with the prediction and a list of input features. It is recommended to check if the input date is equal or close to the last date in the original file.

Analysis of data and interpretation of modelling results showed that in the **rivers** category target value demonstrates correlation with most other parameters (rainfall, temperature, season) and no visible long-term trend. This group was represented by a single object in this study, which make generalizations questionable.

In the **lakes** category target variables - lake level and flow rate - show moderate correlation to one another and no long-term trend. Both targets demonstrate low day-to-day correlation with rainfall and temperature. Simple shift of the target lake level one step backwards results in higher correlation (yesterday's rainfall affects today's lake level). Weekly rolling averages for rainfall and temperature and seasonal features (day and month of year) are also useful for predicting lake level. However flow rate could not be predicted based on these features with acceptable accuracy.

**Aquifers** category showed low correlation of target values with rainfall, medium correlation with temperature and some degree of correlation with volume. In most cases target values are also affected by long-term trends. However these trends do not follow any obvious and easily explainable rules.

**Water springs** are the most difficult group for prediction. Target values correlate mostly with one another and demonstrate practically no correlation with other parameters. Trends could not be generalized or explained with the data available for analysis. Some of the objects demonstrate highly suspicious behavious on the charts (abrupt shifts and gaps), which could be a result of error in the original data.

Proposed algorithm allows to get daily, weekly or monthly forecast for arbitrary chosen period in the future (next day, week or month, second next period, 5th next period and so on). However, it is not recommended to model distant future periods in this way, because accuracy of the predictions will deteriorate. As a rule, predictions accuracy for shorter periods (daily and weekly data) is higher than for longer periods (monthly data).

Accuracy of the predictions could be improved by more thorough data cleaning and removing of noize and errors. However, domain knowledge is required for this step.