# Important note
This notebook will only cover **River Arno**, leaving the other 8 waterbodies (datasets) out of scope.

# Challenge overview

The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania.

This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other. So, if for instance we consider a water spring we notice that its features are different from the lakeâ€™s one. This is correct and reflects the behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water spring (for which three datasets are provided), lake (for which a dataset is provided), river (for which a dataset is provided) and aquifers (for which four datasets are provided).

The desired outcome of this challenge is a notebook that can generate four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) that might be applicable to each single waterbody.

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6195295%2Fcca952eecc1e49c54317daf97ca2cca7%2FAcea-Input.png?generation=1606932492951317&alt=media)

Each waterbody has its own different features to be predicted. The table below shows the expected feature to forecast for each waterbody.

![](https://storage.cloud.google.com/kaggle-media/competitions/Acea/Screen%20Shot%202020-12-02%20at%2012.40.17%20PM.png)

# Reading files

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns
plt.rcParams['figure.dpi'] = 300
import matplotlib.dates as mdates
import missingno as msno
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

# ignoring warnings
import warnings
warnings.simplefilter("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
#        print(os.path.join(dirname, filename))
        if '.csv' in filename:
            files +=list([filename])

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

aq_auser = pd.read_csv("../input/acea-water-prediction/Aquifer_Auser.csv", index_col = 'Date')
aq_doganella = pd.read_csv("../input/acea-water-prediction/Aquifer_Doganella.csv", index_col = 'Date')
aq_luco = pd.read_csv("../input/acea-water-prediction/Aquifer_Luco.csv", index_col = 'Date')
aq_petrignano = pd.read_csv("../input/acea-water-prediction/Aquifer_Petrignano.csv", index_col = 'Date')
lk_bilancino = pd.read_csv("../input/acea-water-prediction/Lake_Bilancino.csv", index_col = 'Date')
rv_arno = pd.read_csv("../input/acea-water-prediction/River_Arno.csv", index_col = 'Date')
ws_amiata = pd.read_csv("../input/acea-water-prediction/Water_Spring_Amiata.csv", index_col = 'Date')
ws_lupa = pd.read_csv("../input/acea-water-prediction/Water_Spring_Lupa.csv", index_col = 'Date')
ws_madonna = pd.read_csv("../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv", index_col = 'Date')

datasets=[aq_auser,aq_doganella,aq_luco,aq_petrignano,lk_bilancino,rv_arno,ws_amiata,ws_lupa,ws_madonna]

# Datasets overview

Below table is just to show all the datasets, waterbody types, rows, and columns of the tables. As stated before, I'll only focus on **River Arno**.

In [None]:
# Creating a brief dataframe to compare the qty of rows and cols of each file.
datasets_df = pd.DataFrame(columns=['File_Name'], data=files)
datasets_df['Waterbody_type'] = datasets_df.File_Name.apply(lambda x: x.split('_')[0])
datasets_df['Qty_Rows'] = datasets_df.File_Name.apply(lambda x: pd.read_csv(f'../input/acea-water-prediction/{x}').shape[0])
datasets_df['Qty_Cols'] = datasets_df.File_Name.apply(lambda x: pd.read_csv(f'../input/acea-water-prediction/{x}').shape[1])
datasets_df = datasets_df.replace('Water','Water_Spring')
datasets_df = datasets_df.sort_values(by=['Waterbody_type','Qty_Rows'], ascending=[True,False]).reset_index(drop=True)
#datasets_df.style.bar(subset=['Qty_Rows','Qty_Cols'], color='#118DFF')
datasets_df

In [None]:
#Stating each waterbody target
auser_targets = ['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS']
doganella_targets = ['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                     'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                     'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9']
luco_targets = ['Depth_to_Groundwater_Podere_Casetta']
petrignano_targets = ['Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25']
bilancino_targets = ['Lake_Level', 'Flow_Rate']
arno_targets = ['Hydrometry_Nave_di_Rosano']
amiata_targets = ['Flow_Rate_Bugnano','Flow_Rate_Arbure', 
                  'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta']
lupa_targets = ['Flow_Rate_Lupa']
madonna_targets = ['Flow_Rate_Madonna_di_Canneto']

# Defining functions

In [None]:
#Defining some functions
def df_relinfo(df, target_var=[]):
    x = pd.DataFrame(df.isna().sum().apply(lambda x: x/df.shape[0])).reset_index().rename(columns={'index':'Feature',0:'%Na'})
    x['Na_qty'] = df.isna().sum().tolist()
    x['Variable'] = x.Feature.apply(lambda x: 'Target' if x in target_var else 'Predictor')
    return x.sort_values(by='%Na', ascending = False).reset_index(drop=True).style.bar(subset = ['%Na'], color = '#118DFF')

def corr_plot(data, top_visible=False, right_visible=False, bottom_visible=True, left_visible=False, ylabel=None, figsize=(15,11), axis_grid='y'):
    fig, ax = plt.subplots(figsize=figsize)
    plt.title('Correlations (Pearson)', size=15, fontweight='bold')
    mask = np.triu(np.ones_like(data.corr(), dtype=bool))
    sns.heatmap(round(data.corr(), 2), mask=mask, cmap='viridis', annot=True)
    plt.show()
    
def line_plot(data, y, title, color, top_visible=False, right_visible=False, bottom_visible=True, left_visible=False,
             ylabel=None, figsize=(10,4), axis_grid='y'):
    fig, ax = plt.subplots(figsize=figsize)
    plt.title(title, size=15, fontweight='bold')
    
    for i in ['top','right','bottom','left']:
        ax.spines[i].set_color('black')
    #    ax.spines[i].set_visible(i+'_visible')
    
    ax.spines['top'].set_visible(top_visible)
    ax.spines['right'].set_visible(right_visible)
    ax.spines['bottom'].set_visible(bottom_visible)
    ax.spines['left'].set_visible(left_visible)
    
    sns.lineplot(x=range(len(data[y])), y=data[y], dashes=False, color=color, linewidth=.5)
    ax.xaxis.set_major_locator(plt.MaxNLocator(20))
    
    ax.set_xticks([])
    plt.xticks(rotation=90)
    plt.xlabel('')
    plt.ylabel(ylabel)
    ax.grid(axis=axis_grid, alpha=0.9, linestyle='--')
    plt.show()

def columns_viz(data, color):
    for i in range(len(data.columns)):
        line_plot(data=data, y=data.columns[i], color=color, 
                 title='{} dynamics'.format(data.columns[i]),
                  bottom_visible=False, figsize=(10,2))
        
# some more helper functions
def add_month(df):
    """
    Convert date to a date object, then create the month column
    """
    df = df.reset_index()
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.sort_values(by = 'Date')
    
    df['Month'] = pd.DatetimeIndex(df['Date']).month
    return df

def add_year(df):
    """
    add a column for the year
    """
    df['Year'] = pd.DatetimeIndex(df['Date']).year
    return df

def add_seasons(df):
    """
    This function will add the season (winter, spring, summer, autumn) based on the month
    Spring: March, April, May
    Summer: June, July, August
    Autumn: September, October, November
    Winter: December, January, February
    """
    months = df['Month'].unique()
    df['Season'] = df['Month']
    for month in months:
        if month in [12,1,2]:
            df.loc[lambda df: df['Month'] == month, 'Season'] = '1_Winter'
        elif month in [3,4,5]:
            df.loc[lambda df: df['Month'] == month, 'Season'] = '2_Spring'
        elif month in [6,7,8]:
            df.loc[lambda df: df['Month'] == month, 'Season'] = '3_Summer'
        else:
            df.loc[lambda df: df['Month'] == month, 'Season'] = '4_Autumn'
    return df

def do_dates(df):
    df = add_month(df)
    df = add_year(df)
    df = add_seasons(df)
    return df

# River Arno Analysis

## EDA

In [None]:
print('The earliest date is: \t', datasets[5].index[0])
print('The latest date is: \t', datasets[5].index[-1])

Let's take a quick look at the features, missing values and types of variables.
We can see 9 predictors with over 44% of missing values.

In [None]:
df_relinfo(rv_arno,arno_targets)

Let's now plot the correlations matrix.
We can see high negative correlation in Hydrometry with Temperature, meaning the higher the temperature, the lower the target.
We can also see a set of rainfalls with lower correlation than others. Coincidentally, the ones with higher correlation are the ones with higher quantity of missing values. This may lead into problems afterwards.

In [None]:
corr_plot(datasets[5])

Let's now plot the daily dynamics for a glimpse

In [None]:
columns_viz(datasets[5], '#FF5733')

Let's now plot the monthly dynamics. Scales will be tweaked to make it more visual (log and *10).
Note how rainfall data is starting in around 2003-2004 and temperature data is until 2017.

In [None]:
#Adding rainfall Sum, year, month, month_year
df = rv_arno[['Hydrometry_Nave_di_Rosano', 'Temperature_Firenze']].reset_index()
df['rainfall'] = rv_arno.iloc[:, 0:-2].sum(axis = 1).values
df['year'] = pd.to_datetime(df.Date).dt.year
df['month'] = pd.to_datetime(df.Date).dt.month
df['month_year'] = pd.to_datetime(df.Date).apply(lambda x: x.strftime('%Y/%m'))

In [None]:
# Monthly dynamics
r_means = np.log(df.groupby('month_year').Hydrometry_Nave_di_Rosano.mean() * 10).reset_index()
r_means['month_year'] = pd.to_datetime(r_means['month_year'])

r_rain = np.log(df.groupby('month_year').rainfall.mean()).reset_index()
r_rain['month_year'] = pd.to_datetime(r_rain['month_year'])

r_temp = np.log(df.groupby('month_year').Temperature_Firenze.mean()).reset_index()
r_temp['month_year'] = pd.to_datetime(r_temp['month_year'])

fig, ax = plt.subplots(figsize = (15, 5))
plt.title('Monthly dynamics (Arno River)', size = 15, fontweight = 'bold')
          
sns.lineplot(data = r_rain, x = 'month_year', y = 'rainfall',  
             color = 'gray', label = 'Rainfall', alpha = 0.4)
plt.xticks(rotation = 45)
sns.lineplot(data = r_temp, x = 'month_year', y = 'Temperature_Firenze', 
             color = 'green', label = 'Temperature_Firenze', alpha = 0.6)
plt.xticks(rotation = 45)
sns.lineplot(data = r_means, x = 'month_year', y = 'Hydrometry_Nave_di_Rosano', 
             color = 'blue', label = 'Hydrometry')
plt.xticks(rotation = 45)
    
for i in ['top', 'right', 'bottom', 'left']:
        ax.spines[i].set_visible(False)

ax.set_xticks(r_means.month_year[::12])
ax.set_xticklabels(range(1998, 2021, 1))
ax.set_xlabel('')
ax.set_ylabel('')
ax.grid(axis = 'y', linestyle = '--', alpha = 0.9)
plt.show()

Now plotting the yearly dynamics

In [None]:
# Yearly dynamics
r_means_y = np.log(df.groupby('year').Hydrometry_Nave_di_Rosano.mean() * 10).reset_index()
r_rain_y = np.log(df.groupby('year').rainfall.mean()).reset_index()
r_temp_y = np.log(df.groupby('year').Temperature_Firenze.mean()).reset_index()

fig, ax = plt.subplots(figsize = (15, 5))
plt.title('Yearly dynamics (Arno River)', size = 15, fontweight = 'bold')
          
sns.lineplot(data = r_rain_y, x = 'year', y = 'rainfall',  
             color = 'gray', label = 'Rainfall', alpha = 0.4)
plt.xticks(rotation = 45)
sns.lineplot(data = r_temp_y, x = 'year', y = 'Temperature_Firenze', 
             color = 'green', label = 'Temperature_Firenze', alpha = 0.6)
plt.xticks(rotation = 45)
sns.lineplot(data = r_means_y, x = 'year', y = 'Hydrometry_Nave_di_Rosano', 
             color = 'blue', label = 'Hydrometry')
plt.xticks(rotation = 45)
    
for i in ['top', 'right', 'bottom', 'left']:
        ax.spines[i].set_visible(False)

ax.set_xticks(r_means_y.year)
ax.set_xlabel('')
ax.set_ylabel('')
ax.grid(axis = 'y', linestyle = '--', alpha = 0.9)
plt.show()

Let's now plot missing values from Missingno which allow us to see also the distribution of the missing values.
We can see that all Rainfalls started to be recorded at the same time, which was later than Temperature and Hydrometry.
We can also see that Temperature was not recorded until last day.

In [None]:
msno.matrix(rv_arno)

## Feature Engineering

Due to high number of missing values in values (over 44%), I will drop the following 9 Rainfalls: 'Rainfall_Vernio','Rainfall_Stia', 'Rainfall_Consuma', 'Rainfall_Incisa', 'Rainfall_Montevarchi', 'Rainfall_S_Savino', 'Rainfall_Laterina', 'Rainfall_Bibbiena', 'Rainfall_Camaldoli'.

Do not forget that these Rainfalls have a higher correlation that the ones that will remain.

I will now group create Seasonal columns particularly to group each of the rainfalls and try to fill the missing values with the mean within the season. I believe this approach is better than the mean of the year. The same approach can be used for the temperature. 
Moreover, a column Rainfall_Mean.

In [None]:
rv_arno_wrk = do_dates(rv_arno).drop(columns=['Rainfall_Vernio','Rainfall_Stia','Rainfall_Consuma', \
    'Rainfall_Incisa', 'Rainfall_Montevarchi', 'Rainfall_S_Savino', 'Rainfall_Laterina', 'Rainfall_Bibbiena', 'Rainfall_Camaldoli'])
rv_arno_wrk['Rainfall_Mean'] = rv_arno_wrk.iloc[:, 1:6].mean(axis = 1).values
rv_arno_wrk

From this plot we can see that Rainfalls varies considerably through seasons.

In [None]:
test = rv_arno_wrk.groupby('Season').mean().drop(columns=['Year','Month','Temperature_Firenze','Hydrometry_Nave_di_Rosano'])

fig, ax = plt.subplots(figsize = (15, 5))
sns.lineplot(data=test, dashes=False)

I decided to drop all values before 2004 because of the quantity of missing values (might not be the best approach but I prefer this one rather than replace NaNs in each rainfall before 2003 with data from after 2004).\
I will also replace few missing values in Hydrometry with ffill since they are only three and not consecutive. Also the type of variable suggest me that extending the measure for one more day might be better than using some average. Quantity of records impacted: 3.\
I found 187 rows with Hydrometry = 0 which sounds like a data collection issue for me. I will also replace this values with the ffill method.

In [None]:
# I will delete data before 2004 because its having all missing values on the Rainfall variables.
# I will fill the NaN values in Target with ffill. Affecting only 3 rows.
# I will replace 0 values in Target with ffill as well. Affecting 187 rows, but mainly not consecutive.
rv_arno_wrk = rv_arno_wrk[rv_arno_wrk.Date>'2004-01-01']
rv_arno_wrk['Hydrometry_Nave_di_Rosano'].fillna(method='ffill', inplace=True)
rv_arno_wrk['Hydrometry_Nave_di_Rosano'].replace(to_replace=0, method='ffill', inplace=True)

Let's now dive into Temperature's missing values. There are 1082 missing values.

In [None]:
rv_arno_wrk['Temperature_Firenze'].isnull().sum()

From the plot below, we can see a high correlation between the temperature and the month (as we can intuitively expect). Based on this, I will replace all missing values in the temperatures with the mean of temperatures of that month, across all years. 

In [None]:
rv_arno_tmp = rv_arno_wrk.groupby(['Year','Month']).mean().drop(columns=['Hydrometry_Nave_di_Rosano','Rainfall_Le_Croci', 'Rainfall_Cavallina', 'Rainfall_S_Agata',
       'Rainfall_Mangona', 'Rainfall_S_Piero','Rainfall_Mean']).reset_index()
rv_arno_tmp =rv_arno_tmp.pivot('Month','Year','Temperature_Firenze')
fig, ax = plt.subplots(figsize = (15, 5))
plt.title('Average Temperature per Month and Year', size = 15, fontweight = 'bold')
sns.lineplot(data=rv_arno_tmp);

In [None]:
rv_arno_tmp_mean = rv_arno_wrk.groupby('Month').mean().drop(columns=['Year','Hydrometry_Nave_di_Rosano','Rainfall_Le_Croci', 'Rainfall_Cavallina', 'Rainfall_S_Agata',
       'Rainfall_Mangona', 'Rainfall_S_Piero','Rainfall_Mean'])
rv_arno_tmp_mean

In [None]:
for month in range(1,13):
    rv_arno_wrk.loc[lambda x: (x['Month']==month) & (x['Temperature_Firenze'].isnull()), 'Temperature_Firenze'] = rv_arno_tmp_mean.loc[month,'Temperature_Firenze']

Let's check if we still have to work some feature.

In [None]:
msno.matrix(rv_arno_wrk)

Now, I'll plot the Monthly Dynamics again just to see how it looks after the Featuring Engineering.

In [None]:
#Adding rainfall Sum, year, month, month_year
df = rv_arno_wrk[['Date','Rainfall_Mean','Hydrometry_Nave_di_Rosano', 'Temperature_Firenze']]#.reset_index()
#df['rainfall'] = rv_arno.iloc[:, 0:-2].sum(axis = 1).values
df['year'] = pd.to_datetime(df.Date).dt.year
df['month'] = pd.to_datetime(df.Date).dt.month
df['month_year'] = pd.to_datetime(df.Date).apply(lambda x: x.strftime('%Y/%m'))

# Monthly dynamics
r_means = np.log(df.groupby('month_year').Hydrometry_Nave_di_Rosano.mean() * 10).reset_index()
r_means['month_year'] = pd.to_datetime(r_means['month_year'])

r_rain = np.log(df.groupby('month_year').Rainfall_Mean.mean() *10).reset_index()
r_rain['month_year'] = pd.to_datetime(r_rain['month_year'])

r_temp = np.log(df.groupby('month_year').Temperature_Firenze.mean()).reset_index()
r_temp['month_year'] = pd.to_datetime(r_temp['month_year'])

fig, ax = plt.subplots(figsize = (15, 5))
plt.title('Monthly dynamics (Arno River)', size = 15, fontweight = 'bold')
          
sns.lineplot(data = r_rain, x = 'month_year', y = 'Rainfall_Mean',  
             color = 'gray', label = 'Rainfall', alpha = 0.4)
plt.xticks(rotation = 45)
sns.lineplot(data = r_temp, x = 'month_year', y = 'Temperature_Firenze', 
             color = 'green', label = 'Temperature_Firenze', alpha = 0.6)
plt.xticks(rotation = 45)
sns.lineplot(data = r_means, x = 'month_year', y = 'Hydrometry_Nave_di_Rosano', 
             color = 'blue', label = 'Hydrometry')
plt.xticks(rotation = 45)
    
for i in ['top', 'right', 'bottom', 'left']:
        ax.spines[i].set_visible(False)

ax.set_xlabel('')
ax.set_ylabel('')
ax.grid(axis = 'y', linestyle = '--', alpha = 0.9)
plt.show()

In [None]:
rv_arno_wrk_model = rv_arno_wrk.set_index('Date')

Let's now plot the Correlations matrix again to see how it looks.

In [None]:
corr_plot(rv_arno_wrk_model.drop(columns=['Year','Month']))

# Model prediction

I will split the data on 70% for training and 30% for testing and apply XGB because it's powerful and popular but other models could also be applied.
Since I'm applying XGB, I'm not normalizing the data (not required). Parameters chosen after some experimentation

In [None]:
y = rv_arno_wrk_model['Hydrometry_Nave_di_Rosano']
X = rv_arno_wrk_model.drop(['Hydrometry_Nave_di_Rosano','Season','Month','Year'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, shuffle = False)

params = {'n_estimators': 100,
          'max_depth': 4,
          'subsample': 0.7,
          'learning_rate': 0.04,
          'random_state': 0}

model = XGBRegressor(**params)

model.fit(X_train, y_train,)

y_pred = model.predict(X_test)
print('MAE value: %.4f'%mean_absolute_error(y_test, y_pred))

The Mean Absolute Error of the predicted values is 0.3607, meaning that on average the model have an error of 36cm.

I will now plot the Feature Importances. Where we can see that Temperature is the most important.

In [None]:
def model_imp_viz(model, train_data, bias = 0.01):
    imp = pd.DataFrame({'importance': model.feature_importances_,
                        'features': train_data.columns}).sort_values('importance', 
                                                                     ascending = False)
    fig, ax = plt.subplots(figsize = (10, 4))
    plt.title('Feature importances', size = 15, fontweight = 'bold')

    sns.barplot(x = imp.importance, y = imp.features, edgecolor = 'black',
                palette = reversed(sns.color_palette("viridis", len(imp.features))))

    for i in ['top', 'right']:
            ax.spines[i].set_visible(None)

    rects = ax.patches
    labels = imp.importance
    for rect, label in zip(rects, labels):
        x_value = rect.get_width() + bias
        y_value = rect.get_y() + rect.get_height() / 2

        ax.text(x_value, y_value, round(label, 3), fontsize = 9, color = 'black',
                 ha = 'center', va = 'center')
    ax.set_xlabel('Importance', fontweight = 'bold')
    ax.set_ylabel('Features', fontweight = 'bold')
    plt.show()

In [None]:
model_imp_viz(model, X_train)

Let's now plot the Hydrometry Real vs Predicted.

In [None]:
def predicted_viz(y_test, y_pred, param, name):
    rm = y_test.reset_index()
    rm['month_year'] = pd.to_datetime(rm.Date).apply(lambda x: x.strftime('%Y/%m'))
    rm_means = rm.groupby('month_year')[param].mean().reset_index()
    rm_means['month_year'] = pd.to_datetime(rm_means['month_year'])

    pm = pd.DataFrame({'Date': y_test.index, param: y_pred})
    pm['month_year'] = pd.to_datetime(pm.Date).apply(lambda x: x.strftime('%Y/%m'))
    pm_means = pm.groupby('month_year')[param].mean().reset_index()
    pm_means['month_year'] = pd.to_datetime(pm_means['month_year'])

    fig, ax = plt.subplots(figsize = (15, 5))
    plt.title('{} prediction ({})'.format(param, name), size = 15, 
              fontweight = 'bold')

    sns.lineplot(data = rm_means, x = 'month_year', y = param, 
                 color = 'blue', label = 'Real {}'.format(param), alpha = 1)
    sns.lineplot(data = pm_means, x = 'month_year', y = param, 
                 color = 'red', label = 'Pred {}'.format(param), alpha = 0.5)

    for i in ['top', 'right', 'bottom', 'left']:
            ax.spines[i].set_visible(False)

    ax.set_xticks(rm_means.month_year[::12])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.grid(axis = 'y', linestyle = '--', alpha = 0.9)
    plt.show()

In [None]:
predicted_viz(y_test, y_pred, 'Hydrometry_Nave_di_Rosano', 'Arno River')

In [None]:
def resid_viz(y_test, y_pred):
    resid = abs(y_test - y_pred)
    fig, ax = plt.subplots(figsize = (10, 5))
    plt.title('Residuals', size = 15, fontweight = 'bold')

    sns.scatterplot(x = y_test, y = resid, color = 'red', 
                    edgecolor = 'black', alpha = 0.7)

    for i in ['top', 'right']:
            ax.spines[i].set_visible(False)

    ax.set_xlabel('Real values', fontweight = 'bold')
    ax.set_ylabel('Resiaduals', fontweight = 'bold')
    plt.show()

In [None]:
resid_viz(y_test, y_pred)

Residual distribution is a powerful tool for assessing the quality of a model. The linear dependence, which is most pronounced for high hydrometry values, proves that our model does not consider all the dependencies. Perhaps if all predictors were used (this is not possible due to missing values), the model would do much better.

# Further steps

* Try differnt algorithms,
* Analyze the remainig 8 waterbodies,
* Try new parameters for XGB,
* Improve data selection and preprocessing.

# Inspiration/Credits/Sources
https://www.kaggle.com/tomwarrens/intro-to-time-series-analysis \
https://www.kaggle.com/marcomarchetti/acea-smart-water-eda#7-Conclusions \
https://www.kaggle.com/maksymshkliarevskyi/acea-smart-water-eda-prediction/execution \
https://www.kaggle.com/iamleonie/intro-to-time-series-forecasting \
https://www.kaggle.com/lucena1990/acea-smater-water-water-availability-data#Exploratory-Data-Analysis \
https://www.kaggle.com/kevinnolasco/random-forests-and-early-stopping-to-predict-water \
Kirill Eremenko, Hadelin de Ponteves, Super Data Science