# The Effect of Rainfall on River Arno 

**TABLE OF CONTENTS**
1. Introduction <br>
*1.1 River Arno*<br>
*1.2 Hydrometry*<br>
*1.3 Rainfall*<br>
2. Data Exploration <br>
3. Annual Data 2015-2019 <br>
4. Hydrometry And Rainfall Average 2015-2019 <br>
5. Conclusion<br>
6. *Addendum: Lake Bilancino*<br>
7. *The Xboost Expedition*

Although this notebook uses the *Acea Smart Water Analytics* competition dataset, it is not a contest entry. As someone dedicated to learn more about data analysis, I possess no skills for building comprehensive and reliable predictive models, which indeed was the aim of the competition in question. Rather, this notebook is a part of a personal learning process.

The reseach question of this notebook was however indirectly provided by the competition organizers. The contest briefing included the following paragraph:

**"It is of the utmost importance to notice that some features like rainfall and temperature, which are present in each dataset, don’t go alongside the date. Indeed, both rainfall and temperature affect features like level, flow, depth to groundwater and hydrometry some time after it fell down. This means, for instance, that rain fell on 1st January doesn’t affect the mentioned features right the same day but some time later.** ***As we don’t know how many days/weeks/months later rainfall affects these features, this is another aspect to keep into consideration when analyzing the dataset."***

As noted, the *Acea Smart Water Analytics* competition was all about predictive models. Personally, I was fascinated about the notion that by their own admission ***the organizers did not know how rainfall affects any of their water resources***. Therefore, I set as my research question to analyze one particular water resource, namely river Arno, from the viewpoint of rainfall.

*So, what is the relationship between rainfall and river Arno water level?* Let's see if we find anything.

*January 22nd, 2021<br>*
*Jari Peltola*

> ## 1. Introduction

### 1.1 River Arno

As the competition brief has it:

"*Arno is the second largest river in peninsular Italy and the main waterway in Tuscany and it has a relatively torrential regime, due to the nature of the surrounding soils (marl and impermeable clays). Arno results to be the main source of water supply of the metropolitan area of Florence-Prato-Pistoia. The availability of water for this waterbody is evaluated 
by checking the hydrometric level of the river at the section of Nave di Rosano.*"

This introduction tells us two important things. First, **the soil surrounding river Arno is hard**, meaning it does not preserve water. As it is, this fact alone suggests that the rainfall in the area finds it way to Arno relatively fast.

Another thing worth noting is that **the water level is always checked on a particular section of the river**, meaning the number we get is consistent. However at the same time, the figure does not necessaarily tell us the whole truth about Arno, since one part of a river can indeed behave quite differently compared to other sections. 

Finally, **water consumption figures were not part of the competition dataset.** As increasing or decreasing consumption greatly affects water resources, this feature brings its own level of uncertainty to any analysis concerning the nature and behavior of these water resources. 

### 1.2 Hydrometry

The following passage was taken from an online article written by hydraulic engineer **Christian Lallement** (link to full article below):

https://www.encyclopedie-environnement.org/en/water/hydrometry-measuring-flow-river-why-how/


"*Hydrometry, a science distinct and complementary to hydrology (science of water in its natural environment) and hydraulics (physics of flows), is the discipline that seeks to measure river flows. The flow rate -volume of water crossing a section of a stream for one unit of time- is expressed in cubic metres per second (m3/s). Each watercourse follows a particular regime, determined by the rhythm of precipitation and its hydrological “terroir”. For the world’s most populated river, the Amazon, the variation in flow between two extreme months of the same month is only one to two. And from one year to the next, its average annual flow at its mouth varies only 10 to 15% around its 206,000 m3/s value. The Amazon is an extremely regular river. On the other hand, an African river like the Chari has an average flow of 1197 m3/s at its outlet in Lake Chad. Within the same year, the variation in flow between two extreme months is a factor of 20 (150 to 3000 m3/s). And from one year to the next, the average annual flow can vary by a factor of two: 739 m3/s in 1942, 1720 m3/s in 1956. The Chari therefore has a much more contrasting regime.*"

Concerning river Arno, we know know how the river flow is measured. Also, we know that **the "rhythm" of the river is an important factor**. Actually, this issue was also mentioned in the contest brief:

"*During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.*"

Lastly, we know now that **hydrometry uses cubic metres per second (m3/s) as unit for measurement**. For example, if there is '1.26' as measurement in the dataset, this to my knowledge means that by that time river Arno had a flow of 1260 cubic meters per second. 

### 1.3 Rainfall

The National Center of Atmospheric Research (NCAR) scientist **Peggy Lemone** has written the following about measuring rainfall (link to full post below):

https://scied.ucar.edu/blog/measuring-rainfall-–-it’s-easy-and-difficult-same-time-0

"*The exposure of the rain gauge is undoubtedly the greatest source of error.  According to the National Weather Service and CoCoRAHS (both of which use citizen volunteers to measure rainfall), “exposure” of the rain gauge is important. Rain may be blocked by nearby obstacles causing the number to be lower than it should. Or, rain may be blown into or away from the gauge by wind gusts.  The recommendation is that the gauge be about twice the distance from the height of the nearest obstacles, but still sheltered from the wind.*"

Thus **the rainfall figures can be misleading** because of the instruments used as well as circumstantial differences (a raincloud hovers around observation area etc.) However **the competition dataset includes different locations on rainfall, enabling us to compare regional rainfall figures.** 

Finally, it is assumed that all rainfall measurement in the dataset were provided in millimetres, although I could not actually confirm this. I have personally measured both snowfall and rainfall while serving in the army, and there are different options for doing it regarding measurement frequency. Since there is only one rainfall measurement per day in the dataset, **it is further assumed that all rainfall measurements in the dataset are given in millimetres/day (mm/d) format**.

It is also worth noting that there is a way of converting rainfall into snowfall: a millimeter of water on wintertime roughly equals to one centimeter of snow. 

As for our task, it looks like our work is cut out.

***We need to find the rhythm of River Arno.***

> ## 2. Data Exploration

First, let's import the modules.

In [None]:
# import modules
import math
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go

from pandas.testing import assert_frame_equal
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

Next we set the column and row display and upload the dataset as dataframe.

In [None]:
# set column and row display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# load dataset
df = pd.read_csv('../input/acea-water-prediction/River_Arno.csv') 

# make a working copy of the original dataframe
df_copy = df.copy()

# show ten last rows
df.tail(10)

We may also want to know the overall size of our dataframe.

In [None]:
#get dataframe shape
shape = df_copy.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

If we check the data type, we can see that the Date column is not in datetime format. Next we convert the column in question accordingly.

In [None]:
df_copy.dtypes

In [None]:
# change column to DateTime format
df_copy['Date'] =  pd.to_datetime(df_copy['Date'],dayfirst=True)

As for river Arno, only the first five rainfall locations include any data. Therfore we make a new dataframe **df_arno**, which includes the five locations along with datetime column.

In [None]:
# select preferred columns by name
df_arno = df_copy.loc[:,['Date', 'Rainfall_Le_Croci', 'Rainfall_Cavallina', 'Rainfall_S_Agata', 'Rainfall_Mangona', 'Rainfall_S_Piero', 'Hydrometry_Nave_di_Rosano']]

We can also change the column names to more practical form.

In [None]:
df_arno.rename(columns = {'Rainfall_Le_Croci':'Le_Croci'}, inplace = True) 
df_arno.rename(columns = {'Rainfall_Cavallina':'Cavallina'}, inplace = True) 
df_arno.rename(columns = {'Rainfall_S_Agata':'S_Agata'}, inplace = True) 
df_arno.rename(columns = {'Rainfall_Mangona':'Mangona'}, inplace = True) 
df_arno.rename(columns = {'Rainfall_S_Piero':'S_Piero'}, inplace = True) 
df_arno.rename(columns = {'Hydrometry_Nave_di_Rosano':'Hydrometry'}, inplace = True) 

In [None]:
 df_arno.head()  

We can see that hydrometry measurements go back to 1998. This analysis takes into account the assumed effect of climate change on old data. By default, each years these days is usually the hottest year ever experienced when it comes to temperature. In other words, the ever-changing climate may in fact render some old data increasingly obsolete. This is why this notebook uses only measurements starting from year 2015 onwards.

Next we mask the dataframe to fit this criteria.

In [None]:
# mask dataframe
start_date = '2015-01-01'
end_date = '2020-06-30'

# wear a mask
mask = (df_arno['Date'] >= start_date) & (df_arno['Date'] < end_date)
df_arno = df_arno.loc[mask]

df_arno.head()

The data from year 2020 includes the first six months of the years. For our purposes however, we need data from all 12 months of the year. 

Next we make five new dataframes, each for one year between 2015-2019. After masking the dataframe, the Hydrometry column is temporarily removed. After that, an average rainfall figure is calculated from the five individual rainfall locations. Finally the Hydromerty column is renamed and reinserted to the dataframe.

In [None]:
# mask dataframe
start_date_2019 = '2019-01-01'
end_date_2019 = '2019-12-31'

# wear a mask
mask_2019 = (df_arno['Date'] >= start_date_2019) & (df_arno['Date'] < end_date_2019)
df_arno_2019 = df_arno.loc[mask_2019]

# remove column
pop_hydrometry_2019 = df_arno_2019.pop("Hydrometry")

# calculate average per dataframe row
df_arno_2019['Rainfall_Mean_2019'] = df_arno_2019.mean(axis=1)

# reinsert oolumn
df_arno_2019['Hydrometry_2019'] = pop_hydrometry_2019

# reset index
df_arno_2019.reset_index(inplace = True) 

# select and drop selected column
col = ['index']
df_arno_2019 = df_arno_2019.drop(col, axis=1)

df_arno_2019.head(10)

Next the same is done to four remaining years included in the analysis.

In [None]:
start_date_2018 = '2018-01-01'
end_date_2018 = '2018-12-31'

mask_2018 = (df_arno['Date'] >= start_date_2018) & (df_arno['Date'] < end_date_2018)
df_arno_2018 = df_arno.loc[mask_2018]

pop_hydrometry_2018 = df_arno_2018.pop("Hydrometry")

df_arno_2018['Rainfall_Mean_2018'] = df_arno_2018.mean(axis=1)
df_arno_2018['Hydrometry_2018'] = pop_hydrometry_2018

df_arno_2018.reset_index(inplace = True) 

col = ['index']
df_arno_2018 = df_arno_2018.drop(col, axis=1)

In [None]:
start_date_2017 = '2017-01-01'
end_date_2017 = '2017-12-31'

mask_2017 = (df_arno['Date'] >= start_date_2017) & (df_arno['Date'] < end_date_2017)
df_arno_2017 = df_arno.loc[mask_2017]

pop_hydrometry_2017 = df_arno_2017.pop("Hydrometry")

df_arno_2017['Rainfall_Mean_2017'] = df_arno_2017.mean(axis=1)
df_arno_2017['Hydrometry_2017'] = pop_hydrometry_2017

df_arno_2017.reset_index(inplace = True) 

col = ['index']
df_arno_2017 = df_arno_2017.drop(col, axis=1)

In [None]:
start_date_2016 = '2016-01-01'
end_date_2016 = '2016-12-31'

mask_2016 = (df_arno['Date'] >= start_date_2016) & (df_arno['Date'] < end_date_2016)
df_arno_2016 = df_arno.loc[mask_2016]

pop_hydrometry_2016 = df_arno_2016.pop("Hydrometry")

df_arno_2016['Rainfall_Mean_2016'] = df_arno_2016.mean(axis=1)
df_arno_2016['Hydrometry_2016'] = pop_hydrometry_2016

df_arno_2016.reset_index(inplace = True) 

col = ['index']
df_arno_2016 = df_arno_2016.drop(col, axis=1)

In [None]:
start_date_2015 = '2015-01-01'
end_date_2015 = '2015-12-31'

mask_2015 = (df_arno['Date'] >= start_date_2015) & (df_arno['Date'] < end_date_2015)
df_arno_2015 = df_arno.loc[mask_2015]

pop_hydrometry_2015 = df_arno_2015.pop("Hydrometry")

df_arno_2015['Rainfall_Mean_2015'] = df_arno_2015.mean(axis=1)
df_arno_2015['Hydrometry_2015'] = pop_hydrometry_2015

df_arno_2015.reset_index(inplace = True) 

col = ['index']
df_arno_2015 = df_arno_2015.drop(col, axis=1)

Let's take a visual representation from year 2019 to see what we are dealing with. For clarity, the first plot includes only three rainfall locations, and the other two are included in the next plot. The x-axis is set to describe date whereas y-axis axe shows both the rainfall and hydrometry, which both use different units, as we saw in the beginning. Therefore the y-axis unit is not a specific measurement unit but, rather, a complementary figure (tick range 0-50). 

In [None]:
# plot figure
fig = go.Figure()

# add hydrometry trace
fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['Hydrometry_2019'],
                name='Hydrometry',
                mode='lines',
                marker_color='black'
               ))

# add rainfall traces
fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['Le_Croci'],
                name='Le Croci',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['Cavallina'],
                name='Cavallina',
                mode='lines',         
                line=dict(color='blue', width=2,
                              dash='dot')         
                ))

fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['S_Agata'],
                name='S_Agata',
                mode='lines',          
                line=dict(color='green', width=2,
                              dash='dash')      
                ))

# set outlook etc.
fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

# add data source
annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# add header
fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry and rainfall (2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of cases',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)

# set axes etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_yaxes(title_text='Hydrometry and rainfall')
fig.update_yaxes(title_font=dict(size=14))

# show figure
fig.show()        

In [None]:
# plot figure
fig = go.Figure()

# add hydrometry trace
fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['Hydrometry_2019'],
                name='Hydrometry',
                mode='lines',
                marker_color='black'
               ))

# add rainfall traces
fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['Mangona'],
                name='Mangona',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_arno_2019['Date'],
                y=df_arno_2019['S_Piero'],
                name='S_Piero',
                mode='lines',         
                line=dict(color='blue', width=2,
                              dash='dot')         
                ))

# set outlook etc.
fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

# add data source
annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# add header
fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry and rainfall (2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of cases',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)

# set axes etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_yaxes(title_text='Hydrometry and rainfall')
fig.update_yaxes(title_font=dict(size=14))

# show figure
fig.show()        

We can see that when flow is concerned, Arno is a stable and predictable river. Even during the winter months, top waterflow was below 7.0 with most days having a measurement closer to 2.0. As for rainfall, during winter months there were local measurements up to 64 millimetres a day. This kind og figure becomes even more impressive if one recalls that one millimeter of rainfall roughly equals to one centimeter of snow.

It would be useful to have an average rainfall figure based on all give measurement locations, so we do that next. For this, we create a new dataframe **df_rainfall**.

In [None]:
# make new dataframe
df_rainfall = df_arno.copy()

# remove column
pop_hydrometry = df_rainfall.pop("Hydrometry")

# calculate average
df_rainfall['Rainfall_Mean'] = df_rainfall.mean(axis=1)

# resinsert column
df_rainfall['Hydrometry'] = pop_hydrometry

df_rainfall.head(10)

Now we can inspect the rainfall average and hydrometry from years 2015-2019 more closely.

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_rainfall['Date'],
                y=df_rainfall['Hydrometry'],
                name='Hydrometry',
                mode='lines',
                marker_color='black'
               ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry (2015-)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.7,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_rainfall['Date'],
                y=df_rainfall['Hydrometry'],
                name='Hydrometry',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_rainfall['Date'],
                y=df_rainfall['Rainfall_Mean'],
                name='Rainfall Average',
                mode='lines',         
                line=dict(color='red', width=2,
                               dash='dot')    
                ))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry and average rainfall (2015-)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry and rainfall',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.7,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

At least in 2015-2019 the peak periods have been rather predicable when the amount of rainfall as well as date of occurrence are concerned.

## 3. Annual data 2015-2019

We continue by concentrating on the annual datasets. Next we make copies of all five of them and select the Hydrometry column along with the average rainfall column.

In [None]:
# copy dataframes to new ones
df_arno_2019_copy = df_arno_2019.copy()
df_arno_2018_copy = df_arno_2018.copy()
df_arno_2017_copy = df_arno_2017.copy()
df_arno_2016_copy = df_arno_2016.copy()
df_arno_2015_copy = df_arno_2015.copy()

# select preferred columns 
df_arno_2019_copy = df_arno_2019_copy.loc[:,['Rainfall_Mean_2019', 'Hydrometry_2019']]
df_arno_2018_copy = df_arno_2018_copy.loc[:,['Rainfall_Mean_2018', 'Hydrometry_2018']]
df_arno_2017_copy = df_arno_2017_copy.loc[:,['Rainfall_Mean_2017', 'Hydrometry_2017']]
df_arno_2016_copy = df_arno_2016_copy.loc[:,['Rainfall_Mean_2016', 'Hydrometry_2016']]
df_arno_2015_copy = df_arno_2015_copy.loc[:,['Rainfall_Mean_2015', 'Hydrometry_2015']]

It is notable that 2016 was a leap year, and any annual data from a leap year thus includes one row more than other years. However, since applying strict daily schedule does not apply well to annual rainfall or weather in general, this is not considered a major issue in this notebook. 

For further merging of dataframes, we next create new variable with simple ascending numeric value (1-365). After that, the variable is assigned to all five dataframes as column Day.

In [None]:
# create variable
one_to_365 = pd.Series(range(1,366))

# set new column
df_arno_2015_copy['Day'] = one_to_365
df_arno_2016_copy['Day'] = one_to_365
df_arno_2017_copy['Day'] = one_to_365
df_arno_2018_copy['Day'] = one_to_365
df_arno_2019_copy['Day'] = one_to_365

Let's take year 2019 as example to see what we have:

In [None]:
df_arno_2019_copy.head(10)

Next we merge our five dataframes based on the common Day column. We do this with two dataframes at a time.

In [None]:
# merge dataframes
df_con_one = pd.merge(df_arno_2015_copy, df_arno_2016_copy, on='Day')
df_con_two = pd.merge(df_arno_2017_copy, df_arno_2018_copy, on='Day')
df_con_three = pd.merge(df_con_one, df_con_two, on='Day')
df_arno_1519 = pd.merge(df_con_three, df_arno_2019_copy, on='Day')

df_arno_1519.tail(10)

We now have all five years and their respected data in one dataframe **df_arno_1519**. If we take a closer look at the data, we can see that on one particular day the hydrometry reading is in fact 0.0. Although it is in theory possible that river Arno was on strike that day, it is more likely that there is a flub in that particular day's measurement. There is no need whatsoever to do anything about it, but since all other measurements are above 1.0, any further plotting of visuals would suffer from this unfortunate measurement error. Therefore I will manually change that day's measurement from zero to one. Although 1.0 is likely not the correct value either, it serves purpose as "neutral" value since we are after average hydrometry readings,   

In [None]:
# change row value
df_arno_1519.at[269,'Hydrometry_2019']=1.0

## 4. Hydrometry and rainfall average 2015-2019

Now we can plot our data more comfortably. We start with hydrometry readings from years 2015-2019.

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Hydrometry_2015'],
                name='Hydrometry 2015',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Hydrometry_2016'],
                name='Hydrometry 2016',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Hydrometry_2017'],
                name='Hydrometry 2017',
                mode='lines',         
                marker_color='magenta'      
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Hydrometry_2018'],
                name='Hydrometry 2018',
                mode='lines',          
                marker_color='steelblue'   
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Hydrometry_2019'],
                name='Hydrometry 2019',
                mode='lines',          
                marker_color='forestgreen'   
                ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=1, dtick= 10)
#fig.update_xaxes(tick0=1, dtick= 364)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_text='Hydrometry average')
fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

We can see the "rhythm of Arno" much better now when it comes to flow. Starting from mid-February (day 45 onwards), the hydrometry figures reach up to 5.0. Coming mid-April (day 100), hydrometry readings start to come down and basically stay down right up to the end of October (day 300) when the autumn rainfall starts to take effect.

Next will will create similar visuals on rainfall average.

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Rainfall_Mean_2015'],
                name='Rainfall Average 2015',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Rainfall_Mean_2016'],
                name='Rainfall Average 2016',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Rainfall_Mean_2017'],
                name='Rainfall Average 2017',
                mode='lines',         
                marker_color='magenta'      
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Rainfall_Mean_2018'],
                name='Rainfall Average 2018',
                mode='lines',          
                marker_color='steelblue'   
                ))

fig.add_trace(go.Scatter(x=df_arno_1519['Day'],
                y=df_arno_1519['Rainfall_Mean_2019'],
                name='Rainfall Average 2019',
                mode='lines',          
                marker_color='forestgreen'   
                ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Rainfall Average (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Rainfall average',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

The rainfall readings somewhat follow the hydrometry readings with a couple of exceptions. For example, during summer months the rainfall does not have similar an effect to hydrometry as in spring. Of course during the spring period there is also a lot of snow melting in the mountain region where Arno begins its flow, so rainfall is by no means not the only factor affecting the river hydrometry. 

Next we pick the rainfall average readings along with Day column to a new dataframe **df_arno_fall_mean**. After that we calculate one overall rainfall average covering the annual averages from years 2015-2019.

In [None]:
# create dataframe
df_arno_fall_mean = df_arno_1519.loc[:,['Day', 'Rainfall_Mean_2015', 'Rainfall_Mean_2016', 'Rainfall_Mean_2017', 'Rainfall_Mean_2018', 'Rainfall_Mean_2019']]

# remove column
pop_rain_day = df_arno_fall_mean.pop("Day")

# calculate average
df_arno_fall_mean['Rainfall_Mean'] = df_arno_fall_mean.mean(axis=1)

# reinsert column
df_arno_fall_mean['Day'] = pop_rain_day

df_arno_fall_mean.head(10)

Let's do the same with the hydrometry readings next.

In [None]:
df_arno_hydro_mean = df_arno_1519.loc[:,['Day', 'Hydrometry_2015', 'Hydrometry_2016', 'Hydrometry_2017', 'Hydrometry_2018', 'Hydrometry_2019']]

pop_hydro_day = df_arno_hydro_mean.pop("Day")

df_arno_hydro_mean['Hydrometry_Mean'] = df_arno_hydro_mean.mean(axis=1)
df_arno_hydro_mean['Day'] = pop_hydro_day

df_arno_hydro_mean.head(10)

We are looking for overall averages only, so we extract them to two new dataframes **df_arno_rainfall** and **df_arno_hydrometry**, with the Day column as common denominator for future merging.

In [None]:
# create dataframes
df_arno_rainfall = df_arno_fall_mean.loc[:,['Day', 'Rainfall_Mean']]
df_arno_hydrometry = df_arno_hydro_mean.loc[:,['Day', 'Hydrometry_Mean']]

Finally, we merge these to dataframes to **df_arno_waterbody**.

In [None]:
# merge dataframes
df_arno_waterbody = pd.merge(df_arno_rainfall, df_arno_hydrometry, on='Day')

df_arno_waterbody.head(10)

Our final plots are visual presentations of this dataframe.

In [None]:
fig = go.Figure()
 
fig.add_trace(go.Scatter(x=df_arno_waterbody['Day'],
                y=df_arno_waterbody['Hydrometry_Mean'],
                name='Hydrometry Average (2015-2019)',
                mode='lines',
                marker_color='black'
               ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry average (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))
fig.update_yaxes(title_font=dict(size=14))

fig.show()        

In [None]:
fig = go.Figure()
 
fig.add_trace(go.Scatter(x=df_arno_waterbody['Day'],
                y=df_arno_waterbody['Hydrometry_Mean'],
                name='Hydrometry Average (2015-2019)',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_arno_waterbody['Day'],
                y=df_arno_waterbody['Rainfall_Mean'],
                name='Rainfall Average (2015-2019)',
                mode='lines',         
                marker_color='orange'
                ))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno</b>:<br>Hydrometry and rainfall average (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry and rainfall',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))
fig.update_yaxes(title_font=dict(size=14))

fig.show()        

## 5. Conclusion

As noted in the beginning, the units for rainfall (millimetres/day) and hydrometry (cubic meters/second) are not directly comparable, the plot above showing both is not a scaled presentation of these two features. However, as hydrometry reacts to rainfall, the mutual coexistence of these two features can be well observed. 

Our research question was all about analyzing how long does it take for rainfall to have an effect on hydrometry when river Arno is concerned.

**Based on the 2015-2019 data, during peak rainfall it takes about 2-3 calendar days for rainfall to have an effect on river Arno hydrometry. During warm summer months however, this effect is radically diminished.**

**After summer, when the rainfall increases starting from September (day 250 onwards), it starts to have effect on hydrometry only in late October (day 300 onwards) because of the long dry summer spell.** 

As climate change evolves, it remains to be seen what kind of effect it has concerning Arno water level. One feature outside this analysis is the snow situation in the area. As ice is known to gradually recede from mountaineous areas in Europe, increased humidity may well bring snow on occasions previously considered unlikely.  

Finally, as mentioned in the beginning, water consumption and demand will partly dictate the level of water reservoirs especially in dry summer months, when average rainfall is lower and there is no melting snow effect. In the case of river Arno, this summer period lasts about 150 days (from day 100 to day 250). Thus, from the viewpoint of water consumption, the major question is whether the more rainy winter months are still able to "refill" river Arno in a way that does not lead to "spillover" (i.e. destructive flooding).

Paradoxically, too much clean water in form of rainfall causes failure of sewage system. Thus overwhelming flow will furthermore contaminate household water, if river Arno breaks its levees and penetrates into residential areas. As we can see from hydrometry averages, this is most likely to happen either in November or in February.

Everything we know about climate change would suggest that all this will in the future be ever more challenging a task. As mentioned in the beginning, the riverbanks of Arno don't effectively store water. Yet the biggest problem may still not be the Arno region running out of water in summer.

It may be all about water running over Arno region in winter.

## 6. Addendum: Lake Bilancino

The rather sparse English Wikipedia article on Lake Bilancino (Lago di Bilancino) describes the lake as following (link to full article below, however the paragraph below *is* the full article):

https://en.wikipedia.org/wiki/Lago_di_Bilancino

"***Lago di Bilancino is an artificial lake near Barberino di Mugello in the Metropolitan City of Florence, Tuscany, Italy, made with a dam on the river Sieve. At an elevation of 252 m, the lake surface area is approximately 5 km².***"

Moreover, the supplemental brief from the dataset states the following:

"***Bilancino lake is an artificial lake located in the municipality of Barberino di Mugello (about 50 km from Florence). It is used to refill the Arno river during the summer months. Indeed, during the winter months, the lake is filled up and then, during the summer months, the water of the lake is poured into the Arno river.***"

However small these glimpses into Lake Bilancino might be, they provide us with important facts. First, **lake Bilancino gets most its water from river Sieve, which is not part of the original dataset**. Secondly, **Bilancino is the water reservoir for river Arno during summer months**. This means the "rhythm of Bilancino" is very much intertwined with whatever is going on in Arno.

In the analysis on river Arno and rainfall, it was discovered that Arno has a dry spell ranging from May to August. Also, Arno has two "flood peaks" when the water level threatens to overwhelm riverbanks. Of course, lake Bilancino can fill Arno but it does not work the other way. If autumn rainfall starts to fill Bilancino too much and Arno is at peak flow at the same time, there is nowhere Bilancino can be emptied into.

Based on the Arno analysis, two resesarch questions should be asked concerning lake Bilancino. First, **does lake Bilancino "dry up" during summer?** Secondly, **is Bilancino water level "too high" during winter months, meaning the reservoir itself would be in danger of breaking its barriers?**

So let's get to work.

In [None]:
# set column and row display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# load dataset
df = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv') 

# make a working copy of the original dataframe
df_bilancino = df.copy()

# change column to DateTime format
df_bilancino['Date'] =  pd.to_datetime(df_bilancino['Date'],dayfirst=True)

# mask dataframe
start_date = '2015-01-01'
end_date = '2020-06-30'

# wear a mask
mask = (df_bilancino['Date'] >= start_date) & (df_bilancino['Date'] < end_date)
df_bilancino = df_bilancino.loc[mask]

df_bilancino.rename(columns = {'Rainfall_Le_Croci':'Le_Croci'}, inplace = True) 
df_bilancino.rename(columns = {'Rainfall_Cavallina':'Cavallina'}, inplace = True) 
df_bilancino.rename(columns = {'Rainfall_S_Agata':'S_Agata'}, inplace = True) 
df_bilancino.rename(columns = {'Rainfall_Mangona':'Mangona'}, inplace = True) 
df_bilancino.rename(columns = {'Rainfall_S_Piero':'S_Piero'}, inplace = True) 
df_bilancino.rename(columns = {'Hydrometry_Nave_di_Rosano':'Hydrometry'}, inplace = True) 

df_bilancino.head(10)

The rainfall readings are identical with river Arno data, so they can be applied as such to lake Bilancino. In addition, there are readings on lake water level (assumably in meter.centimeter format) and low rate. This of course can mean two different things, either water flowing *into* the lake or water flowing *out* from the lake. 

As Bilancino is a water reservoir with its own man-controlled flow out from the lake into Arno, it is further assumed that flow rate in the dataset describes this water flow from Bilancino into Arno. Also, the water flow unit remains unclear in the original dataset, so in this analysis it is assumed that this unit is the same as in Arno's hydrometry i.e. cubic metres in second. 

We already have average rainfall and hydrometry on Arno from 2015-2019 period. Next we will do the same with lake Bilancino and its water level as well as flow rate. As for temperature, I find it more or less unreliable a factor in analyzing lake Bilancino, since the same exact temperature has quite different an effect depending on air humidity, hours of unimpeded sunlight etc. Therefore temperature readings are further excluded from this analysis.

First, let's check the five-year period of Bilancino data.

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_bilancino['Date'],
                y=df_bilancino['Lake_Level'],
                name='Lake Level',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_bilancino['Date'],
                y=df_bilancino['Flow_Rate'],
                name='Flow Rate',
                mode='lines',         
                line=dict(color='red', width=2,
                               dash='dot')    
                ))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>Lake Bilancino</b>:<br>Lake level and flow rate (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Lake level and flow rate',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.7,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

We can see that the water level is artificially kept relatively steady. The flow rate seems to increase every year between November and February, which is the flood season in Arno. 

Because there is certainly no water shortage in Arno during winter, the rational explanation would be that lake Bilancino water flow needs to be increased at the same time regardless of Arno. After all, there is an increase in Bilancino water level every autumn from 246-247 to 250 and even above. In numbers this does not sound much, but it we think about water level in metres in a lake, these numbers suddenly get a whole new level of importance. Therefore seems that the "pattern of crisis" concerning Arno and Bilancino would be when levels in both waterbodies rise simultaneously during rain season above a certain threshold level considered as safe. 

For example, in November 2019 river Arno rose to its highest levels in 20 years (link to article below):

https://www.wantedinrome.com/news/florence-on-alert-as-river-arno-rises-to-highest-levels-in-20-years.html

According to article, on November 17th Arno had water level of 4,8 meters as a result of 62 millimetres of rainfall in 24 hours. In November 2019, Bilancino water level rapidly rose from 247 to 250 metres caused by the same rainy spell.

According to Wikipedia, lake Bilancino has surface elevation of 252 metres. This figure most likely serves also as the final threshold below which water level in the lake must stay in order to be controlled. Given the wave effect caused by wind, the actual safety level is more likely closer to 250 metres.

**If all this is true, lake Bilancino is not about to dry up every year during summer but actually flow over on annual basis during winter.** 

Next we create a new dataframe **df_bi_copy** with appropriate columns:

In [None]:
# select preferred columns by name
df_bi_copy = df_bilancino.loc[:,['Date', 'Lake_Level', 'Flow_Rate']]

Now we can create the annual dataframes in one neat bulk.

In [None]:
start_date_2019 = '2019-01-01'
end_date_2019 = '2019-12-31'
mask_2019 = (df_bi_copy['Date'] >= start_date_2019) & (df_bi_copy['Date'] < end_date_2019)
df_bilancino_2019 = df_bi_copy.loc[mask_2019]
pop_date_2019 = df_bilancino_2019.pop("Date")
df_bilancino_2019.rename(columns = {'Lake_Level':'Lake_Level_2019'}, inplace = True) 
df_bilancino_2019.rename(columns = {'Flow_Rate':'Flow_Rate_2019'}, inplace = True) 
df_bilancino_2019.reset_index(inplace = True) 
col = ['index']
df_bilancino_2019 = df_bilancino_2019.drop(col, axis=1)


start_date_2018 = '2018-01-01'
end_date_2018 = '2018-12-31'
mask_2018 = (df_bi_copy['Date'] >= start_date_2018) & (df_bi_copy['Date'] < end_date_2018)
df_bilancino_2018 = df_bi_copy.loc[mask_2018]
pop_date_2018 = df_bilancino_2018.pop("Date")
df_bilancino_2018.rename(columns = {'Lake_Level':'Lake_Level_2018'}, inplace = True) 
df_bilancino_2018.rename(columns = {'Flow_Rate':'Flow_Rate_2018'}, inplace = True) 
df_bilancino_2018.reset_index(inplace = True) 
col = ['index']
df_bilancino_2018 = df_bilancino_2018.drop(col, axis=1)

start_date_2017 = '2017-01-01'
end_date_2017 = '2017-12-31'
mask_2017 = (df_bi_copy['Date'] >= start_date_2017) & (df_bi_copy['Date'] < end_date_2017)
df_bilancino_2017 = df_bi_copy.loc[mask_2017]
pop_date_2017 = df_bilancino_2017.pop("Date")
df_bilancino_2017.rename(columns = {'Lake_Level':'Lake_Level_2017'}, inplace = True) 
df_bilancino_2017.rename(columns = {'Flow_Rate':'Flow_Rate_2017'}, inplace = True) 
df_bilancino_2017.reset_index(inplace = True) 
col = ['index']
df_bilancino_2017 = df_bilancino_2017.drop(col, axis=1)

start_date_2016 = '2016-01-01'
end_date_2016 = '2016-12-31'
mask_2016 = (df_bi_copy['Date'] >= start_date_2016) & (df_bi_copy['Date'] < end_date_2016)
df_bilancino_2016 = df_bi_copy.loc[mask_2016]
pop_date_2016 = df_bilancino_2016.pop("Date")
df_bilancino_2016.rename(columns = {'Lake_Level':'Lake_Level_2016'}, inplace = True) 
df_bilancino_2016.rename(columns = {'Flow_Rate':'Flow_Rate_2016'}, inplace = True) 
df_bilancino_2016.reset_index(inplace = True) 
col = ['index']
df_bilancino_2016 = df_bilancino_2016.drop(col, axis=1)

start_date_2015 = '2015-01-01'
end_date_2015 = '2015-12-31'
mask_2015 = (df_bi_copy['Date'] >= start_date_2015) & (df_bi_copy['Date'] < end_date_2015)
df_bilancino_2015 = df_bi_copy.loc[mask_2015]
pop_date_2015 = df_bilancino_2015.pop("Date")
df_bilancino_2015.rename(columns = {'Lake_Level':'Lake_Level_2015'}, inplace = True) 
df_bilancino_2015.rename(columns = {'Flow_Rate':'Flow_Rate_2015'}, inplace = True) 
df_bilancino_2015.reset_index(inplace = True) 
col = ['index']
df_bilancino_2015 = df_bilancino_2015.drop(col, axis=1)

# create variable
one_to_365 = pd.Series(range(1,366))

# set new column
df_bilancino_2015['Day'] = one_to_365
df_bilancino_2016['Day'] = one_to_365
df_bilancino_2017['Day'] = one_to_365
df_bilancino_2018['Day'] = one_to_365
df_bilancino_2019['Day'] = one_to_365

# merge dataframes
df_bil_one = pd.merge(df_bilancino_2015, df_bilancino_2016, on='Day')
df_bil_two = pd.merge(df_bilancino_2017, df_bilancino_2018, on='Day')
df_bil_three = pd.merge(df_bil_one, df_bil_two, on='Day')
df_bilancino_1519 = pd.merge(df_bil_three, df_bilancino_2019, on='Day')

df_bilancino_1519.tail(10)

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Lake_Level_2015'],
                name='Lake level 2015',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Lake_Level_2016'],
                name='Lake level 2016',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Lake_Level_2017'],
                name='Lake level 2017',
                mode='lines',         
                marker_color='magenta'      
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Lake_Level_2018'],
                name='Lake level 2018',
                mode='lines',          
                marker_color='steelblue'   
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Lake_Level_2019'],
                name='Lake level 2019',
                mode='lines',          
                marker_color='forestgreen'   
                ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>Lake Bilancino</b>:<br>Lake level (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.2,
        y=0.4,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)
#fig.update_xaxes(tick0=1, dtick= 364)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_text='Lake level')
fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

The figure above seems to support the hypothesis that lake Bilancino has no shortage of water. Rather the water level is most of the time close to maximum level defined by the artificially created surrounding surface level (252 metres).

In [None]:
fig = go.Figure()

 
fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Flow_Rate_2015'],
                name='Flow rate 2015',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Flow_Rate_2016'],
                name='Flow rate 2016',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Flow_Rate_2017'],
                name='Flow rate 2017',
                mode='lines',         
                marker_color='magenta'      
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Flow_Rate_2018'],
                name='Flow rate 2018',
                mode='lines',          
                marker_color='steelblue'   
                ))

fig.add_trace(go.Scatter(x=df_bilancino_1519['Day'],
                y=df_bilancino_1519['Flow_Rate_2019'],
                name='Flow rate 2019',
                mode='lines',          
                marker_color='forestgreen'   
                ))


fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>Lake Bilancino</b>:<br>Flow rate (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.6,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)
#fig.update_xaxes(tick0=1, dtick= 364)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_yaxes(title_text='Flow rate')
fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))

fig.show()        

As noted earlier, it is assumed that the flow rate in the dataset is consistent with the hydrometer measurements i.e. both are given in cubic metres per second format. If one takes a look at the lake Bilancino flow rates, there are significant peaks during winter season reaching up to 31.0. 

This is not about lake Bilancino acting as a water reservoir, since river Arno has no water shortage during winter period. Rather is seems that the figures describe lake Bilancino in crisis mode, a waterbody in the brink of breaking its levees.

Next we calculate the five-year averages of both water flow and lake level.

In [None]:
df_bilancino_level_mean = df_bilancino_1519.loc[:,['Day', 'Lake_Level_2015', 'Lake_Level_2016', 'Lake_Level_2017', 'Lake_Level_2018', 'Lake_Level_2019']]
pop_level_day = df_bilancino_level_mean.pop("Day")
df_bilancino_level_mean['Lake_Level_Mean'] = df_bilancino_level_mean.mean(axis=1)
df_bilancino_level_mean['Day'] = pop_level_day

df_bilancino_flow_mean = df_bilancino_1519.loc[:,['Day', 'Flow_Rate_2015', 'Flow_Rate_2016', 'Flow_Rate_2017', 'Flow_Rate_2018', 'Flow_Rate_2019']]
pop_flow_day = df_bilancino_flow_mean.pop("Day")
df_bilancino_flow_mean['Flow_Rate_Mean'] = df_bilancino_flow_mean.mean(axis=1)
df_bilancino_flow_mean['Day'] = pop_flow_day

In [None]:
# create dataframes
df_bilancino_level = df_bilancino_level_mean.loc[:,['Day', 'Lake_Level_Mean']]
df_bilancino_flow = df_bilancino_flow_mean.loc[:,['Day', 'Flow_Rate_Mean']]

In [None]:
# merge dataframes
df_bilancino_waterbody = pd.merge(df_bilancino_level, df_bilancino_flow, on='Day')
df_bilancino_waterbody.head(10)

In [None]:
# merge dataframes
df_arno_and_bilancino = pd.merge(df_arno_waterbody, df_bilancino_waterbody, on='Day')
df_arno_and_bilancino.head(10)

In [None]:
fig = go.Figure()
 
fig.add_trace(go.Scatter(x=df_arno_and_bilancino['Day'],
                y=df_arno_and_bilancino['Hydrometry_Mean'],
                name='Arno Hydrometry Average (2015-2019)',
                mode='lines',
                marker_color='black'
               ))

fig.add_trace(go.Scatter(x=df_arno_and_bilancino['Day'],
                y=df_arno_and_bilancino['Rainfall_Mean'],
                name='Arno Rainfall Average (2015-2019)',
                mode='lines',         
                marker_color='orange'
                ))

fig.add_trace(go.Scatter(x=df_arno_and_bilancino['Day'],
                y=df_arno_and_bilancino['Flow_Rate_Mean'],
                name='Bilancino Flow Rate Average (2015-2019)',
                mode='lines',         
                marker_color='red'
                ))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>River Arno and Lake Bilancino</b>:<br>Hydrometry, rainfall average, water flow average (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Hydrometry, rainfall, water flow',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.17,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))
fig.update_yaxes(title_font=dict(size=14))

fig.show()        

The flow rate of lake Bilancino seems to have two different peaks at the turn of each year. The time between February and March finds lake Bilancino having to release its water pressure into Arno, and at the end of October the autumn rainfall has again filled the lake up.

Finally, let's take a look at the lake level average.

In [None]:
fig = go.Figure()
 
fig.add_trace(go.Scatter(x=df_arno_and_bilancino['Day'],
                y=df_arno_and_bilancino['Lake_Level_Mean'],
                name='Lake Level Average (2015-2019)',
                mode='lines',
                marker_color='black'
               ))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))


annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.75, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Source: Kaggle.com<br> Acea Smart Water Analytics competition', 
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    title='<b>Lake Bilancino</b>:<br>Lake level average (2015-2019)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Lake level',
        titlefont_size=14,
        tickfont_size=14,
    ),
    
     title_font=dict(
        size=12,     
    ),
    legend=dict(
        x=0.3,
        y=0.9,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)
fig.update_yaxes(tick0=0, dtick= 50)

fig.update_layout(
   font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_xaxes(title_text='Day')

fig.update_yaxes(title_font=dict(size=14))
fig.update_yaxes(title_font=dict(size=14))

fig.show()        

The most important factor in lake level is in my opinion the fact that lake Bilancino level is man-controlled. Therefore I found it odd that this feature was to be predicted by algorithmic modeling. As we can see from the average figures, winter has no problem filling Bilancino to its maximum level of 252 metres.

Rather, there seems to be an ongoing struggle to let water out of Bilancino into Arno. Of course Arno has its limits too, and when rainfall exceeds certain threshold, both Bilancino and Arno may meet their respective limits simultaneously, leading to uncontrolled flooding.

This is why it seems - based on what we just learned - that **the real issue with Bilancino and Arno is that the combination of the two is insufficient by capacity to handle the increasing rainfall likely created by accelerating climate change in the near future.**




## 7. The Xboost Expedition

As noted in the beginning, I possess no skills to make accurate predictive models. The content above is something I can to some extent vouch for at least when it comes to how and why different measures were taken.

Beyond this sentence, this is no longer the case. To be honest, I don't know what the ---- I'm about to get into. Hence the title referring to an expedition. I may be stuck on thick data ice in the middle of nowhere pretty soon, seriously thinking if canine tastes good with lamp oil... Horrible, horrible...

*What? How did Roald Amundsen get into this? Well, he did disappear in the Arctic, flying way too high...*

In a sentence, I am about to try the simplest of modeling first time ever, without any necessary skills or knowledge. The first obstacle I encountered had to do with train, test and validation data. Apparently everything I thought I knew about it was wrong...

For a long time, I was under the impression that the following is true:

*One uses train-test split function to divide data into train and test sets.*

Enter Wikipedia article on the matter (link to full article below):

https://en.wikipedia.org/wiki/Training%2C_validation%2C_and_test_sets

"*The model is initially fit on a training dataset, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a neural net or a naive Bayes classifier) is trained on the training dataset using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent on.*"

"*Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the model's hyperparameters (e.g. the number of hidden units (layers and layer widths) in a neural network.*"

"*Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset. If the data in the test dataset has never been used in training (for example in cross-validation), the test dataset is also called a holdout dataset.*"

"*The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the original dataset was partitioned into only two subsets, the test set might be referred to as the validation set).*"

It seems that I am not the only one who has trouble knowing what they are doing regarding train, test and validation datasets. ***How come there is massive confusion when there are exactly three different variables available to create that confusion in the first place?***

Because the function is called train-test split, it would in my opinion be correct to call the datasets it provides "training set" and "testing set", and the third and separate dataset - used for validating the model afterwards - would be called "validation set". To prove the usefulness of this, next time you visit car dealership, ask for a validation drive.

But then again, I am just an inexperienced explorer about to be seriously stuck in data ice on imminent basis. It's somehow comforting to know though that others seem to have trouble with these extreme conditions too at least when it comes to defining concepts...

One has to start the expedition somewhere, so I will next import the module.

In [None]:
# import module
from sklearn.model_selection import train_test_split

Next I will load the dataset and define predictors as well as the target. **My aim is to see whether the average rainfall readings combined with lake Bilancino water level and flow rate (predictors) can actually predict the lake Arno hydrometry (target).** 

As a layman's guess, there should be no trouble whatsoever since the two waterbodies and their respected average values are indeed interconnected, but we'll see...

In [None]:
# load dataset
data = df_arno_and_bilancino

# select predictors
predictors = ['Rainfall_Mean', 'Lake_Level_Mean', 'Flow_Rate_Mean']
X = data[predictors]

# select target
y = data.Hydrometry_Mean

With a conceptually heavy heart, **I will next call the train-test split function and also call the two datasets as X_train and X_valid. I then separate dataset X_test to be applied later**. 70 percent of overall data goes to training set, whereas 20 percent goes to validation and 10 percent is left for testing.  

In [None]:
# thank you to Jorge Barrios on Stack Overflow for this solution

# link to thread:

#https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn#15136

# Define ratios
ratio_train = 0.7
ratio_val = 0.2
ratio_test = 0.1

# Produce test split
X_remaining, X_test, y_remaining, y_test = train_test_split(
    X, y, test_size=ratio_test)

# Adjust val ratio, w.r.t. remaining dataset
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# Produce train and val splits
X_train, X_valid, y_train, y_valid = train_test_split(
    X_remaining, y_remaining, test_size=ratio_val_adjusted)

In [None]:
# check dataset shape 
X_train.shape,X_valid.shape,X_test.shape,y_train.shape,y_valid.shape,y_test.shape

Next, after importing the respected XGBoost (eXtreme Gradient Boosting) module, the very basic parameters for it will be chosen. All choices are based on the article written by **Aarsahy Jain** (link to full article below). I also noted that the article author gives credit to **Sudalai Rajikumar**, a Kaggle member whose excellent and carefully maintained COVID-19 datasets I have previously taken advantage of. 

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

As this is the simplest of model, I will only select three parameters: number of estimators, max_depth and learning rate.  By doing so, I will have no warm thoughts about people who think the {} characters were a really great invention (no keyboard manufacturers seem to share this view).

In [None]:
# import module
from xgboost import XGBRegressor

# select and define parameters 
parameters = {'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1}

# create regressor model with parameters
arno_model = XGBRegressor(**parameters)

# fit model with training data
arno_model.fit(X_train, y_train)

The next phase is to try to get some grip on how the simplest of models behaves by evaluating it. Some cut-paste results (links to full Wikipedia articles below):

https://en.wikipedia.org/wiki/Mean_squared_error

"*In statistics, the mean squared error (MSE) ...measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss... The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.*"

"*RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.*"

Let's check out these two metrics.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = arno_model.predict(X_valid)
mse = mean_squared_error(y_valid, y_pred)

print("MSE: %.2f" % mse)
print("RMSE: %.2f" % (mse**(1/2.0)))

As I am on an expedition to unknown, I will reload my pipe and check what just happened while printing out the model training score.

In [None]:
score = arno_model.score(X_train, y_train)   

print("Training score: ", score) 

*What?* That's not a proper score, more like election result in North Korea... As I try to get my sled away from the thaw, I at the same time quietly explore ways of evaluating the figure above. Again, with layman's logic I would think that the dataset of average readings on Arno and Bilancino is so "easy" that the algorithm fits too well and therefore  "sinks" into the data thaw, thus more or less mimicking it in a way that is often referred as "overfitting". ***If the model technically becomes one with the training data, applying it to another, unseen dataset (the infamous "test dataset") might prove to be disastrous a move... like training for Arctic expedition in a basement fridge...***

I will next try the mean cross-validation score. By doing so, I will hold no warm thoughts about internet, because there is no simple online article telling what sort of score should be considered as "good". I mean, if mean cross-validation renders unequivocal results, there should be a unified scale for interpreting the result instead of huge opinion disputes in the Stack Overflow comment section. Otherwise no two models and their validation scores can be compared with each other in any way... 

Most of the tutorials end up exactly the same: "well, here is the validation score result". *Ok fine, but is it a good or bad result? If the teacher cannot answer that, how are the students expected to do so?* Seriously, I start to think that I am not the only one lost in the data thaw with one's own thoughts... There is no way the tutorial writers all forgot to explain the very same thing. What if they don't really understand it either? *The validation score is dependent on what is predicted? Good, so what is the preferred percentage of deviation?* Without common idea of a preferred score, the outcome is necessarily prone to subjective evaluation and therefore becomes more philosophical than statistical object.

Suddenly I understand all those forum posts much better, they are applying rhetoric to supplement data... But after all, if there is something to be expressed in the first place, it can be expressed in a clear manner. **Ludwig Wittgenstein** said that, or maybe **Dua Lipa**...

Some articles available on validation I found are below as links:

https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85

https://www.dummies.com/programming/big-data/data-science/data-science-cross-validating-in-python/

https://vitalflux.com/k-fold-cross-validation-python-example/

In [None]:
from sklearn.model_selection import cross_val_score, KFold

# cross validation 
scores = cross_val_score(arno_model,X_train, y_train, cv=5)

# print average accuracy
print("Mean cross-validation score: %.2f" % scores.mean())

In [None]:
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(arno_model, X_train, y_train, cv=kfold)

print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(StandardScaler(), XGBRegressor(n_estimators=100, max_depth=4))

scores = cross_val_score(pipeline, X=X_train, y=y_train, cv=10, n_jobs=1)
 
print('Cross Validation accuracy scores: %s' % scores)
print('Cross Validation accuracy: %.3f +/- %.3f' % (np.mean(scores),np.std(scores)))

To me the results above are somewhat in line with each other, but as I mentioned, I am in the data Arctic equipped with t-shirt and slippers and thus don't know what the ---- I'm doing here except for rapidly turning into an ice statue... 

There was also a piece of code online (I didn' t preserve the exact source, unfortunately) for plotting the results between predictions and original values, so I will throw that in the mix.

In [None]:
x_ax = range(len(y_valid))
plt.scatter(x_ax, y_valid, s=5, color="blue", label="original")
plt.plot(x_ax, y_pred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [None]:
# check dataset shape 
y_valid.shape,y_pred.shape

To the best of my knowledge, the cross validation score with standard deviation is about the accuracy of predicted values in the average hydrometry column (target). As printed below, the target column has a range of 1.91 to 2.76 with a mean of 1.45. I have my own idea on whether the score (which obviously changes slightly every time the notebook is run) is good or bad, but I can only repeat my wish of some sort of uniform scale being available.

In [None]:
df_arno_and_bilancino['Hydrometry_Mean'].describe()

Next I will dig up the humble test dataset created earlier and apply it for the simplest of models.

In [None]:
y_pred_two = arno_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_two)

print("MSE: %.2f" % mse)
print("RMSE: %.2f" % (mse**(1/2.0)))

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_pred_two, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [None]:
# check dataset shape 
y_test.shape,y_pred_two.shape

*So what would happen if the simplest of models actually encountered completely new data?* 

*In other words...*

***How does one actually apply a new dataset after the model is all tuned up? What happens if one simply changes the dataset to a new one when making the predictions? Seriously, I looked up 25-30 articles and did not find a clear answer to this. Also, the Kaggle course on intermediate machine learning did not have a word on it...***

The year 2020 data in the original dataset is something that has so far been unused, so why not try that? 

***Just to make it clear, the following most likely does not make any sense at all and - in the spirit of true expedition - the results may prove to be utterly obscure. The moment I learn how to do the process right, I promise I will then act accordingly instead of scribbling an expedition journal while suffering from a serious case of data-induced snowblindness...***

As the pristine data needs to fit the training data in format, I do some quick housekeeping in my hastily built igloo next.

In [None]:
start_date_2020 = '2020-01-01'
end_date_2020 = '2020-06-30'
mask_2020_arno = (df_arno['Date'] >= start_date_2020) & (df_arno['Date'] < end_date_2020)
df_arno_2020 = df_arno.loc[mask_2020_arno]
pop_hydrometry_2020 = df_arno_2020.pop("Hydrometry")
df_arno_2020['Rainfall_Mean'] = df_arno_2020.mean(axis=1)
df_arno_2020['Hydrometry_Mean'] = pop_hydrometry_2020
df_arno_2020.reset_index(inplace = True) 
col = ['index', 'Date', 'Le_Croci', 'Cavallina', 'S_Agata', 'Mangona', 'S_Piero']
df_arno_2020 = df_arno_2020.drop(col, axis=1)
one_to_365 = pd.Series(range(1,366))
df_arno_2020['Day'] = one_to_365

mask_2020_bilancino = (df_bi_copy['Date'] >= start_date_2020) & (df_bi_copy['Date'] < end_date_2020)
df_bilancino_2020 = df_bi_copy.loc[mask_2020_bilancino]
pop_date_2020 = df_bilancino_2020.pop("Date")
df_bilancino_2020.rename(columns = {'Lake_Level':'Lake_Level_Mean'}, inplace = True) 
df_bilancino_2020.rename(columns = {'Flow_Rate':'Flow_Rate_Mean'}, inplace = True) 
df_bilancino_2020.reset_index(inplace = True) 
col = ['index']
df_bilancino_2020 = df_bilancino_2020.drop(col, axis=1)
one_to_365 = pd.Series(range(1,366))
df_bilancino_2020['Day'] = one_to_365

df_Xpedition_test = pd.merge(df_arno_2020, df_bilancino_2020, on='Day')
df_Xpedition_test.head(10)

In [None]:
df_Xpedition_test.isnull().sum()

There's NaN value, unless we're talking bread it's never favourable...

In [None]:
print(df_Xpedition_test[df_Xpedition_test["Hydrometry_Mean"].isnull()])

In [None]:
df_Xpedition_test.at[125,'Hydrometry_Mean']=1.0

In [None]:
df_Xpedition_test.isnull().sum()

In [None]:
data_two = df_Xpedition_test
predictors = ['Rainfall_Mean', 'Lake_Level_Mean', 'Flow_Rate_Mean']
X_test = data_two[predictors]
y_test = data_two.Hydrometry_Mean
y_pred_two = arno_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_two)

print("MSE: %.2f" % mse)
print("RMSE: %.2f" % (mse**(1/2.0)))

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_pred_two, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

End of expedition, time to heal the frostbites and scurvy. I think **Adolf Erik Nordenskiöld** said that, or maybe **Billie Eilish**... 