# Acea Smart Water Analytics
![water](https://storage.googleapis.com/kaggle-competitions/kaggle/24191/logos/header.png?t=2020-11-24-14-43-27)

The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In this competition we will focus only on the water sector to help Acea Group preserve precious waterbodies. As it is easy to imagine, a water supply company struggles with the need to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption. During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.

This data contains total four different types of water bodies `water_spring`, `lake`, `river`, and `aquifer`. We will analyze all of them one by one.

- *Spring water* are the `aquifer` water that flows out from the Earth's surface.

- *Aquifer* is an underground layer of water-bearing permeable rock, rock fractures or unconsolidated materials (gravel, sand, or silt). It can be extracted using a water well.

- *Lake* is an area filled with water, localized in a basin, surrounded by land, apart from any river or other outlet that serves to feed or drain the lake. Lakes lie on land and are not part of the ocean, although like the much larger oceans, they form part of earth's water cycle.

- *River* 

In [None]:
# Load the data
auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv')
doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv')
luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv')
petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv')
bilancino = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv')
arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv')
amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv')
lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv')
madonna = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv')

## Evaluation criteria

This is an Analytics competition where your task is to create a Notebook that best addresses the Evaluation criteria below. Submissions should be shared directly with host and will be judged by the Acea Group based on how well they address:

Methodology/Completeness (min 0 points, max 5 points)

- Are the statistical models appropriate given the data?
- Did the author develop one or more machine learning models?
- Did the author provide a way of assessing the performance and accuracy of their solution?
- What is the Mean Absolute Error (MAE) of the models?
- What is the Root Mean Square Error (RMSE) of the models?

Presentation (min 0 points, max 5 points)

- Does the notebook have a compelling and coherent narrative?
- Does the notebook contain data visualizations that help to communicate the author’s main points?
- Did the author include a thorough discussion on the intersection between features and their prediction? For example between rainfall and amount/level of water.
- Was there discussion of automated insight generation, demonstrating what factors to take into account?
- Is the code documented in a way that makes it easy to understand and reproduce?
- Were all external sources of data made public and cited appropriately?

Application (min 0 points, max 5 points)

- Is the provided model useful/able to forecast water availability in terms of level or water flow in a time interval of the year?
- Is the provided methodology applicable also on new datasets belong to another waterbody?

## Model
Can you build a story to predict the amount of water in each unique waterbody? The challenge is to determine how features influence the water availability of each presented waterbody. To be more straightforward, gaining a better understanding of volumes, they will be able to ensure water availability for each time interval of the year.


The desired outcome is a notebook that can generate four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) that might be applicable to each single waterbody.

![prediction](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6195295%2Fcca952eecc1e49c54317daf97ca2cca7%2FAcea-Input.png?generation=1606932492951317&alt=media)

## Spring Water

A spring is a point at which water flows from an aquifer to the Earth's surface. It is a component of the hydrosphere.

![spring_water](https://images.squarespace-cdn.com/content/v1/583ca2f2d482e9bbbef7dad9/1485192337534-E0WEKOBDABDKGT82C8KE/ke17ZwdGBToddI8pDm48kOhJMV1PnJZ188LmW2h2eLVZw-zPPgdn4jUwVcJE1ZvWEtT5uBSRWt4vQZAgTJucoTqqXjS3CfNDSuuf31e0tVF3A8Q7ch3m2DjVP-Ub0jbakg0qvdpmlE8oRecCZ8CDGGQ6l2WM7tn7mqHTODzkmeM/Tapping+Groundwater+Sources)

It has three different area Amiata, Lupa, and Madonna_di_Canneto. Let's see the discription of their columns:

- **Date**: Regular date (Primary key)
- **Rainfall_X**: It indicates the quantity of rain falling, expressed in millimeters (mm), in the area X. 
- **Depth_to_Groundwater_Y**: It indicates the groundwater level, expressed in ground level (meters from the ground floor), detected by the piezometer Y.
- **Temperature_Z**: It indicates the temperature, expressed in °C, detected by the thermometric station Z.
- **Flow_Rate_K**: It indicates the flow rate, expressed in liters per second (l/s), taken from water spring K.

From above, we know that for this water body we need a model which can predict `flow_rate` of water per seconds.

In [None]:
# Let's simplify the name of the columns
all_col = [amiata, madonna, lupa]
for idx, cols in enumerate(all_col):
    new_cols = []
    for col in cols.columns:
        if 'Rainfall' in col:
            new_name = col.replace('Rainfall', 'R')
        elif 'Depth_to_Groundwater' in col:
            new_name = col.replace('Depth_to_Groundwater', 'DG')
        elif 'Temperature' in col:
            new_name = col.replace('Temperature', 'T')
        elif 'Flow_Rate' in col:
            new_name = col.replace('Flow_Rate', 'FR')
        else:
            new_name = col
        new_cols.append(new_name)
    cols.columns = new_cols

In [None]:
amiata.info()

In [None]:
# Convert Date object type column to datetime type
amiata['Date'] = pd.to_datetime(amiata.Date)
madonna['Date'] = pd.to_datetime(madonna.Date)
lupa['Date'] = pd.to_datetime(lupa.Date)

In [None]:
plt.figure(figsize=(12, 8))
plt.title('Amiata correlation heatmap')
sns.heatmap(amiata.corr(), annot=True, fmt='.2f', linewidths=0.2, cmap='YlGnBu')
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
plt.title('Madonna di Canneto correlation heatmap')
sns.heatmap(madonna.corr(), annot=True, fmt='.2f', cmap='YlGnBu')
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
plt.title('Lupa correlation heatmap')
sns.heatmap(lupa.corr(), annot=True, fmt='.2f', cmap='YlGnBu')
plt.show()

Let's choose some columns for building the model which predicts the flow_rate in spring water waterbody. After seeing the heatmap map, and also we need one model for each waterbody. Here, we can see `Rainfall` is the only feature which is common in all the three dataset.

In [None]:
# date = amiata.Date.dt.year
# plt.figure(figsize=(18,4))
# plt.plot(date, amiata.FR_Bugnano)
# plt.show()

## Aquifer

Aquifer is an underground layer of water-bearing permeable rock, rock fractures or unconsolidated materials (gravel, sand, or silt). It can be extracted using a water well.

![aquifer](https://www.dtn.com/wp-content/uploads/2016/05/aquifer_diagram_crop.jpg)

It has four different area Auser, Doganella, Luco, and Petrignano. Let's see the discription of their columns:

- **Date**: Regular date (Primary key)
- **Rainfall_X**: It indicates the quantity of rain falling, expressed in millimeters (mm), in the area X. 
- **Depth_to_Groundwater_Y**: It indicates the groundwater level, expressed in ground level (meters from the ground floor), detected by the piezometer Y.
- **Temperature_Z**: It indicates the temperature, expressed in °C, detected by the thermometric station Z.
- **Volume_H**: It indicates the volume of water, expressed in cubic meters (mc), taken from the drinking water treatment plant H.
- **Hydrometry_K**: It indicates the groundwater level, expressed in meters (m), detected by the hydrometric station K.

From above, we know that for this water body we need a model which can predict `Depth_to_Groundwater` of water per seconds.

In [None]:
# Let's simplify the name of the columns
all_col = [auser, doganella, luco, petrignano]
for idx, cols in enumerate(all_col):
    new_cols = []
    for col in cols.columns:
        if 'Rainfall' in col:
            new_name = col.replace('Rainfall', 'R')
        elif 'Depth_to_Groundwater' in col:
            new_name = col.replace('Depth_to_Groundwater', 'DG')
        elif 'Temperature' in col:
            new_name = col.replace('Temperature', 'T')
        elif 'Volume' in col:
            new_name = col.replace('Volume', 'V')
        elif 'Hydrometry' in col:
            new_name = col.replace('Hydrometry', 'H')
        else:
            new_name = col
        new_cols.append(new_name)
    cols.columns = new_cols

In [None]:
auser.info()

In [None]:
# Convert Date object type column to datetime type
auser['Date'] = pd.to_datetime(auser.Date)
doganella['Date'] = pd.to_datetime(doganella.Date)
luco['Date'] = pd.to_datetime(luco.Date)
petrignano['Date'] = pd.to_datetime(petrignano.Date)

In [None]:
plt.figure(figsize=(18, 8))
plt.title('Auser correlation heatmap')
sns.heatmap(auser.corr(), annot=True, fmt='.2f', linewidths=0.2, cmap='YlGnBu')
plt.show()

In [None]:
plt.figure(figsize=(18, 8))
plt.title('Doganella correlation heatmap')
sns.heatmap(doganella.corr(), annot=True, fmt='.2f', linewidths=0.2, cmap='YlGnBu')
plt.show()

In [None]:
plt.figure(figsize=(18, 8))
plt.title('Luco correlation heatmap')
sns.heatmap(luco.corr(), annot=True, fmt='.2f', linewidths=0.2, cmap='YlGnBu')
plt.show()

In [None]:
plt.figure(figsize=(18, 8))
plt.title('Petrignano correlation heatmap')
sns.heatmap(petrignano.corr(), annot=True, fmt='.2f', linewidths=0.2, cmap='YlGnBu')
plt.show()

## Lake

Lake is an area filled with water, localized in a basin, surrounded by land, apart from any river or other outlet that serves to feed or drain the lake. Lakes lie on land and are not part of the ocean, although like the much larger oceans, they form part of earth's water cycle.

![bilancino_lake](https://hoteldeivicari.com/images/demo/gallery/intestazione1280x500/mugello_florence_lake_bilancino_near_florence.jpg)

Bilancino Lake is an artificial lake in Mugello, in the province of Florence. It has a maximum depth of thirty-one metres and a surface area of 5 square kilometres. 

Let's see the discription of the Bilancino lake data columns:

- **Date**: Regular date (Primary key)
- **Rainfall_X**: It indicates the quantity of rain falling, expressed in millimeters (mm), in the area X. 
- **Temperature_Z**: It indicates the temperature, expressed in °C, detected by the thermometric station Z.
- **Flow Rate**: It indicates the lake's flow rate, expressed in cubic meters per seconds (mc/s).
- **Lake Level**: It indicates the river level, expressed in meters (m).

From above, we know that for this water body we need a model which can predict `Lake_level, flow_rate` of water per seconds.

### Work in progress....