# <center>ACEA Water Analytics</center>

> ### Water is one of the most valueable resources not only to we humans but also for many other living organisms as well. Although, water security often gets less media attention than climate change topics, water scarcity is a serious issue that affects us all. Water scarcity is created where the water withdrawal from a basin exceeds its recharge and water stress level is the freshwater withdrawal as a proportion of available freshwater resources. Managing water as a ressource can be challenging. Therefore, the goal of this challenge is to predict water levels to help Acea Group preserve precious waterbodies. In this notebook we are getting an overview of the challenge first and then explore the various datasets and see if we can find some interesting insights.

> > > Special thanks to [Leonie's notebook](https://www.kaggle.com/iamleonie/eda-quenching-the-thirst-for-insights).. Some ideas have been taken from that notebook. Please upvote that notebook as well.

![img](https://marketingmaverick.in/wp-content/uploads/2019/03/images1-2.jpg)

# Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('../input/acea-water-prediction'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


> ## So basically we are given 9 datasets, having 4 different water bodies --> Aquifier, Water Spring, Lake and River...
And In this notebook, I am first trying to do some analysis of each water bodies (how data is distributed and how we can apply modelling on top of that)

## Basic EDA

In [None]:
Aquifer_Doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv', index_col = 'Date')
Aquifer_Auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv', index_col = 'Date')
Water_Spring_Amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv', index_col = 'Date')
Lake_Bilancino = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv', index_col = 'Date')
Water_Spring_Madonna_di_Canneto = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv', index_col = 'Date')
Aquifer_Luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv', index_col = 'Date')
Aquifer_Petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv', index_col = 'Date')
Water_Spring_Lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv', index_col = 'Date')
River_Arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv', index_col = 'Date')


print('Datasets shape:')
print('*'*30)
print('Aquifer_Doganella --> {}'.format(Aquifer_Doganella.shape))
print('Aquifer_Auser --> {}'.format(Aquifer_Auser.shape))
print('Water_Spring_Amiata --> {}'.format(Water_Spring_Amiata.shape))
print('Lake_Bilancino --> {}'.format(Lake_Bilancino.shape))
print('Water_Spring_Madonna_di_Canneto --> {}'.format(Water_Spring_Madonna_di_Canneto.shape))
print('Aquifer_Luco --> {}'.format(Aquifer_Luco.shape))
print('Aquifer_Petrignano --> {}'.format(Aquifer_Petrignano.shape))
print('Water_Spring_Lupa --> {}'.format(Water_Spring_Lupa.shape))
print('River_Arno --> {}'.format(River_Arno.shape))
print('*'*30)

## Checking for NAN values

In [None]:
datasets = [Aquifer_Doganella, Aquifer_Auser, Water_Spring_Amiata,
            Lake_Bilancino, Water_Spring_Madonna_di_Canneto, Aquifer_Luco,
            Aquifer_Petrignano, Water_Spring_Lupa, River_Arno]

datasets_names = ['Aquifer_Doganella', 'Aquifer_Auser', 'Water_Spring_Amiata',
                'Lake_Bilancino', 'Water_Spring_Madonna_di_Canneto', 'Aquifer_Luco',
                'Aquifer_Petrignano', 'Water_Spring_Lupa', 'River_Arno']
def bar_plot(x, y, title, palette_len, xlim = None, ylim = None, 
             xticklabels = None, yticklabels = None, 
             top_visible = False, right_visible = False, 
             bottom_visible = True, left_visible = False,
             xlabel = None, ylabel = None, figsize = (10, 4),
             axis_grid = 'y'):
    fig, ax = plt.subplots(figsize = figsize)
    plt.title(title, size = 15, fontweight = 'bold', fontfamily = 'serif')

    for i in ['top', 'right', 'bottom', 'left']:
        ax.spines[i].set_color('black')
    
    ax.spines['top'].set_visible(top_visible)
    ax.spines['right'].set_visible(right_visible)
    ax.spines['bottom'].set_visible(bottom_visible)
    ax.spines['left'].set_visible(left_visible)

    sns.barplot(x = x, y = y, edgecolor = 'black', ax = ax,
                palette = reversed(sns.color_palette("viridis", len(palette_len))))
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)    
    ax.set_xticklabels(xticklabels, fontfamily = 'serif')
    ax.set_yticklabels(yticklabels, fontfamily = 'serif')
    plt.xlabel(xlabel, fontfamily = 'serif')
    plt.ylabel(ylabel, fontfamily = 'serif')
    ax.grid(axis = axis_grid, linestyle = '--', alpha = 0.9)
    plt.show()
    

for i in range(len(datasets)):
    NaN_values = (datasets[i].isnull().sum() / len(datasets[i]) * 100).sort_values(ascending = False)
    bar_plot(x = NaN_values, 
             y = NaN_values.index,
             title = '{}: NaN values (%)'.format(datasets_names[i]),
             palette_len = NaN_values.index, 
             xlim = (0, 100), 
             xticklabels = range(0, 101, 20),
             yticklabels = NaN_values.index,
             left_visible = True,
             figsize = (10, 8), axis_grid = 'x')

So, we can see that there are so many NAN values in datasets, So we would handle these NAN values by filling up mean of that column.

## Handling missing values of all 9 datasets

Firstly I am showing 3 different ways to handle missing values..

In [None]:
Aquifer_Doganella = Aquifer_Doganella.fillna(method='bfill')
print(Aquifer_Doganella.shape)

# Add averages isntead of still missing values
Aquifer_Doganella = Aquifer_Doganella.fillna(Aquifer_Doganella.mean())
print(Aquifer_Doganella.shape)
# Drop rows with NaN value
# Aquifer_Doganella = Aquifer_Doganella.dropna()
# print(Aquifer_Doganella.shape)

Now handling all missing values by bfill method-->

In [None]:
Aquifer_Doganella=pd.read_csv("../input/acea-water-prediction/Aquifer_Doganella.csv")
Aquifer_Doganella = Aquifer_Doganella.fillna(method='bfill')
print("Aquifer_Doganella-:")
print(Aquifer_Doganella.shape)
print("*"*30)

Aquifer_Auser=pd.read_csv("../input/acea-water-prediction/Aquifer_Auser.csv")
Aquifer_Auser = Aquifer_Auser.fillna(method='bfill')
print("Aquifer_Auser-:")
print(Aquifer_Auser.shape)
print("*"*30)

Water_Spring_Amiata=pd.read_csv("../input/acea-water-prediction/Water_Spring_Amiata.csv")
Water_Spring_Amiata = Water_Spring_Amiata.fillna(method='bfill')
print("Water_Spring_Amiata-:")
print(Water_Spring_Amiata.shape)
print("*"*30)

Lake_Bilancino=pd.read_csv("../input/acea-water-prediction/Lake_Bilancino.csv")
Lake_Bilancino = Lake_Bilancino.fillna(method='bfill')
print("Lake_Bilancino-:")
print(Lake_Bilancino.shape)
print("*"*30)

Water_Spring_Madonna_di_Canneto=pd.read_csv("../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv")
Water_Spring_Madonna_di_Canneto = Water_Spring_Madonna_di_Canneto.fillna(method='bfill')
print("Water_Spring_Madonna_di_Canneto-:")
print(Water_Spring_Madonna_di_Canneto.shape)
print("*"*30)

Aquifer_Luco=pd.read_csv("../input/acea-water-prediction/Aquifer_Luco.csv")
Aquifer_Luco = Aquifer_Luco.fillna(method='bfill')
print("Aquifer_Luco-:")
print(Aquifer_Luco.shape)
print("*"*30)

Aquifer_Petrignano=pd.read_csv("../input/acea-water-prediction/Aquifer_Petrignano.csv")
Aquifer_Petrignano = Aquifer_Petrignano.fillna(method='bfill')
print("Aquifer_Petrignano-:")
print(Aquifer_Petrignano.shape)
print("*"*30)

Water_Spring_Lupa=pd.read_csv("../input/acea-water-prediction/Water_Spring_Lupa.csv")
Water_Spring_Lupa = Water_Spring_Lupa.fillna(method='bfill')
print("Water_Spring_Lupa-:")
print(Water_Spring_Lupa.shape)
print("*"*30)

River_Arno=pd.read_csv("../input/acea-water-prediction/River_Arno.csv")
River_Arno = River_Arno.fillna(method='bfill')
print("River_Arno-:")
print(River_Arno.shape)
print("*"*30)



![img](https://www.kaggleusercontent.com/kf/49696239/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..SAo3B3vMxdhgnFNdyEhCBw.eJJFrtOEINVD9Ced1up4dpCzR-UF1s5SS5xlWiZDgdP2Bgi0T5OJEB65h8LSzv_62TsslGVUulIy7tc0ENc6TV5yZ5N7x9yDPYYGedNrl8lhTuFA5SdDVKDE5kDubs9_hFrTn9eQqWlbt1-OhfbUrzvge0j7PDilBkOs9l34_DY7W0npCOe5Wsa4UNU2XF1orbM-AMkd2waRCixQDBbrHcy6c9j4rkRmK8TdHkibV6m9nWDfLRxxHqTXJEDXxdDis0XvLObucPOBJexB58nTcjagg8ijFV2bjCmETRX-_pXO_nQpL8kAa7PGWQ7ZFtVhTI29QPFksh8Vtusg7iYiySFYmbMXdQDodKmQRnU4Iip5v7lrJhvQuGQt-kQGPiiGkq7aLV8WRcoXOkLqps6IXIvr6C7vShtUgOM3gZy-7vLnQs1WP-qkGG-8YIMSSpXpa1yvsZBlyN8xMKp-hSoTzj5TKXcDMY00ApSUR3QiU0tEOW-4F8gQermtJBXdvNcrt_c5Ux6WGsY1LeMzNhlL3dOnFrdJUV-JamkJ33JIotlkdbeMsmotvbt4KNJYF1cU5ZZa5zbdKSaUAoyRk4-aiSHOeV2EG5aUdiBeTEPHgsiJ3OUl9MjgoXtrV0qd3Qxyc7rISekcqXiWvSrdFfdZKKEORHVGyKTT2RqdFeYae2g.Flk7gpzebDQcXm_H3eB9Wg/__results___files/__results___1_1.png)

> > > Visualization taken from Leonie's notebook


# EDA on Aquifer Water Body

In [None]:
print("Aquifer_Doganella:")
print('*'*30)
print(Aquifer_Doganella.columns)
print()
print("Aquifer_Auser:")
print('*'*30)
print(Aquifer_Auser.columns)
print()
print("Aquifer_Luco:")
print('*'*30)
print(Aquifer_Luco.columns)
print()
print("Aquifer_Petrignano:")
print('*'*30)
print(Aquifer_Petrignano.columns)

In [None]:
print("Aquifer_Doganella:")
print('*'*30)
print(Aquifer_Doganella.describe())
print()
print("Aquifer_Auser:")
print('*'*30)
print(Aquifer_Auser.describe())
print()
print("Aquifer_Luco:")
print('*'*30)
print(Aquifer_Luco.describe())
print()
print("Aquifer_Petrignano:")
print('*'*30)
print(Aquifer_Petrignano.describe())

In [None]:
#Checking relationship between variables
datasets = [Aquifer_Doganella, Aquifer_Auser,Aquifer_Luco,Aquifer_Petrignano]
datasets_names = ["Aquifer_Doganella", "Aquifer_Auser","Aquifer_Luco","Aquifer_Petrignano"]
i=0
for data in datasets:
    cor=data.corr()
    plt.figure(figsize=(20,10), facecolor='w')
    sns.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
    plt.title("Correlation among all the Variables of the "+datasets_names[i]+" data", size=20)
    cor
    i+=1

## Doing Univariate analysis of first 4 features/columns only of all 4 datasets

We can do any type of univariate analysis.. I am doing voilin plot analysis..Also i have commented the code for histogram plots.. We can do analysis for those plots..
Also, in index, just mention the features for which you want to do univariate analysis. The code will do the part.

In [None]:
datasets = [Aquifer_Doganella, Aquifer_Auser,Aquifer_Luco,Aquifer_Petrignano]
for data in datasets:
    features=data.columns.to_list()
    index=[1,3,5]
    features = [features[i] for i in index] 
    num_plots = len(features)
    total_cols = 2
    total_rows = num_plots//total_cols + 1
    color = ['m', 'g', 'b', 'r', 'y', 'v', 'o']
    fig, axs = plt.subplots(nrows=total_rows, ncols=total_cols,
                            figsize=(7*total_cols, 7*total_rows), facecolor='w', constrained_layout=True)
    
    #For violin plots
    for i, var in enumerate(features):
        row = i//total_cols
        pos = i % total_cols
        plot = sns.violinplot(y=var, data=data, ax=axs[row][pos], linewidth=2)
     
    #For histogram or bar plots and line plots
#     for feature in features:
#         plt.figure(figsize=(18, 10), facecolor='w')
#         sns.distplot(data[feature])
#         plt.title('{} Distribution'.format(feature), fontsize=20)
#         plt.show()

This can help to tell which features are different and have their role in prediction of amount of water in their water bodies.

## Bivariate analysis

You can choose any two features and do analysis of dataset..

In [None]:
plt.figure(figsize=(20,10), facecolor='w')
sns.boxplot(x="Rainfall_Monteporzio",y="Depth_to_Groundwater_Pozzo_1",data=Aquifer_Doganella)
plt.title("Distribution of Rainfall_Monteporzio with respect to Depth_to_Groundwater_Pozzo_1", size=5)
plt.show()

In [None]:
plt.figure(figsize=(20,10), facecolor='w')
sns.boxplot(x="Rainfall_Gallicano",y="Depth_to_Groundwater_PAG",data=Aquifer_Auser)
plt.title("Distribution of Rainfall_Gallicano with respect to Depth_to_Groundwater_PAG", size=5)
plt.show()

## Multivariate analysis

In [None]:

graph_1 = Aquifer_Doganella.groupby("Date").Rainfall_Monteporzio.mean()
graph_2 = Aquifer_Doganella.groupby("Date").Rainfall_Velletri.mean()
graph_3 = Aquifer_Doganella.groupby("Date").Temperature_Monteporzio.mean()
graph_4 = Aquifer_Doganella.groupby("Date").Temperature_Velletri.mean()

plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_Monteporzio")
sns.lineplot(data=graph_2, label="Rainfall_Velletri")
sns.lineplot(data=graph_3, label="Temperature_Monteporzio")
sns.lineplot(data=graph_4, label="Temperature_Velletri")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

In [None]:
graph_1 = Aquifer_Auser.groupby("Date").Rainfall_Gallicano.mean()
graph_2 = Aquifer_Auser.groupby("Date").Rainfall_Pontetetto.mean()
graph_3 = Aquifer_Auser.groupby("Date").Temperature_Orentano.mean()
graph_4 = Aquifer_Auser.groupby("Date").Temperature_Monte_Serra.mean()

plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_Gallicano")
sns.lineplot(data=graph_2, label="Rainfall_Pontetetto")
sns.lineplot(data=graph_3, label="Temperature_Orentano")
sns.lineplot(data=graph_4, label="Temperature_Monte_Serra")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

## Target distribution

In [None]:
PATH = "../input/acea-water-prediction/"
aquifer_auser_df = pd.read_csv(f"{PATH}Aquifer_Auser.csv")
aquifer_doganella_df = pd.read_csv(f"{PATH}Aquifer_Doganella.csv")
aquifer_luco_df = pd.read_csv(f"{PATH}Aquifer_Luco.csv")
aquifer_petrignano_df = pd.read_csv(f"{PATH}Aquifer_Petrignano.csv")

lake_biliancino_df = pd.read_csv(f"{PATH}Lake_Bilancino.csv")

river_arno_df = pd.read_csv(f"{PATH}River_Arno.csv")

water_spring_amiata_df = pd.read_csv(f"{PATH}Water_Spring_Amiata.csv")
water_spring_lupa_df = pd.read_csv(f"{PATH}Water_Spring_Lupa.csv")
water_spring_madonna_df = pd.read_csv(f"{PATH}Water_Spring_Madonna_di_Canneto.csv")

waterbodies_df = aquifer_auser_df.merge(aquifer_doganella_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(aquifer_luco_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(aquifer_petrignano_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(lake_biliancino_df[['Date','Temperature_Le_Croci','Lake_Level', 'Flow_Rate']], on='Date', how='outer') # Only merge specific columns because 'Rainfall_S_Piero', 'Rainfall_Mangona', 'Rainfall_S_Agata', 'Rainfall_Cavallina', 'Rainfall_Le_Croci' are shared with river_arno_df
waterbodies_df = waterbodies_df.merge(river_arno_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(water_spring_amiata_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(water_spring_lupa_df, on='Date', how='outer')
waterbodies_df = waterbodies_df.merge(water_spring_madonna_df, on='Date', how='outer')

waterbodies_df['Date_dt'] = pd.to_datetime(waterbodies_df.Date, format = '%d/%m/%Y')
waterbodies_df = waterbodies_df.sort_values(by='Date_dt').reset_index(drop=True)

n_targets = 1
height=4

custom_colors = ['mediumblue', 'steelblue', 'dodgerblue', 'cornflowerblue', 'lightblue', 
                 'cadetblue', 'teal', 'mediumaquamarine', 'lightseagreen']
f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Targer Variables for Aquifier Auser', fontsize=16)

for i, target in enumerate(['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS', ]): # 'Depth_to_Groundwater_PAG', 'Depth_to_Groundwater_DIEC']): not targets
    sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df[target].replace({np.nan : np.inf}), ax=ax, color=custom_colors[i], label=target)

ax.set_ylabel('Depth to Groundwater', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()


f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Targer Variables for Aquifier Doganella', fontsize=16)

for i, target in enumerate(['Depth_to_Groundwater_Pozzo_1_x', 'Depth_to_Groundwater_Pozzo_2', 'Depth_to_Groundwater_Pozzo_3_x', 'Depth_to_Groundwater_Pozzo_4_x',
               'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6', 'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8',
              'Depth_to_Groundwater_Pozzo_9']):
    sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df[target].replace({np.nan : np.inf}), ax=ax, color=custom_colors[i], label=target)

ax.set_ylabel('Depth to Groundwater', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()


f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Targer Variables for Aquifier Luco', fontsize=16)
for i, target in enumerate(['Depth_to_Groundwater_Podere_Casetta',]): # 'Depth_to_Groundwater_Pozzo_1_y', 'Depth_to_Groundwater_Pozzo_3_y', 'Depth_to_Groundwater_Pozzo_4_y']): not targets
    sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df[target].replace({np.nan : np.inf}), ax=ax, color=custom_colors[i], label=target)

ax.set_ylabel('Depth to Groundwater', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()


f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Targer Variables for Aquifier Petrignano', fontsize=16)

for i, target in enumerate(['Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25']):
    sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df[target].replace({np.nan : np.inf}), ax=ax, color=custom_colors[i], label=target)

ax.set_ylabel('Depth to Groundwater', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()



# EDA on Water_Spring

In [None]:
print("Water_Spring_Amiata:")
print('*'*30)
print(Water_Spring_Amiata.columns)
print()
print("Water_Spring_Lupa:")
print('*'*30)
print(Water_Spring_Lupa.columns)
print()
print("Water_Spring_Madonna_di_Canneto:")
print('*'*30)
print(Water_Spring_Madonna_di_Canneto.columns)
print()

In [None]:
print("Water_Spring_Amiata:")
print('*'*30)
print(Water_Spring_Amiata.describe())
print()
print("Water_Spring_Lupa:")
print('*'*30)
print(Water_Spring_Lupa.describe())
print()
print("Water_Spring_Madonna_di_Canneto:")
print('*'*30)
print(Water_Spring_Madonna_di_Canneto.describe())
print()

In [None]:
datasets = [Water_Spring_Amiata,Water_Spring_Madonna_di_Canneto,Water_Spring_Lupa]
datasets_names = ["Water_Spring_Amiata","Water_Spring_Madonna_di_Canneto","Water_Spring_Lupa"]
i=0
for data in datasets:
    cor=data.corr()
    plt.figure(figsize=(20,10), facecolor='w')
    sns.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
    plt.title("Correlation among all the Variables of the "+datasets_names[i]+" data", size=20)
    cor
    i+=1

# Univariate analysis

In [None]:
datasets = [Water_Spring_Amiata,Water_Spring_Madonna_di_Canneto,Water_Spring_Lupa]

for data in datasets:
    features=data.columns.to_list()
    index=[1,2]
    features = [features[i] for i in index] 
    num_plots = len(features)
    total_cols = 2
    total_rows = num_plots//total_cols + 1
    color = ['m', 'g', 'b', 'r', 'y', 'v', 'o']
    fig, axs = plt.subplots(nrows=total_rows, ncols=total_cols,
                            figsize=(7*total_cols, 7*total_rows), facecolor='w', constrained_layout=True)
    
    #For violin plots
    for i, var in enumerate(features):
        row = i//total_cols
        pos = i % total_cols
        plot = sns.violinplot(y=var, data=data, ax=axs[row][pos], linewidth=2)
     
    #For histogram or bar plots and line plots
#     for feature in features:
#         plt.figure(figsize=(18, 10), facecolor='w')
#         sns.distplot(data[feature])
#         plt.title('{} Distribution'.format(feature), fontsize=20)
#         plt.show()

# Multivariate analysis

In [None]:
graph_1 = Water_Spring_Amiata.groupby("Date").Rainfall_Castel_del_Piano.mean()
graph_2 = Water_Spring_Amiata.groupby("Date").Rainfall_Abbadia_S_Salvatore.mean()
graph_3 = Water_Spring_Amiata.groupby("Date").Temperature_Abbadia_S_Salvatore.mean()
graph_4 = Water_Spring_Amiata.groupby("Date").Temperature_S_Fiora.mean()

plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_Castel_del_Piano")
sns.lineplot(data=graph_2, label="Rainfall_Abbadia_S_Salvatore")
sns.lineplot(data=graph_3, label="Temperature_Orentano")
sns.lineplot(data=graph_4, label="Temperature_S_Fiora")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

In [None]:
graph_1 = Water_Spring_Lupa.groupby("Date").Rainfall_Terni.mean()
graph_2 = Water_Spring_Lupa.groupby("Date").Flow_Rate_Lupa.mean()


plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_Terni")
sns.lineplot(data=graph_2, label="Flow_Rate_Lupa")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

## Target Distribution 

In [None]:
n_targets = 1
f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Target Variables for Water Spring Amiata', fontsize=16)

for i, target in enumerate(['Flow_Rate_Bugnano', 'Flow_Rate_Arbure', 'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta']):
    sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df[target].replace({np.nan : np.inf}), ax=ax, color=custom_colors[i], label=target)

ax.set_ylabel('Flow Rate', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()


f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Target Variables for Water Spring Lupa', fontsize=16)
sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df.Flow_Rate_Lupa.replace({np.nan : np.inf}), ax=ax, color=custom_colors[0])

ax.set_ylabel('Flow Rate', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()


f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Target Variables for Water Spring Madonna di Canneto', fontsize=16)
sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df.Flow_Rate_Madonna_di_Canneto.replace({np.nan : np.inf}), ax=ax, color=custom_colors[0])

ax.set_ylabel('Flow Rate', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()

# EDA On Lake

In [None]:
print("Lake_Bilancino:")
print('*'*30)
print(Lake_Bilancino.columns)

In [None]:
Lake_Bilancino.describe()

In [None]:
cor=Lake_Bilancino.corr()
plt.figure(figsize=(20,10), facecolor='w')
sns.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
plt.title("Correlation among all the Variables of the Lake_Bilancino data", size=20)
cor

Univariate analysis

In [None]:
features=Lake_Bilancino.columns.to_list()
index=[1,3,5]
features = [features[i] for i in index] 
num_plots = len(features)
total_cols = 2
total_rows = num_plots//total_cols + 1
color = ['m', 'g', 'b', 'r', 'y', 'v', 'o']
fig, axs = plt.subplots(nrows=total_rows, ncols=total_cols,
                        figsize=(7*total_cols, 7*total_rows), facecolor='w', constrained_layout=True)

#For violin plots
for i, var in enumerate(features):
    row = i//total_cols
    pos = i % total_cols
    plot = sns.violinplot(y=var, data=Lake_Bilancino, ax=axs[row][pos], linewidth=2)

#For histogram or bar plots and line plots
for feature in features:
    plt.figure(figsize=(18, 10), facecolor='w')
    sns.distplot(Lake_Bilancino[feature])
    plt.title('{} Distribution'.format(feature), fontsize=20)
    plt.show()

Multivariate analysis

In [None]:
graph_1 = Lake_Bilancino.groupby("Date").Rainfall_S_Piero.mean()
graph_2 = Lake_Bilancino.groupby("Date").Rainfall_Mangona.mean()
graph_3 = Lake_Bilancino.groupby("Date").Temperature_Le_Croci.mean()
graph_4 = Lake_Bilancino.groupby("Date").Lake_Level.mean()

plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_S_Piero")
sns.lineplot(data=graph_2, label="Rainfall_Mangona")
sns.lineplot(data=graph_3, label="Temperature_Le_Croci")
sns.lineplot(data=graph_4, label="Lake_Level")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

Target Distribution

In [None]:
n_targets = 2
f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Target Variables for Lake Biliancino', fontsize=16)
sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df.Lake_Level.replace({np.nan : np.inf}), ax=ax[0], color=custom_colors[0])
sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df.Flow_Rate.replace({np.nan : np.inf}), ax=ax[1], color=custom_colors[0])
ax[0].set_ylabel('Lake Level', fontsize=14)
ax[1].set_ylabel('Flow Rate', fontsize=14)

for i in range(n_targets):
    ax[i].set_xlabel('Date', fontsize=14)
plt.show()

# EDA on River

In [None]:
print("River_Arno:")
print('*'*30)
print(River_Arno.columns)

In [None]:
River_Arno.describe()

In [None]:
cor=River_Arno.corr()
plt.figure(figsize=(20,10), facecolor='w')
sns.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
plt.title("Correlation among all the Variables of the River_Arno data", size=20)
cor

# Univariate analysis

In [None]:
features=River_Arno.columns.to_list()
index=[1,3,5]
features = [features[i] for i in index] 
num_plots = len(features)
total_cols = 2
total_rows = num_plots//total_cols + 1
color = ['m', 'g', 'b', 'r', 'y', 'v', 'o']
fig, axs = plt.subplots(nrows=total_rows, ncols=total_cols,
                        figsize=(7*total_cols, 7*total_rows), facecolor='w', constrained_layout=True)

#For violin plots
for i, var in enumerate(features):
    row = i//total_cols
    pos = i % total_cols
    plot = sns.violinplot(y=var, data=River_Arno, ax=axs[row][pos], linewidth=2)

#For histogram or bar plots and line plots
for feature in features:
    plt.figure(figsize=(18, 10), facecolor='w')
    sns.distplot(River_Arno[feature])
    plt.title('{} Distribution'.format(feature), fontsize=20)
    plt.show()

Multivariate analysis

In [None]:
graph_1 = River_Arno.groupby("Date").Rainfall_Le_Croci.mean()
graph_2 = River_Arno.groupby("Date").Rainfall_Cavallina.mean()
graph_3 = River_Arno.groupby("Date").Temperature_Firenze.mean()
graph_4 = River_Arno.groupby("Date").Hydrometry_Nave_di_Rosano.mean()

plt.figure(figsize=(20,10), facecolor='w')
sns.lineplot(data=graph_1, label="Rainfall_Le_Croci")
sns.lineplot(data=graph_2, label="Rainfall_Cavallina")
sns.lineplot(data=graph_3, label="Temperature_Firenze")
sns.lineplot(data=graph_4, label="Hydrometry_Nave_di_Rosano")
plt.title("Graph showing Rainfall and Temperature date wise ", size=20)
plt.xlabel("Date", size=20)
plt.ylabel("Value", size=20)
plt.xticks(size=12)
plt.yticks(size=12)

## Target Distribution

In [None]:
n_targets=1
f, ax = plt.subplots(nrows=n_targets, ncols=1, figsize=(15, height*n_targets))
f.suptitle('Target Variables for River Arno', fontsize=16)
sns.lineplot(x=waterbodies_df.Date_dt, y=waterbodies_df.Hydrometry_Nave_di_Rosano.replace({np.nan : np.inf}), ax=ax, color=custom_colors[0], label='Hydrometry_Nave_di_Rosano')
ax.set_ylabel('Hydrometry', fontsize=14)
ax.set_xlabel('Date', fontsize=14)
plt.show()

Thanks for reading out my notebook. Hope you like some visualisations.. 