# Notebook Contents

Hi, here I share with you my Data Exploration for the ACEA Smart Water Analytics challenge.


Contenuti: 

- analisi di tutti i file relativi alla challenge (con dimensione in bytes, numero di colonne e numero di righe)

- estensione temporale di ogni dataset 

- analisi degli NA

- mappa (e magari correlazione con meteo) 



Inserire il link di ogni sezione! 


### Props to:

- https://www.kaggle.com/maunish/jsmp-super-cool-eda-lgbm-baseline/comments

- https://www.kaggle.com/iamleonie/eda-quenching-the-thirst-for-insights

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_columns = 30
import os
import re
from colorama import Fore, Back, Style
import seaborn as sns
import plotly.express as px
import matplotlib
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

root_path = '/kaggle/input/acea-water-prediction'
data_files = [i for i in os.listdir(root_path) if re.match(".+\.csv$", i)]
data_files.sort()
data_names = [i.replace('.csv', '') for i in data_files]
data_files = list(map(lambda x: os.path.join(root_path, x), data_files))

waterbody_type = [re.match("water_spring|aquifer|river|lake", i.lower())[0] for i in data_names]

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
sr_ = Style.RESET_ALL

color_dict = {'aquifer': r_, 'water_spring': g_, 'lake': c_, 'river': b_}

def get_pattern_count(pattern_list, cols): 
    counts = []
    for j in pattern_list: 
        counts.append(len([i for i in cols if re.match(j, i)]))
    return counts

def get_df_basic_information(df, waterbody_type, df_name): 
    
    n_rows, n_columns = df.shape
    
    mb_size = round(df.memory_usage(deep=True).sum()/1000000., 3)
    
    print("""{0}{1}\n
          N rows: {2}\tN columns: {3}\n
          Memory Usage: {4} Mb\n\n\n""".format(color_dict[waterbody_type], df_name,
                                           n_rows, n_columns, mb_size))
    
def chunks(l, n):
    """ Yield n successive chunks from l.
    """
    newn = int(len(l) / n)
    for i in range(0, n-1):
        yield l[i*newn:i*newn+newn]
    yield l[n*newn-newn:]

### Challenge Description

_This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other. So, if for instance we consider a water spring we notice that its features are different from the lake’s one. This is correct and reflects the behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water spring (for which three datasets are provided), lake (for which a dataset is provided), river (for which a dataset is provided) and aquifers (for which four datasets are provided)._

So the dataset we have at our disposal are: 

In [None]:
for i in range(len(waterbody_type)):
    print("{0}{1}\n".format(color_dict[waterbody_type[i]], data_names[i]))

We can see there are 4 aquifers, one lake, one river and 3 water springs, as expected. 

The challenge descriptions states: _As each waterbody is different from the other, the related features as well are different from each other_. 

Let's start by checking each dataframe shape. 


In [None]:
df_dict = dict(zip(data_names, list(map(lambda x: pd.read_csv(x, sep = ";") if 'Auser' in x else pd.read_csv(x), data_files))))

for name in data_names:
    df_dict[name] = df_dict[name].loc[~df_dict[name]['Date'].isna()]
    df_dict[name]['Date'] = pd.to_datetime(df_dict[name]['Date'], format = "%d/%m/%Y").dt.date
    df_dict[name].sort_values('Date', ignore_index = True, inplace = True)

In [None]:
for j in range(len(data_names)): 
    
    source_name = data_names[j]
    get_df_basic_information(df_dict[source_name], waterbody_type[j],source_name)

Each source has its own number of rows and columns...

Possibili idee: 

- ho 9 dataset, vorrei vedere per ognuno di questi quali colonne sono presenti (0-1) (Una sorta di confusion matrix?) 

- se il numero unico di colonne è troppo elevato, potrebbe essere interessante andare a considerare dei sottogruppi

- naturalmente ci potrebbe essere un'analisi di correlazione e crosscorrelazione (che potrebbero dare vita a notebook separati) 

- un'altra sarebbe quella di fare un'analisi geografica nelle distanze tra target e altre colonne o tra dataset diversi (potrebbe dar vita a un notebook separato)

- analisi per singola fonte?

In [None]:
features_pattern = ['Date', 'Depth_to_Groundwater', 'Flow_Rate', 'Hydrometry', 'Lake_Level', 'Rainfall', 'Temperature', 'Volume']

feature_matrix = np.zeros((len(data_names), len(features_pattern)))

for k in range(len(data_names)): 
    name_df = data_names[k]
    df_columns = df_dict[name_df].columns.tolist()
    feature_matrix[k, :] = get_pattern_count(features_pattern, df_columns)
    
feature_matrix = pd.DataFrame(feature_matrix, columns = features_pattern, index = data_names)
feature_matrix['total_columns'] = list(map(lambda x: x.shape[1], df_dict.values()))

In [None]:
sns.color_palette("rocket")
fig, ax = plt.subplots(1, 1, figsize = (17, 11))

colors = sns.color_palette('rocket', 15)
levels = [1, 2, 3, 4, 5, 7, 9, 11, 14, 17, 20, 23, 26, 29, 32]
cmap, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")

sns.heatmap(feature_matrix, annot=True, cmap=cmap, ax = ax, norm=norm)
ax.xaxis.set_ticks_position('top')
ax.set(xlabel='Column Type', ylabel='Dataset')
plt.title('Number of columns per type')
plt.xticks(rotation=290)

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (17,11))

norm_feature_matrix = feature_matrix.copy()
norm_feature_matrix.iloc[:, :-1] = round(norm_feature_matrix.iloc[:, :-1].div(norm_feature_matrix.iloc[:, -1], axis = 0), 3)

sns.heatmap(norm_feature_matrix.drop('total_columns', 1), annot=True, ax = ax, cmap = sns.color_palette('rocket'))
ax.xaxis.set_ticks_position('top')
ax.set(xlabel='Column Type', ylabel='Dataset')
plt.title('Number of columns per type')
plt.xticks(rotation=290)

#### TimeSpan

Here I compare each dataset timespan (minimum and maximum date) for target columns. 

In [None]:
target_dict = {'Aquifer_Auser' : ['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS'],
               'Aquifer_Doganella' : ['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                                      'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                                      'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9'],
               'Aquifer_Luco' : ['Depth_to_Groundwater_Podere_Casetta'],
               'Aquifer_Petrignano' : ['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25'],
               'Lake_Bilancino': ['Lake_Level','Flow_Rate'],
               'River_Arno': ['Hydrometry_Nave_di_Rosano'],
               'Water_Spring_Amiata': ['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'],
               'Water_Spring_Lupa': ['Flow_Rate_Lupa'],
               'Water_Spring_Madonna_di_Canneto': ['Flow_Rate_Madonna_di_Canneto']}

In [None]:
for j in range(len(data_names)):
    
    df_name = data_names[j]
    target_cols = target_dict[df_name]
    
    df = df_dict[df_name][['Date'] + target_cols]
    
    print(df.dropna().Date.min(), df.loc[df[target_cols].isna().sum(axis = 1) < len(target_cols)-1].Date.min())

Ok, quando è NaN un target lo sono tutti 

In [None]:
cmap_plot = plt.get_cmap('jet_r')

for j in range(len(data_names)):
    
    data_name = data_names[j]
    
    target_cols = target_dict[data_name]    
    target_cols_len = len(target_cols)

    df = (df_dict[data_name][['Date']+target_cols].melt(id_vars='Date', value_vars=target_cols))
    df['value'] = df['value'].astype(float)
    
    fig = px.line(data_frame=df, x = 'Date', y = 'value', color = 'variable', title = data_name, labels = 'variable')
    
    fig.update_xaxes(tickangle=45,
                 tickmode = 'array',
                 tickvals = [df['Date'].min(), df['Date'].max()])
    
    fig.show()
    

##### Targets distribution over time for each Dataset

In [None]:
cmap_plot = plt.get_cmap('jet_r')

for j in range(len(data_names)):
    
    data_name = data_names[j]
    
    target_cols = target_dict[data_name]    
    target_cols_len = len(target_cols)
    
    fig, ax = plt.subplots(1, 1, figsize = (14, 7))
    for i, target in enumerate(target_cols):
        if target_cols_len > 4:
            color = cmap_plot(float(i)/target_cols_len)
            df_dict[data_name][[target, 'Date']].plot(x = 'Date', ax = ax, c = color, lw = 3)
        else:
            df_dict[data_name][[target, 'Date']].plot(x = 'Date', ax = ax, lw = 3)
        plt.legend(title='targets', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax.set_title(data_name)

##### Targets distribution for each dataset

In [None]:
cmap_plot = plt.get_cmap('jet_r')

for j in range(len(data_names)):
    
    data_name = data_names[j]
    
    target_cols = target_dict[data_name]    
    target_cols_len = len(target_cols)
    
    df = df_dict[data_name]
    
    fig, ax = plt.subplots(1, 1, figsize = (12, 6))
    for i, target in enumerate(target_cols):
        sns.kdeplot(df[target], shade=True, alpha=0.5, ax = ax)
        
    plt.legend(title='targets', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax.set_title(data_name)

In [None]:
?plt.subplots

In [None]:
def distribution1_mod(feature, color, df, data_name):
    fig, axes = plt.subplots(1, 2, figsize=(11, 7))
    fig.suptitle(data_name+" "+feature)
    ax = axes.ravel()
    sns.distplot(df[feature],color=color,ax=ax[0])
    ax[0].set(xlabel='density')
    sns.violinplot(df[feature], ax=ax[1])
    ax[0].set(xlabel='violin')
    

In [None]:
cmap_plot = plt.get_cmap('jet_r')

for j in range(len(data_names)):
    
    data_name = data_names[j]
    
    target_cols = target_dict[data_name]    
    target_cols_len = len(target_cols)
    
    df = df_dict[data_name]
    
    for i, target in enumerate(target_cols):
        distribution1_mod(target,'red', df, data_name)
        

#### NaNs

C'è un discorso legato ai NaN ma anche uno legato alle date mancanti. 

Per step:

- numero di NaN nelle target cols per ogni dataset 

- filling dei NaNs ? 

#### (Auto/Cross) Correlation Analysis

- Correlation for each dataset

- Correlation just for targets (rectangular matrix)

- Correlation between all variables of all datasets

- AutoCorrelation for both Target and Predictor variables

- CrossCorrelation between Target/Targets and Target/Predictors

- CrossCorrelation between all columns of all datasets

Correlation Matrix for each Dataset

In [None]:
for j in range(len(data_names)):
    
    data_name = data_names[j]
    df = df_dict[data_name]
    
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    
    fig, ax = plt.subplots(1, 1, figsize = (16, 10))
    colors = sns.color_palette('rocket', 21)
    levels = np.linspace(-1, 1, 21)
    cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")
    sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.label.set_size(14)
    plt.title('Correlation Matrix for {}'.format(data_name))
    plt.xticks(rotation=280)
    fig.show()
    

Correlation Matrix for each dataset, target variables


- plot both real values and absolute ones

In [None]:
ax.title.set_fontsize

In [None]:
for j in range(len(data_names)):
    
    data_name = data_names[j]
    df = df_dict[data_name]
    
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    corr_matrix = corr_matrix[target_dict[data_name]]
    
    fig, axes = plt.subplots(1, 2, figsize = (14, 8))
    plt.suptitle(data_name)
    ax = axes.ravel()[0]
    colors = sns.color_palette('rocket', 11)
    levels = np.linspace(-1, 1, 11)
    cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")
    sns.heatmap(corr_matrix, annot=True, ax = ax, cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.label.set_size(10)
    ax.tick_params(axis='both', which='major', labelsize=10)
    ax.tick_params(axis='both', which='minor', labelsize=8)
    ax.title.set_text('Correlation Matrix')
    ax.title.set_fontsize(12)

    ax = axes.ravel()[1]
    colors = sns.color_palette('rocket', 11)
    levels = np.linspace(0, 1, 11)
    cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")
    sns.heatmap(abs(corr_matrix), annot=True, ax = ax, cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.xaxis.set_ticks_position('top')
    ax.tick_params(axis='both', which='major', labelsize=10)
    ax.tick_params(axis='both', which='minor', labelsize=8)
    ax.set_yticks([])

    ax.title.set_text('Absolute Correlation Matrix')
    ax.title.set_fontsize(12)
    
    for ax in fig.axes:
        matplotlib.pyplot.sca(ax)
        plt.xticks(rotation=280)
    fig.show()
    

Correlation Matrix for each dataset, but let's consider just absolute values > 0.75

In [None]:
for j in range(len(data_names)):
    
    data_name = data_names[j]
    df = df_dict[data_name]
    
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    
    mask[(abs(corr_matrix) < 0.75) & (mask == False)] = True
    
    fig, ax = plt.subplots(1, 1, figsize = (12, 8))
    colors = sns.color_palette('rocket', 11)
    levels = np.linspace(-1, 1, 11)
    cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")
    sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.label.set_size(14)
    plt.title('Correlation Matrix for {}'.format(data_name))
    plt.xticks(rotation=280)
    fig.show()
    

In [None]:
for j in range(len(data_names)):
    
    data_name = data_names[j]
    df = df_dict[data_name]
    
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    corr_matrix = corr_matrix[target_dict[data_name]]
    
    mask = np.zeros_like(corr_matrix, dtype=bool)
    
    mask[(abs(corr_matrix) < 0.75) & (mask == False)] = True
    
    fig, ax = plt.subplots(1, 1, figsize = (12, 8))
    colors = sns.color_palette('rocket', 11)
    levels = np.linspace(-1, 1, 11)
    cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")
    sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.label.set_size(14)
    plt.title('Correlation Matrix for {}'.format(data_name))
    plt.xticks(rotation=280)
    fig.show()

Autocorrelation

In [None]:
y_labels = ax.get_yticklabels()
for j in y_labels:
    print(j)

In [None]:
?sns.barplot

In [None]:
data_chunks = chunks(range(len(data_names)), 3)
chunk_len = 3

for chunk in data_chunks:
    fig, axes = plt.subplots(1, 3, figsize = (20, 12))
    fig.suptitle('Autocorrelation for each features and dataset')
    axes_raveled = axes.ravel()
    for k in range(len(chunk)):
        
        j = chunk[k]
        data_name = data_names[j]
        df = df_dict[data_name].sort_values('Date', ignore_index = True).drop('Date', 1)
        
        autocorr_dataframe = (pd.DataFrame(df.apply(lambda x: x.autocorr(), 0))
                             .reset_index().rename(columns = {'index': 'feature', 0: 'autocorrelation'})
                             .sort_values('autocorrelation', ascending = False))
        
        ax = axes_raveled[k]
    
        sns.barplot(x='autocorrelation', y='feature', data=(autocorr_dataframe), ax = ax, palette = 'jet_r')
        y_labels = autocorr_dataframe.feature.tolist()
        ax.set_yticklabels([])
        ax.set_xticklabels([])
        ax.title.set_text(data_name)
        ax.title.set_fontsize(12)
        t=0
        for p in ax.patches:
            width = p.get_width() 
            if width < 0.01:# get bar length
                ax.text(width,       # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
                '{:1.4f}'.format(width), # set variable to display, 2 decimals
                ha = 'left',   # horizontal alignment
                va = 'center')  # vertical alignment
            else:
                ax.text(width/4, 
                    # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
                '{} {:1.4f}'.format(y_labels[t], width), # set variable to display, 2 decimals
                ha = 'left',   # horizontal alignment
                va = 'center',
                color = 'black',
                fontsize = 11)
            t+=1