Usually in data exploration, we need some statistical measures to have a quick look into the data. 
In this notebook, we try to write a function to create a **Data Quality Report** for continuous features, which could be used for any dataframe and can reduce some effort in EDA.

A data quality report includes tabular reports that describe the characteristics of each feature in a dataset using standard statistical measures of central tendency and variation. The tabular reports are accompanied by data visualizations that illustrate the distribution of the values in each feature of the dataset. 

For the data quality report, we use the below measures and plots:<br>
<b>
1. No. of rows
2. No. of columns
3. Data types for all the rows
4. Cardinality of the data types for rows
5. A table with the below statistical measures for each continuous feature:
    * Count
    * Missing values percentage
    * Cardinality
    * Minimum
    * 1st quartile - 25%ile
    * 2nd quartile - 50%ile - Median
    * 3rd quartile - 75%ile
    * Maximum
    * Mean
    * Mode
    * Standard deviation
6. Histogram of the features
7. Line graphs of the features
8. Graph showing the occurrence of missing values based on row index
9. Correlation graph - Pearson's correlation
</b>

In [None]:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt

In [None]:
plt.style.use('ggplot')

In [None]:
data_folder='/kaggle/input/acea-water-prediction/'

In [None]:
for i in os.listdir(data_folder):
    print(i)

In [None]:
df_Auser_Aquifer=pd.read_csv(data_folder+"Aquifer_Auser.csv")
df_Auser_Aquifer['Date'] = pd.to_datetime(df_Auser_Aquifer['Date'], dayfirst=True)

df_Doganella_Aquifer=pd.read_csv(data_folder+"Aquifer_Doganella.csv")
df_Luco_Aquifer=pd.read_csv(data_folder+"Aquifer_Luco.csv")
df_Petrignano_Aquifer=pd.read_csv(data_folder+"Aquifer_Petrignano.csv")
df_Bilancino_Lake=pd.read_csv(data_folder+"Lake_Bilancino.csv")
df_Arno_River=pd.read_csv(data_folder+"River_Arno.csv")
df_Amiata_Water_Spring=pd.read_csv(data_folder+"Water_Spring_Amiata.csv")
df_Lupa_Water_Spring=pd.read_csv(data_folder+"Water_Spring_Lupa.csv")
df_Madonna_di_Canneto_Water_Spring=pd.read_csv(data_folder+"Water_Spring_Madonna_di_Canneto.csv")

In [None]:
df_Auser_Aquifer_2020 = df_Auser_Aquifer[(df_Auser_Aquifer['Date'] >= '2019-01-01') & (df_Auser_Aquifer['Date'] < '2020-01-01')]

In [None]:
def missing_values(df):
    fig, ax = plt.subplots(1, 1, figsize=(16,10))

    ax1 = ax.pcolormesh(df.isnull().T, cmap='Blues')
    ax.set_yticks([x + 0.5 for x in range(0,len(df.columns))])
    ax.set_yticklabels([x + " - " + str(round(sum(df[x].isnull())/df.shape[0]*100,2)) + "%" for x in df.columns])

    ax.set_title("Missing Values",
                {'fontsize':25})
    plt.show()

In [None]:
def corr_graph(df):
    fig, ax1 = plt.subplots(1,1,figsize=(10,8))
    ax1.set_title("Correlation Graph")
    corr = df.corr('pearson')
    pcm = ax1.pcolormesh(corr)
    ax1.set_xticks(np.arange(0.5,len(corr.columns)))
    ax1.set_xticklabels(corr.columns, rotation='vertical')
    ax1.set_yticks(np.arange(0.5,len(corr.columns)))
    ax1.set_yticklabels(corr.columns)
    plt.colorbar(pcm, ax=ax1)
    plt.show()

In [None]:
def data_quality_report(df):
    n_rows = df.shape[0]
    n_cols = df.shape[1]
    
    desc = df.describe().T
    
    desc['miss %'] = desc['count'].apply(lambda x : round((n_rows - x)*100/n_rows,2))
    desc['card'] = [len(df[x].value_counts()) for x in desc.index]
    
    desc = desc[['count', 'miss %', 'card', 'min', '25%', '50%', '75%', 'max', 'mean', 'std']]

    print("No. of rows: " + str(n_rows))
    print("No. of cols: " + str(n_cols))
    
    print("Data types:")
    
    display(df.dtypes)
    display(df.dtypes.value_counts())
    display(desc)
    
    n_num_cols = desc.shape[0]
    numeric_cols = list(desc.index)
    
    if(n_num_cols > 5):     
        df.hist(figsize=(20,((n_num_cols//5)+1)*4), layout=((n_num_cols//5)+1, 5), bins=100)
        df[numeric_cols].plot(figsize=(20,n_num_cols*4), layout=(n_num_cols,1), kind='line', subplots=True)
        plt.show()
        
    else:
        df.hist(figsize=(20,5), layout=(1, n_num_cols), bins=100)
        df[numeric_cols].plot(figsize=(20,5), layout=(n_num_cols,1), kind='line', subplots=True)
        plt.show()
        
    missing_values(df)
    corr_graph(df)

In [None]:
pd.set_option('display.max_rows',5000)
data_quality_report(df_Auser_Aquifer_2020)