# In this EDA, I'll Share some Principles/Practices/Heuristics that has gained valued feedback and Praise in the Data Science Community

### All the below mentioned Heuristics are developed either by [Alberto Cairo](https://www.google.com/search?q=alberto+cairo) or [Edward Tufte](https://www.google.com/search?q=edward+tufte)

## Data-ink ratio by Edward Tufte :

- Before I dive into the “Five Qualities of Great Visualizations,” there’s another related concept that I want to cover: data-ink ratio, introduced by Edward Tufte in The Visual Display of Quantitative Information.

![Data Ink Ratio](https://miro.medium.com/max/2280/1*4A4CIVrU_lJCsCJBwmTpFQ.png)

- Tufte defines the data-ink ratio as the amount of data-ink divided by the total ink required to print a graphic. Now, I don’t think he asks us to measure the amount of ink laid down on the page. Instead, Tufte suggests we remove those elements that don’t add new information to the graphic.

- Ex : (The Most Famous Example Of Data Ink Ratio was Given By [Dark Horse Analytics](https://www.darkhorseanalytics.com/))
     ![Data Ink](https://lukebeacon.com.au/wp-content/uploads/2020/07/Capture-1060x299.png)

## Five Qualities of Great Visualizations By Alberto Cairo

### 1. Truthful
-> We need to advocate for true data. Truth based on data analysis can be subjective. But as data scientists, we should make our best effort to protect the truth. These methods help you achieve truthfulness

### 2. Functional
-> Consider whether your visualization is functional or not. The data-ink ratio can help increase the functionality of a visualization. There are many heuristics for increasing functionality.

### 3. Beautiful
-> It might sound like a weird quality, but beauty is important to data visualization. To achieve beauty, you need to know your audience. “Beauty is in the eye of the beholder” is a common but accurate expression.

### 4. Insightful
-> A good visualization doesn’t merely replicate data from tables or files. It displays relevant data in a visual format that reveals trends or relationships. When insights are successfully visualized, the viewer gets an “aha!” moment.

### 5. Enlightening
Although it sounds similar, enlightening is a different concept from insightful. Cairo says this quality is composed of the previous four: truthful, functional, beautiful, and insightful.

# Data Visualizations

### Importing Libraries and Data

In [None]:
# Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as ms
from scipy import stats
from scipy.stats import norm, skew
%matplotlib inline

#Data
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

#Looking at the Data
df.head(10)

### This Dataset Contains Both Categorical and Continous Features Lets work with Continous Features

## Continous Features :

- First Lets take a look at the default distribution plot and then how can we improve it

In [None]:
#Simple And Default Histogram
plt.figure(figsize=(12,10))
plt.hist(df['platelets'],bins=30)
plt.ylabel('Frequency')
plt.xlabel('Platelets');

### According to the Principles

1. Data Ink Ratio is not bad in this Graph, Still can be Improved by adding additional and relevant Information
2. The Whole Point of making the Visualzation easy to the eye and Beautiful is to remove any unnecessary Aspect (Like in this case 'The Boundaries') of the graph, So that viewer's attention will can focus on the Data 
3. We will remove the upper and right side boundaries and then add a grid to track values easily and lastly and some other relevant information
4. Best way to make the Visualization easy on the eye is to use transperancy(By Using 'alpha' parameter of Matplotlib) and using a calming colors
5. Calming colors are those colors with which humans interact regularly like light blue, orange, grey etc. 

In [None]:
#Improved Visualization
plt.figure(figsize=(15,10))

#Using a low alpha and orange color
sns.distplot(df['platelets'],ax=plt.gca(),color='orange',fit=norm,kde_kws={'linewidth':2})
plt.tick_params(axis='both', which='major', labelsize=13) #adjusting ticks 

mu,sigma = norm.fit(df['platelets']) # adding additional information to increase Data Ink Ratio
plt.legend(['Normal dist. $\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu, sigma)],
            loc='best',frameon=False,fontsize=13)
    
sp = plt.gca().spines
sp['top'].set_visible(False)
sp['right'].set_visible(False)

plt.ylabel('Frequency',fontsize=14,labelpad=10)
plt.xlabel('Platelets (Counts)',fontsize=14,labelpad=10)
plt.title('Platelets Distribution',fontsize=17,pad=7)

plt.grid( alpha=0.5,color='lightslategrey');
# Data ink is reduced by adding a grid but in a plot like 
#this we cannot directly label the bar and hence grid

## Looking at the Diffrence :
### You Can Decide Which One is Better

In [None]:
#Making Canvas
canv, axs = plt.subplots(1,2)
canv.set_size_inches(20,10)
canv.tight_layout(pad=4)

#First 'Before' Plot
plt.sca(axs[0])
plt.hist(df['platelets'],bins=30)
plt.ylabel('Frequency')
plt.xlabel('Platelets')
plt.title('Platelets Distribution W/O Using Heuristics')

#Second 'After' Plot 

plt.sca(axs[1])
sns.distplot(df['platelets'],ax=plt.gca(),color='orange',fit=norm,kde_kws={'linewidth':2.5}) 
plt.tick_params(axis='both', which='major', labelsize=13)

mu,sigma = norm.fit(df['platelets']) 
plt.legend(['Normal dist. $\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu, sigma)],
            loc='best',frameon=False,fontsize=13)
    
sp = plt.gca().spines
sp['top'].set_visible(False)
sp['right'].set_visible(False)

plt.ylabel('Frequency',fontsize=14,labelpad=10)
plt.xlabel('Platelets (Counts)',fontsize=14,labelpad=10)
plt.title('Platelets Distribution Using Heuristics',fontsize=17,pad=7)

plt.grid( alpha=0.5,color='lightslategrey');

## Now Scaling all the Continous Variable to a common scale and plotting all distributions

In [None]:
df.head(2)

In [None]:
from sklearn.preprocessing import MinMaxScaler

cols = ['creatinine_phosphokinase','ejection_fraction','platelets','serum_sodium'] #Continous Features
df_cont = df[cols]

scale = MinMaxScaler(feature_range=(0,12))#Scaling to range of [0,12]
scaled = scale.fit_transform(df_cont)

df_sc = pd.DataFrame(data=scaled,columns=cols)
df_sc.head(5)

## Making Final Visualization Of Continous Feature Distribution

In [None]:
#Making Canvas
canv, axs = plt.subplots(2,2)
canv.set_size_inches(20,18)
canv.tight_layout(pad=10)

#Plotting

cnt = 0
for rw in axs:   # Little Bit of Automation is not bad right!!!
    for ax in rw:
        plt.sca(ax)
        sns.distplot(df_sc[cols[cnt]],ax=plt.gca(),color='orange',
                     fit=norm,kde_kws={'linewidth':2.5}) 
        plt.tick_params(axis='both', which='major', labelsize=13)

        mu,sigma = norm.fit(df_sc[cols[cnt]])  
        plt.legend(['Normal dist. $\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu, sigma)],
            loc=1,frameon=False,fontsize=13)
    
        sp = plt.gca().spines
        sp['top'].set_visible(False)
        sp['right'].set_visible(False)

        plt.ylabel('Frequency',fontsize=14,labelpad=10)
        plt.xlabel('{}'.format(cols[cnt]),fontsize=14,labelpad=10)
        plt.title('{} Distribution Using Heuristics'.format(cols[cnt]),fontsize=17,pad=10)

        plt.grid( alpha=0.5,color='lightslategrey');
        cnt += 1

## Categorical Features :
Using the same principle making plots for categorical data

### Converting time to Bin


In [None]:
df['time'] = pd.cut(df['time'],bins=5)
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder().fit(df['time'])
df['time'] = lbl.transform(df['time'])

In [None]:
ccol = ['anaemia','diabetes','high_blood_pressure','sex','smoking','time']

### Comparing Do's and Don't
### You can see why the later is better

In [None]:
pt = pd.pivot_table(df,index='DEATH_EVENT',columns=ccol[2],values='smoking',
                    aggfunc ='count').fillna(0)

#Making Canvas
canv, axs = plt.subplots(1,2)
canv.set_size_inches(20,10)
canv.tight_layout(pad=4)

#First 'Don't' plot
plt.sca(axs[0])
plt.bar(df[ccol[2]].value_counts().index-0.4,np.array(pt.query('DEATH_EVENT==["0"]'))[0],
        width=0.4,align='center',label='Not Dead')
plt.bar(df[ccol[2]].value_counts().index,np.array(pt.query('DEATH_EVENT==["1"]'))[0],
        width=0.4,align='center',label='Dead')
plt.xticks(df[ccol[2]].value_counts().index-0.2,df[ccol[2]].value_counts().index)
plt.ylabel('Number of Patients',fontsize=14)
plt.title(ccol[2],fontsize=14)

#Second Do's Plot
plt.sca(axs[1])
plt.title(ccol[2],fontsize=14)
        
bars = plt.bar(df[ccol[2]].value_counts().index-0.4,
        np.array(pt.query('DEATH_EVENT==["0"]'))[0],
      width=0.4,align='center',label='Not Dead',
        color='lightslategrey',alpha=0.9)
        
for bar,value in zip(bars,np.array(pt.query('DEATH_EVENT==["0"]'))[0]):
    plt.text((bar.get_x()+0.158),(bar.get_height()-5),'{}'.format(value),
             color='white',fontsize=18)
    
bars = plt.bar(df[ccol[2]].value_counts().index,
        np.array(pt.query('DEATH_EVENT==["1"]'))[0],
        width=0.4,align='center',label='Dead',
        color='orange',alpha=0.8)
        
for bar,value in zip(bars,np.array(pt.query('DEATH_EVENT==["0"]'))[0]):
    plt.text((bar.get_x()+0.158),(bar.get_height()-5),'{}'.format(value),
             color='white',fontsize=18)
        
plt.legend(fontsize=12,frameon=False)
plt.xticks(df[ccol[2]].value_counts().index-0.2,df[ccol[2]].value_counts().index)
plt.ylabel('Number of Patients',fontsize=14)
        
for key,spine in plt.gca().spines.items():
    spine.set_visible(False)
        
plt.tick_params(axis='x', which='both',length=0,labelsize=12)
plt.tick_params(axis='y', which='both',length=0,labelsize=0);

## Final Categorical Visualization

In [None]:
cnt = 0
canv , axs = plt.subplots(2,3,sharey=False)
canv.set_size_inches(20,18)
canv.tight_layout(pad=5)

for row in axs:   # Automation is Awesome!!!!!
    for axis in row:
        try:
            if ccol[cnt] != 'smoking':
                pt = pd.pivot_table(df,index='DEATH_EVENT',columns=ccol[cnt],values='smoking'
                                    ,aggfunc ='count').fillna(0)
            else:
                pt = pd.pivot_table(df,index ='DEATH_EVENT',columns=ccol[cnt],
                                    values ='diabetes',
                                    aggfunc ='count').fillna(0)
        except:
            continue
        
        plt.sca(axis)
        plt.title(ccol[cnt],fontsize=14)
        
        bars = plt.bar(df[ccol[cnt]].value_counts().index-0.4,
                np.array(pt.query('DEATH_EVENT==["0"]'))[0],
                width=0.4,align='center',label='Not Dead',
                color='lightslategrey',alpha=0.9)
        
        for bar,value in zip(bars,np.array(pt.query('DEATH_EVENT==["0"]'))[0]):
            plt.text((bar.get_x()+0.11),(bar.get_height()+1.1),'{}'.format(value),
                     color='k',fontsize=14)
        
        bars = plt.bar(df[ccol[cnt]].value_counts().index,
                np.array(pt.query('DEATH_EVENT==["1"]'))[0],
                width=0.4,align='center',label='Dead',
                color='orange',alpha=0.7)
        
        for bar,value in zip(bars,np.array(pt.query('DEATH_EVENT==["1"]'))[0]):
            plt.text((bar.get_x()+0.11),(bar.get_height()+1.1),'{}'.format(value),
                     color='k',fontsize=14)
        
        plt.legend(fontsize=12,frameon=False)
        plt.xticks(df[ccol[cnt]].value_counts().index-0.2,df[ccol[cnt]].value_counts().index)
        if cnt == 0 or cnt == 3:
            plt.ylabel('Number of Patients',fontsize=14)
        
        for key,spine in plt.gca().spines.items():
            spine.set_visible(False)
        
        plt.tick_params(axis='x', which='both',length=0,labelsize=12)
        plt.tick_params(axis='y', which='both',length=0,labelsize=0)
        cnt+=1

## This Notebook is a Work in Progress, Upvote if you want to see more
## If you have an idea, Share and I will add it as soon as possible
## Feedback is Appreciated
## And As Always Thank You For Scrolling