## Data Visualization Tools and Libraries

### Part2: Exploratory Data Analysis in Data Science through Matplotlib, Seaborn, Pandas and NumPy in Python

We will cover the following topics in this session:
1. Loading Data and Data Inspection using Pandas
2. Data Distribution using Histograms
3. Univariate Analysis using Bar plots, Scatter Plots, Area Plots, Pie Charts,Point Plots
4. Bivariate analysis using Pairplots, Box plots, Violin Plots, Stacked plots
5. Missing Value Visualization
6. Correlation Visualization using Heatmaps
7. Customizing Matplotlib and Seaborn Visualization
8. Bonus section: Word Clouds, Sankey Plots, Radar plots


### About the dataset

Heart Disease Dataset- https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

The Heart Disease Dataset is used to solve the problem of classifying whether a patient will have heart disease or not, which is a Binary Classification problem. The data used for this assignment is obtained from Kaggle. This dataset is a processed version of the actual Cleveland database data obtained from UCI Datasets. This problem and the dataset are interesting because the dataset has multiple feature attributes (13 feature attributes) and each of these features are of different types (combination of continuous and categorical values) and the scale of the continuous values are completely different and it will be interesting to see how the Machine Learning algorithms perform on this. Also, the number of records is limited (around 303 records) and there are some outliers observed within the data. So, it will be really interesting to see how our Machine Learning Algorithms can be used to generalize on such a limited dataset with noise. And finally predicting heart disease can have real life use case and application, so that’s why I felt this would be a good problem to investigate.


### Data Description

1. `age` (Age of the patient in years)
2. `gender` (Male/Female)
3. `chest_pain_type` : chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
4. `resting_bp` : resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
5. `cholestoral` (serum cholesterol in mg/dl)
6. `fasting_blood_sugar` (if fasting blood sugar > 120 mg/dl)
7. `restecg` (resting electrocardiographic results) -> Values: [normal, stt abnormality, lv hypertrophy]
8. `max_hr`: maximum heart rate achieved
9. `exang`: exercise-induced angina (True/ False)
10. `oldpeak`: ST depression induced by exercise relative to rest
11. `slope`: the slope of the peak exercise ST segment
12. `num_major_vessels`: number of major vessels (0-3) colored by fluoroscopy
13. `thal`: [normal; fixed defect; reversible defect]
14. `num`: the predicted attribute

### Importing libraries in Python

In [1]:
# Importing the python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading data and Data inspection

In [2]:
# Downloading data and loading data as a data frame
data = pd.read_csv("https://raw.githubusercontent.com/adib0073/Educative_SSDS_course/main/data/heart_disease.csv")
data.head()

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

In [None]:
data.columns

In [None]:
data.shape

In [None]:
data.info()

### Checking Missing Values

In [None]:
data.isnull().sum()

In [None]:
# Let's check missing values
sns.displot(
    data=data.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    height=4,
    aspect=1.3
)
plt.show()

### Predictor variable analysis

In [None]:
# Separating Categorical and Continuous Numerical Variables
categorical = ['gender','chest_pain_type', 'fasting_blood_sugar', 'num_major_vessels', 'restecg', 'exang', 'slope', 'thal']
continuous = ['age', 'resting_bp', 'cholestoral', 'max_hr', 'oldpeak']

In [None]:
# Descriptive Statistics for Numerical Variables

data[continuous].describe().T

In [None]:
# Frequency for Categorical Variables
for variable in categorical:
  print(data[variable].value_counts())


### Data Distribution Check using Histogram

In [None]:
data.hist(layout = (3,5), figsize=(14,9), color = 'r')
print('Data Distribution')

### Univariate Analysis

##### **Checking Frequency of Categorical Variables using Bar Plots**

In [None]:
sns.countplot(x="target", data=data, palette="mako_r")
plt.show()
print('This looks like a fairly balanced dataset, as distribution of majority and minority class is around 55:45')

#### **Using Pie charts instead of bar plots**

In [None]:
data.target.value_counts().plot(kind='pie', labels=['no heart disease', 'heart disease'], autopct='%2.0f%%')
plt.show()

In [None]:
sns.countplot(x='gender', data=data, palette="bwr_r")
plt.xlabel("Gender (0 = female, 1= male)")
plt.show()

##### **Data Density using Area Plot**


In [None]:
sns.kdeplot(x='cholestoral',
            data=data,
            fill = True ,
            color = "Red")

In [None]:
# Custom area plots
# Plotting a line plot first
plt.plot(range(len(data)),data["cholestoral"])
# Using fill_between to highlight the area covered under the plot
plt.fill_between(range(len(data)),data["cholestoral"], color="skyblue", alpha=0.4)
plt.show()

#### Bivariate Analysis

In [None]:
print('Analysing distribution of target and gender (0-female 1-male)')
sns.countplot(x = data['gender'], hue = data['target'], palette='bwr_r')
plt.legend(["Diseased", "Not Diseased"])
plt.xlabel("Gender (0 = female, 1= male)")
plt.show()

##### **Scatter Plots**

In [None]:
# Scatter Plots
sns.scatterplot(x="target",
                y="cholestoral",
                data=data,
                hue='target')

##### **Linked Bar Plots**

In [None]:
# Linked Bar Plot
pd.crosstab(data.age,data.target).plot(kind="bar",figsize=(20,6), color = ['g','r'])
plt.title('Heart Disease Distribution by Patient Age')
plt.legend(["Diseased", "Not Diseased"])
plt.xlabel('Age')
plt.ylabel('Counts')
plt.show()

##### **Stacked Bar Plots**

In [None]:
plt.bar(data.age[data.target==1], data.max_hr[(data.target==1)], color="red")
plt.bar(data.age[data.target==0], data.max_hr[(data.target==0)], color="grey")
plt.legend(["Diseased", "Not Diseased"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

#### **Point plots**

In [None]:
sns.pointplot(
    data=data, x="target", y="max_hr",
    hue="target",
    errorbar=("pi", 100), capsize=.4, join=False,
)

#### **Pair plot**

In [None]:
# Pair plots in Python
sns.pairplot(data=data[continuous + ['target']], hue='target')

### Duplicate data inspection

In [None]:
data.duplicated().any()

In [None]:
data.drop_duplicates(subset=None, inplace=True)
data.duplicated().any()

In [None]:
data.shape

### Outlier Detection

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
outliers = []
# Outlier detection for continuous variables
for col in continuous:
    print(f"{col} :  {check_outlier(data, col)}")
    outliers.append(check_outlier(data, col))

print(f"Outliers detected? {np.array(outliers).any()}")

##### **Box Plots**

In [None]:
# Box Plots
sns.boxplot(x="target",
            y="cholestoral",
            data=data,
            hue='target')

##### **Violin Plots**

In [None]:
# Violin Plots
sns.violinplot(x="target",
               y="cholestoral",
               data=data,
               hue='target')

### Data Correlation Check

In [None]:
data.corr()

##### **Heatmap Plots**

In [None]:
plt.figure(figsize=(7,5))
sns.heatmap(data.corr(),
            cmap="YlGnBu")

In [None]:
sns.set_style(style="darkgrid")
print(data.corr()['target'])
corr = data.corr()
plt.figure(figsize=(10,5))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)],
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 9});
plt.show()

### Customizing Matplotlib and Seaborn Visualization

Till now, we have seen different types of plots we can use in `matplotlib.pyplot`. However, we may also need to customize our plots based on our needs and requirements.

### Markers
Markers in matplotlib is used to emphasize the data points using the specified symbol of specified size and color. The following keywords are used to customize the corresponding elements of the plots:

`marker` : marker type

`ms` : size

`mec` : edge color

`mfc` : face color

Let's see a code example for customizing the marker in matplotlib.

In [None]:
data_val = np.array([11, 2, 3, 15, 8, 6])
plt.plot(data_val, marker = '*', ms = 30, mec = 'purple', mfc = 'hotpink')
plt.show()

### Line style

Along with the marker, we can also customize the line style in line plots using Matplotlib. The keyword `linestyle` or `l`s (it's short form) is used to customize the line style. We can use `linewidth` and `color` to customize the width and color of the line respectively.

In [None]:
data_val = np.array([11, 2, 3, 15, 8, 6])
plt.plot(data_val, ls='--', linewidth =2, color='red')
plt.show()

### Labels
As shown in the following code demonstration, we can use `xlabel()`, `ylabel()` and `title()` from `pyplot` to add x and y axes labels and the plot title, respectively.

In [None]:
data_val = np.array([11, 2, 3, 15, 8, 6])
plt.plot(data_val, marker = '*', ms = 30, mec = 'purple', mfc = 'hotpink')
plt.xlabel('time intervals')
plt.ylabel('Counts')
plt.title("Counts vs time interval plot")
plt.show()

### Grids
As shown in the following code demonstration, we can use `grid()` from `pyplot` to display gridlines in the plot. We can also specify the axes of the gridline if needed using the axis parameter.

In [None]:
data_val = np.array([11, 2, 3, 15, 8, 6])
plt.plot(data_val, marker = '*', ms = 30, mec = 'purple', mfc = 'hotpink')
plt.grid(axis = 'y') #or plt.grid() for both x and y axes
plt.show()

### Sub-plots
The `subplot` method in `pyplot` allows us to use multiple subplots. We can specify the number of rows, columns, and plot numbers in the `subplot()` as shown in the following code.

In [None]:
#plot 1
plt.subplot(2, 1, 1) #parameters: number of rows, columns and plots in one row
data1 = np.array([11, 2, 3, 15, 8, 6])
plt.plot(data1, marker = '*', ms = 30, mec = 'purple', mfc = 'hotpink')

#plot 2
plt.subplot(2, 1, 2)
data2 = np.array([3, 5, 1, 11, 7, 5])
plt.plot(data2, marker = 'o', ms = 30, mec = 'green', mfc = 'skyblue')
plt.show()

Take a look at this document for other customization options in Matplotlib: https://matplotlib.org/stable/users/explain/customizing.html

### Customisation for the Seaborn Library

**Plot themes**

Seaborn provides `set_theme()` to set the plot themes. The default style is `'white'` when `set_theme()` is not used. When `set_theme()` is used, the style changes to `'dark_grid'` instead. Only the plot style can be altered using the `set_style()` method. I recommend you take a look at the Seaborn documentation to find out more about the different themes available in Seaborn.

In [None]:
# dark grid theme
sns.set_theme()
sns.countplot(x = data['gender'], hue = data['target'], palette='bwr_r')


In [None]:
# white-grid pastel theme
sns.set_theme(style='whitegrid', palette="pastel")
sns.countplot(x = data['gender'], hue = data['target'], palette='bwr_r')

### Color palettes
Seaborn also offers different color palettes to customize the plot visualizations. The `color_palette()` allows us to select a color palette of our choice in Seaborn. Color Palettes are usually categorized as `Diverging`, `Sequential`, and `Qualitative`.

Take a look at this document for more color palettes: https://seaborn.pydata.org/tutorial/color_palettes.html

Let us see some code examples for some of the different color palettes in Seaborn.

In [None]:
# Quantitative
sns.set_theme(style='whitegrid', palette=sns.color_palette())
sns.countplot(x = data['gender'], hue = data['target'])

In [None]:
# Diverging
sns.set_theme(style='whitegrid', palette=sns.color_palette('terrain_r', 3))
sns.countplot(x = data['gender'], hue = data['target'])

In [None]:
# Sequential
sns.set_theme(style='whitegrid', palette=sns.color_palette('Oranges', 3))
sns.countplot(x = data['gender'], hue = data['target'])

### Bonus section: Word Clouds, Sankey Plots, Radar plots

### Word Clouds

In [None]:
from wordcloud import WordCloud

text = "Research in the Augment group focuses on Human-Centered AI, an emerging discipline that aims to amplify and augment human abilities and preserve human control in order to make AI systems more productive, enjoyable, and fair. The objective is to enable end-users to understand the rationale of AI models and to enable them to steer models with input and feedback.  The approach is researched to increase appropriate trust, acceptance and accuracy of models and to empower users to be an active and responsive part-taker in data-driven solutions. The focus is on visualisation and interaction techniques, using the full spectrum of hardware from small mobile devices to large multi-touch displays. Applications include learning analytics, precision agriculture, healthcare, media consumption, digital humanities, food & nutrition, fintech and human resources."

# Generate a word cloud image
wordcloud = WordCloud(background_color="white",).generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

#### Sankey Plots using Plotly (discussed more in the next session)

In [None]:
import plotly.graph_objects as go
import urllib, json

In [None]:
url = 'https://raw.githubusercontent.com/plotly/plotly.js/master/test/image/mocks/sankey_energy.json'
response = urllib.request.urlopen(url)
data = json.loads(response.read())

# override gray link colors with 'source' colors
opacity = 0.4
# change 'magenta' to its 'rgba' value to add opacity
data['data'][0]['node']['color'] = ['rgba(255,0,255, 0.8)' if color == "magenta" else color for color in data['data'][0]['node']['color']]
data['data'][0]['link']['color'] = [data['data'][0]['node']['color'][src].replace("0.8", str(opacity))
                                    for src in data['data'][0]['link']['source']]

In [None]:
fig = go.Figure(data=[go.Sankey(
    valueformat = ".0f",
    valuesuffix = "TWh",
    # Define nodes
    node = dict(
      pad = 15,
      thickness = 15,
      line = dict(color = "black", width = 0.5),
      label =  data['data'][0]['node']['label'],
      color =  data['data'][0]['node']['color']
    ),
    # Add links
    link = dict(
      source =  data['data'][0]['link']['source'],
      target =  data['data'][0]['link']['target'],
      value =  data['data'][0]['link']['value'],
      label =  data['data'][0]['link']['label'],
      color =  data['data'][0]['link']['color']
))])

fig.update_layout(title_text="Energy forecast for 2050<br>Source: Department of Energy & Climate Change, Tom Counsell via <a href='https://bost.ocks.org/mike/sankey/'>Mike Bostock</a>",
                  font_size=10)

### Radar Plots in Python

In [None]:
from math import pi
# Set data
df = pd.DataFrame({
'group': ['A','B','C','D'],
'var1': [38, 1.5, 30, 4],
'var2': [29, 10, 9, 34],
'var3': [8, 39, 23, 24],
'var4': [7, 31, 33, 14],
'var5': [28, 15, 32, 14]
})

In [None]:
# ------- PART 1: Create background

# number of variable
categories=list(df)[1:]
N = len(categories)

# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]

# Initialise the spider plot
ax = plt.subplot(111, polar=True)

# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)

# Draw one axe per variable + add labels
plt.xticks(angles[:-1], categories)

# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([10,20,30], ["10","20","30"], color="grey", size=7)
plt.ylim(0,40)

# ------- PART 2: Add plots

# Plot each individual = each line of the data
# I don't make a loop, because plotting more than 3 groups makes the chart unreadable

# Ind1
values=df.loc[0].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label="group A")
ax.fill(angles, values, 'b', alpha=0.5)

# Ind2
values=df.loc[1].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label="group B")
ax.fill(angles, values, 'r', alpha=0.5)

# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))

# Show the graph
plt.show()


Explore the Python Graph Gallery for more visualizations in Python: https://python-graph-gallery.com/