<a href="https://colab.research.google.com/github/shaneahmed/StatswithPython/blob/main/03-DataVisualisation.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://kaggle.com/kernels/welcome?src=https://github.com/shaneahmed/StatswithPython/blob/main/03-DataVisualisation.ipynb" target="_blank"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Colab"/></a>

# Data Visualisation Techniques

### by Shan E Ahmed Raza
---



**Name: (Please write your name and ID here prior to submission)**



---
In this notebook we will focus on data visualisation techniques using python. We will mainly work with three python packages pandas, seaborn and matplotlib. We will also consider calculating skewness, kurtosis and correlation between multiple variables.


Let's start with installing required libraries.

In [None]:
!pip install pandas seaborn matplotlib scipy scikit-learn

Import required libraries

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import math
from scipy import stats

We will start with loading builtin data set `iris` in `seaborn` package. The [iris (Fisher data set)](https://archive.ics.uci.edu/ml/datasets/iris) is perhaps the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The data set has following attributes:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:\
    -- Iris Setosa\
    -- Iris Versicolour\
    -- Iris Virginica

In [None]:
# you can either download the iris data from sklearn or seaborn package. 
# To download from sklearn you can use the following code
import numpy as np
from sklearn import datasets
iris = datasets.load_iris() # load the data set
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target']
                    ) # convert to pandas dataframe
iris # display the data

In [None]:
# seaboarn contains iris data in a dataframe with the specicies well-defined 
# so we can use this as well.
iris = sns.load_dataset('iris') 

Let's have a look at the data set

In [None]:
iris

## Distribution
Let's plot distribution of sepal length for the whole data set.

### Histogram

In [None]:
n, bins, patches = plt.hist(x=iris['sepal_length'], bins='auto', color='#0504aa', rwidth=0.80)
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Sepal Histogram')

    The y-axis is the grouped frequency distribution of sepal length. Try changing the bin size to experiment with class interval.

### Density Plot
A density plot is a smooth continuous version of a histogram estimated from data. We discussed about skewness and kurtosis.

In [None]:
ax = sns.displot(iris['sepal_length'], kde=True)
_ = ax.set(xlabel='Sepal Length', ylabel='Density')
print("Kurtosis = " + str(stats.kurtosis(iris['sepal_length']))) # Display kurtosis
print("Skewness = " + str(stats.skew(iris['sepal_length']))) # Display Skewness 

    Try comparing skewness and kurtosis for the remaining variables with the shape of distribution. 

    You can plot density plot without showing the histogram.

In [None]:
ax = sns.displot(iris['sepal_length'], kind="kde")
_ = ax.set(xlabel='Sepal Length', ylabel='Density')

    You can also plot distribution of sepal length related to each type of iris

In [None]:
ax = sns.displot(iris, x="sepal_length", hue="species", kind="kde")
_ = ax.set(xlabel='Sepal Length', ylabel='Density')

### Cumulative Distribution

In [None]:
ax = sns.displot(iris, x="sepal_length", kind="ecdf")
_ = ax.set(xlabel='Sepal Length', ylabel='Proportion')

### Box Plot
A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.

In [None]:
ax = sns.boxplot(x="species", y="sepal_length", data=iris)
_ = ax.set(ylabel='Sepal Length', xlabel='Species')

### Violin Plots
Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
ax = sns.violinplot(x="species", y="sepal_length", data=iris)
_ = ax.set(ylabel='Sepal Length', xlabel='Species')

You can draw split violins to compare the across the `hue` variable

In [None]:
tips = sns.load_dataset("tips")
print(tips)
ax = sns.violinplot(x="day", y="total_bill", hue="smoker",
                    data=tips, palette="muted", split=True)
_ = ax.set(ylabel='Total Bill', xlabel='Date')

### Strip Plot
A strip plot can be used where you would like to show all observations along with some representation of the underlying distribution. For example, with a violin plot.

In [None]:
ax = sns.violinplot(x="species", y="sepal_length", data=iris,
                    inner=None, color=".8")
ax = sns.stripplot(x="species", y="sepal_length", data=iris)
_ = ax.set(ylabel='Sepal Length', xlabel='Species')

### Bar Plot
A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars. 

In [None]:
ax = sns.barplot(x="day", y="tip", hue="sex", data=tips)

### Scatter Plot

In [None]:
ax = sns.scatterplot(data=iris, x="sepal_length", y="petal_length")
_ = ax.set(xlabel='Sepal Length', ylabel='Petal Length')

    From the scatter plot we should expect a positive strong correlation between sepal length and petal length

In [None]:
print(stats.pearsonr(x=iris["sepal_length"], y=iris["petal_length"])[0]) # Person Correlation

In [None]:
print(stats.spearmanr(a=iris["sepal_length"], b=iris["petal_length"])[0]) # Spearman Correlation

In [None]:
print(stats.kendalltau(x=iris["sepal_length"], y=iris["petal_length"])[0]) # Kendall Tau Correlation

    To identify values from different species you can use the hue variable

In [None]:
ax = sns.scatterplot(data=iris, x="sepal_length", y="petal_length", hue="species")
_ = ax.set(xlabel='Sepal Length', ylabel='Petal Length')

### Bivariate density plot

In [None]:
ax = sns.displot(iris, x="sepal_length", y="petal_length")
_ = ax.set(xlabel='Sepal Length', ylabel='Petal Length')

In [None]:
ax = sns.displot(iris, x="sepal_length", y="petal_length", hue = "species")
_ = ax.set(xlabel='Sepal Length', ylabel='Petal Length')

In [None]:
ax = sns.displot(iris, x="sepal_length", y="petal_length", hue = "species", kind="kde") # kernel density estimation
_ = ax.set(xlabel='Sepal Length', ylabel='Petal Length')

### Joint Plot

In [None]:
ax = sns.jointplot(data=iris, x="sepal_length", y="petal_length", hue="species")

### Scatter Matrix or Pair Plot

In [None]:
ax = sns.pairplot(iris, hue="species")

### Heat Map

In [None]:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
flights

In [None]:
ax = sns.heatmap(flights)

### Line Plot

In [None]:
_ = sns.lineplot(data=flights)

### Pie Chart

In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = ['Joe Biden', 'Donald Trump', 'others']

votes = [81283495, 74223755, 2704848]
# explode = (0.1, 0, 0) 
_ = plt.pie(votes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.show()

### Interactive Plots
There are various python libraries which allow interactive plots e.g., altair, plotly & bokeh. Here are a few examples below:

In [None]:
# Install plotly and Bokeh
!pip install plotly bokeh

In [None]:
# Interactive plots using plotly
import plotly.io as pio
import plotly.express as px
import plotly.offline as py

df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", size="sepal_length")
fig

In [None]:
from bokeh.plotting import figure, show, output_notebook
output_notebook()

p = figure()
p.circle(df["sepal_width"], df["sepal_length"], fill_color=df["species"], size=df["sepal_length"])
show(p)

## Exercise
1. Load the `flights` data set using `sns.load_dataset`. What will be the appropriate visualisation technique to visualise average passengers every year along with an indication of variability. Use an appropriate function to draw the graph [15]
2. Load the `penguins` data set using `sns.load_dataset`. Identify the variable which will have highest correlation with `body_mass_g` and why? You do not need to calculate the correlation. [25]
3. Using the `penguins` data set above, draw a box plot using the variable `bill_depth_mm` for different species of penguins. Identify the species you will be able to identify using `bill_depth_mm`. [20]
4. Compute [pairwise correlation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) (you can use default method e.g., `penguins.corr()` to calculate pairwise correlation) for the `penguins` data set and plot the heatmap. Can you identify the variables which are highly correlated? What is the correlation value along the diagonal and why? Do these results match your observation in question 2? [20]
5. Load the `fmri` data set using `sns.load_dataset`. Choose the appropriate graph to analyse `signal` against the `timepoints` for different regions. [20]