# Workshop I - Data Visualisation

- Install seaborn, pandas, numpy and matplotlib libraries

```conda install -y seaborn pandas numpy matplotlib plotly```

- Python libraries are pieces of software that can be imported into you Jupyter notebook and used
- Seaborn, plotly and matplotlib are data visualisation libraries
- Pandas is for data wrangling - a bit like excel but in code
- Numpy is for maths and linear algebra - we won't be doing this, but the library is still useful

## How to get help

1. Google - all programmers spend a large amount of their time googling
2. Stack Overflow - someone has almost certainly asked the same question you have.  The answer will be here.
3. Use python's help function

## Load our libraries into the notebook and set some parameters

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import ppscore as pps

%matplotlib inline
sns.set(rc={'figure.figsize': (11.7, 8.27)}, font_scale=1.5)
from IPython.core.display import display, HTML

## How to use Seaborn

In [None]:
%%html
<iframe src="https://seaborn.pydata.org/index.html" width="1000" height="800"></iframe>

## Load the penguin example dataset

- The penguin dataset is assigned to the variable ```penguin```
- The dataset is in the format of a pandas dataframe
- Pandas dataframe cheatsheet at https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- We'll cover this in the stats workshop

In [None]:
# make the variable penguins equal to the dataset
penguins = sns.load_dataset("penguins")
# Have a look at the dataset
penguins

### How to use Python's inbuilt help function

In [None]:
help(sns.load_dataset)

## Summarise the dataset

In [None]:
len(penguins)

In [None]:
print("Number of species: {}".format(penguins['species'].nunique()))
print("Number of islands: {}".format(penguins['island'].nunique()))

### Describe function automatically generates statistics for the dataset

In [None]:
penguins.describe()

## Write to file and load from file

- Use Excel as usual for data entry

In [None]:
penguins.to_csv("penguins.csv")

In [None]:
penguins = pd.read_csv("penguins.csv", index_col=0)
penguins

## Plot a histogram

### All species

In [None]:
penguins = sns.load_dataset("penguins")
sns.displot(penguins, x="flipper_length_mm", height=10)

### Colour by species

In [None]:
sns.displot(penguins,
            x="flipper_length_mm",
            hue="species",
            element="step",
            height=10)

### Normalise data

In [None]:
sns.displot(penguins,
            x="flipper_length_mm",
            hue="species",
            stat="density",
            common_norm=False,
            element="step",
            height=10)

### Smoothed and normalised plot using kernel density

In [None]:
sns.displot(penguins,
            x="flipper_length_mm",
            hue="species",
            kind="kde",
            common_norm=False,
            fill=True,
            height=10)

### Distribution plot in 2 dimensions

In [None]:
sns.displot(penguins,
            x="bill_length_mm",
            y="bill_depth_mm",
            hue="species",
            height=10)

### Kernel density in 2 dimensions

In [None]:
sns.displot(penguins,
            x="bill_length_mm",
            y="bill_depth_mm",
            hue="species",
            kind="kde",
            height=10)

### Joint plot includes 1D and 2D for 2 variables

In [None]:
g = sns.jointplot(data=penguins,
                  x="bill_length_mm",
                  y="bill_depth_mm",
                  hue="species",
                  kind="kde",
                  fill=True,
                  height=10)

### Show multiple plot types in a grid for combinations of variables

In [None]:
g = sns.PairGrid(penguins)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.histplot, kde=True)

## Load and display the titanic dataset

In [None]:
titanic = sns.load_dataset("titanic")
titanic

### Categorical bar plot - confidence intervals are automatically generated

In [None]:
sns.catplot(x="sex",
            y="survived",
            hue="class",
            kind="bar",
            data=titanic,
            capsize=0.05,
            errwidth=2,
            height=10)

### Categorical box plot

In [None]:
iris = sns.load_dataset("iris")
sns.catplot(data=iris, orient="h", kind="box", height=5, aspect=2)

### Violin plot - similar to box plot but give a better idea of distribution 

In [None]:
sns.violinplot(x=iris.species, y=iris.sepal_length)

### Categorical box plot

- plots are automatically aligned
- not the the input data can be limited - i.e. fare must be above 0

In [None]:
g = sns.catplot(x="fare",
                y="survived",
                row="class",
                kind="box",
                orient="h",
                height=3,
                aspect=4,
                data=titanic.query("fare > 0"))
g.set(xscale="log")

### Load and view the tips dataset

In [None]:
tips = sns.load_dataset("tips")
tips

### Scatter plot with regression line + confidence intervals

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips)

### Another join plot - scatter plot and distributions of each variable

In [None]:
sns.jointplot(data=tips, x="total_bill", y="tip", kind="reg", height=10)

## Heatmap

- a table with coloured cells

### Correlation matrix - correlation between numerical variables

In [None]:
sns.heatmap(penguins.corr(), annot=True, cmap='RdYlGn')

### Predictive power score - how predictive is each variable of the others

- uses the PPS library

In [None]:
pps_matrix = pps.matrix(penguins)
pps_pivot = pps_matrix.pivot('x', 'y', 'ppscore')
pps_pivot.index.name, pps_pivot.columns.name = None, None
sns.heatmap(pps_pivot, annot=True, cmap='YlGn', annot_kws={"fontsize": 10})

### Predictive power scores as a bar graph

In [None]:
import seaborn as sns
predictors_df = pps.predictors(penguins, y="species")
sns.barplot(data=predictors_df, x="x", y="ppscore")
plt.xticks(rotation=70)
plt.tight_layout()

## Plotly library

- an alternative to Seaborn with some nice features

In [None]:
%%html
<iframe src="https://plotly.com/python/" width="1000" height="800"></iframe>

### Interactive 3D plots

In [None]:
df = px.data.iris()
fig = px.scatter_3d(df,
                    x='sepal_length',
                    y='sepal_width',
                    z='petal_width',
                    color='species',
                    size='petal_length',
                    size_max=22,
                    opacity=0.7,
                    width=950,
                    height=950)
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), )

### Animated plots - e.g. change over time

In [None]:
df = px.data.gapminder()
px.scatter(df,
           x="gdpPercap",
           y="lifeExp",
           animation_frame="year",
           animation_group="country",
           size="pop",
           color="continent",
           hover_name="country",
           log_x=True,
           size_max=55,
           range_x=[100, 100000],
           range_y=[25, 90])

# Next - basic Python