# 3.  Visualization

## 3.1 Introduction
When you're new to Python, the amount of visualization libraries can be overwhelming. As a general rule of thumb, it's often a better choice to reflect on which type of graph you want to plot and choose the corresponding and appropriate library. For this course we've chosen to dive into the Seaborn library. 

Generally you could summarize that **Matplotlib (pyplot)**  has great flexibility and versatility that comes with the cost of (sometimes) complexity and low-level type of programming. **Seaborn** is a Python data visualization library which is built on-top of Matplotlib and closely integrated with pandas data structures. It provides a higher-level wrapper on the library which makes it easier to create more aesthetically pleasing plots.  

At the end of this chapter you'll find a bunch of references to blogs with comparisons of different libraries. 

## 3.2 Seaborn

### 3.2.1 Introduction
An overview of Seaborn plots is accessible on the [documentation website](https://seaborn.pydata.org/examples/index.html). Many plots can be accomplished with only seaborn functions, however for non-conventional plots, further customization is possible using Matplotlib pyplot directly. 

Throughout this chapter we'll be using seaborn, pandas and some of matplotlib's features to further modify our plots.

In [None]:
# Importing libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Some notebooks require the explicit setting of matplotlib inline to plot the graphs into the notebook
%matplotlib inline

### 3.2.2 Lineplot  
We will start exploring the Seaborn library with lineplots. 

The following example elaborates on lineplots with a new dataset. This dataset is retrieved from [Datahub.io](https://datahub.io/core/pharmaceutical-drug-spending) and contains the spendings of a bunch of countries in pharmaceutical as from 1971. The dataset is available in the data folder as `pharmaspending.csv`. 

In [None]:
# Download data from datahub.io. 
pharma = pd.read_csv('data/pharmaspending.csv')

# Inspect the data
pharma

Often we are interested in the average value of one variable as a function of other variables. Many seaborn functions can automatically perform the statistical estimation and plot it on your graph. 

In [None]:
# Make a lineplot of the percentage of GDP over time
ax = sns.lineplot(x = 'TIME', y = 'PC_GDP', data = pharma)

This is the lineplot of the percentage of GDP for all the countries in this dataframe. We can make a subselection of this dataframe that contains the data for Belgium and its neigbhouring countries France, Germany and the Netherlands. 

In [None]:
# Make empty dataframe
sub_pharma = pd.DataFrame()

# Countries of interest
countries = ['BEL', 'FRA', 'DEU', 'NLD']

# Make subselection dataframe with the data of the countries of interest
for country in countries:
    sub_pharma = sub_pharma.append(pharma.loc[pharma['LOCATION'] == country], ignore_index=True)
    
sub_pharma

In [None]:
# Make a lineplot of the percentage of GDP over time
ax = sns.lineplot(x = 'TIME', y = 'PC_GDP', data = sub_pharma)

In [None]:
# Make a lineplot of the percentage of GDP over time
ax = sns.lineplot(x = 'TIME', y = 'PC_GDP', data = sub_pharma, hue='LOCATION')

---
### Question:
Why doesn't Seaborn calculate a statistical estimation around the lines in this plot? What has changed as compared to the first lineplot? 

---

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean:



In [None]:
ax = sns.relplot(x = 'TIME', y = 'PC_GDP', kind='line', data = pharma)

### 3.2.3 Barplot

Take the sum of all spendings from oldest until the most recent spendings (TOTAL_SPEND) and make a barplot. Notice that this is may lead to misinterpretation and plot the spendings per capita. 

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(15,6))

# Plot barplot with similar x, y and data definitions
ax = sns.barplot(x = 'LOCATION', y = 'TOTAL_SPEND', data = pharma)

In [None]:
# Make sure to set the figure size again to your likings
plt.figure(figsize=(15,6))

# Plot procentual GDP as opposed to total spendings
ax = sns.barplot(x = 'LOCATION', y = 'PC_GDP', data = pharma)

Additional parameters: 
- Use median as the estimate of central tendency: `estimator=median` (from numpy import median)
- Show the standard error of the mean with the error bars: `ci=68`
- Show standard deviation of observations instead of a confidence interval: `ci="sd"` 
- Add “caps” to the error bars: `capsize=.2`
- Use a different color palette for the bars: `palette="Blues_d"`

---
### 3.2.3.1 Exercise:
Plot the same barplot but only for Belgium vs the Netherlands. Find a barplot argument that selects which country is selected and hence plotted (instead of making another subselection of the pandas dataframe). 

---

For the following example we're going to plot horizontal layered barplots. We will import the same file with metagenic classifications from the previous chapter. 

In [None]:
# Import data
metagenic = pd.read_csv('data/metagenic.csv')

# Add a column with total counts
metagenic["total"] = metagenic.sum(axis = 1)

# Order the table per total counts
metagenic = metagenic.sort_values("total", ascending = False)
metagenic

In [None]:
# The following line will make a grid on a white background
sns.set(style="whitegrid")

# Set the size of the figure plot
plt.figure(figsize=(15,6))

# Create a barplot with 
sns.barplot(x="total", y="chr", data= metagenic)

Make it an exercise?

In [None]:
# Define colorstyle
sns.set_color_codes("pastel")

# Define figure size
f, ax = plt.subplots(figsize=(10,6))

# First bar
sns.barplot(x= "total", y="chr", data= metagenic, label = "total", color = 'b')
# Second bar
sns.barplot(x="ribo",y="chr", data= metagenic, label = "ribo", color="y")
# Third bar
sns.barplot(x="exon",y="chr", data= metagenic, label = "exon", color="r")
# Fourth bar
sns.barplot(x="intron",y="chr", data= metagenic, label = "intron", color="g")
#... 

# Add a legend and informative axis label
ax.legend(loc = "lower right")


### 3.2.4 Multiplot grids
Seaborn also allows you to visualize pairwise relationships and marginal distributions. The iris dataset is the perfect dataset for showing this:

In [None]:
# We'll be working with the famously known iris dataset to make some seaborn plots. 
iris = sns.load_dataset("iris")
iris.head()

In [None]:
sns.set(style='white')
# A great way of making pairwise comparisons is pairplot
sns.pairplot(data=iris, hue="species")

### 3.2.5 Scatter plot (Volcano plot)
A useful plot in differential expression analysis (RNA-seq) is a volcano plot. Essentially, a volcano plot is a scatter plot and can also be approached from this perspective in Seaborn logic. The data for this experiment is retrieved from the GTN tutorial "RNA Seq Counts to Viz in R" ([link](https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/rna-seq-counts-to-viz-in-r/tutorial.html)).

A volcano plot shows statistical significance (P-value) versus magnitude of change (fold change). The most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top. With this plot, we can then quickly identify genes with large fold changes that are also statistically significant, i.e. probably the most biologically significant genes.

In [None]:
# Import dataset
volc = pd.read_csv('data/annotatedDEgenes.tabular', sep='\t')
volc.head()

The P-value needs to be log10 transformed. Notice that the datatype of the values in the DataFrame are interpreted as `numpy.float64`. 

In [None]:
# Check the data type of a value in the P-values column
type(volc['P-value'][0])

In the following code block we will add the log10 transformed P-values using Numpy's log10 function. 

In [None]:
volc['Log10 P-value'] = -np.log10(volc['P-value'])
volc.head()

In [None]:
# The following line will make a grid on a white background
sns.set(style="whitegrid")

fig = plt.figure(figsize=(5,5))

# Volcano plot using Seaborn's scatterplot
ax = sns.scatterplot(x='log2(FC)', y='Log10 P-value', data=volc)

fig.savefig('img/example.png')

Color the scatter points according to the strandedness

---
### 3.2.5.1 Exercise
Plot the same graph as given here above with the following modifications:
- Use a white background
- Color the dots according to its strand orientations
- Change x- and y-labels
- Remove the upper and right spine of the plots (http://seaborn.pydata.org/generated/seaborn.despine.html)

Example:
![Volcano plot](img/volcanoplot.png)

In [None]:
sns.set(style='white')

sns.scatterplot(x='log2(FC)', y='Log10 P-value', data = volc, hue='Strand')

plt.xlabel('log2 Fold change values')
plt.ylabel('-log10 p-values')

sns.despine()

---
For the sake of giving another example, we'll show here how we can plot the same graph using the visualization library `Plotly`. Plotly allows interactive visualizations.

In [None]:
pip install plotly

In [None]:
# Importing the library
import plotly.express as px

In [None]:
# Define dataset, x- and y-axis, color based on column values and add another label when hovering over the data.
fig = px.scatter(volc, x='log2(FC)', y='Log10 P-value', color='Strand', hover_data=['GeneID'])
fig.show()

### 3.2.6 Heatmap
This exercises is the sequel of exercise 2.5.4 in the previous chapter. In this exercise, derived from the GTN, we will plot  the data that we cleaned in the previous chapter to create a heatmap of the top differentially expressed genes in an RNA-seq counts dataset.    

In [None]:
# 1. Import data & Prepare the data (note the index)
heatmap_df = pd.read_csv('data/heatmap_data.csv', index_col=0)
heatmap_df.head()

It's important that the gene names are the row indeces and the names of the experiments are the column indeces. Besides the indeces, also a normalization will be necessary in order to allow a better interpretation. This normalization is a standard procedure in creating heatmaps and is done by using z-scores. 

In [None]:
# Importing statistical library from scipy for calculating z-scores
from scipy import stats 

In [None]:
# 1. Preparing the data
# Data scaling by row (scale genes) with zscores
for row in range(len(heatmap_df)):
    heatmap_df.iloc[row] = stats.zscore(heatmap_df.iloc[row])
    
heatmap_df

In [None]:
# 2. Set style of the plot
sns.set_style("ticks")
sns.color_palette("deep")

# 3. Define/create the plot
f = plt.figure(figsize=(10,10))
ax = sns.heatmap(heatmap_df, cmap="RdBu", annot=True)

# 4. Tweak lay-out
ax.set(xlabel='Samples', ylabel='Genes', title='Heatmap of DE genes')

## 3.3  Lay-outing options with Seaborn and pyplot
Finally, we will explore and summarize lay-out options by using simple line-plots. 

For this example, we're importing a dataset from [datahub.io](https://datahub.io/core/genome-sequencing-costs) containing the cost of genome sequencing throughout the years. We can import the dataset directly from the website using pandas `read_csv()` function or find it in the data-folder as `sequencing_costs.csv`.  

In [None]:
# Import the data
seqcost = pd.read_csv('data/sequencing_costs.csv', sep=',')

# Inspect the data
seqcost.head()

Making a graph that plots the Cost per Mb (y-axis) throughout the years (x-axis) can be achieved with Seaborn's `lineplot`. The following is defined:
- x-axis: column name that contains the data for the x-axis = `'Date'` 
- y-axis: column name that contains the data for the y-axis = `'Cost per Mb'`
- dataset: name of the dataset = `seqcost` 

Note that the name of the x and y argument are not random. They share the identical name of the column names in the dataframe. 

In [None]:
sns.lineplot(x = 'Date', y = 'Cost per Mb', data = seqcost)

The plot we just made doesn't really look like an aesthetic plot. In the next exercise we will learn how to modify the plot to our likings. 

Drawing attractive figures is important. Visualizations are central to communicating quantitative insights to an audience, and in that setting it’s even more necessary to have figures that catch the attention. Besides the attractiveness, correctness is obviously even more important. Misleading data visualizations can lead to misinterpretations and false conclusions (examples [here](https://www.datapine.com/blog/misleading-data-visualization-examples/) and [here](https://learningsolutionsmag.com/articles/misleading-data-visualizations-can-confuse-deceive-learners)).  

---
### 3.3.1 Exercise 
Edit the lineplot above with the following adjustments:
- Set the style to a white background with ticks on the axes
- Set the context to a paper format
- Change the figure size
- Rename the axes and title of the plot 

Use the information here below to change the lay-out of the plot.

![Seq cost per Mb](img/seqcost.png)

---
### 3.3.2 Extra exercise
Make two subplots underneath each other that plot the Cost per Mb over years and the Total cost. 
Find more information on subplots [here](https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html).

![Subplots sequencing cost](img/subplots_seqcost.png)

The following workflow can help you with the right steps for plotting and lay-outing your plots:

--- 

```python
# 1. Import data
...

# 2. Set style of the plot
sns.set(...)
sns.set_style(...)
sns.set_context(...)
sns.color_palette(...)
sns.axes_style(...)

# 3. Define/create the plot
subplots and figsize
ax = sns.lineplot(...)

# 4. Modify lay-out (title, labels, legend, etc.)
ax.set(...)
plt.xlabel(...)
plt.ylabel(...)

# Save the figure with:
#ax.savefig()
```

---

**General style of the plot**:  
[`sns.set()`](https://seaborn.pydata.org/generated/seaborn.set.html) is the overarching method that sets aesthetic parameters in one step. Alternatively, choose one of the following methods to edit the general style of the plot:
- [`sns.set_style()`](https://seaborn.pydata.org/generated/seaborn.set_style.html) will set the background color of the graph. Examples are: *white*, *whitegrid* or *dark*. If you choose white you will see that it loses some structure, therefore it is possible to use *ticks* on the axes. 
- [`sns.set_context()`](https://seaborn.pydata.org/generated/seaborn.set_context.html) will basically scale your figure for usage in a *paper*, *poster*, *talk* or *notebook* (default). 
- [`sns.color_palettes()`](http://seaborn.pydata.org/tutorial/color_palettes.html) choose any of the color palettes defined in the link or make your own color palette. Options are: *pastel*, *deep*, *husl*, ...
- [`sns.axes_style()`](https://seaborn.pydata.org/generated/seaborn.axes_style.html#seaborn.axes_style) This affects things like the color of the axes, whether a grid is enabled by default, and other aesthetic elements.



**Define the plot**:  
- First define the figure size and/or subplots. Note that this is done at the matplotlib level: `f = plt.figure(figsize=(10,4))` or `f, ax = plt.subplot(1, 1, figsize=(10,4))` 
- Use any of [Seaborn's plots](https://seaborn.pydata.org/examples/index.html)

**Additional tweaking of axes**:  
Further modifications are possible on the matplotlib level by using `ax.set()` with a list of parameters. Refer to the [official documentation](https://matplotlib.org/3.3.0/api/axes_api.html) for a list of all possibilities. However it's also possible to define them individually using [pyplots methods](https://matplotlib.org/3.1.1/api/pyplot_summary.html). Here are some that you might find interesting:
- `plt.title`
- `plt.xlabel` and `plt.ylabel`
- `plt.xlim` and `plt.ylim`
- `plt.xscale` and `plt.yscale`
- `plt.legend`

## 3.4 Exercises
We'll experiment further with Pandas and Seaborn using the publicly available Covid-19 datasets provided by the Belgian national health institute Sciensano. The data is accessible following [this link](https://epistat.wiv-isp.be/covid/). Here are some plotting ideas; can you create a graph with...:
- Casualties by age and sex
- Progress of cases (hospitalization) per province over time
- Number of tests over time
- Relation between cases and tests. 

Try to map the Belgian cases on a map using the Plotly library. Here is a good hint: https://colab.research.google.com/drive/1vTdhtYk1H7KyPrpL-5NEFzstwAcglt_F#scrollTo=BboAWwU_TqQ_

## 3.5 Further reading
The internet is full of blogs discussing visualization libraries and best practices, here is a start for some further reading:
- https://mode.com/blog/python-data-visualization-libraries
- https://pbpython.com/visualization-tools-1.html
- https://www.dataquest.io/blog/python-data-visualization-libraries/
- https://lisacharlotterost.de/2016/05/17/one-chart-code/

## 3.6 Next session
Explore how we can work with biological data in the [next chapter](04_Biopython_Introduction.ipynb)!