# Data visualisation

*Based on BDACA and CCS1 materials*

## Choosing the right library

Some background about our modules of choice:
* matplotlib is the standard work horse. It's neither smart nor beautiful, but the decent option that just does its job. It is what happens under the hood if you use the .plot() methods in pandas, but you can also just directly feed it some lists and dictionaries as input.
* seaborn offers more beautiful output, has many more options, is smarter and can produce advanced statistical plots. It is also a bit easier to use. Its a bit like ggplot2 in R.

As you will see when browsing examples on https://python-graph-gallery.com/ , both matplotlib and seaborn are very popular, and some of the examples use one, some the other. You can do all the standard stuff (like bar charts) in either of them, and we therefore introduce you to both (after all, when googling for help, you are very likely to come accross both libraries).


We will use a dataset that comes with seaborn and contains statistics on passengers that were embarked on the Titanic. It's just a pandas dataframe, so of course, you could also load your own dataset.

It's always a good idea to use
- `.head()`
- `.shape` (not a method, just a property (=variable), therefore no need to call with `()`
- `.describe()`
- `.value_counts()`

to get a first overview, so let's do that

### Preparation

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
titanic = sns.load_dataset("titanic")
titanic.head()

In [None]:
tips = sns.load_dataset("tips")
tips.head()

In [None]:
videos = pd.read_csv('YouTube_climatechange.tab',sep='\t')
videos.head()

## Matplotlib

We can work directly from lists

In [None]:
l1 = [2,3,4]
l2 = [1,4,4]
plt.plot(l1,l2)

In [None]:
# example adapted from https://matplotlib.org/3.1.1/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py

# let's generate some sample data
x = np.linspace(0, 2, 100)  # this gives you an array (sth like a list) 
                            # with all numbers from 0 to 2, in 100 steps. [0, 0.02, 0.04, ... 2.0]

In [None]:
x

In [None]:
# create a figure (a canvas)
plt.figure(figsize=(10,7))

In [None]:
# plot things on that canvas
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')

# add labels, titles, legends, ...
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()

# Save and/or show the figure. Use ? to get different options for file formats etc
#plt.savefig('test.png',dpi=300)
#plt.show()

But we can also plot Pandas columns

In [None]:
#point to column
#.plot - with kind define what kind
videos['likeCount'].plot(kind='hist')

In [None]:
videos['videoCategoryLabel'].value_counts().plot(kind='bar')

In [None]:
videos[['commentCount','videoCategoryLabel']].groupby('videoCategoryLabel').agg(np.mean).plot(kind='bar')

## Seaborn

Seaborn uses very specific grammar that many find quite intuitive (and may recognize from ggplot ;))
You need to provide the following arguments:
```
data : the name of the dataframe with all your data
x    : which column to plot on the x axis
y    : which column to plot on the y axis
hue  : which column to use to group the data (and represent by different colors)
col  : which column to use to group the data into different sub-plots ('columns')
style: which column to use to group the data into different styles of the marker
size : which column to use to change the size of the marker
... 
```


In other words, when using seaborn, we do not use the method of a pandas object (as with matplotlib, see above), but call the seaborn function directly and pass on the dataframe with the `data = ...` argument

# A systematic look at different visualizations

## 1. Bar charts

- counts per (discrete) category  (in seaborn: `.countplot()`)
- or: other statistical property (e..g, mean)  per (discrete) category (in seaborn: `barplot()`)
- different variations (e.g., stacked bar charts)



In [None]:
# Look how sns has a function that integrates counting of values and producing the bar chart.
sns.countplot(x = 'class', data=titanic)

In the following example, we do not want to calculate counts and then plot them in a bar chart, but rather plot the (implicitly calculated) means:

In [None]:
sns.barplot(x="class", y="fare", data=titanic, ci=95)

As always, use the ? to get more help:

In [None]:
sns.barplot?

### 1.1 Grouped barplots

We can use the `hue` argument to say that we want to use different colors for different groups:

In [None]:
sns.barplot(x="class", y="fare",  hue="sex",  data=titanic)

`catplot` provides a more generic interface, where we can also specify a different `kind` than bar if we want to (such as `point` instead of `bar`, see next section):

In [None]:
sns.catplot(x="class", y="fare", hue="sex", kind='bar', data=titanic)

In [None]:
sns.barplot?

In [None]:
sns.set(style="whitegrid") #style for all plots
fig, ax = plt.subplots(figsize=(15,5)) #create a "canvas"
sns.barplot(ax=ax, x="likeCount", y="category", hue='licensedContent', data=videos, ci = None)

### 1.2 Histograms


Histograms could be seen as somewhat related to bar plots, but they are used to display the univariate distribution of numerical variables. If there are many, then they are 'binned'. Some examples:

In [None]:
sns.histplot(x = 'fare', data=titanic)

In [None]:
sns.histplot(x = 'fare', bins=20, data=titanic)

# 2. Point charts

- similar to bar charts
- but: can look 'cleaner'
- but: can give a clearer message if the categories can be meaningfully ordered

Compare the following two charts that present the same data:



In [None]:
sns.catplot(x="class", y="survived", hue="sex",
            palette={"male": "m", "female": "y"},
            kind="bar", data=titanic)

In [None]:
sns.catplot(x="class", y="survived", hue="sex",
            palette={"male": "g", "female": "m"},
            markers=["^", "o"], linestyles=["-", "--"],
            kind="point", data=titanic)

## 3. Line charts

- ordered (and typically evenly spaced) values on x-axis
- x-axis has many (in theory: contious) measurements
- most typical: plot some value(s) over time


In [None]:
# never mind about this specific code, we just want a dataset that looks like a time series
# this is some economic indicator per year per country that's it
capital= pd.read_csv("https://cssbook.net/d/private_capital.csv").melt(id_vars="Year",
            var_name="country", value_name="capital").rename({"Year":"year"}, axis=1)
capital.head()

In [None]:
# if we have multiple measurements (=countries) for a year, we get a confidence interval
sns.lineplot(x='year', y='capital', data=capital)

In [None]:
# there's never more than one measurement per year and country, so no confindence interval
sns.lineplot(x='year', y='capital', hue='country', data=capital)

# 4. Plotting bivariate relationships

If we have bivariate relationships between **continous variables**, we usually plot them using a **scatter plot**. Scatter plots show a little dot for each data point.

In seaborn, we can create them using either `.relplot()` for a simple scatter plot, or `.regplot()` if we want to have a scatter plot with an added regression line.

In [None]:
# let's use a different dataset for this, where we have two continuous variables (total_bill and tip)

tips = sns.load_dataset("tips")
tips.head()

In [None]:
# the same with seaborn
sns.relplot(x='total_bill', y='tip', data=tips)

In [None]:
# but seaborn can do more, e.g. add a regression line and confidence intervals
sns.regplot(x='total_bill', y='tip', data=tips)

In [None]:
# Or we can draw a locally weighted regression line (lowess)
# (and much more)
sns.regplot(x='total_bill', y='tip', data=tips, lowess=True)

In [None]:
#Or we can add multiple variables
sns.relplot(x="total_bill", y="tip", size='size',  hue='smoker',  col='sex', data=tips);


In [None]:
# example taken fromn https://seaborn.pydata.org/introduction.html#introduction
sns.relplot(x="total_bill", y="tip", col="time",
            hue="smoker", style="smoker", size="size",
            data=tips);

In [None]:
# We could also add histograms to the sides
sns.jointplot(data = tips, x = 'total_bill', y='tip', 
              kind="scatter", color="#4CB391")

In [None]:
# ... or replace the scatterplot by a hexplot in which overlapping points lead to a darker shade
sns.jointplot(data = tips, x = 'total_bill', y='tip', 
              kind="hex", color="#4CB391")

# Bonus 

We do not cover this in this class, but you can also create interactive plots in Python:

In [None]:
pd.options.plotting.backend = "plotly"
fig = tips.plot(x= "total_bill", y='tip', kind='scatter', hover_data={"smoker":True})
fig.show()
fig.write_html("plotly-example.html")

Another, really powerful way to build interactive visualization apps is Bokeh - see the tutorial here: 
https://mybinder.org/v2/gh/bokeh/bokeh-notebooks/master?filepath=tutorial%2F00%20-%20Introduction%20and%20Setup.ipynb