In [None]:
#!pip install wbgapi
#!pip install seaborn

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import wbgapi as wb

# Introduction to data visualization in Python

Python has several data visualization packages and none of them is a clear dominant in the Python users community. Arguably, two libraries are perhaps the most widely used: `matplotlib` and `seaborn`.

# `matplotlib`

- First-ever Python data vis library
- Very powerful
- Allows low-level customization of plots
- "Wordy" syntax, can get quite complex easily
- Very popular in scientific programming

Remember this "picture"? it was actually a plot created with `matplotlib`.

![black-hole](img/black-hole.jpg)

# `seaborn`

- Built on top of `matplotlib`
- Nicer defaults
- High-level syntax
- Much easier to use than `matplotlib` but allows less customization

We're going to use `matplotlib` and `seaborn` in this session.

# Data

 We'll start by fetching some data from the WB API.

In [None]:
countries = ['MEX', 'CAN', 'USA']
years = range(2010, 2020)
df = wb.data.DataFrame('SP.POP.TOTL', countries, years, labels=True)

In [None]:
df

We're going to do a bit of data wrangling to give this data the shape that we need for data visualization, which is the long format.

In [None]:
df = df.reset_index(drop=True)

In [None]:
df = pd.wide_to_long(df, stubnames='YR', i='Country', j='year')

In [None]:
df = df.reset_index()

In [None]:
df = df.rename(columns={'YR': 'Population'})

In [None]:
df.head()

Through the rest of this session, we'll always do the data wrangling outside the visualization libraries and will only pass the wrangled data as visualization inputs.

# Bar plots

We'll create a simple bar plot of Mexico's total population.

In [None]:
y = df[df['Country']=='Mexico']['Population'] / 1000000
x = df[df['Country']=='Mexico']['year']

## Using `matplotlib`

In [None]:
# Simplest bar plot with default options
plt.bar(x, y)

In [None]:
# Adding some customization
plot_title = 'Mexico - Total Population in millions' 
plt.bar(x, y)
plt.title(plot_title)
plt.xlabel('year')
plt.ylabel('Population')
plt.xticks(x);

- `plt` has a feature that might not seem very common in Python: it modifies an object in-place
- The multiple calls to `plt` add customizations on top of the result of the previous line
- When used in a notebook, `plt` will by default print the result of the last line of the code block
- This will not work across code blocks, though: a new code block will not have access to the `plt` object of the previous one

In [None]:
# This will return nothing because this block doesn't "have access" to the previous plt
plt.show()

- The semicolon (`;`) at the last line of a block tells the notebook to omit printing the return value of the last line (try removing it to see the difference)

## Using `seaborn`

In [None]:
sns.barplot(x=x, y=y)

In [None]:
sns.barplot(x=x, y=y, color='C0')
plt.title(plot_title);

In [None]:
sns.barplot(data=df_mexico, x='year', y='Population', color='C0')
plt.title(plot_title);

A few notes:
- Compare the syntax of both libraries to get the same result:

```
# matplotlib
plt.bar(x, y)
plt.title(plot_title)
plt.xlabel('year')
plt.ylabel('Population')
plt.xticks(x)

# seaborn
sns.barplot(x=x, y=y, color='C0')
plt.title(plot_title)
```

- `matplotlib` has a heavier syntax -- you'll also note this in the next examples
- Noticed how we assigned the title in the `seaborn` example? `matplotlib` syntax can be used on top of `seaborn` plots
- `seaborn` sets x and y-axis labels and gives a different color to every bar by default

**Challenge:** Line plots have a very similar syntax than bar plots in `matplotlib`, but they use the method `plt.plot()` instead of `plt.barplot()`. Try creating a line plot of the total population of Canada in millions.

# Scatter plots

We'll create a scatter plot of GDP per capita and life expectancy for 2010.

## Fetching the data

In [None]:
all_units = wb.economy.DataFrame()
all_units.head()

In [None]:
all_countries = all_units[all_units['aggregate']==False]
all_countries.head()

In [None]:
countries_list = list(all_countries.index)

The WB API client library also asks for the series we want to retrieve. They are:
- `NY.GDP.PCAP.KD`: GDP per capita (constant 2015 US$)
- `SP.DYN.LE00.IN`: Life expectancy at birth, total (years)

In [None]:
indicators = ['NY.GDP.PCAP.KD', 'SP.DYN.LE00.IN']

Retrieving the data:

In [None]:
df = wb.data.DataFrame(indicators, countries_list, time=2010, labels=True)

In [None]:
df.head()

## With `matplotlib`

In [None]:
x = df['NY.GDP.PCAP.KD']
y = df['SP.DYN.LE00.IN']

In [None]:
# Simple scatter plot
plt.scatter(x, y);

In [None]:
# Adding some customization
plt.scatter(x, y, s=10) # s=10 indicates the size of the markers
plt.title('Country GDP per capita and life expectancy')
plt.xlabel('GDP per capita (constant 2015 USD)')
plt.ylabel('Life expectancy in years');

## With `seaborn`

In [None]:
# A simple scatter plot with default options
sns.scatterplot(x=x, y=y);

In [None]:
sns.scatterplot(x=x, y=y)
plt.title('Country GDP per capita and life expectancy')
plt.xlabel('GDP per capita (constant 2015 USD)')
plt.ylabel('Life expectancy in years');

Alternatively, this gets us the same result in `seaborn`:

In [None]:
sns.scatterplot(data=df, x='NY.GDP.PCAP.KD', y='SP.DYN.LE00.IN')
plt.title('Country GDP per capita and life expectancy')
plt.xlabel('GDP per capita (constant 2015 USD)')
plt.ylabel('Life expectancy in years');

Some additional notes:
- the `x` and `y` arguments in `seaborn` can be:
    + Pandas series, NumPy series, or lists
    + strings with the names of columns in a Pandas dataframe, in which case the dataframe is passed in the argument `data`
- Check again the difference in the default results of `matplotlib` and `seaborn`. Which one is closer to our finalized results?

# More resources

Data visualizations require the review of documentation and examples. Some resources:

- [`matplotlib` official tutorials](https://matplotlib.org/stable/tutorials/index.html)
- [`seaborn` examples gallery](https://seaborn.pydata.org/examples/index.html)