# Lecture 3: Statistical Graphs
### Zhentao Shi

kernel: base (python 3.11.3)


## Graphics

An English cliche says "One picture is worth ten thousand words".
John Tukey, a renowned mathematical statistician, was one of the pioneers of statistical graphs
in the computer era. Nowadays, powerful software is able to produce dazzling statistical graphs,
sometimes web-based and interactive. Outside of academia, journalism hooks a wide readership with
professional data-based graphs. New York Times and The Economists are first-rate examples;
South China Morning Post sometimes also does a respectable job.
A well designed statistical graph can deliver an intuitive and powerful message.
I consider graph prior to table when writing a research report or an academic paper.
Graph is lively and engaging. Table is tedious and boring.

We have seen an example of mathplotlib graph in the OLS regression linear example in Lecture 1.
`plot` is a generic command for graphs, and is one of the oldest libraries in Python for plotting.
It is capable of producing preliminary statistical graphs.

Over the years, developers all over the world have had many proposals for
more sophisticated statistical graphs. In my opinion, `ggplot2`, contributed by [Hadley Wickham](http://had.co.nz/),
is the best.

`ggplot2` is an advanced graphic system that generates high-quality statistical graphs.
It is not possible to cover it in a lecture. Fortunately, the author wrote a comprehensive reference.

The workflow of ggplot is to add the elements in a graph one by one, and then print out
the graph all together.
In contrast, `plot` draws the main graph at first, and then adds the supplementary elements later.

`ggplot2` is particularly good at drawing multiple graphs, either of the same pattern or of
different patterns. Multiple subgraphs convey rich information and easy comparison.

In Python, many libraries have established a solid reputation for plotting, for example: `matplotlib`, `seaborn`, `Altair` or even `ggplot` and `pandas`. But many people prefer the syntax and the "grammar of graphics" as in `ggplot2`, hence the trend goes to the package `plotnine` which is capable of data visualization like ggplot2 in R.

In [None]:
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt
import datetime

In [None]:
# Read the CSV file
d0 = pd.read_csv("data_example/AJR.csv")

# Plot the data
plt.scatter(d0['avexpr'], d0['logpgp95'])
plt.xlabel('avexpr')
plt.ylabel('logpgp95')
plt.show()

In [None]:

# Read the CSV file
bank_0 = pd.read_csv("data_example/bank-full.csv", sep=";")

# Display the dataframe
print(bank_0)

# Print the names of the columns
print(bank_0.columns)

In [None]:
# Scatter plot
plt.scatter(bank_0['age'], bank_0['balance'])
plt.xlabel('Age')
plt.ylabel('Balance')
plt.show()

In [None]:
# Scatter plot with groups
import seaborn as sns
sns.scatterplot(data=bank_0, x='age', y='balance', hue='education', alpha=0.5)
plt.show()

In [None]:
# Create a FacetGrid
g = sns.FacetGrid(bank_0, col='education', row='marital')

# Map a scatter plot to the FacetGrid
g.map(plt.scatter, 'age', 'balance', alpha=0.5)

# Show the plot
plt.show()

In [None]:
# Bar plot with 'education' as hue
sns.countplot(data=bank_0, x='age', hue='education')
plt.show()

In [None]:
# Dodged bar plot with 'education' as hue
sns.countplot(data=bank_0, y='age', hue='education', dodge=True)
plt.show()

### data manipulation

In [None]:
# Read the CSV file
d0 = pd.read_csv("data_example/PWT100.csv")

# Display the first few rows of the dataframe
print(d0.head())

# Print the names of the columns
print(d0.columns)

In [None]:
# Select specific columns and filter rows
d1 = d0[['countrycode', 'year', 'rgdpe', 'pop']]
d1 = d1[d1['countrycode'].isin(['CHN', 'RUS', 'JPN', 'USA'])]

# Create new column 'gdpcapita'
d1['gdpcapita'] = d1['rgdpe'] / d1['pop']

# Print the dataframe
print(d1)

In [None]:
# Scatter plot with 'countrycode' as hue
sns.scatterplot(data=d1, x='year', y='rgdpe', hue='countrycode')
plt.show()

In [None]:
# Line plot with 'countrycode' as hue
sns.lineplot(data=d1, x='year', y='gdpcapita', hue='countrycode')
plt.show()

In [None]:
# Select specific columns
s1 = d1[['countrycode', 'year', 'pop']]

# Spread 'year' column into multiple columns with 'pop' as values
s1 = s1.pivot(index='countrycode', columns='year', values='pop')

print(s1)

In [None]:
# Gather '1950' to '2019' columns into key-value pairs
s1 = s1.reset_index().melt(id_vars='countrycode', var_name='year', value_name='pop')

# Print the dataframe
print(s1)

## Interactive Graphs

* [Plotly](https://plotly.com/graphing-libraries/)

* Shinny for Python [posit](https://shiny.posit.co/py/docs/overview.html)
* [Shinny Express](https://shiny.posit.co/blog/posts/shiny-express/)
