# Visualization


The main reference for this Chapter is @McK2017.



## `matplotlib.pyplot`

`matplotlib` is a modern and classic plot library. Its main features are inspired by `MATLAB`. In this book we mostly use `pyplot` package from `matplotlib`. We use the following import convention:


In [None]:
import matplotlib.pyplot as plt

### `matplotlib` interface



`matplotlib` has two major application interfaces, or styles of using the library:

- An explicit `Axes` interface that uses methods on a `Figure` or `Axes` object to create other Artists, and build a visualization step by step. You may treat this `Figure` object as a canvas, and `Axes` as plots on a canvas. There might be one or more plots on one canvas. This has also been called an *object-oriented* interface.
- An implicit `pyplot` interface that keeps track of the last `Figure` and `Axes` created, and adds Artists to the object it thinks the user wants.

Here is an example of an explicit interface.

In [None]:
fig = plt.figure()
ax = fig.subplots()
ax.plot([1, 2, 3, 4], [0, 0.5, 1, 0.2])

Here is an example of an implicit interface.


In [None]:
plt.plot([1, 2, 3, 4], [0, 0.5, 1, 0.2])

::: {.callout-note}
If the plot is not shown, you may want to type `plt.show()` to force the plot being rendered. However, to make `plt.show()` work is related to switching `matplotlib` backends, and is sometimes very complicated.
:::

The purpose to explicitly use `fig` and `ax` is to have more control over the configurations. The first important configuration is subplots.

- `.subplot()`
- `.subplots()`
- `.add_subplot()`

Please see the following examples.


::: {#exm-}

In [None]:
plt.subplot(1, 2, 1)
plt.plot([1, 2, 3], [0, 0.5, 0.2])

:::


::: {#exm-}

In [None]:
plt.subplot(1, 2, 1)
plt.plot([1, 2, 3], [0, 0.5, 0.2])
plt.subplot(1, 2, 2)
plt.plot([3, 2, 1], [0, 0.5, 0.2])

:::


::: {#exm-}

In [None]:
fig, axs = plt.subplots(1, 2)
axs[0].plot([1, 2, 3], [0, 0.5, 0.2])
axs[1].plot([3, 2, 1], [0, 0.5, 0.2])

:::


::: {#exm-}

In [None]:
import numpy as np
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax3.plot([1, 2, 3], [0, 0.5, 0.2])

The auguments `2, 2, 1` means that we split the figure into a `2x2` grid and the axis `ax1` is in the 1st position. The rest is understood in the same way.
:::


::: {#exm-}
If you don't explicitly initialize `fig` and `ax`, you may use `plt.gcf()` and `plt.gca()` to get the handles for further operations.

In [None]:
plt.subplot(1, 2, 1)
ax = plt.gca()
ax.plot([1, 2, 3], [0, 0.5, 0.2])

plt.subplot(1, 2, 2)
ax = plt.gca()
ax.plot([3, 2, 1], [0, 0.5, 0.2])

:::



The purpose to explicitly use `fig` and `ax` is to have more control over the configurations. For example, when generate a `figure` object, we may use `figsize=(3, 3)` as an option to set the figure size to be `3x3`. `dpi` is another commonly modified option.

In [None]:
fig = plt.figure(figsize=(2, 2), dpi=50)
plt.plot([1, 2, 3], [0, 0.5, 0.2])

If you would like to change this setting later, you may use the following command before plotting. 


In [None]:
fig.set_size_inches(10, 10)
fig.set_dpi(300)
plt.plot([1, 2, 3], [0, 0.5, 0.2])

You may use `fig.savefig('filename.png')` to save the image into a file.


### Downstream packages
There are multiple packages depending on `matplotlib` to provide plotting. For example, you may directly plot from a Pandas DataFrame or a Pandas Series.

::: {#exm-}


In [None]:
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100, 10))
s.plot()

In [None]:
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
                  columns=['A', 'B', 'C', 'D'],
                  index=np.arange(0, 100, 10))
df.plot()

:::


### plotting 
#### `plt.plot()`
This is the command for line plotting. You may use `linestyle='--'` and `color='g'` to control the line style and color. The style can be shortened as `g--`. 

Here is a list of commonly used linestyles and colors.

- line styles
    - `solid` or `-`
    - `dashed` or `--`
    - `dashdot` or `-.`
    - `dotted` or `:`
- marker styles
    - `o` as circle markers
    - `+` as plusses
    - `^` as triangles
    - `s` as squares
- colors
    - `b` as blue
    - `g` as green
    - `r` as red
    - `k` as black
    - `w` as white

The input of `plt.plot()` is two lists `x` and `y`. If there is only one list inputed, that one will be recognized as `y` and the index of elements of `y` will be used as the dafault `x`.


::: {#exm-plot1}


In [None]:
plt.plot(np.random.randn(30).cumsum(), color='r', linestyle='--', marker='o')

You may compare it with this [Example](#exm-plotwithsns) for the purpose of `seaborn` from next Section.
:::

#### `plt.bar()` and `plt.barh()`
The two commands make vertical and horizontal bar plots, respectively.
::: {#exm-}


In [None]:
import pandas as pd
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))

fig, axes = plt.subplots(2, 1)
axes[0].bar(x=data.index, height=data, color='k', alpha=0.7)
axes[1].barh(y=data.index, width=data, color='b', alpha=0.7)

We may also directly plot the bar plot from the Series. 


In [None]:
fig, axes = plt.subplots(2, 1)
data.plot.bar(ax=axes[0], color='k', alpha=0.7)
data.plot.barh(ax=axes[1], color='b', alpha=0.7)

:::



With a DataFrame, bar plots group the values in each row together in a group in bars. This is easier if we directly plot from the DataFrame.

::: {#exm-}


In [None]:
df = pd.DataFrame(np.random.rand(6, 4),
                  index=['one', 'two', 'three', 'four', 'five', 'six'],
                  columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df

In [None]:
df.plot.bar()

In [None]:
df.plot.barh(stacked=True, alpha=0.5)

:::

#### `plt.scatter()`


::: {#exm-}


In [None]:
import numpy as np

N = 100
data = 0.9 * np.random.rand(N, 2)
area = (20 * np.random.rand(N))**2 
c = np.sqrt(area)
plt.scatter(data[:, 0], data[:, 1], s=area, marker='^', c=c)

:::


#### `plt.hist()`
Here are two plots with build-in statistics. The plot command will have statistics as outputs. To disable it we could send the outputs to a temporary variable `_`.
::: {#exm-histogram1}

In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
y = mu-30 + sigma*2 * np.random.randn(10000)
_ = plt.hist(x, 50, density=True, facecolor='g', alpha=0.75)
_ = plt.hist(y, 50, density=True, facecolor='r', alpha=0.75)

:::


### `plt.boxplot()`

::: {#exm-}

In [None]:
spread = np.random.rand(50) * 100
center = np.ones(30) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low)).reshape(50, 2)

_ = plt.boxplot(data, flierprops={'markerfacecolor': 'g', 'marker': 'D'})

:::

### Titles, labels and legends

- Titles
    - `plt.title(label)`, `plt.xlabel(label)`, `plt.ylabel(label)` will set the title/xlabel/ylabel.
    - `ax.set_title(label)`, `ax.set_xlabel(label)`, `ax.set_ylabel(label)` will do the same thing.
- Labels
    - `plt` methods
        - `xlim()`, `ylim()`, `xticks()`, `yticks()`, `xticklabels()`, `yticklabels()`
        - all the above with arguments
    - `ax` methods
        - `get_xlim()`, `get_ylim()`, etc..
        - `set_xlim()`, `set_ylim()`, etc..
- Legneds
    - First add `label` option to each piece when plotting, and then add `ax.legends()` or `plt.legends()` at the end to display the legends.
    - You may use `handles, labels = ax.get_legend_handles_labels()` to get the handles and labels of the legends, and modify them if necessary.


::: {#exm-}


In [None]:
import numpy as np
fig, ax = plt.subplots(1, 1)
ax.plot(np.random.randn(1000).cumsum(), 'k', label='one')
ax.plot(np.random.randn(1000).cumsum(), 'r--', label='two')
ax.plot(np.random.randn(1000).cumsum(), 'b.', label='three')

ax.set_title('Example')
ax.set_xlabel('x')
ax.set_ylabel('y')

ax.set_yticks([-40, 0, 40])
ax.set_yticklabels(['good', 'bad', 'ugly'])

ax.legend(loc='best')

:::

### Annotations

- The command to add simple annotations is `ax.text()`. The required auguments are the coordinates of the text and the text itself. You may add several options to modify the style.
- If arrows are needed, we may use `ax.annotation()`. Here an arrow will be shown from `xytext` to `xy`. The style of the arrow is controlled by the option `arrowprops`.


::: {#exm-}


In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(np.random.randn(1000).cumsum(), 'k', label='one')
ax.text(500, 0, 'Hello world!', family='monospace', fontsize=15, c='r')
ax.annotate('test', xy=(400, 0), xytext=(400, -10), c='r',
            arrowprops={'facecolor': 'black',
                        'shrink': 0.05})

:::


### Example

::: {#exm-}
The stock data can be downloaded from [here](assests/datasets/spx.csv). 

In [None]:
from datetime import datetime
fig, ax = plt.subplots()
data = pd.read_csv('assests/datasets/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']
spx.plot(ax=ax, style='k-')
crisis_data = [(datetime(2007, 10, 11), 'Peak of bull market'),
               (datetime(2008, 3, 12), 'Bear Stearns Fails'),
               (datetime(2008, 9, 15), 'Lehman Bankruptcy')]
for date, label in crisis_data:
    ax.annotate(label, xy=(date, spx.asof(date) + 75),
                xytext=(date, spx.asof(date) + 225),
                arrowprops=dict(facecolor='black', headwidth=4, width=2,
                                headlength=4),
                horizontalalignment='left', verticalalignment='top')
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])
_ = ax.set_title('Important dates in the 2008-2009 financial crisis')

:::


::: {#exm-}
Here is an example of arrows with different shapes. For more details please read the official [document](https://matplotlib.org/stable/tutorials/text/annotations.html#annotating-with-arrow).

In [None]:
fig, ax = plt.subplots()

x = np.linspace(0, 20, 1000)
ax.plot(x, np.cos(x))
ax.axis('equal')

ax.annotate('local maximum', xy=(6.28, 1), xytext=(10, 4),
            arrowprops=dict(facecolor='black', shrink=0.05))

ax.annotate('local minimum', xy=(5 * np.pi, -1), xytext=(2, -6),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="angle3,angleA=0,angleB=-90",
                            color='r'))

:::


## `seaborn`

There are some new libraries built upon `matplotlib`, and `seaborn` is one of them. `seaborn` is for statistical graphics. 

`seaborn` is used imported in the following way.


In [None]:
import seaborn as sns

In [None]:
#| echo: false
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

`seaborn` also modifies the default `matplotlib` color schemes and plot styles to improve readability and aesthetics. Even if you do not use the `seaborn` API, you may prefer to import `seaborn` as a simple way to improve the visual aesthetics of general `matplotlib` plots.

To apply `sns` theme, run the following code.


In [None]:
sns.set_theme()

Let us directly run a few codes from the last section and compare the differences between them.



::: {#exm-plotwithsns}


In [None]:
plt.plot(np.random.randn(30).cumsum(), color='r', linestyle='--', marker='o')

Please compare the output of the same code with [the previous example](#exm-plot1)
:::


### Scatter plots with `relplot()`

The basic scatter plot method is `scatterplot()`. It is wrapped in `relplot()` as the default plotting method. So here we will mainly talk about `relplot()`. It is named that way because it is designed to visualize many different statistical relationships. 

The idea of `relplot()` is to display points based on the variables `x` and `y` you choose, and assign different properties to alter the apperance of the points. 

- `col` will create multiple plots based on the column you choose.
- `hue` is for color encoding, based on the column you choose.
- `size` will change the marker area, based on the column you choose.
- `style` will change the marker symbol, based on the column you choose.


::: {#exm-}
Consider the following example. `tips` is a DataFrame, which is shown below.


In [None]:
import seaborn as sns
tips = sns.load_dataset("tips")
tips

In [None]:
sns.relplot(data=tips,
            x="total_bill", y="tip", col="time",
            hue="smoker", style="smoker", size="size")

:::

The default type of plots for `relplot()` is scatter plots. However you may change it to line plot by setting `kind='line'`.


::: {#exm-}


In [None]:
dots = sns.load_dataset("dots")
sns.relplot(data=dots, kind="line",
            x="time", y="firing_rate", col="align",
            hue="choice", size="coherence", style="choice",
            facet_kws=dict(sharex=False))

:::

### `regplot()`
This method is a combination between scatter plots and linear regression.

::: {#exm-}
We still use `tips` as an example.

In [None]:
sns.regplot(x='total_bill', y='tip', data=tips)

:::


### `pairplot()`
This is a way to display the pairwise relations among several variables. 

::: {#exm-}
The following code shows the pairplots among all numeric data in `tips`.

In [None]:
sns.pairplot(tips, diag_kind='kde', plot_kws={'alpha': 0.2})

:::


### `barplot`


::: {#exm-}


In [None]:
sns.barplot(x='total_bill', y='day', data=tips, orient='h')

In the plot, there are several `total_bill` during each `day`. The value in the plot is the average of `total_bill` in each `day`, and the black line stands for the `95%` confidence interval.


In [None]:
sns.barplot(x='total_bill', y='day', hue='time', data=tips, orient='h')

In this plot, lunch and dinner are distinguished by colors.
:::

### Histogram

::: {#exm-}


In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
y = mu-30 + sigma*2 * np.random.randn(10000)
df = pd.DataFrame(np.array([x,y]).T)
sns.histplot(df, bins=100, kde=True)

Please compare this plot with [this Example](#exm-histogram1)
:::


## Examples


### Example 1: `USA.gov` Data From `Bitly`
In 2011, URL shortening service `Bitly` partnered with the US government website
`USA.gov` to provide a feed of anonymous data gathered from users who shorten links
ending with `.gov` or `.mil`. The data is gotten from @McK2017.

The data file can be downloaded from [here](assests/datasets/example.txt). The file is mostly in JSON. It can be converted into a DataFrame by the following code.

In [None]:
import pandas as pd
import numpy as np
import json
path = 'assests/datasets/example.txt'
df = pd.DataFrame([json.loads(line) for line in open(path)])

We mainly use `tz` and `a` columns. So let us clean it.


In [None]:
#| warning: false
df['tz'] = df['tz'].fillna('Missing')
df['tz'][df['tz'] == ''] = 'Unknown'
df['a'] = df['a'].fillna('Missing')
df['a'][df['a'] == ''] = 'Unknown'

We first want to extract the timezone infomation from it. The timezone info is in the column `tz`.


In [None]:
tzone = df['tz']
tvc = tzone.value_counts()
tvc

After cleaning data, we would like to visulize the value counts.


In [None]:
import seaborn as sns
sns.barplot(x=tvc[:10].values, y=tvc[:10].index)

We then would like to extract information from the column `a`. This column is about the agent of the connection. The important info is the part before the space `' '`.


In [None]:
agent = df['a']
agent = agent.str.split(' ').str[0]
avc = agent.value_counts()
avc[:10]

Now let us assume that, if `Windows` appears in column `a` the user is using `Windows` os, if not then not. In this case, the os can be detected by the following code.


In [None]:
df['os'] = np.where(df['a'].str.contains('Windows'), 'Windows', 'Not Windows')

Now we can make a bar plot about the counts based on `os` and `timezone`.


In [None]:
tz_os_counts = df.groupby(['tz', 'os']).size().unstack().fillna(0)
tz_os_counts.head()

We then turn it into a DataFrame using the `.stack()`, `.unstack()` tricks.


In [None]:
tovc = tz_os_counts.stack()[tz_os_counts.sum(axis=1).nlargest(10).index]
tovc.name = 'count'
dftovc = pd.DataFrame(tovc).reset_index()

Finally we may draw the bar plot.


In [None]:
sns.barplot(x='count', y='tz', hue='os', data=dftovc)

### Example 2: US Baby Names 1880–2010
The United States Social Security Administration (SSA) has made available data on the frequency of baby names from 1880 through the present. Hadley Wickham, an author of several popular R packages, has often made use of this dataset in illustrating data manipulation in `R`. The dataset can be downloaded from [here](assests/datasets/babynames.zip) as a zip file. Please unzip it and put it in your working folder.

In the folder there are 131 `.txt` files. The naming scheme is `yob` + the year. Each file contains 3 columns: `name`, `gender`, and `counts`. We would like to add a column `year`, and combine all files into a single DataFrame. In our example, the year is from 1880 to 2010.


In [None]:
#| eval: false
import pandas as pd

path = 'assests/datasets/babynames/'
dflist = list()
for year in range(1880, 2011):
    filename = path + 'yob' + str(year) + '.txt'
    df = pd.read_csv(filename, names=['name', 'gender', 'counts'])
    df['year'] = year
    dflist.append(df)
df = pd.concat(dflist, ignore_index=True)

We can plot the total births by sex and year.


In [None]:
#| eval: false
import seaborn as sns

sns.relplot(data=df.groupby(['gender', 'year']).sum().reset_index(),
            x='year', y='counts', hue='gender', kind='line')

In [None]:
#| echo: false
import seaborn as sns
sns.set_theme()
dfgy = pd.read_csv('assests/datasets/gy.csv')
sns.relplot(data=dfgy,
            x='year', y='counts', hue='gender', kind='line')

For further analysis, we would like to compute the proportions of each name relative to the total number of births per year per gender. 


In [None]:
#| eval: false
def add_prop(group):
    group['prop'] = group.counts / group.counts.sum()
    return group

df = df.groupby(['gender', 'year']).apply(add_prop)
df.head()

In [None]:
#| echo: false
babydf5 = pd.read_csv('assests/datasets/babydf5.csv')
babydf5

Now we would like to keep the first 100 names in each year, and save it as a new DataFrame `top100`.


In [None]:
#| eval: false
top100 = (
    df.groupby(['year', 'gender'])
    .apply(lambda x: df.loc[x['counts'].nlargest(100).index])
    .drop(columns=['year', 'gender'])
    .reset_index()
    .drop(columns='level_2')
)
top100.head()

In [None]:
#| echo: false
dftop100 = pd.read_csv('assests/datasets/baby100.csv')
dftop100

Note that `level_2` is related to the original index after `reset_index()`. That's why we don't need it here.

Now we would like to draw the trend of some names.


In [None]:
#| eval: false
namelist = ['John', 'Harry', 'Mary']
sns.relplot(data=top100[top100['name'].isin(namelist)],
            x='year', y='counts', hue='name', kind='line')

In [None]:
#| echo: false
babynamelist = pd.read_csv('assests/datasets/babynamelist.csv')
sns.relplot(data=babynamelist,
            x='year', y='counts', hue='name', kind='line')

Now we would like to analyze the ending of names.


In [None]:
#| eval: false
df['ending'] = df['name'].str[-1]
endingcount = df.groupby(['gender', 'year', 'ending']).sum().reset_index()

We would like to draw barplots to show the distributions in year 1910, 1960 and 2010.


In [None]:
#| eval: false
certainyear = endingcount[endingcount['year'].isin([1910, 1960, 2010])]
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 1, figsize=(10,7))
sns.barplot(data=certainyear[endingcount['gender']=='M'],
            x='ending', y='prop', hue='year', ax=axs[0])
sns.barplot(data=certainyear[endingcount['gender']=='F'],
            x='ending', y='prop', hue='year', ax=axs[1]).legend_.remove()

In [None]:
#| echo: false
babyyear = pd.read_csv('assests/datasets/babyyear.csv')
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 1, figsize=(10,7))
sns.barplot(data=babyyear[babyyear['gender']=='M'],
            x='ending', y='prop', hue='year', ax=axs[0])
sns.barplot(data=babyyear[babyyear['gender']=='F'],
            x='ending', y='prop', hue='year', ax=axs[1]).legend_.remove()

We would also like to draw the line plot to show the trending of certain letters through years.


In [None]:
#| eval: false
sns.relplot(data=endingcount[endingcount.ending.isin(['d', 'n', 'y'])],
            x='year', y='prop', hue='ending', kind='line')

In [None]:
#| echo: false
babyending = pd.read_csv('assests/datasets/babyending.csv')
sns.relplot(data=babyending[babyending.ending.isin(['d', 'n', 'y'])],
            x='year', y='prop', hue='ending', kind='line')

## Exercises



::: {#exr-}
Please download the `mtcars` file from [here](assests/datasets/mtcars.csv) and read it as a DataFrame. Then create a scatter plot of the `drat` and `wt` variables from `mtcars` and color the dots by the `carb` variable. 
:::


::: {#exr-}
Please consider the baby name dataset. Please draw the trends of counts of names ending in `a`, `e`, `n` across years for each gender.
:::

## Projects



::: {#exr-}
Please read the file as a DataFrame from [here](assests/datasets/QSR.csv). This is the Dining satisfaction with quick service restaurants questionare data provided by Dr. Siri McDowall, supported by DART SEED grant.

1. Please pick out all rating columns. Excluding `last.visit`, `visit.again` and `recommend`, compute the mean of the rest and add it to the DataFrame as a new column.
2. Use a plot to show the relations among these four columns:  `last.visit`, `visit.again`, `recommend` and `mean`.
3. Look at the column `Profession`. Keep `Student`, and change everything else to be `Professional`, and add it as a new column `Status` to the DataFrame.
4. Draw the histogram of `mean` with respect to `Status`.
5. Find the counts of each `recommend` rating for each `Status` and draw the barplot. Do the same to `last.visit`/`Status` and `visit.again`/`Status`.
6. Exploer the dataset and draw one plot.
:::



::: {#exr-}
Please use the baby name dataset. We would like to consider the diversity of the names. Please compute the number of popular names in top 50% for each year each gender. Draw a line plot to show the trend and discuss the result.
:::
