# Data Visualization - Part 1

Visualization is an important aspect of communicating stories and insights around data. In this lecture we will be using functions from two important visualization libraries: `matplotlib` and `seaborn`. Within the `matplotlib` library, most of our functions will be coming from `pyplot`, which we abbreviate as `plt`.

Depending on the goals of your visualization, you may chose to use `matplotlib` (`plt`) or `seaborn`. Both libraries can create many of the same type of visualizations. However, both libraries may not have the same options and flexibility to customize visualizations as you want. Thus, it's important to become familiar with the documentation to understand how to create graphs from both libraries.

Before we get started, let's remember what types of data are best for which visualizations.

### <code style="background:#83ebd5;color:black">Exercise: Data Visualization Recap</span>

**Fill in the blanks below:**

- Numerical data consists of (***Blank_1***) and (***Blank_2***) values. (***Blank_1***) values are finite integers, while (***Blank_2***) values can take on any real number in an interval.

Blank_1:
Blank_2: 

- Scatter plots, histograms, and line plots are best for visualizing (***Blank_3***) data. Pie charts and bar graphs are best for visualizing (***Blank_4***) data.

Blank_3:
Blank_4:

---

For this lecture, we will be using data pertaining to the number of births per state through the years 2016 - 2021, recorded by the CDC. We will import the necessary libraries and load the data as a dataframe called `babies`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')     # specifies the matplotlib style for graphs, see references for more info

In [None]:
babies = pd.read_csv('../datasets/us_births.csv')
babies

We can begin to explore and understand this dataset by using visualization techniques. How we chose to visualize the data will depend on the type of data (numerical versus categorical). We will learn visualizations previously discussed as well as a few new ones.

## Scatter plots

Scatter plots are most commonly used to visualize two continous numerical variables against each other or when data takes on a large number of different discrete integers. 

We will use a scatter plot to visualize the average age of mothers versus the average birth weight.

Because this dataset is so large, let's first subset it so that we only visualize data from the states of our home institutions. First we need to check what the unique values are in the `State` column.

In [None]:
babies['State'].unique()

### <code style="background:#83ebd5;color:black">Exercise: Data processing for visualization</span>

Subset `babies` to only include data from states of the home institutions of the CAN network. Assign this subset of data to the variable `states` below:

In [None]:
## ANSWER
babies['State'].unique()
states = babies.loc[(babies['State'] == 'District of Columbia') | (babies['State'] == 'California') | 
                    (babies['State'] == 'Illinois') | (babies['State'] == 'North Carolina') |
                    (babies['State'] == 'Texas') | (babies['State'] == 'Georgia')]
states.head()

<u>Useful parameters for `plt.scatter()`</u>
- `x`: an array of values to be plotted on the x-axis
- `y`: an array of values to be plotted on the y-axis
- `c`: an array of values corresponding to colors or values that can be mapped to colors in the `cmap` parameter
- `cmap`: specifies the colormap used to color the variables (referenced below)
- `s`: dictates the size of the data points

In [None]:
plt.figure(figsize=(15,6)) 

plt.scatter(x= states['Average Age of Mother (years)'], y= states['Average Birth Weight (g)'])
plt.show()

We can also change some aesthetics about our plot by using the `c`, `cmap`, and `s` parameters. 

To color data points based on a particular column, we can use the `pd.Categorical()` function to convert the values of a column to a `categorical` data type, which is innate to the `pandas` library.

Calling the `.codes` attribute will then convert this categorical to an array of values, which can be used in the `c` parameter of `plt.scatter()`:

In [None]:
plt.figure(figsize=(15,6)) 

labelz = pd.Categorical(states['State']).codes

plt.scatter(x=states['Average Age of Mother (years)'], y=states['Average Birth Weight (g)'], 
           c = labelz, cmap='tab20', s=75)

plt.show()



---

<u>Additional functions for plotting:</u>

`plt.legend()` - places a legend within the plot
- `handles`: displays a color, shape, icon, etc., representing each category in the plot
- `labels`: text labels of the categories
- `ncol`: sets the number of columns the legend has
- `labelcolor`: sets the color of the text in the legend
- `bbox_to_anchor`: positions the legend within the plotting area

`plt.title()` - makes a title for the plot

In [None]:
plt.figure(figsize=(15,6)) 

labelz = pd.Categorical(states['State']).codes

scatter = plt.scatter(x=states['Average Age of Mother (years)'], y=states['Average Birth Weight (g)'], 
           c = labelz, cmap='tab20', s=75)


plt.legend(handles=scatter.legend_elements()[0], labels = pd.Categorical(states['State']).unique(), 
           bbox_to_anchor=(1, 1.01))                                  # Legend added

plt.title('Average Mother Age vs Average Birth Weight')               # title added

plt.show()

For a more straightforward way of coloring data points based on a category in a variable, try `sns.scatterplot()`.

Refer to the <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html">matplotlib</a> or <a href="https://seaborn.pydata.org/generated/seaborn.scatterplot.html">seaborn</a> documentation to learn more about how to create scatter plots in these libraries.

## Histograms

Histograms are used to visualize distributions by quantifying the number of items that fall within a particular *bin*. Bins are basically subdivisions or partitioned "buckets" of a numeric variable.

We can use a histogram to visualize the distribution of ages of mothers in our dataset. First, we need to see what the range of ages are. We can do this by determining the minimum and maximum range in our dataset:

In [None]:
print(babies["Average Age of Mother (years)"].min())
print(babies["Average Age of Mother (years)"].max())

### Making a histogram

We need to know the minimum and maximum values of ages so that we can determine the range of our bins when making our histogram. 

For our plot, lets create a variable called `binz` which will be an array from 23 and ending at 36 (inclusive).

To construct the histogram, we use the `plt.hist()` function:

In [None]:
plt.figure(figsize=(7,5)) 

binz = np.arange(23, 37)                      # creates an array of numbers that determine the ranges of the bins
plt.hist(x=babies["Average Age of Mother (years)"], bins = binz)
plt.show()

Alternatively, we could decrease the number of bins by including 2 years per bin by making an array starting at 23 and ending at 36 (inclusive) in increments of 2. This will change the look of our histogram:

In [None]:
plt.figure(figsize=(7,5)) 

binz = np.arange(23, 37, 2)                            # binz set to increments of 2
plt.hist(x=babies["Average Age of Mother (years)"], bins = binz)
plt.show()

---
<u>Additional features for plots:</u>

`plt.xlabel` - labels the x-axis

`plt.ylabel` - labels the x-axis

In [None]:
plt.figure(figsize=(7,5)) 

binz = np.arange(23, 37, 2)
plt.hist(x=babies["Average Age of Mother (years)"], bins = binz)
plt.xlabel('Age (years)')                                   # x-label added
plt.ylabel('Count')                                         # y-label added
plt.title('Age Distribution of Mothers')                    # title added
plt.show()

<u>Useful parameters for `plt.hist()`</u>
- `x`: an array of values to be binned and plotted
- `bins`: an array of values that set the ranges of each bin **or** a single number that determines the number of bins
- `rwidth`: dictates the width of each bar relative to the width of its bin
- `orientation`: toggles between a vertical and horizontal plot
- `color`: dictates the color of the bar; any of the accepted `matplotlib` colors or hexadecimal code can be passed as an argument (see references).

In [None]:
plt.figure(figsize=(7,5)) 

binz = np.arange(23, 37, 2)
plt.hist(x=babies["Average Age of Mother (years)"], bins = binz, 
         orientation = 'horizontal', rwidth = 0.75, color = 'limegreen') # orientation, bar width, and color changed
plt.xlabel('Count')                                                      # x and y labeling are switched
                                                                         # due to orientation
plt.ylabel('Age (years)')
plt.title('Age Distribution of Mothers')                                 # title added
plt.show()

For more information on how to create custom histograms, refer to the <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html">matplotlib histogram</a> documentation or the <a href="https://seaborn.pydata.org/generated/seaborn.histplot.html">seaborn</a> equivalent.

### Line Graphs

Line graphs are commonly used to display sequential data to easily see trends over time. 

We will use `plt.plot()` to make a line graph showing the number of total births per year. First, we need to organize the data so that it is grouped by year and then find the total number of births for each group. 


Type the code that would accomplish this below. Assign it to the variable `years`.

In [None]:
## ANSWER

years = babies.groupby('Year').sum()[['Number of Births']]
years

Now that we have our data organized in this way, we can easily plot it by using `years` as an argument for `plt.plot()`:

In [None]:
plt.figure(figsize=(7,5)) 

plt.plot(years)
plt.show()

<u>Useful parameters for `plt.plot()`</u>
- `color`: dictates the color of the line
- `linewidth`: sets the weight of the line
- `linestyle`: sets the style of the line; accepted arguments include '-', '--', '-.', and ':'

---

<u>Additional features for plots:</u>

`plt.xticks`: - sets tick locations and labels on the x-axis


In [None]:
plt.figure(figsize=(7,5)) 

plt.plot(years, color = '#3a45c5', linewidth = 3, linestyle= '--') # color changed with a hexademical code,
                                                                   # line width and line style altered
plt.xlabel('Year')                                                 # x-label added
plt.ylabel('Number of Births (in millions)')                       # y-label added
plt.title('Births from 2016-2021')                                 # title added
plt.xticks(np.arange(2016, 2022, 2))                               # increments of x-axis changed
plt.show()

There are a ton of ways to customize a line graph. See the <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html">matplotlib</a> or <a href="https://seaborn.pydata.org/generated/seaborn.lineplot.html">seaborn</a> line graph documentation for more information.

### <code style="background:#83ebd5;color:black">Exercise: Creating a visualization</span>

Using `matplotlib`, create a histogram of the average birth weight using the `babies` dataset. Make the bars of the histogram take up 75% (0.75) of the width of the bins. Color the bars a purplish color.

In [None]:
# ANSWER

plt.figure(figsize=(7,5)) 

binz = np.arange(2400, 3700, step = 100)
plt.hist(x=babies["Average Birth Weight (g)"], bins = binz, color = 'mediumorchid', rwidth = 0.75)
plt.xlabel('Average Birth Weight (g)')
plt.ylabel('Count')
plt.title('Count of Average Birth Weight')
plt.show()

## Bar Graphs

Bar graphs are best to use when you have a limited number of categories to visualize. 

We will be using a bar graph to visualize the number of babies born to mothers of different educational levels. First we will need to determine the total number of babies born to each educational group. 

### <code style="background:#83ebd5;color:black">Exercise: Data processing for visualization</span>

Write code below that makes a dataframe that gives the total number of births for each educational level. Assign this dataframe to the variable `ed`.

In [None]:
## ANSWER

ed = babies.groupby('Education Level of Mother').sum()[['Number of Births']]
ed.reset_index(inplace=True)
ed

Now that we have our data grouped by education level, we can use `plt.bar()` to create a bar graph, using the education level as the `x` parameter and the number of births as the `height` parameter (i.e., the y-variable):

In [None]:
plt.figure(figsize=(7,5)) 
plt.bar(x = ed['Education Level of Mother'], height= ed['Number of Births'])
plt.show()

As you can see, the text of our x-labels are all jumbled together because some of the education levels have extensive descriptions. We can fix this by rotating the labels using `plt.xticks()` and using 270 as an argument for the `rotation` parameter. 

As before, we will add other labeling and a title to our graph:

In [None]:
plt.figure(figsize=(7,5)) 
plt.bar(x = ed['Education Level of Mother'], height= ed['Number of Births'])
plt.xticks(rotation=270)                           # Rotate x-labels by 270 degrees
plt.xlabel('Education Level')                      # x-label added
plt.ylabel('Number of births (millions)')          # y-label added
plt.title('Number of births per education level')  # title added
plt.show()

This graph is better, but still not ideal. The x-labels elongate the graph in a visually displeasing way.

We can use `plt.barh()` to change the orientation:

In [None]:
plt.figure(figsize=(7,5)) 
plt.barh(y = ed['Education Level of Mother'], width= ed['Number of Births'])
plt.xlabel('Number of births (millions)')          # x-label added, switched due to orientation
plt.ylabel('Education Level')                      # y-label added, switched due to orientation
plt.title('Number of births per education level')  # title added
plt.show()

<u>Useful parameters for `plt.bar()`</u>
- `x`: values/categories of the x-axis
- `height`: numerical value of each bar corresponding to the height

<u>Useful parameters for `plt.barh()`</u>
- `y`: values/categories of the y-axis
- `width`: numerical value of each bar corresponding to the with of the bar
---

Refer to the <a href="https://seaborn.pydata.org/generated/seaborn.barplot.html">seaborn</a> or <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html">matplotlib</a> bar plot documentations for more information. Documentation for horizontal bar plots can be found <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.barh.html#matplotlib.pyplot.barh">here</a>.

### Box and whisker plot

Box and whisker plots are a useful visualization tool that provides several quantitative measures of a distribution across multiple categories. 

We will use box and whisker plots to visualize the average age of mothers from each state that gave birth to male and female babies. First we need to group the data based on the sex of the child and state the child was born. Then, we need to find the average age of the mothers in each group.

### <code style="background:#83ebd5;color:black">Exercise: Data processing for visualization</span>

Write code that creates a dataframe that organizes births by gender and finds the average of mothers for each gender per state. Assign this dataframe to the variable `age`.

In [None]:
# ANSWER

age = babies.groupby(['Gender', 'State'])[['Average Age of Mother (years)']].mean()
age.reset_index(inplace=True)
age

Box and whisker plots can be easily made using `sns.boxplot()`. Simply specify the `data`, `x`, and `y` parameters like so:

In [None]:
plt.figure(figsize=(7,5)) 

sns.boxplot(data=age, x="Gender", y="Average Age of Mother (years)")

plt.show()

To see the distribution of data points, `sns.swarmplot()` can be used along with `sns.boxplot()`. The same `x` and `y` parameters can be specified, along with others, to situate the data points on top of the box plot:

In [None]:
plt.figure(figsize=(7,5)) 

sns.boxplot(data=age, x="Gender", y="Average Age of Mother (years)", palette = 'husl', width= 0.5)
sns.swarmplot(data=age, x="Gender", y="Average Age of Mother (years)", color='black', size=5)

plt.show()

<u>Useful parameters for `sns.boxplot()`</u>
- `data`: dataframe or array containing data to be plotted
- `x`: variable of data to plot on the x-axis
- `y`: variable of data to plot on the y-axis
- `palette`: color palette to be used to color plotted variables
- `width`: width of box and whiskers

<u>Useful parameters for `sns.swarmplot()`</u>
- `data`: dataframe or array containing data to be plotted
- `x`: variable of data to plot on the x-axis
- `y`: variable of data to plot on the y-axis
- `color`: color of data points
- `size`: size of data points

Box and whisker plots can be made in both <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html">matplotlib</a> or <a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html">seaborn</a>. Refer to the documentation to learn how to use both.

## Pie chart

Pie charts are a very simple data visualization method used to display proportion data. They go over well with lay-person audiences. As previously discussed, while simple and effective, pie charts should be used with discretion.

Here, we can create a pie chart to visualize the total number of males and females born between 2016 and 2021. First, let's group our data accordingly:

In [None]:
sex = babies.groupby('Gender').sum()[['Number of Births']]
sex.reset_index(inplace=True)
sex

We create a simple pie chart using `plt.pie()` and using the `Number of Births` column for our data. The `radius` specifies how big the pie chart will be. 

In [None]:
plt.figure(figsize=(5,5)) 

plt.pie(x = sex['Number of Births'])

plt.show()

This is obviously not an informative graph, but we can improve it by specifying additional parameters in `plt.pie()`:

In [None]:
plt.figure(figsize=(5,5)) 

plt.pie(x = sex['Number of Births'], radius=1.5, labels = sex['Gender'], normalize = True, 
        autopct="%.1f%%", textprops={'fontsize': 15, 'weight':'bold'})

plt.show()

<u>Useful parameters for `plt.pie()`</u>
- `x`: sequence of numbers dictating the size of each wedge
- `radius`: radius of the pie
- `labels`: labels for each wedge
- `normalize`: when set to `True`, `x` values will be normalized to a proportion of 1
- `autopct`: dictates labeling of the wedges; follows formatting of string literals (referenced below)
- `textprops`: controls text objects in the plot

Documentation on pie charts in matplotlib can be found <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html">here</a>.

## Heatmap

A heatmap is a matrix of data points depicted through a color gradient. Heatmaps are a great way to visualize data when you want to look at a multidimensional comparison of many categorial variables. 


<img src="https://seaborn.pydata.org/_images/spreadsheet_heatmap.png" width="600"/>

<center> <i>An example of a heatmap depicting the number of each flights out of an airport each month over, several years. Heatmap was annotated with the numerical values corresponding to a color gradient</i></center>

<br></br>
We can use a heatmap to visualize variables that have a large number of values, such as the number of births from mothers of various educational levels in each of the states. To do this, we will first need to organize our data.


### <code style="background:#83ebd5;color:black">Exercise: Data processing for visualization</span>

Write code below that constructs a dataframe displaying the total number of births from 2016 to 2021, per state. *Hint: Your dataframe should show the states as indexes and years as column titles.*

In [None]:
## ANSWER

babies.groupby('State')[['Number of Births']].sum().head()

# or

pivot = babies.pivot_table('Number of Births', index='State', aggfunc=np.sum)
pivot.head()

With this dataframe, we will generate our heatmap using `sns.heatmap()`:

<u>Useful parameters for `sns.heatmap()`</u>
- `data`: a 2D dataset, such as an array with two variables or a 2D array
- `cmap`: specifies the colormap used to color the intensity of values
- `linewidth`: dictates width of borders
- `linecolor`: dictates color of bordres
- `cbar_kws`: provides options to modify the color bar
- `vmin`: sets the minimum value for the color bar; by default, `sns.heatmap()` sets this based on the minimum of your data
- `vmin`: sets the maximum value for the color bar; by default, `sns.heatmap()` sets this based on the maximum of your data

In [None]:
plt.figure(figsize=(6,15)) 

sns.heatmap(data = pivot, cmap='BuGn', linewidth=.5, linecolor="black", # colormap, line width, and color specified,
           cbar_kws={'label': 'Births per year'}, vmin=0, vmax=700000)  # color bar labeled, min and max values set
plt.title('Births per year in each state')                              # title added

plt.show()

<u>Additional features for plots:</u>

`plt.savefig`: - saves the current plot as an image in the desired format
- `fname`: sets the file name; accepted arguments are strings representing the relative filepath, including the file name and extension
- `format`: sets the format of the image; accepted arguments include 'png', 'jpeg', 'pdf', 'eps', and 'svg'

In [None]:
plt.figure(figsize=(6,15)) 

sns.heatmap(data = pivot, cmap='BuGn', linewidth=.5, linecolor="black", # colormap, line width, and color specified,
           cbar_kws={'label': 'Births per year'}, vmin=0, vmax=700000)  # color bar labeled, min and max values set
plt.title('Births per year in each state')                              # title added

plt.savefig(fname = 'state_heatmap.png', format='png')                  # save as a PNG file

More information on heatmaps can be found in the <a href="https://seaborn.pydata.org/generated/seaborn.heatmap.html">seaborn</a> documentation page.

***

## The `.plot()` method

Dataframes have an innate `.plot()` method that allows you to plot data directly. The `.plot()` method has an `x` and `y` parameter that can specify the x- and y-variables, respectively. It also has a `kind` parameter that allows you to pass a string of the kind of plot you want to create. 


If we wanted to remake the scatter plot above, we could call the `.plot()` method on the `states` dataframe that we used before:

In [None]:
states.plot(x= 'Average Age of Mother (years)', y= 'Average Birth Weight (g)', kind='scatter')
plt.show()

Or we could remake the horizontal bar graph from above:

In [None]:
ed.plot(x = 'Education Level of Mother', y = 'Number of Births', kind = 'barh')
plt.show()

As always, refer to the documentation for the `.plot()` method (referenced below) to understand what parameters it takes and acceptable arguments for these parameters.

### <code style="background:#83ebd5;color:black">Exercise: Data processing for visualization</span>

Create a dataframe that shows the average birth weight for each of the Midatlantic states (NY, NJ, PA, DE, MD, District of Columbia). Assign this dataframe to `mid_states`.

In [None]:
## ANSWER

mid = babies.loc[(babies['State'] == 'District of Columbia') | (babies['State'] == 'New York') | 
                    (babies['State'] == 'New Jersey') | (babies['State'] == 'Delaware') |
                    (babies['State'] == 'Maryland') | (babies['State'] == 'Pennsylvania')]
mid.head()

In [None]:
# Answer

mid_weight = states.groupby('State')[['Average Birth Weight (g)']].mean()
mid_weight

### <code style="background:#83ebd5;color:black">Exercise: Creating visualization</span>

Create a horizontal bar graph using `mid_states` data. Make the bars of the graph orange and the lines around the graph green. Make sure the linewidth is thick enough that it is easily visible.

In [None]:
## ANSWER

plt.barh(mid_weight.index, width = mid_weight['Average Birth Weight (g)'], color='#e96016', 
         edgecolor='#55bb44', linewidth=2)
plt.show()

## References

- <a href="https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html">matplotlib style sheets for graphs</a>
- <a href="https://matplotlib.org/stable/tutorials/colors/colormaps.html">matplotlib color maps</a>
- <a href="https://matplotlib.org/stable/gallery/color/named_colors.html">matplotlib colors</a>
- <a href="https://www.canva.com/colors/color-wheel/">hexadecimal color wheel picker</a>
- <a href="https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting">formatting string literals</a>
- <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html">`.plot()` method</a>



## Activity:

Form groups with members of your fellow CAN cohort. Using the data listed on your card (from the `datasets` directory), decide what visualization tool would be best to display the data. Then, make this visualizaton using the tools discussed during this lecture. Make sure your visuals are properly labeled and legible.