## Lecture 06 : Data Visualization (Part 1)

This is the first of a two part lecture examining data visualization.  In this part, we focus on line, scatter, and bar plots. These can be very helpful when trying to understand your data and when communicating properties of your data. 

The standard imports we use for most lectures.

In [None]:
from datascience import *
import numpy as np

These are the new commands we use to setup the plotting tools.  You don't need to understand these but you need to run them at the beginning of your notebook:

In [None]:
# This command enables plots to appear directly in your notebook.
%matplotlib inline

# This includes the powerful matplotlib plotting library
import matplotlib.pyplot as plots

# This sets the style to mirror that of the popular fivethirtyeight blog ...
plots.style.use('fivethirtyeight')

---

# Preparing the Census Data

In this lecture we continue with the census data.  However, before we proceed we will do some initial cleanup.

In [None]:
full = Table.read_table('data/nc-est2019-agesex-res.csv')
full

**Exercise:** Simplify the table to contain just the `"SEX"`, `"AGE"`, and the population estimates for `"2014"` and `"2019"`. Remove the aggregate data stored in `"AGE"=999"` (see previous lecture for details). Save the result in a table called `data`.

<details><summary>Click to Expand Solution</summary>
    

```python
data = (
    full
        .select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')
        .relabeled('POPESTIMATE2014', '2014')
        .relabeled('POPESTIMATE2019', '2019')
        .where('AGE', are.not_equal_to(999)) # remove aggregates   
)
data
```
    
Notice in this solution we use an extra parenthesis:

```python
data = (
    # I put my code 
    # on multiple lines here
)
```
    
This allows me to break the expression over multiple lines.  

</details>

--- 

# Line Plots 

Line plots are used to visualize the relationship between two numerical variables where we believe one is a function of the other.  There is single x (horizontal axis) value and one or more y (vertical axis) values.  

**Exercise:** Plot the relationship between age and the *total population* at that age in 2019.

<details><summary>Click to Expand Solution</summary>
    

```python
data.where("SEX", 0).plot('AGE', '2019')
```

</details>

**Exercise:** What happens when I plot something like:

In [None]:
data.plot("AGE", "2019", marker="o")

What happened?

**Exercise:** How does the population change between `2014` and `2019`? Plot both years

**Exercise:** It is very difficult to relate both years by looking at two separate plots.  Merge both plots into a single plot. (Try making it interactive by replacing `plot` with `iplot`)

What do we observe?

---

## Males vs Females

How does the proportion of males and females change with age?

**Exercise:** Create a table containing three columns `"Age"`, `"Males"`, and `"Females"` with the corresponding population counts for 2019.

**Exercise:** Plot the number of males and females against their age as two separate lines.

**Exercise:** Add a column containing the proportion of females and plot that against age.

Notice there is a large change in the proportion of females at older ages. You can't see this easily in the earlier visualization.  This is why we will often construct multiple visualizations with additional transformations to help reveal potentially interesting patterns in our data.

--- 

# Scatter Plots 

Scatter plots are also used to visualize the relationship between numerical data.  However, unlike line plots they can be more flexible and do not imply a functional relationship between data. 

Here we will examine the `"actors.csv"` table which contains 50 rows, corresponding to the 50 top grossing actors. The table is already sorted by `"Total Gross"`, so it is easy to see that *Harrison Ford* is the highest grossing actor.

In [None]:
# Actors and their highest grossing movies
actors = Table.read_table('data/actors.csv')
actors

**Exercise:** Construct a scatter plot examining the relationship between `"Number of Movies"` and `"Average per Movie"`. (Try using `iscatter` instead.)

Why not use a line plot?

Who is the outlier in the top left?

---
<center> Return to Slides </center>

---


# Bar Charts 

Bar charts are used to visualize the relationship between numerical attributes and categorical attributes.

Here we examine the top 200 highest grossing movies of all time (as of 2017). 

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv')
top_movies

**Exercise:** Based on this data, what is the relationship between the studios and their total revenue? Construct a bar plot showing the total `Gross (Adjusted)` income for each studio.

**Exercise:** Dig deeper into the top studio.  
1. Plot the top grossing movies.
2. Plot relationship between gross revenue and release year


## Bonus

You do not need to learn about plotly or any of the syntax I am about to use, *this is bonus material*.  However, I think the following plot will hopefully highlight how you can mix different kinds of data to make a visualization. 

Notice here we specify what variables to encode as `x`, `y`, and `color`. We have also included the movie title inside each marker so when you hover over with your mouse you can see which movie each dot corresponds to. How many "dimensions" is this plot?

<details> <summary>Solution</summary>

```python
import plotly.express as px # Import the powerful plotly viz tool
px.scatter(x = top_movies.column("Year"), 
           y = top_movies.column("Gross (Adjusted)"), 
           color = top_movies.column("Studio"),
           hover_name = top_movies.column("Title"))
```
    
</details>