There is no otter-grader for this lab, so you do not need to run the typical otter cell at the top.

In [None]:
import numpy as np
from datascience import *
Table.interactive_plots()

# Lab 7 – Data Visualization

## Data 94, Spring 2021

Today we will be looking at table methods that help us visualize data we have. We can use the methods we have discussed so far in this class to interpret the data, and we can use the methods we discuss today (and their variants) to display the data. This is helpful for showing data to people who don't necessarily have a background in data science, and require a data scientist like you to help them understand the data in a more intuitive way.

We will be looking at the `barh` and `hist` methods today. As data scientists it is not only our job to be able to use the visualization methods we know, but also our job to know when to use which methods, and as we look at methods going forward, always keep in mind when it is most useful to use each new method.

## Loading in the Data

Before we get started, let's get some data to visualize. This dataset contains weekly information about weather all over the United States in the year 2016:

*Note: Precipitation is in inches, all temperatures are in Fahrenheit, and Wind Speed is in Miles per Hour.*

In [None]:
weather = Table.read_table("data/weather.csv")
weather.show(5)

### Cleaning our Data

We have to clean our data before we use it, just run the following cell once:

In [None]:
def clean_weather_table(tbl):
    tbl = tbl.drop("Date.Full", "Date.Week of", "Station.Code", "Station.Location", "Data.Wind.Direction")
    old_labels = np.array(["Data.Precipitation", "Date.Month", "Date.Year", "Station.City", "Station.State", "Data.Temperature.Avg Temp", "Data.Temperature.Max Temp", "Data.Temperature.Min Temp", "Data.Wind.Speed"])
    new_labels = np.array(["Precipitation", "Month", "Year", "City", "State", "Average Temp", "High Temp", "Low Temp", "Wind Speed"])
    tbl.relabel(old_labels, new_labels)
    tbl = tbl.move_to_start("Month")
    tbl = tbl.move_to_start("Year")
    tbl = tbl.move_to_start("City")
    tbl = tbl.move_to_start("State")
    def clean_states(state):
        if state == "VA":
            return "Virginia"
        elif state == "DE":
            return "Delaware"
        else:
            return state
    tbl = tbl.with_column("State", tbl.apply(clean_states, "State"))
    return tbl
    
weather = clean_weather_table(weather)

In [None]:
weather.show(5)

# The [barh](http://data8.org/datascience/_autosummary/datascience.tables.Table.barh.html#datascience.tables.Table.barh) method

The barh method is used to visualize **categorical** variable values. Categorical variables are non-numbers, like names and qualities (Color, State Names, etc.). As we saw in lecture, categorical variables come in 2 different types: *ordinal* and *nominal*. Refer to [Lecture 24](https://docs.google.com/presentation/d/19sNzs3WCtJNd2pzpMVdAIslnwehzZBjVazisQnM9TKg) to see the difference between the two types.

The `barh` method takes in 1 mandatory argument, which is the name of the column you want on the left axis of your `barh` plot. There are also optional arguments that have to do with plotting (axis names, plot title, etc.), and you can look at examples of those in this lab and in the homework. The remaining optional arguments in the datascience documentation linked above can also be used, feel free to try out some of the others on your own!

To use the `barh` method properly, we first need to select the columns we want to see in the graph. We should not call `barh` directly on a Table because without specifying a column, we get a bar graph for every single instance of every single variable, which you can imagine results in a lot of bar graphs (see Question 1a of Homework 7 to see an example of how this does not work the way we want it to).

Let's look at an example of `barh` that can show us the number of weather readings from each state in the dataset:

In [None]:
# First we need to select the column we want to see, then we can plot it with barh
state_weather = weather.group("State")
state_weather.barh("State")

We can also use `barh` to see multiple statistics at once. Let's use the `group` method and `barh` method to see the average low temperature **and** high temperature in each state:

*The dataset is reduced to only include the first 10 states for convenience.*

In [None]:
# We must group first to get our desired columns, then we can call barh
state_weather_avg = weather.group("State", collect=np.average).take(np.arange(10)).select("State", "High Temp average", "Low Temp average")
state_weather_avg.barh("State", overlay=True)

If we want different visualizations for each variable, we can set the optional `overlay` argument to `False`. The default value of `overlay` is `True`, so if you don't give it a value, you will get a plot with all the included variables at once.

In [None]:
state_weather_avg.barh("State", overlay=False)

That way we can choose if we want to have one plot with all our information or a new plot for each piece of information!

### Where `barh` fails

The `barh` method works well on categorical variables, but what if we have a **numerical** variable that we want to see the distribution in one particular state? Let's see what happens if we try to use `barh` on a numerical variable (`Wind Speed`) instead of a categorical variable:

In [None]:
weather.group("Wind Speed").barh("Wind Speed")

As you can see, this bar plot is not particularly helpful. Seeing the breakdown of `Wind Speed` does not provide us with any useful information, and it is also difficult to read or understand. Instead, for numerical variables, we have another visualization method that helps us visualize a numerical variable's distribution...

# The [hist](http://data8.org/datascience/_autosummary/datascience.tables.Table.hist.html#datascience.tables.Table.hist) method

The `hist` method allows us to see the distribution of a numerical variable. Categorical variables should be visualized using `barh`, and numerical variables should be visualized using `hist`.

The `hist` method takes in 1 mandatory argument and has several optional arguments (as is the case with `barh`, there are many other optional arguments, but here are just a few of them), and **`density` should always be set to `False`**

| **Argument** | **Description** | **Type** | **Mandatory?** |
| -- | -- | -- | -- |
| `column` | Column name whose values you want on the x-axis of your plot | Column name (string) | Yes |
| `density` | If `True`, then the resulting plot will be displayed not on the count of a value, but on the density of that value in the Table | boolean | No |
| `group` | Similar to the Table method `group`, groups rows by this label before plotting | Column name (string) | No |
| `overlay` | When `False`, make a new plot for each eligible statistic in the Table | boolean | No |
| `xaxis_title` | Label on the x-axis of your plot | string | No |
| `yaxis_title` | Label on the y-axis of your plot | string | No |
| `title` | Title of your plot | string | No |
| `bins` | A NumPy array of bin boundaries you want your histogram to gather data into | array | No |

**Again, in all cases, `density` should be set to `False`**

Keep in mind the same plotting optional arguments mentioned in the `barh` introduction.

Let's take a look at the weather in different states to see how the `hist` method helps visualize numerical variables:

In [None]:
# Oregon and Tennessee have similar counts in the weather dataset, so we can compare them
# First we get all Oregon weather information
oregon_weather = weather.where("State", "Oregon")
oregon_weather.show(5)

In [None]:
# This plot shows the distribution of average temperatures in Oregon for 2016
oregon_weather.select("Wind Speed").hist(
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Oregon Wind Speeds', 
                       density = False
                       )

This shows us that wind speeds in Oregon tend to fall into the 0-10 mph range, but they can get higher on certain occasions. Let's see how that compares to wind speeds in another state, Tennessee:

In [None]:
# Get all Tennessee information
tennessee_weather = weather.where("State", "Tennessee")
tennessee_weather.show(5)

In [None]:
# This plot shows the distribution of average temperatures in Tennessee for 2016
tennessee_weather.select("Wind Speed").hist(
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Tennessee Wind Speeds', 
                       density=False
                       )

We can use `hist` on a Table with just rows for these two states and use the optional `group` argument.

*You can ignore the warning message that appears when you run the plotting cell below.*

In [None]:
tennegon_weather = weather.where("State", are.contained_in(["Oregon", "Tennessee"]))
tennegon_weather.show(5)

In [None]:
# This plot shows the distribution of average temperatures in Oregon and Tennessee for 2016
tennegon_weather.select("State", "Wind Speed").hist(
                       group = "State",
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Oregon and Tennessee Wind Speeds', 
                       density = False
                       )

In [None]:
print("Oregon weeks in weather dataset:", oregon_weather.num_rows)
print("Tennessee weeks in weather dataset:", tennessee_weather.num_rows)

Because Oregon and Tennessee have very similar counts in the `weather` dataset, we can compare them with each other in visualizations like this. It appears that wind speeds in Oregon are a bit higher on average, as the plot above shows the oregon `Wind Speeds` to be a but more shifted to the right than the Tennessee `Wind Speeds`. Let's see if we can use a table query to figure out the same information:

In [None]:
tennegon_weather.show(5)

In [None]:
oregon_average_wind_speed = np.average(oregon_weather.column("Wind Speed"))
tennessee_average_wind_speed = np.average(tennessee_weather.column("Wind Speed"))
print("Average Oregon Wind Speed:", oregon_average_wind_speed)
print("Average Tennessee Wind Speed:", tennessee_average_wind_speed)

As we can see, the plot we made appeared to suggest that the average wind speed would be a bit higher in Oregon, and the table operations reflected that! This is a benefit of visualization, that information can be learned about the dataset with just visual observation. It is always beneficial to back your claims about data with concrete facts about the dataset, but visualizations can help abstract away some of the confusion of looking at raw data so that non-data-scientists can better understand what is going on.

Now, think about what would happen if you chose two states with very different counts, why would it be more difficult to compare them with histograms? Let's take a look at what happens when we do this:

In [None]:
rhode_island_weather = weather.where("State", "Rhode Island")
alaska_weather = weather.where("State", "Alaska")
print("Rhode Island weeks in weather dataset:", rhode_island_weather.num_rows)
print("Alaska weeks in weather dataset:", alaska_weather.num_rows)

Each individual plot looks fine:

In [None]:
# This plot shows the distribution of average temperatures in Rhode Island 2016
rhode_island_weather.select("State", "Wind Speed").hist(
                       group = "State",
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Rhode Island Wind Speeds', 
                       density = False
                       )

In [None]:
# This plot shows the distribution of average temperatures in Alaska for 2016
alaska_weather.select("State", "Wind Speed").hist(
                       group = "State",
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Alaska Wind Speeds', 
                       density = False
                       )

Take a look at the y-axis on both of these plots.

*What do you think will happen when we try to plot them on the same graph?*

In [None]:
rhodlaska_weather = weather.where("State", are.contained_in(["Rhode Island", "Alaska"]))
rhodlaska_weather.show(5)

In [None]:
# This plot shows the distribution of average temperatures in Rhode Island and Alaska for 2016
rhodlaska_weather.select("State", "Wind Speed").hist(
                       group = "State",
                       xaxis_title = 'Wind Speed',
                       yaxis_title = 'Count',
                       title = 'Distribution of Rhode Island and Alaska Wind Speeds', 
                       density = False
                       )

As you can see, there is so much more Alaska data than Rhode Island data that we can hardly make comparisons between the two. Trying to figure out information from this plot is very difficult, so we would either have to use another type of visualization or change the perspective of this plot to be able to learn from it.

### Homework Solution Formatting

In this lab, we will wrote visualization code that does not need to be saved to be submitted, but on your homework assignments we need to be able to save the visualizations you make. Follow **all** of these steps to correctly submit your homework code:

When writing your solutions for visualizations on homework, remember to:
- Store your plot in a variable name
- Use the `show=True` argument in your plot
- Make the last line of your cell the variable you stored your plot in so that you can see it when you run the cell

Doing all of these will allow us to grade your submission properly. There are additional instructions on the homework notebook this week, so if you have any questions feel free to ask on Ed.

## Done! 😇

That's it! There's nowhere for you to submit this, as labs are not assignments. However, please ask any questions you have with this notebook in lab or on Ed.