In [None]:
import numpy as np
from datascience import *
Table.interactive_plots()

# Lab 8 – More Visualization Methods

## Data 94, Spring 2021

This week we will be covering some new visualization methods that have been discussed in lecture. Last week we talked about methods we could use to visualize one variable, but this week we want to build on that. Visualizing two or more variables at once allows us to see more patterns in the data, and can further improve your ability to visualize data for people who do not necessarily understand data science.

We will be working with the same dataset as we did last week, so we will load that in to begin looking at the new methods. Notice there are columns, `"Latitude"` and `"Longitude"`, which will be used for visualization purposes later in this lab:

In [None]:
weather = Table.read_table("data/weather.csv")
weather.show(5)

# The [scatter](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) method

As we mentioned, visualizing two variables can show us patterns in the data that can help us learn new information. The `scatter` method allows us to see the relationship between two numerical variables in our data using a **scatter plot**. The first provided column name goes along the x-axis and the second goes along the y-axis.

Let’s take a look at the relationship between the average high and low temperatures in each city. Right now, our table doesn’t contain that information; it contains the high and low temperature for cities on several days throughout two years. In order to get our table in the right format, we need to group by 'City':

In [None]:
weather_averages = weather.group("City", np.average)
weather_averages.show(5)

Now, we can call `scatter` on this table:

In [None]:
weather_averages.scatter("Low Temp average", "High Temp average")

This plot looks good, but it is difficult to see which points correspond to which cities. To give each data point it's city name, we can use the `group` or `label` arguments:

In [None]:
weather_averages.scatter("Low Temp average", "High Temp average", labels="City")

In [None]:
weather_averages.scatter("Low Temp average", "High Temp average", group="City")

As you can see, one of these plots is easier to read than the other, so we were better off using the `group` argument in this case. However, in practice, it may be useful to use `labels`, not `group`, so think about when it may be useful to use each argument.

Scatter plots are useful when visualizing two numerical variables together. If you want to plot 2 numerical variables, but one of those variables corresponds to time, we can use a line plot to visualize the non-time variable as time passes.

# The [plot](http://data8.org/datascience/_autosummary/datascience.tables.Table.plot.html#datascience.tables.Table.plot) method

Similar to `scatter`, we give plot the names of two numerical columns and it creates a **line plot** for us. If we want to draw multiple line plots on the same set of axes, we give it a table with multiple numerical columns, and tell it which one contains the values for the x-axis.

The `plot` method allows us to see how non-time variables change over time. Let's use `plot` to look at the temperature patterns in Alaska. First, we will look at a single line plot using `plot`:

In [None]:
alaska_weather = weather.where("State", "Alaska")
alaska_average_temps = alaska_weather.select("Month", "Average Temp").group("Month", np.average)
alaska_average_temps.show(5)

In [None]:
alaska_average_temps.plot("Month",
                 xaxis_title = "Month",
                 yaxis_title = "Average Temperature",
                 title = "Average Alaskan Temperature by Month in 2016")

If we want to see multiple variables on one plot, we can include them in the table we call `plot` on:

In [None]:
alaska_temps = alaska_weather.select("Month", "High Temp", "Average Temp", "Low Temp").group("Month", np.average)
alaska_temps.show(5)

In [None]:
alaska_temps.plot("Month",
                 xaxis_title = "Month",
                 yaxis_title = "Temperature",
                 title = "Alaskan Temperature by Month in 2016")

We can see that temperatures are highest during the summer months and lowest in the winter months, which is to be expected in the Northern Hemisphere. We can also use `plot` to see these patterns in each `State`:

In [None]:
avg_temps = weather.select("State", "Month", "Average Temp").group(["State", "Month"], np.average)
state_avg_temps_by_month = avg_temps.pivot("State", "Month", "Average Temp average", np.sum)
state_avg_temps_by_month.plot("Month",
                 xaxis_title = "Month",
                 yaxis_title = "Temperature",
                 title = "State Temperatures by Month in 2016")

If we want to see less clutter that comes with seeing all 51 states in the dataset, we can pick a generally hotter state, an extreme state (gets very hot and very cold), and a generally colder state to view alone on one plot:

In [None]:
hot_state = "Hawaii"
extreme_state = "Utah"
cold_state = "Alaska"
temp_states = [hot_state, extreme_state, cold_state]

fewer_states_weather = weather.where("State", are.contained_in(temp_states))
fewer_state_avg_temps = fewer_states_weather.select("State", "Month", "Average Temp").group(["State", "Month"], np.average)
fewer_state_avg_temps_by_month = fewer_state_avg_temps.pivot("State", "Month", "Average Temp average", np.sum)
fewer_state_avg_temps_by_month.plot("Month",
                 xaxis_title = "Month",
                 yaxis_title = "Temperature",
                 title = "State Temperatures by Month in 2016")

You can see here that Hawaii is above Alaska and almost always above Utah, but Utah bounces back and forth between the two. Each still appears to peak in the summer and valley in the winter, just like we saw with Alaska alone.

Hotter states tend to see less temperature difference over the year than colder states. This is because humans can tolerate extremely low temperatures better than extrememly high temperatures, so there is little room to go up from these states' high average temperatures even during the winter.

Now that we have seen both scatter plots and line plots, we will now look at another visualization technique for viewing multiple variables at once.

# `Circle` and `Marker`: the `map_table` function

`Circle` and `Marker` allow us to make interactive maps. These special scatter plots are created using longitude as the x-axis and latitude as the y-axis.

`Circle` uses circles to indicate locations on the map and `Marker` uses map markers. **Both require a table with the first two columns corresponding to latitude and longitude coordinates of your data points.** The table can have additional columns containing other information, and we will talk about those later in this lab, but you always need the first two columns to be latitude and longitude.

There are many cool features of these visualizations, but rather than talk about each one, let's take a look at one first and then talk about what we see. We'll first look at all the cities represented in our weather table:

*You may ignore any warning generated by the following cells.*

In [None]:
weather_cities = weather.group(["Latitude", "Longitude"]).drop("count")
Circle.map_table(weather_cities,
                line_color = None,
                fill_opacity = 0.5,
                area = 50)

Let's discuss the arguments we used for this `Circle.map_table` call:

- The first input is a table with only 2 columns: `"Latitude"` and `"Longitude"` (they do not have to have these exact names, but they should store latitude and longitude information about each data point).

- The `line_color` argument, when not `None`, creates a border around each circle. You can choose whether or not you prefer the border around the circles, it is up to you.

- The `fill_opacity` argument is a float between 0 to 1 that dictates how transparent each circle is. Values closer to 0 are more transparent and values closer to 1 are more opaque.

- The `area` argument indicates the size of your circles on the plot. Larger numbers correspond to larger circles, so when making your own visualizations, select values of the `area` argument that make the circle visible, but not so big that they clutter the plot too much or make it hard to read.

Now we'll look at a visualization using `Marker`:

In [None]:
Marker.map_table(weather_cities,
                color = 'blue',
                marker_icon = 'star')

Let's discuss the arguments we used for this `Marker.map_table` call:

- Again, the first input is a table containing latitude and longitude information.

- The `color` argument indicates the color of your markers.

- The `marker_icon` argument indicates the icon on your marker. For more icon options (also called glyphs), you can click [here](https://getbootstrap.com/docs/3.3/components/).

`Circle` is easier to read when looking at the continent level, and `Marker` is more difficult to see because of the clutter. However, if we zoom into each map, you can see that each becomes far more readable and useful for seeing the density of the cities in our dataset. We can see that many of our cities are in the Midwest, which is useful for understanding where the data is coming from.

To prevent this clutter on maps, we can use the `clustered_marker` argument to create clusters of data points based on how zoomed in or out we are. Let's take a look at how the map looks with clustering on:

In [None]:
Marker.map_table(weather_cities,
                color = 'blue',
                marker_icon = 'star',
                clustered_marker = True)

As you zoom in and out, the clusters change to reduce clutter! The warmer the color of the cluster, the more data points there are in that cluster. This is very useful for dense regions where seeing each individual data point is not necessary.

As we mentioned above, there are additional columns you can have in the table you supply to `Circle.map_table` and `Marker.map_table`. *These columns provide the map with information about **every** data point in your table.*

| **Column Name** | **Description** |
| -- | -- |
| `labels` | Gives each point a clickable label |
| `color_scale` | Colors of points to correspond to numerical values (e.g. darker colors = higher values) |
| `colors` | Colors of points are assigned based on category |
| `areas` | Areas of points are proportional to numerical values |

Note: `color_scale` only works with numerical variables, and `colors` only works to distinguish between different categorical variable values.

*The columns corresponding to these pieces of information must have these **exact** labels, otherwise `map_table` will error.*

## Done! 😇

That's it! There's nowhere for you to submit this, as labs are not assignments. However, please ask any questions you have with this notebook in lab or on Ed.