<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module12_data_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://images.pexels.com/photos/5561923/pexels-photo-5561923.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2' width=700>  
Photo by Olya Kobruseva from Pexels

# APEX Faculty Training, Module 12: Data Visualization

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: May 7, 2022  

**Learning outcomes**  
In this module, you will:
* Be introducted two new libraries, `matplotlib` and `seaborn`, both of which are used for data visualization purposes
* Learn how to use functions within these libraries to create scatter plots, histograms, count plots, bar plots, and line plots

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`
        


## 2. Overview of Data Visualization
The last few modules introduced you to the `Pandas` and `SciPy` libraries. `Pandas` allows us to work with dataframes and perform simple descriptive statistics. The `stats` module within the `SciPy` library allows us to do more complex, inferential statistics. 

Statistics is only one piece of the data analysis pie, however, so the focus of this module will be on visualizing data, i.e., creating data plots. 

There are *many* Python libraries designed for data visualization purposes, and here we'll simply focus on two of the most popular: `matplotlib` and `seaborn`. Although `matplotlib` is quite popular, the plots that it creates aren't particularly attractive without a lot of additional coding. 

`seaborn` to the rescue! `seaborn` uses `matplotlib` functions behind the scenes; in other words, it does the hard work for you, allowing users to much more easily and simply create attractive plots. To learn more about the types of plots that `seaborn` can create, we encourage checking out the [tutorial](http://seaborn.pydata.org/tutorial.html) and [example gallery](http://seaborn.pydata.org/examples/index.html).  

To use functions within the `seaborn` library, we need to import both `matplotlib` and `seaborn`. If you're curious, the former is required given that `seaborn` uses `matplotlib` functions behind the scenes.)

Finally, two notes about importing these libraries:
* When importing `matplotlib`, you need to include an additional line of code that allows plots to be displayed directly inside your notebook. See sample code below. 
* When importing `seaborn`, we'll use the abbreviation `sns`. Again, see sample code below.  

<font color='red'>Exercise 1</font>  
Run the cell below to import the `matplotlib` and `seaborn` libraries. Given that we'll be working with dataframes, we'll also need to import `pandas`. As usual, you will not see an output from importing libraries.

In [None]:
# Library for dataframes
import pandas as pd

# Libraries for plotting
import matplotlib
%matplotlib inline
import seaborn as sns

<font color='red'>Exercise 2</font>  
Next, let's read in a few CSV files to create  dataframes. Run the cell below, which will create four dataframes as follows:

* `movies_df` contains movie rating data taken from a variety of websites, such as Rotten Tomatoes, Fandango, etc.
* `biopics_df` contains data taken from IMDB regarding the biopic film genre
* `drugs_df` contains data from a national survey on drug use
* `drugs_df_long` contains the exact same data, formatted in a different way

In [None]:
movies_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/movie_ratings.csv')
biopics_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/biopics.csv')
drugs_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/0f146f266ca577467be93f726956a9ff17711282/S21/drug_use_wide.csv')
drugs_df_long = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/drug_use_long.csv')

## 3. Scatter Plots
Scatter plots are useful for understanding the relationship between two data sets. They can be created with `seaborn` using either of the following functions: `scatterplot()` or `regplot()`. These plots look identical, except `regplot()` includes a regression line, which can make it easier to understand whether two datasets have a linear relationship.

The syntax for these two functions is as follows, with the only difference being the function names themselves (scatterplot vs. regplot):  

* `plot_name = sns.scatterplot(x = 'col1', y = 'col2', data = df_name)`  

* `plot_name = sns.regplot(x = 'col1', y = 'col2', data = df_name)`

Breaking down this syntax:
* Start by creating a variable to represent the plot. For example, if we're creating a scatter plot relating to movie data, we might use `movie_scatter`
* Then we need to indicate the relevant library (in this case, `sns`) and the specific function within that library that we want to use (scatterplot or regplot)
* Finally, within the parentheses, we need to indicate which two columns represent our two datasets, and what dataframe they can be found in. Note that the column assigned to `x` will be displayed on the x-axis, and vice-versa for `y`. 

<font color='red'>Exercise 3</font>  
Run the first cell below to see a scatter plot of the `rotten_tom` and `metacritic` columns of the `movies_df` dataframe. 

In the cell beneath that, write code to create a regression plot using the same data, but this time assign the plot to a variable called `movie_reg`.

In [None]:
# Scatter plot of movie_df
movie_scatter = sns.scatterplot(x = 'rotten_tom', y = 'metacritic', data = movies_df)

In [None]:
# Regression plot of movie_df


### 3a. Changing plot aesthetics
Don't like plotting with the default blue color? You can easily change this! Simply include an additional argument within the parentheses as follows:

`plot_name = sns.scatterplot(x = 'col1', y = 'col2', data = df_name, color = 'red')`

Naturally, you can replace `'red'` with another color of your choosing. Seaborn recognizes a surprisingly large number of colors!

<font color='red'>Exercise 4</font>  
Create a new version of your regression plot from above so that it uses a non-default color of your choosing. Also make sure to use a different variable name for the plot.

In [None]:
# New Regression plot with a custom color


Note that there are many other aspects of plot aesthetics that can be changed in `seaborn` (e.g., size of plot points, title of plot, axis titles, background color, etc.). If you're curious to know more, check out: 
* The `seaborn` page for a particular plotting function. For example: https://seaborn.pydata.org/generated/seaborn.scatterplot.html#seaborn.scatterplot 
* The general `seaborn` page for controlling plot aesthetics: https://seaborn.pydata.org/tutorial/aesthetics.html

### 3b. Saving plots to Drive
In addition to viewing plots directly in your notebook, you can also save them to Google Drive as a PDF (or other image types, if you prefer). To do this, you'll need to connect this notebook to your Google Drive account. For a review of this connection process, please see Module 10: Using your own data.

<font color='red'>Exercise 5</font>  
Run the cell below, which will open a series of windows prompting you to connect this notebook with your desired Drive account.

In [None]:
# Connect notebook to Drive
from google.colab import drive
drive.mount('/content/drive')

Next, you need to specify where, exactly, in Google Drive you'd like your plot to be saved. As a reminder, we call this the file's path. The path should include the relevant location as well as a name for the new file. See below for the generic syntax:

`filepath = '/content/drive/My Drive/EDIT_HERE/file.pdf’`

You should leave most of the above path as-is, with the exception of the portion that says EDIT HERE and the specific name you'd like to give your file. For example, you might want to save a regression plot within your Colab Notebooks folder, which would give you:

`filepath = '/content/drive/My Drive/Colab Notebooks/movie_reg_plot.pdf’`

After defining the path, you'll then execute two lines of code that will do the actual saving. See below for the generic syntax:

`fig = my_plot.get_figure()`  
`fig.savefig(filepath)`

Using this syntax, the only thing you'll need to change is on the first line: `my_plot`. Rather than use this generic variable name, you should instead use the variable that represents your desired plot. Thinking back to exercises 3 and 4, you used several different variable names when creating your plots – if you wanted to save one of those plots as PDF, you would use that plot's variable.

<font color='red'>Exercise 6</font>  
In this exercise, you'll save the regression plot you created in Exercise 4. In the cell below:
1. On line 2, edit the file path to fit your preferred location on Drive and your preferred name for the PDF you'll be creating.   
2. On line 5, change `my_plot` to fit the name of your plot from Exercise 4.

In [None]:
# Define where to save the file, including what to call the file
filepath = '/content/drive/My Drive/EDIT HERE/file.pdf'

# Do the actual saving
fig = my_plot.get_figure()
fig.savefig(filepath)

## 4. Histograms
Histograms are used for approximating a data set's distribution by binning continuous data into a series of consecutive intervals. For example, you could use a histogram to explore the distribution of heights among NBA players or gas prices across the country.   

In `seaborn`, histograms are created using the function `histplot()`. See below for sample syntax, noting that we only need to specify one column rather than two. If you define the x-axis, you'll have vertical bars; if you define the y-axis, you'll have horizontal bars.

`plot_name = sns.histplot(x = ‘col_name’, data = df_name)` for vertical bars   
`plot_name = sns.histplot(y = ‘col_name’, data = df_name)` for horizontal bars  

<font color='red'>Exercise 7</font>  
Run each of the cells below to see histograms for movie ratings from Rotten Tomatoes, plotted either vertically or horizontally:


In [None]:
movie_hg = sns.histplot(x = 'rotten_tom', data = movies_df)

In [None]:
movie_hg = sns.histplot(y = 'rotten_tom', data = movies_df)


As with scatter plots, you can easily change the color of the plot by including the `color` argument. For example:

`plot_name = sns.histplot(x = ‘col_name’, data = df_name, color = 'green')`

Another plotting change that you may wish to make relates to the bins themselves. By default, `seaborn` will automatically select a bin size using a reference rule that depends on the sample size and variance. However, you can instead specify the bin width yourself, or you can specify the number of desired bins.

To specify bin width, use:

`plot_name = sns.histplot(x = ‘col_name’, data = df_name, binwidth = 0.5)`

The above example would create bins with a width of 0.5; naturally, you should change this value as needed for each data set you work with.

To specify the number of bins, use:

`plot_name = sns.histplot(x = ‘col_name’, data = df_name, bins = 10)`

The above example would create 10 bins; again, you should change this value as needed for each data set you work with.

<font color='red'>Exercise 8</font>  
Create a histogram of movie ratings from the '`fandango`' column of  `movies_df`, selecting a non-default color and a non-default number of bins. Remember to use a different variable name than you used in Exercise 7.

In [None]:
# New Histogram with custom color and number of bins


## 5. Count plots

Whereas the `histplot()` function is used for continuous data, the `countplot()` function is used with categorical data, or data that fall into categories, such as the number of folks from your university from different ethnicities.

Count plots are created using the following syntax, which is identical to that of histograms with the exception of the function name itself:  
`plot_name = sns.countplot(x = ‘col_name’, data = df_name)` for vertical bars  
`plot_name = sns.countplot(y = ‘col_name’, data = df_name])` for horizontal bars  

<font color='red'>Exercise 9</font>  
Our prior data set (movie ratings) contained continuous data. We'll thus need to use a different data set that contains categorical data for demonstrating the `countplot()` function. This particular data set also relates to movies, but this time each row is a biopic movie, and each column is categorical information about the movie, e.g., the type of subject (politician, athlete, musician, etc.) or the subject's gender.

Run the following cell to see a count plot, displayed with vertical bars, for the column `type_of_subject` within `biopics_df`.

In [None]:
# Count plot with vertical bars
job = sns.countplot(x = 'type_of_subject', data = biopics_df)

Yikes!!

This plot looks like a hot mess with axis labels that are overlapping and impossible to read. Thankfully, we have an easy solution: create the plot with horizontal bars, instead.

<font color='red'>Exercise 10</font>  
Recreate the plot from above with vertical, rather than horizontal bars. Remember to use a new variable name.

In [None]:
# Count plot with horizontal bars


## 6. Dataframe formatting (wide vs. long)
Dataframes can be organized in "wide" or "long" format. Depending on the type of plot you wish to create, you'll need to format your dataframe accordingly. Specifically:

* Use wide format for the plot types learned above, i.e., `scatterplot()`, `regplot()`, `histplot()`, and `countplot()`
* Use long format for the new plot types you'll learn below: `pointplot()` and `barplot()`

First, though, let's clarify the difference between wide and long format. Rather than re-invent the wheel, we recommend that you take a look at [this website](https://www.statology.org/long-vs-wide-data/), which does a nice job of explaining the difference.

Below, we'll explore the same set of data represented in both formats. This data set contains information relating to the percentage of people in different age groups that have used specific drugs in the past year.

<font color='red'>Exercise 11</font>  
Run each of the cells below to view the difference between wide and long formatting.

In [None]:
# Wide format
drugs_df

In [None]:
# Same data, but in long format
drugs_df_long

**Note**: If you already have a data set in one format, but need to convert it to the other, you could certainly do this manually in Excel or Sheets. However, this process can be tedious and prone to error, particularly when using large data sets. Instead, you can use Pandas methods to reformat your data. This process is beyond the scope of today's module, but we recommend that you check out the Pandas "cheat sheet" for tips:

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf 

## 7. Point Plots
Point plots could perhaps more accurately be called line plots, but they essentially involve points connected by lines. This will make more sense once you run the code below to see an example!

Point plots are created using the function `pointplot()`, which takes the following inputs:  
* `x = 'col1'`
    * Which column to plot on the x-axis (ex: age)
* `y = 'col2'`
    * Which column to plot on the y-axis (ex: percentage)
* Optional: `hue = 'col3'`
    * Use this option if you have more than one category of data, and you want separate lines for each category (ex: drug)
* `data = df`
    * Relevant dataframe (ex: drugs_df_long)

<font color='red'>Exercise 12</font>  
Run the cell below, which produces a point plot for the `drugs_df_long` dataframe, with the `age` column displayed on the x axis, the `percentage` column on the y axis, and each drug displayed as a different color.

In [None]:
# Point plot for drugs_df_long
drugs_point = sns.pointplot(x = 'age', y = 'percentage', hue = 'drug', 
                            data = drugs_df_long)

## 8. Bar Plots

Bar plots can be created using the function `barplot()`, which takes the same inputs as point plots:

* `x = 'col1'`
    * Which column to plot on the x-axis (ex: age)
* `y = 'col2'`
    * Which column to plot on the y-axis (ex: percentage)
* Optional: `hue = 'col3'`
    * Use this option if you have more than one category of data, and you want separate lines for each category (ex: drug)
* `data = df`
    * Relevant dataframe (ex: drugs_df_long)

<font color='red'>Exercise 13</font>  
Create a bar plot using the same data as the point plot above:
* `age` displayed on the x axis
* `percentage` displayed on the y axis
* `drug` displayed as different line colors

**Question**: Do you prefer one of these plots over the other (i.e., point vs bar)? For example, do you feel that one plot conveys a message to the audience that is easier to understand? That's the fun of data visualization – taking the same data and deciding how best to tell your story!

In [None]:
# Bar plot for drugs_df_long


This is just the tip of the iceberg! Seaborn can create many, many more types of plots. We thoroughly recommend checking out the seaborn page given at the beginning of this module to see what else is possible. 

## Congratulations!
You've finished the APEX Python training modules!