<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module12_data_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://images.pexels.com/photos/5561923/pexels-photo-5561923.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2' width=700>  
Photo by Olya Kobruseva from Pexels

# APEX Faculty Training, Module 12: Data Visualization

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Mar 27, 2022  

**Learning outcomes**  

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`
        


## 1. Overview of Data Visualization
So far, we've worked with two Python libraries. Pandas allows us to work with dataframes and perform descriptive stats. The Stats module from the Scipy library allows us to do more complex statistics. Now, we'd like to learn how to create graphs and plots of our data--in other words, we'd like to visualize it. 

Python has two other libraries that will allow us to do this visualization work. 

There are two libraries that we will use for data visualization in this module: `matplotlib` and `seaborn`.  

`matplotlib` is a popular method of creating plots in Python, but plots aren't particularly attractive without a lot of work. `seaborn` uses `matplotlib` functionas as a starting point and allows for easily creating attractive plots.  

For a broad view of plots that seaborn can create, visit the userguide and tutorial http://seaborn.pydata.org/tutorial.html and the example gallery http://seaborn.pydata.org/examples/index.html.  

To use the `seaborn` functions, we need to import both `matplotlib` and `seaborn`. Note about importing `matplotlib`: You need to include an additional line of code so that plots are displayed inside your notebook. Note about importing `seaborn`: We import it using the initials `sns`. Putting all this together, we get the code in the following cell.  

<font color='red'>Exercise 1</font>  
Run the cell below to import the `matplotlib` and `seaborn` libraries. Just as before, you will not see an output from importing libraries.

In [1]:
# Library for dataframes
import pandas as pd

# Libraries for plotting
import matplotlib
%matplotlib inline
import seaborn as sns

<font color='red'>Exercise 2</font>  
Now that we've imported the libraries we'll be using, let's read in the csv files we'll be working with as dataframes. Run the cell below to create the dataframes.

In [45]:
movies_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/movie_ratings.csv')
biopics_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/biopics.csv')
drugs_df = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/drug_use.csv')
drugs_df_long = pd.read_csv('https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/drug_use_long.csv')

## 2. Plotting with Seaborn

### 2a. Scatter Plots
Scatter plots are useful for understanding the relationship between two data sets. They can be created with `seaborn` using either of the following functions: `scatterplot()` or `regplot()`. These plots look identical, except `regplot()` plots a regression line. This can make it easier to understand whether you have a positive or negative correlation.

The syntax for scatter plots are as follow:  
`plot_name = sns.scatterplot(x = 'col1', y = 'col2', data = df_name)`  
`plot_name = sns.regplot(x = 'col1', y = 'col2', data = df_name)`

<font color='red'>Exercise 3</font>  
Run the first cell to see the scatter plot of the `movies_df`, then create the regression plot and save it in a variable called `movie_reg` in the second cell.

In [None]:
# Scatter plot of movie_df
movie_scatter = sns.scatterplot(x = 'fandango', y = 'rotten_tom', data = movies_df)

In [None]:
# Regression plot of movie_df


You can also include additional arguments for these plots to do things like:  
* change the color of plotted points:
  * Add `color = 'red'` after you define the other parameters
* change the shape of plotted points:
  * Add `marker = '*'` after you define the other parameters

<font color='red'>Exercise 4</font>  
Run the cell below to change the marker for the regression plot from dots to stars.

In [None]:
movie_reg = sns.regplot(x = 'fandango', y = 'rotten_tom', data = movies_df, marker = '*')

<font color='red'>Exercise 5</font>  
Change the color of the markers for the scatter plot to another color of your choosing.

In addition to viewing plots directly in your notebook, you can also save them out to Google Drive as images or, our preference, pdfs. To do this:
1. Define the output path and file name  
`out_path = '/content/drive/My Drive/EDIT_HERE/file.pdf’`
2. Then use the following two commands:  
`fig = my_plot.get_figure()`  
`fig.savefig(out_path)`  

Be sure to change "`my_plot`" to fit the name of your plot.

<font color='red'>Exercise 6</font>  

In [None]:
reg_outpath = '/content/drive/My_Drive/movie_reg.pdf'
fig = movie_reg.get_figure()
fig.savefig(reg_outpath)

<font color='green'>The cell above doesn't work for me--even after mounting Google Drive</font>

### 2b. Histograms and Count Plots
Histograms are used for continuous data, such as height, temperature, score, price, etc. For example, in a certain class, how many students scored between 91-100, 81-90, 71-80, etc.  

Histograms are created using the function `histplot()` with the syntax:  
`plot_name = sns.histplot(df_name[‘col_name’])`  

<font color='red'>Exercise 7</font>  


In [None]:
movie_hg = sns.histplot(movies_df['rotten_tom'])

Additional arguments can be included to do things like:
* change the color of plotted data
  * Add `color = 'green'` after the column name
* change the number of bins in a histogram
  * Add `bins = 10` (or a number of your choosing) after the column name

<font color='red'>Exercise 8</font>  
Create a histogram plot of the ratings in the '`fandango`' column of the `movies_df`, changing the color of the bars to one of your choice and changing the number of bins to 15.

Countplots, on the other hand, are used for categorical data, or values that fall into categories, such as gender, ethnicity, company, university, etc. For example, in a certain group, how many people are Latino, Asian, Pacific Islander, etc. 

Countplots are created using the function `countplot()` with the syntax:  
`plot_name = sns.countplot(x = df_name[‘col_name’])` for vertical bars, or  
`plot_name = sns.countplot(y = df_name[‘col_name’])` for horizontal bars  

<font color='red'>Exercise 9</font>  
Run the following cell to see the vertical count plot for the column `type_of_subject` for the `biopics_df`.

In [None]:
# Count plot with vertical bars
job = sns.countplot(x = biopics_df['type_of_subject'])

Notice that you cannot see the labels for the bars.  
<font color='red'>Exercise 10</font>  
Fix this by creating the same `job` count plot, but with horizontal bars instead.

In [None]:
# Count plot with horizontal bars


---

#### <font color='blue'>Note about data formatting</font>
Dataframes can be organized in "wide" or "long" format. The wide format is best for scatter plots, histograms, and count plots, while the long format is best for point plots and bar plots (which we'll learn next).

<font color='red'>Exercise 11</font>  
Run the cells below to see both wide and long format for the same data. Feel free to change the number displayed to see more of each dataframe. They are currently set to 15 for space.  

#### • Wide Format
Percentage of people that have used a given drug in the past year, according to age:

In [None]:
drugs_df.head(15)

#### • Long Format
Same data as above

In [None]:
drugs_df_long.head(15)

#### • Converting wide to long format
Doing this conversion manually can be painful if you have a large dataset. Instead, you can convert it with code. <font color='green'>Supply sample code?</font>

---

### 2c. Point Plots and Bar Plots
Point plots are created using the function `pointplot()`, which takes the following inputs:  
* `x = df_name['col1']` - in our example, age
* `y = df_name['col2']` - in our example, percentage
* `hue = df_name['col3']` - in our example, drug
  * This is the column with the category names; each category will be plotted with a different hue (color)

<font color='red'>Exercise 12</font>  
Run the cell below, which produces a point plot for the `drugs_df_long` dataframe, with the `age` column displayed on the x axis, the `percentage` column on the y axis, and the drugs displayed as different colors.

In [None]:
# Point plot for drugs_df_long
drugs_point = sns.pointplot(x = drugs_df_long['age'], y = drugs_df_long['percentage'], hue = drugs_df_long['drug'])

Bar plots can be created using the function `barplot()`, which takes the same inputs as point plots:
* `x = df_name['col1']` - in our example, age
* `y = df_name['col2']` - in our example, percentage
* `hue = df_name['col3']` - in our example, drug
  * This is the column with the category names; each category will be plotted with a different hue (color)

<font color='red'>Exercise 13</font>  
Create a bar plot with the same data as the point plot above:
* `age` column displayed on the x axis
* `percentage` column displayed on the y axis
* `drug` column displayed as different colors

In [None]:
# Bar plot for drugs_df_long


This is just the tip of the iceberg! Seaborn can create many, many more types of plots. We thoroughly recommend checking out the seaborn page given at the beginning of this module to see what else is possible. 

## Congratulations!
You've finished the APEX Python training modules!