# Introduction to Data Visualization with Python

This notebook was prepared by Taylor Hixson (taylor.hixson@nyu.edu) for the NYUAD Library workshop Introduction to Data Visualization with Python offered on 3 November 2019. This workshop shows the basics of using seaborn to create basic static visualizations for publication. 

[Seaborn](https://seaborn.pydata.org/) is a Python library that is built on [Matplotlib](https://matplotlib.org/). While Matplotlib offers great flexibility and customization (e.g. creating functions to automate figures), seaborn has a simpler syntax and visually attractive default style options. 

Much of the content in this session was inspired by the [Data Camp](https://www.datacamp.com/) courses [Introduction to seaborn](https://www.datacamp.com/courses/introduction-to-seaborn) and [Introduction to MatPlotLib](https://www.datacamp.com/courses/introduction-to-matplotlib).


## Introductory notes and resources
### Working in a Jupyter Notebook
- Jupyter Notebook works sequentially, so it is necessary to follow the order in the notebook to complete tasks. That is, run the code cells in the order of the exercises in this notebook or errors may occur.
- To run code in Jupyter Notebook, use the **Run** button in the toolbar above, or press `shift + return/enter` on the keyboard.
- To edit a block of text double click it. Run the cell to return it to its visual state.
- Press **H** on the keyboard to see all keyboard shortcuts available in Jupyter Notebook.

### How to copy a file path
- **Mac**: right click/control (ctrl) click the file and press alt. An option to **Copy FILE as Pathname** will appear in the menu. Select it to copy the path.
- **PC**: hold down shift on the keyboard and, while continuing to hold down shift, right click the file. An option to **Copy as path** will appear in the menu. Select it to copy the path.

### Further resources
- [seaborn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf)
- [Matplotlib Cheat Sheet]( https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Workshop key terms and resources]()

<a id="toc"></a>
## Table of contents
1. [Exercise 0: Import libraries and load data](#ex0)
- [Relational plots](#relplot)
- [Exercise 1: Create a scatter plot](#ex1)
- [Exercise 2: Change x and y values](#ex2)
- [Exercise 3: Customize plot parameters](#ex3)
- [Exercise 4: Styling and saving plots](#ex4)
- [Exercise 5: Load in a dataset from a csv with Pandas](#ex5)
- [Exercise 6: Try it yourself](#ex6)
- [Categorical plots](#cat)
- [Exercise 7: Explore data for categorical plots](#ex7)
- [Exercise 8: Create a bar chart](#ex8)
- [Exercise 9: Rotating labels](#ex9)
- [Exercise 10: Explore different categorical plots](#ex10)
- [Seaborn color palettes](#color)
- [Exercise 11: Setting a palette ](#ex11)
- [Distribution plots](#dist)
- [Exercise 12: Explore distribution plots](#ex12)

<a id="ex0"></a>
## Exercise 0: Import libraries and load data
Python uses zero-based indexing--so let's start our exercises at 0, too!

To complete this exercise, run the following two cells to import the three Python libraries used in this workshop and the first practice dataset. To run code in Jupyter Notebook, use the **Run** button in the toolbar above, or press `shift + return/enter` on the keyboard.

Jupyter Notebook works sequentially, so it is usually necessary to follow the order of the code in the notebook in order to complete tasts. That is, errors may occur if a library is not imported or a cell is skipped.

The libraries listed below are part of the core Anaconda package, so if running Jupyter Notebooks through Anaconda, all you need to do is run the code to import the libraries [pandas](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/3.1.1/index.html) and [seaborn](https://seaborn.pydata.org/index.html).

[Return to table of contents](#toc)

In [None]:
# Import libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Run the next cell to load in the `tips` dataset and view the first five rows, which is referred to as the head of a dataset. The tips dataset is provided by the seaborn library to test visualizations. Later, we will load in data from other sources.

In the output, take note of the column names: total_bill, tip, sex, smoker, day, time, size. What type of data is in each column?

In [None]:
# Import data

# sns.load_dataset is used to load in the sample data available from Seaborn
tips = sns.load_dataset("tips")

#View the first 5 rows of the Tips dataset
print(tips.head())

<a id="relplot"></a>
## Introduction to relational plots
Relational plots show the relationship between 2 quantitative (numeric) variables. This session uses `sns.relplot()` to show how to create relational plots: scatter and line. In a relplot, the x and y axis data must be numeric values. [More on sns.relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html)

It is possible to make scatter and line plots using sns.scatterplot() and sns.lineplot(), but sns.relplot() allows for subplotting.

The required parameters of sns.relplot(): 
- x = "Column" Between the quotations, place the column name to go on the x axis
- y = "Column" Between the quotations, place the column name to go on the y axis
- data = After the equal sign, place the variable name of the dataset

Optional to create different types of plots:
- kind = "scatter" OR "line"

Other optional and highly used parameters of sns.relplot() are:
- hue = "Column" Column to color code by. This can be any column in the dataset, not just x or y
- size = "Column" Column to visualize data with varying sizes
- row = "Column" Creates subplots based on a category/column and arranged in rows
- col = "Column" Creates subplots based on a category/column and arranged in columns

[Return to table of contents](#toc)

<a id="ex1"></a>
## Exercise 1: Create a scatter plot

To complete this exercise, run the following line of code to create a scatter plot.

[Return to table of contents](#toc)

In [None]:
# sns.relplot(x,y,data, kind="scatter")
sns.relplot(x = "total_bill",
           y = "tip",
           data = tips,
            # What does this look line when scatter is changed to line?
           kind = "scatter")

plt.show()

<a id="ex2"></a>
## Exercise 2: Change x and y values

In the following line of code, swap the x and y values. That is, make `x = "tip"` and `y = "total_bill"`. How does the plot change?

Try again by changing the x or y value to `size`. How does the plot look now?

[Return to table of contents](#toc)

In [None]:
# sns.relplot(x,y,data, kind="scatter")
sns.relplot(x = "total_bill",
           y = "tip",
           data = tips,
           kind = "scatter")

plt.show()

<a id="ex3"></a>
## Exercise 3: Customize plot parameters

To complete this exercise:
1. Uncomment **lines 5, 6, AND/OR 7** to add more parameters from the `tips` dataset to the scatter plot such as `hue = "size"`, where size indicates the number of people dining at the table. To uncomment a line, delete the `#`.
2. On **line 12** in `g.set()`, add x and y labels to the axes and a title by adding text between the empty quotations.

**Note**: Notice the addition of g = in line 12. This allows you to call on more functions from Matplotlib by defining the plot as a variable, which in this case is arbitrarily defined as g. This is really only necessary if using one of the subplotting options through the col and row parameters. Overall, it is a good practice to follow if you plan on creating more complex plots.

[Return to table of contents](#toc)

In [None]:
# Add more parameters to visualize the scatter plot
g = sns.relplot(x = "total_bill",
           y = "tip",
           data = tips,
           kind = "scatter",
           #hue = " ",
           #size = " ",
            #col = " "
           )

# Add a title and label the axes
g.set(xlabel = " ",
      ylabel = " ",
      title = " ")

plt.show()

<a id="ex4"></a>
## Exercise 4: Styling and saving plots

The great default style options is one of the reasons to use seaborn! 

To complete this exercise: 
1. In **line 2**, add a grid style within the quotations in `sns.set_style("")`. The options are "white", dark", "whitegrid", "darkgrid", or "ticks".

2. In **line 6**, change the context of the plot to `"poster"` with `sns.set_context()`. This makes it so the scale of the visual is suitable for a poster presentation.

3. In **line 21** change MAIN PLOT TITLE to something more descriptive about the plot. 

4. Uncomment **line 44** by deleting the **#**. Line 44 contains code for saving visualization to a file by providing a path (~/Desktop/), a new file name (myFirstPlot), and file type (.png). 

**Notes**: 
- **Line 44** uses a **relative file path**, so if a new image file does **not** appear on the desktop after a few minutes. Then, **copy** the folder/file path where you want to save the output and replace ~/Desktop (retain the final /) with the copied path. To copy the file path on a Mac, right click the file/folder and hold down the alt key. The option to **Copy as Pathname** will appear in place of the regular copy. Select this option to copy the path.

- To save a figure, DO NOT use `plt.show()` in the code. To save **and** view the figure, use `plt.draw()`, NOT `plt.show()`, in the code. However, neither `plt.show()` nor `plt.draw()` are actually required to view the plot in a Notebook enviroment. This is not the case for all Python environments, so it is a good practice to follow if you plan to use other Python environments for your work.

[Return to table of contents](#toc)

In [None]:
# seaborn has 5 default styles: "white" dark" "whitegrid""darkgrid" "ticks"
sns.set_style(" ")

# Change scale of plot depending on situation of showing. Default is "paper"
# Options: "paper", "notebook", "talk", "poster"
sns.set_context(" ")

# Add data and axes
# Uncomment col = below to create subplots
g = sns.relplot(x = "total_bill",
           y = "tip",
           data = tips,
           kind = "scatter",
            hue = "size",
           size = "size",
             # Col will create subplots from the categories   
          #col = "smoker"
           )

# Change the main title on the plot using the Matplotlib function plt.suptitle()
plt.suptitle("MAIN PLOT TITLE",
            ha = "center",
             #y = 1.03,
             #x = 0.1,
            weight = "extra bold")

# Label axes
g.set(xlabel = "Tip in USD",
      ylabel = "Total bill in USD")

# !!!!!Bonus!!!!!!
# Change subplot titles
#axes = ax.axes.flatten()
#axes[0].set_title("Smoker")
#axes[1].set_title("Nonsmoker")

# Save plot by uncommenting plt.savefig and inserting your own file path
# To save a figure, DO NOT use plt.show() in the code
# To save and view the figure, use plt.draw(), NOT plt.show()
# However, plt.show() nor plt.draw() are actually required to view the plot in Notebook
# This is not the case for all Python environments
plt.draw()
#plt.savefig("~/Desktop/myFirstPlot.png", dpi=300)

<a id="ex5"></a>
## Exercise 5: Load in a dataset from a csv with Pandas

[Pandas](https://pandas.pydata.org/) is a Python library for data analysis, and it is great for reading in many data file types, including a .csv file (comma separated values).

To complete this exercise, run the code to print the first five lines, or the head, of the `titanic` dataset. What types of values are in each column: categorical, numeric, etc.?

**NOTE**: The below code uses a **relative file path**, so if something ressembling a data table does **not** appear **copy** the file path for the `titanic.csv` dataset that was included in the workshop download and **paste** it between the quotations in **line 6**. To copy the file path on a Mac, right click the file and hold down the alt key. The option to **Copy as Pathname** will appear in place of the regular copy. Select this option to copy the path.
 
[Return to table of contents](#toc)

In [None]:
# Load a csv from a file path
titanic = pd.read_csv("~/IntroPythonViz/titanic.csv")
print(titanic.head())

<a id="ex6"></a>
## Exercise 6: Try it yourself

To complete this exercise, fill in all the parameters with empty quotations marks to create a relational plot with values from the `titantic.csv` dataset. Look for empty quotations on **lines 2, 3, 5, 12, 22, and 23**.

Remember, relational plots require numeric values. Not sure which to use? Try `Age` and `Fare`.

[Return to table of contents](#toc)

In [None]:
# Uncomment col = below to create subplots
ax = sns.relplot(x = " ",
           y = " ",
           data = titanic,
            #use scatter OR line for kind
           kind = " ",
         # Bonus: Style by further features
          # hue = " ",
          # size = " ",  
           # col = " "
           )

# Change the main title on the plot using .suptitle("")
plt.suptitle(" ",
             # Bonus: Uncomment to style the title
            # ha = "center",
             #y = 1.03,
             #x = 0.1,
            # weight = "extra bold"
            )

# Label axes
ax.set(xlabel = " ",
      ylabel = " ")

#Uncomment and add your file path and dpi (dots per inch)
#plt.savefig("", dpi=)

<a id="cat"></a>
## Categorical Plots

Categorical plots show the distribution of quantitative values as defined by a categorical variable. This session uses `sns.catplot()` to show how to create categorical plots: bar, count, box, point, etc. [More on sns.catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)

It is possible to make categorical plots using other functions, but `sns.catplot()` allows for subplotting. 

The required parameters of `sns.catplot()`: 
- x = "Column" Between the quotations, place the column name to go on the x axis
- y = "Column" Between the quotations, place the column name to go on the y axis
- data = After the equal sign, place the variable name of the dataset

Optional to create different types of plots:
- kind = "bar" OR "count" OR "box" OR "point" OR "boxen" OR "swarm" OR "strip" OR "violin". If kind is not used, the default is swarm.

Other optional and highly used parameters of `sns.catplot()` are:
- hue = "Column" Column to color code by" This can be any column in the dataset, not just x or y
- row = "Column" Creates subplots based on a category/column and arranged in rows
- col = "Column" Creates subplots based on a category/column and arranged in columns

[Return to table of contents](#toc)

<a id="ex7"></a>
## Exercise 7: Explore data for categorical plots

To complete this exercise, look back at the head printed in exercise 6. Copy any column heading from the `titanic` dataset, and paste it between the quotation marks in **line 2** of the below code block. Run the cell to print the unique entries of the specified column. 

One reason you might want to know the number of unique categories in a column is to learn more about whether that data would make a good bar chart. That is, are there too many or too few categories to make an understandable categorical plot?

[Return to table of contents](#toc)

In [None]:
# Use the base Python function .unique() to count the unique entries in a specified column
type = titanic[" "].unique()
print(type)

#It's also possible to run this on one line without defining a variable
#print(titanic["Fare"].unique())

<a id="ex8"></a>
## Exercise 8: Create a bar chart

To complete this exercise, use `sns.catplot()` to create a `kind = "bar"` chart showing `Fare` distribution for the recorded `Sex` of the passengers. Place `Fare` and `Sex` as either x or y. Run the code once, and then, swap x and y values to see if there is any visual difference when it is run again.

[Return to table of contents](#toc)

In [None]:
sns.catplot(x = " ",
            y = " ",
           data = titanic,
           kind = " ")

plt.show()


<a id="ex9"></a>
## Exercise 9: Rotating labels
To complete this exercise: 
1. On **lines 13 and 14** label the x and y axes to represent the respective values in **lines 7 and 8**.
2. On **line 21** in `f.set_yticklabels(rotation=)`, add a number after the equal sign between 0-360 to rotate the label. 
3. Run the cell and view the output. How does the rotated y label look now? Replace the number and run the code block a second time.

Optional: change the rotation angle of the y-axis labels.

[Return to table of contents](#toc)

In [None]:
# Style

sns.set_style("darkgrid")

sns.set_context("paper")

f = sns.catplot(x = "Sex",
            y = "Fare",
           data = titanic,
           kind = "bar")

# Label
f.set(xlabel = " ",
      ylabel = " ",
      #xlim = (-20,20)
      )

# Defining the variable f in line 7 allows
# access to the matplotlib function .set_xticklabels()
f.set_xticklabels(rotation=0)
f.set_yticklabels(rotation= )


plt.show()

<a id="ex10"></a>
# Exercise 10: Explore different charts

To complete this exercise, in **line 4** add a value between the quotations of the `kind = " "` parameter. Value options include `"violin"`, `"swarm"`, and `"box"`. Find the range of options in the description of [categorical plots](#cat).

The x and y parameters as well as the data have already been added for you, but you may change them.

[Return to table of contents](#toc)

In [None]:
sns.catplot(x = "Pclass",
            y = "Age",
            data = titanic,
            kind = " ")

<a id="color"></a>
## Seaborn color palettes

Styling data with an intuitive and accessible color palette is an important part of visualization. Seaborn offers many different palettes and customization options. Some basics are listed below, and more information is available in the seaborn [color palettes documentation](https://seaborn.pydata.org/tutorial/color_palettes.html). 

Set a color palette using the function `sns.set_palette()`. Between the paranthesis, add a default color palette or create your own. Seaborn's default palettes must be placed within quotation marks. It is also possible to use the parameter `palette = ` within the `sns.relplot()` and `sns.catplot()` functions to define a palette for a plot.

### Sample color palettes

- Qualitative color palettes: "deep", "muted", "pastel", "bright", "dark", and "colorblind".

- Sequential color palettes: "Blues", "BuGn"

- Diverging color palettes: "BrBG", "RdBu", "coolwarm" 

Notes:
- Use `_r` reverse a sequential palette. E.g.: `"BuGn_r"`
- Use `_d` to darken the palette

### Create a custom palette
It is possible to create a custom palette by passing a list of seaborn colors and/or hexcodes to the function sns.set_palette(). To create a list use square brackets [], place the list items in quotations, and separate each item by a comma. E.g.,: `sns.set_palette(["#F2FF33","blue"])`

[Return to table of contents](#toc)

<a id="ex11"></a>
## Exercise 11: Setting a palette

To complete this exercise:
1. In **line 2** set a `colorblind` palette for the chart using `sns.set_palette("")`.
2. Edit the functions or parameters on **lines 5, 8, 15, 19, 20, 24, and 42** to style, visualize, and save a chart using everything you have learned in this workshop so far.

[Return to table of contents](#toc)

In [None]:
# Set palette to colorblind
sns.set_palette(" ")

# Options: "white", dark", "whitegrid", "darkgrid", or "ticks".
sns.set_style(" ")

# Options: "paper", "notebook", "talk", "poster"
sns.set_context(" ")

ax = sns.catplot(x = "Pclass",
           y = "Age",
           data = titanic,
            hue = "Sex",
            # Options: box, violin, swarm
           kind = " ",
           col = "Survived")

# Label axes
ax.set(xlabel = " ",
      ylabel = " "
      )

# Provide a main title
plt.suptitle(" ",
             # Bonus: Uncomment to style the title
            # ha = "center",
             y = 1.03,
             #x = 0.1,
            # weight = "extra bold"
            )


# Bonus
# Uncomment lines 37-39 to change subplot titles
#axes = ax.axes.flatten()
#axes[0].set_title("Died")
#axes[1].set_title("Survived")

plt.draw()

# Uncomment line 44 to save figure as an image. Replace the file path with your own.
#plt.savefig("~Desktop/distplot.png", dpi=900)

<a id="dist"></a>
## Distribution plots

Distribution plots show the distribution and range of a set of values. There are many options for distribution plots. A histogram is one commonly used distribution plot. Seaborn allows you to make interesting and complex distribution plots such as a kernel density plot with `sns.kdeplot()` or a combination of multiple distributions with `sns.jointplot()`. [More about distribution plots.](https://seaborn.pydata.org/tutorial/distributions.html)

To create a histogram with seaborn use `sns.distplot()`. The only required parameter of `sns.displot()` is data. To call a specific column in a dataset loaded in the workspace, use `sns.relplot(dataset["Column_Name"]`). Dataset represents the variable name defined when the data was loaded into the workspace.

Some of the optional parameters for `sns.distplot()` include:
- bins = A numeric value for the number of groupings of the data to display. E.g., bins = 10
- rug = True OR False
- hist = True OR False
- kde = True OR False
- color = an html color OR hex code within quotations. E.g., color = "red" where red indicates an html color name, or color = "000000" where 000000 indicates the hex code for black.

[Return to table of contents](#toc)

<a id="ex12"></a>
## Exercise 12: Explore distribution plots

To complete this exercise:
1. First, run the following three blocks of code. What kinds of distributions plots are these?
2. Make changes to the parameters (e.g., data selected, `kind =`, `rug = `) to try to make new distribution plots.

Bonus: Save a plot with `plt.savefig()`.

[Return to table of contents](#toc)

In [None]:
sns.set()
sns.distplot(titanic["Age"], 
             bins = 5,
            rug = True,
             #hist = False
            kde = True,
            color = "000000")

In [None]:
sns.jointplot(x = titanic["Fare"],
             y = titanic["Age"],
             kind = "hex")

In [None]:
sns.kdeplot(tips["tip"], 
        tips["total_bill"], 
        n_levels=10, 
        shade=True)