<img src="images/python-logo-master.png" style="width:25%;height:25%"></img>
## Data Visualization with Python

>**Hardik I. Parikh, PhD**  
>**School of Medicine Research Computing**  
>**University of Virginia**  
>**11/01/2018**  
>hiparikh@virginia.edu

---

## Jake VanderPlas, PyCon 2017
<img src="https://pbs.twimg.com/media/DBplpP_VYAA0rS5.jpg" alt="" style="width: 600px;"/>



## Today's Goals:  

#### Impossible to cover everything 

- ~~Which chart to choose???~~
- Basic syntax 
- Popular visualization libraries
    - [Matplotlib v2](https://matplotlib.org/index.html)
    - [Seaborn](https://seaborn.pydata.org/)
    - Demo [Bokeh](https://bokeh.pydata.org/en/latest/) (If time permits ...)

## Our Dataset: National Health and Nutrition Examination Survey (NHANES)

[CDC HomePage](https://www.cdc.gov/nchs/nhanes/)  

NHANES is a research program designed to assess the health and nutritional status of adults and children in the United States. The survey is one of the only to combine both survey questions and physical examinations. It began in the 1960s and since 1999 examines a nationally representative sample of about 5,000 people each year. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The physical exam includes medical, dental, and physiological measurements, as well as several standard laboratory tests. NHANES is used to determine the prevalence of major diseases and risk factors for those diseases. NHANES data are also the basis for national standards for measurements like height, weight, and blood pressure. Data from this survey is used in epidemiology studies and health sciences research, which help develop public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.

We are using a small slice of this data. We're only using a handful of variables from the 2011-2012 survey years on about 5,000 individuals. The CDC uses a sampling strategy to purposefully oversample certain subpopulations like racial minorities. Naive analysis of the original NHANES data can lead to mistaken conclusions because the percentages of people from each racial group in the data are different from general population. The 5,000 individuals here are resampled from the larger NHANES study population to undo these oversampling effects, so you can treat this as if it were a simple random sample from the American population.

## Matplotlib Basics

**Matplotlib** is the whole package!  

**`matplotlib.pyplot`** is a module in matplotlib for plotting  

**`pylab`** is a convenience module imports 
    - matplotlib.pyplot (for plotting), and 
    - numpy (for mathematics and working with arrays) 
    in a single name space.  
    
#### Import modules    

In [None]:
import matplotlib.pyplot as plt
import numpy as np

#import pylab 
#equivalent to above two statements

# import other modules 
import pandas as pd

# import seaborn
import seaborn as sns


### Anatomy of Plot

[Matplotlib Usage](https://matplotlib.org/faq/usage_faq.html)

<img src="https://matplotlib.org/_images/anatomy1.png" alt="" style="width: 500px;"/>


#### Components
- **(Canvas)**

   - **Figure**  
   
      - **Subplots**
      
         - **Axes** (single set, multiple superimposed or stacked)
         
            - **Data series** (single or multiple)
            
         - **Legend**
         
         - **Annotations** (textual and graphic)

In [None]:
# empty figure

# empty figure with 4 Axes


## A simple plot

#### `PyPlot` style coding

#### Multiplot figure: `subplot` 

In [None]:
# subplots function 


In [None]:
# If you want subplots to have different size
# pass the width ratio using `gridspec_kw`



#### Save Figure

## Read Data in

In [None]:
# read the data as pandas dataframe 
nh = pd.read_csv("./data/nhanes_long.csv")

nh.head()
#nh.describe()

## Histograms

In [None]:
# plot the distribution of Age 

### Method1: Visualize in Pandas


In [None]:
### Method2: Matplotlib


In [None]:
### Method3: Seaborn
# Seaborn: "high-level" plotting library 
# It has a collection of pre-built complex figures


### Overlay plots

#### Let's visualize distribution of Age by Gender

In [None]:
### Create separate data series by Gender
nh_male = nh[nh.Gender == "male"]
nh_female = nh[nh.Gender == "female"]

In [None]:
### Overlay plots using matplotlib

# define uniform bins for both plots
bins = range(0,81,80//10)



In [None]:
### Overlay kde plots using Seaborn


#### Plot them side-by-side

In [None]:
# We have already done this, lets repeat 



In [None]:
### kde plots using Seaborn side-by-side



## Scatter plots

Let's plot the relationship between two continuous variables: Age and Height

In [None]:
# plot using pyplot 



In [None]:
# plot data from two series


## Linear Relationships with Seaborn

[Tutorial: Visualizing linear relationships](https://seaborn.pydata.org/tutorial/regression.html#regression-tutorial)

In [None]:
# filter the data for children

nh_child = nh[nh.Age < 18]

In [None]:
# Lets explore the relationship between 
# age and height using sns.regplot()


In [None]:
# Lets use lmplot()


In [None]:
# Lets use lmplot()


In [None]:
# lmplot() - add variability on X axis


In [None]:
# seaborn regression plot with linear regression calculation
import scipy.stats

tips = sns.load_dataset("tips")

slope, intercept = np.polyfit(tips['total_bill'], tips['tip'], 1)  # fit a linear model (1st order)
r, p = scipy.stats.pearsonr(tips['total_bill'], tips['tip'])       # calculate r and p value



### Exercise: 

Explore the relationship between Testosterone levels and Age   
- color points by gender
- subset data for men >65 and <80 years old

## Bar plots

Let's plot the relationship between discrete X and continuous Y 

In [None]:
# pandas processing

means = nh.groupby('SmokingStatus')['BMI'].mean()  # group data by gender
sds = nh.groupby('SmokingStatus')['BMI'].std()     # calculate std dev
xpos = np.arange(len(means))                # x positions
names = means.index                         # create index


In [None]:
# Vertical bar graph


In [None]:
# Stacked Bars

# Unfortunately, there is no built-in function for 
# stacked bar plots in Seaborn
# lets plot it directly using pandas

mydf1 = nh_male[['id', 'Race', 'SmokingStatus']].groupby(['Race', 'SmokingStatus']).count().unstack()
mydf2 = nh_female[['id', 'Race', 'SmokingStatus']].groupby(['Race', 'SmokingStatus']).count().unstack()



## Box plots

In [None]:
# Statistical plots with Seaborn 


## Swarm plots

In [None]:
# swarm plots with Seaborn


## Color Palettes with Seaborn

[Tutorial: Choosing color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
# Write a function to plot BMI by Race
sns.set_style("whitegrid")
def myPlot():
    fig = sns.boxplot(data=nh, x="Race", y="BMI", width=0.5)
    fig.set_title("BMI by Race", fontsize=16)
    plt.show()

In [None]:
# Statistical plots with Seaborn 
myPlot()

In [None]:
# change color palettet to Set2


In [None]:
# give custom list of colors
mycolors = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]



## Figure Asthetics with Seaborn

[Tutorial: Controlling Figure Aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html#aesthetics-tutorial)

### Figure Style

In [None]:
sns.set_style("whitegrid")
with sns.color_palette(mycolors):
    fig = sns.boxplot(data=nh, x="Race", y="BMI", width=0.5)
    fig.set_title("BMI by Race", fontsize=16)
    plt.show()

In [None]:
# change back to default simply use sns.set()
sns.set()


In [None]:
### Try other styles: 
# sns.set_style("dark")
# sns.set_style("white")
# sns.set_style("ticks")


In [None]:
### Set context
# Four presets available: paper, notebook, talk, poster


## Facet Grid with Seaborn

[Tutorial: Multi-plot grid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)

In [None]:
sns.set_style("white")
# lets define our grid


## Interactive graphics with Bokeh

[Gallery](https://bokeh.pydata.org/en/latest/docs/gallery.html)

In [None]:
# set up bokeh
from bokeh.plotting import figure, output_notebook, show
output_notebook()    # direct output to the Jupyter notebook
# for export to png or svg files, see https://bokeh.pydata.org/en/latest/docs/user_guide/export.html

In [None]:
### Import test data
from bokeh.sampledata.iris import flowers
flowers.head()
#flowers.describe()
#flowers['species'].unique()

In [None]:
# Subset the dataframes
flowers_setosa = flowers[flowers.species == "setosa"]
flowers_versicolor = flowers[flowers.species == "versicolor"]
flowers_virginica = flowers[flowers.species == "virginica"]

# create the plot
p = figure(title = "Iris Morphology")
p.circle(flowers_setosa["petal_length"], flowers_setosa["petal_width"], 
         color="red", fill_alpha=0.2, size=10, legend="setosa")
p.circle(flowers_versicolor["petal_length"], flowers_versicolor["petal_width"], 
         color="green", fill_alpha=0.2, size=10, legend="versicolor")
p.circle(flowers_virginica["petal_length"], flowers_virginica["petal_width"], 
         color="blue", fill_alpha=0.2, size=10, legend="virginica")

p.legend.location = 'top_left'

show(p)


In [None]:
from bokeh.core.properties import value
from bokeh.models import HoverTool

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ["2015", "2016", "2017"]
colors = ["#c9d9d3", "#718dbf", "#e84d60"]

data = {'fruits' : fruits,
        '2015'   : [2, 1, 4, 3, 2, 4],
        '2016'   : [5, 3, 4, 2, 4, 6],
        '2017'   : [3, 2, 4, 4, 5, 3]}

# print(data)

p = figure(x_range=fruits, plot_height=250, title="Fruit Counts by Year",
           toolbar_location=None)

p.vbar_stack(years, x='fruits', width=0.9, color=colors, source=data,
             legend=[value(x) for x in years])

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"

show(p)

In [None]:
# Two graphs with linked selection
# Credit: Dr. Jim Harrison, Python Data Viz Workshop 04/04

from bokeh.sampledata.autompg import autompg as auto_data   # a pandas dataframe
from bokeh.models import ColumnDataSource   # data source for linked graphs
from bokeh.layouts import gridplot          # layout tools for multiple plots

# create a common data source for both graphs using the ColumnDataSource class
source = ColumnDataSource(data = {'x1': auto_data['mpg'], 'x2': auto_data['accel'], 
                                  'y1': auto_data['hp'], 'y2': auto_data['weight']})  # dictionary of columns

TOOLS = "box_select, lasso_select, reset"   # select a useful subset of interactive tools
# left side plot
L = figure(width=400, plot_height=400, title="Horsepower vs. Mileage", tools=TOOLS)
L.circle(x='x1', y='y1', source=source, size=8, alpha=0.6)
L.xaxis.axis_label="Miles Per Gallon"
L.yaxis.axis_label="Horsepower"
# right side plot
R = figure(width=400, plot_height=400, title="Weight vs. Acceleration", tools=TOOLS)
R.circle(x='x2', y='y2', source=source, size=8, alpha=0.6)
R.xaxis.axis_label="Acceleration"
R.yaxis.axis_label="Weight"

p = gridplot([[L, R]])  # gridplot handles multiple figures in rows & cols
show(p)

### Useful Links: 

- [Matplotlib Usage](https://matplotlib.org/faq/usage_faq.html)
- [Matplotlib Tutorials](https://matplotlib.org/tutorials/index.html)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/)
