# Visualizing big data with datashader
Bokeh gets its power by mirroring data from Python (or R) into the web browser. This approach provides full flexibility and interactivity, but because of the way web browsers are designed and built, there are limitations to how much data can be shown in this way. Most web browsers can handle up to about 100,000 or 200,000 datapoints in a Bokeh plot before they will slow down or have memory issues. What do you do when you have larger datasets than that?

The datashader library is designed to complement Bokeh by providing visualizations for very large datasets, focusing on faithfully revealing the overall distribution, not just individual data points. datashader installs separately from bokeh, e.g. using conda install datashader.

# When not to use datashader:

1. Plotting less than 1e5 or 1e6 data points
2. When every datapoint matters; standard Bokeh will render all of them
3. For full interactivity (hover tools) with every datapoint

# When to use datashader
1. Plotting less than 1e5 or 1e6 data points
2. When every datapoint matters; standard Bokeh will render all of them
3. For full interactivity (hover tools) with every datapoint

# How does datashader work?
1. Tools like Bokeh map Data directly into an HTML/JavaScript Plot
2. datashader renders Data into a screen-sized Aggregate array, from which an Image can be constructed then embedded into a Bokeh Plot
3. Only the fixed-sized Image needs to be sent to the browser, allowing millions or billions of datapoints to be used
4. Every step automatically adjusts to the data, but can be customized

### The process roughly follows the following:
* Data project into a scene
* A scene is aggregated/transformed into aggregates
* Aggregations are colormapped as an image
* image is embedded into the bokeh plot

# Visualizations supported by datashader
Datashader currently supports:

Scatterplots/heatmaps
Time series
Connected points (trajectories)
Rasters
In each case, the output is easily embedded into Bokeh plots, with interactive resampling on pan and zoom, in notebooks or apps. Legends/hover information can be generated from the aggregate arrays, helping provide interactivity.

# Faithfully visualizing big data
Once data is large enough that individual points are not easily discerned, it is crucial that the visualization be constructed in a principled way, faithfully revealing the underlying distribution for your visual system to process. For instance, all of these plots show the same data -- is any of them the real distribution?

In [6]:
import pandas as pd
import numpy as np

np.random.seed(1)
num=10000

dists = {cat: pd.DataFrame(dict(x=np.random.normal(x,s,num),
                                y=np.random.normal(y,s,num),
                                val=val,cat=cat))
         for x,y,s,val,cat in 
         [(2,2,0.01,10,"d1"), (2,-2,0.1,20,"d2"), (-2,-2,0.5,30,"d3"), (-2,2,1.0,40,"d4"), (0,0,3,50,"d5")]}

df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")
df.tail()

Unnamed: 0,cat,val,x,y
49995,d5,50,-1.397579,0.610189
49996,d5,50,-2.64961,3.080821
49997,d5,50,1.93336,0.243676
49998,d5,50,4.306374,1.032139
49999,d5,50,-0.493567,-2.242669


In [2]:
df.shape

(50000, 4)

Here we have 50000 points, 10000 in each of five categories with associated numerical values. This amount of data will be slow to plot directly with Bokeh or any similar libraries that copy the full data into the web browser. Moreover, plotting data of this size with standard approaches has fatal flaws that make the above plots misrepresent the data:

Plot A suffers from overplotting, with the distribution obscured by later-plotted datapoints.
Plot B uses smaller dots to avoid overplotting,but suffers from oversaturation, with differences in datapoint density not visible because all densities above a certain value show up as the same pure black color
Plot C uses transparency to avoid oversaturation, but then suffers from undersaturation, with the 10,000 datapoints in the largest Gaussian (at 0,0) not visible at all.
Bokeh can handle 50,000 points, but if the data were larger then these plots would suffer from undersampling, with the distribution not visible or misleading due to too few data points in sparse or zoomed-in regions.
Plots A-B also required time-consuming and error-prone manual tweaking of parameters, which is problematic if the data is large enough that the visualization is the main way for us to understand the data.

Using datashader, we can avoid all of these problems by rendering the data to an intermediate array that allows automatic ranging in all dimensions, revealing the true distribution with no parameter tweaking and very little code:

In [4]:
import datashader as ds
import datashader.glyphs
import datashader.transfer_functions as tf

In [13]:
import sys
print("Python"+ sys.version)

Python3.6.3 |Anaconda custom (64-bit)| (default, Nov  8 2017, 18:10:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
