## Plotting non-geographic data

Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions.  Here let's start by plotting `trip_distance` versus `fare_amount` for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb. 

### Load NYC Taxi data

(takes a dozen seconds or so...)

In [1]:
import pandas as pd

df = pd.read_csv('data/nyc_taxi.csv',usecols=['trip_distance','fare_amount','tip_amount','passenger_count'])

df.tail()

Unnamed: 0,passenger_count,trip_distance,fare_amount,tip_amount
10679302,2,1.0,5.5,1.25
10679303,2,0.8,6.0,2.0
10679304,1,3.4,13.5,0.0
10679305,1,1.3,10.5,2.25
10679306,1,0.7,5.5,0.0


### Define a simple plot

In [2]:
from bokeh.plotting import figure, output_notebook, show

output_notebook()

def base_plot():
    p = figure(
        x_range=(0, 20),
        y_range=(0, 40),
        tools='pan,wheel_zoom,box_zoom,reset', 
        plot_width=800, 
        plot_height=500,
    )
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    p.xaxis.axis_label = "Distance, miles"
    p.yaxis.axis_label = "Fare, $"
    p.xaxis.axis_label_text_font_size = '12pt'
    p.yaxis.axis_label_text_font_size = '12pt'
    return p
    
options = dict(line_color=None, fill_color='blue', size=5)

### 1000 points reveals the expected linear relationship

In [3]:
samples = df.sample(n=1000)
p = base_plot()
p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)
show(p)


BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this
may be due to a slow or bad network connection. Possible fixes:

* ALWAYS run `output_notebook()` in a cell BY ITSELF, AT THE TOP, with no other code
* re-rerun `output_notebook()` to attempt to load from CDN again, or
* use INLINE resources instead, as so:

    from bokeh.resources import INLINE
    output_notebook(resources=INLINE)



### 10,000 points show more detailed, systematic patterns in fares and times
  
Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

In [4]:
options = dict(line_color='blue', fill_color='blue', size=1, alpha=0.05)
samples = df.sample(n=10000)
p = base_plot()
p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)
show(p)

### Datashader reveals additional detail, especially when zooming in

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

In [5]:
import datashader as ds
from datashader.bokeh_ext import InteractiveImage

In [6]:
p = base_plot()
pipeline = ds.Pipeline(df, ds.Point("trip_distance", "fare_amount"))
InteractiveImage(p, pipeline)

Here we're using the default histogram-equalized color mapping function to reveal density differences across this space.  If we used a linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:

In [7]:
from datashader import transfer_functions as tf
import functools as ft
color_fn = ft.partial(tf.shade,how='linear')

p = base_plot()
pipeline = ds.Pipeline(df, ds.Point("trip_distance", "fare_amount"), color_fn=color_fn)
InteractiveImage(p, pipeline)

Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:

In [8]:
p = base_plot()
p.xaxis.axis_label = "Fare, $"
p.yaxis.axis_label = "Tip, $"
pipeline = ds.Pipeline(df, ds.Point("fare_amount", "tip_amount"))
InteractiveImage(p, pipeline)

Interestingly, tips go down when the number of passengers is greater than 1:

In [9]:
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
from bokeh.models import Range1d

p = base_plot()
p.xaxis.axis_label = "Passengers"
p.yaxis.axis_label = "Tip, $"
p.x_range = Range1d(-0.5, 6.5)
p.y_range = Range1d(0, 60)

pipeline = ds.Pipeline(df, ds.Point("passenger_count", "tip_amount"), width_scale=0.035)
InteractiveImage(p, pipeline)

Here we've reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.

The above plots all use Bokeh directly, but a much wider range of interactive plots can be built easily using the separate [HoloViews](http://holoviews.org) library, which builds Bokeh and Matplotlib plots from high-level specifications.  For instance, Datashader currently only provides 2D aggregates, but you can easily make a zoomable one-dimensional histogram using HoloViews to dynamically collapse across a second dimension:

In [10]:
result=None
try:
    import numpy as np
    import holoviews as hv
    from holoviews.operation.datashader import aggregate
    hv.notebook_extension('bokeh')

    %opts Histogram [width=800] 

    dataset = hv.Dataset(df, kdims=['fare_amount', 'trip_distance'], vdims=[])
    points = hv.Points(dataset)
    result = hv.operation.histogram(aggregate(points, streams=[hv.streams.RangeX()], width=500, height=2), 
                 bin_range=None, num_bins=500, adjoin=False, normed=False, 
                 dimension='fare_amount', weight_dimension='Count')
    
except ImportError: pass
result

Here datashader is aggregating over both fare_amount and trip_distance, but trip_distance was specified to have only a height of 2, because it will be further collapsed to create the histogram being displayed.  You can now use the wheel zoom tool when hovering over the x axis, and the plot will zoom in or out, dynamically resampling at the given location to make a new histogram (as long as there is a live Python server running). 

In this particular plot, there is a very wide range of fare amounts, with an implausibly high maximum fare of over \$4000, but you can easily zoom in to the bulk of the data to show that nearly all fares are between \$4 and \$20, following something like a gamma distribution, and they are discretized to the nearest $0.50 in this dataset.