# Exercise goals
In this exercise, we will create a heat map. A heatmap combines a scatter plot with a histogram. It shades each pixel based on its density. 

Using the NYC train data again, we will create a heat map of dropoff latitude and longitude.

# Code documentation
Run the following cell to see documentation for the Python `shade` method.

In [None]:
import datashader as ds
from datashader import transfer_functions as tf
#?ds.Canvas
#?ds.Canvas.points
?tf.shade

# Create the heat map
Run the following cell to create the heat map.

In [None]:
# datashader allows us to handle data with many rows, like NYC Taxi
# it uses bokeh for plotting functionality
import datashader as ds
from datashader.colors import Hot
from datashader import transfer_functions as tf
import numpy as np
import pandas as pd

# load data 
df = pd.read_csv('train.csv')

plot_height, plot_width = int(500), int(500)

# we use the range we found earlier
x_range, y_range = ((-74.2,-73.7), (40.6, 40.9)) 

# this is analogous to plt.figure; it represents the frame for a picture
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range) 

# this plots dropoff long/lat and counts the # of points in each pixel
# notice we don't graph a sample, but the whole dataset; that's the power of datashader.
agg = cvs.points(df, 'dropoff_longitude', 'dropoff_latitude',  ds.count('passenger_count')) 

# this takes the counts from each pixel given by agg and shades them accordingly
# set_background controls the background color
# how controls the transition to different colors based on intensity of each pixel
tf.set_background(tf.shade(agg, cmap=Hot, how='eq_hist'),"black")

# Heat map exercise
Your task is to use the data provided to create a heatmap using the cell above as a guide.

# Code documentation
Run the cell below to see docs on the `percentile` method.

Remove the comment sign (`#`) before `?pd.DataFrame.where` and add it before `np.percentile`, then run again to see docs for the `where` method.

In [None]:
import numpy as np 
import pandas as pd
?np.percentile
#?pd.DataFrame.where

# Create the heat map
In the cell below, fill in the call to `agg.where` to show greater than 90th
percentile dropoffs. You will need the following functions:

- `np.percentile`
- `pd.DataFrame.where`

Try it first.  You can check your code against the answer in the cell that follows.

In [None]:
# datashader allows us to handle data with many rows, like NYC Taxi
# it uses bokeh for plotting functionality
import datashader as ds
from datashader.colors import Hot
from datashader import transfer_functions as tf
import numpy as np
import pandas as pd

# load NYC Taxi train data 
df = pd.read_csv('train.csv')

plot_height, plot_width = int(500), int(500)

x_range, y_range = ((-74.2,-73.7), (40.6, 40.9)) 

# sets up frame for plot, analogous to plt.figure
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range) 

# plots dropoff lat/long and sums passenger_count for all points in each pixel
agg = cvs.points(df, 'dropoff_longitude', 'dropoff_latitude',  ds.count('passenger_count')) 

#-----------------Your code here------------------------------#

# fill in the call to agg.where to show 90th percentile or greater dropoffs
tf.set_background(tf.shade(agg.where(), 
                  cmap=Hot, how='eq_hist'),"black")

#-------------------------------------------------------------#

# Answer code
We used the following code in the cell above. Note the parameter for `agg.where()`.

```python
#-----------------Your code here------------------------------#

# fill in the call to agg.where to show 90th percentile or greater dropoffs
tf.set_background(tf.shade(agg.where(agg>np.percentile(agg,90)), 
                  cmap=Hot, how='eq_hist'),"black")

#-------------------------------------------------------------#
```

# Thinking Ahead
Eventually we will work on predicting trip duration based on
pickup coordinates. This could be tackled many different ways, but the simplest
is fitting a linear model to pickup coordinates and trip duration information.
That's what we learn next week. Read further on linear models here:

- https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset

# Privacy
When the NYC Taxi & Limousine
Commission or Stackoverflow
release their data, they assume they're not harming
their customers who provided
that data. This isn't always safe to assume due to
re-identification. You can
read more at the following link:


- https://www.georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/