# What We Will Be Doing

This notebook will be linked to the features mentioned in the Medium article with regard to Bokeh. Specifically we will use data on NYC apartments to look at the relationship between price and square footage, while showing off some cool features of Bokeh. To get the data just go to [this GitHub Repo](https://github.com/scochran3/BokehExplorationMedium)

# Libraries

In [1]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Range1d, HoverTool
from bokeh.embed import components
from bokeh.io import curdoc
from bokeh.themes import Theme
import pandas as pd
import numpy as np

This little function will make our plots show up nice and cleanly in Jupyter - adios Matplotlib!

In [2]:
output_notebook()

# Our Data
This data comes from a little pipeline I built that was outline in this [medium article on AWS Lambda Pipelines](https://towardsdatascience.com/make-data-acquisition-easy-with-aws-lambda-python-in-12-steps-33fe201d1bb4). Basically our data is on New York City apartments, which is scraped from Craigslist over June and July 2019. This data from Craigslist has a few enrichments which brings in data from Mapquest and Walk Scores, but it should be pretty intuitive to understand.

# Import our data
Read in our data and let's convert the date column to a datefield

In [3]:
df = pd.read_csv('data/nyc_apartments.csv')
df['date'] = pd.to_datetime(df['datetime'], infer_datetime_format=True).dt.date
df.head()

Unnamed: 0,id,address,area,bedrooms,bikeScore,datetime,distanceToNearestIntersection,has_image,has_map,name,...,month,dow,day,hour,advertises_no_fee,is_repost,sideOfStreetEncoded,postalCodeChopped,neighborhood,date
0,6911917730,320 Chauncey St,,3.0,64.0,2019-06-21 14:34:00,0.0,1,1,you’re in good hands...t e x t us to view bk’s...,...,6,4,21,14,1,0,1.0,11233.0,Southeast Bronx,2019-06-21
1,6917210186,530 W 143rd St,800.0,1.0,88.0,2019-06-21 14:33:00,203.483553,1,1,spacious 1br penthouse with deck!! near col un...,...,6,4,21,14,0,1,0.0,10031.0,Upper West Side,2019-06-21
2,6914527887,410 Pulaski St,,3.0,79.0,2019-06-21 14:33:00,0.013114,1,1,this is the one you’ve been looking for… call ...,...,6,4,21,14,0,0,1.0,11221.0,Sunset Park,2019-06-21
3,6914529944,410 Pulaski St,,3.0,79.0,2019-06-21 14:33:00,0.013114,1,1,simplify your search with us**pro team w/ big ...,...,6,4,21,14,1,0,1.0,11221.0,Sunset Park,2019-06-21
4,6917173545,4754 Center Blvd,653.0,1.0,81.0,2019-06-21 14:33:00,61.301497,1,1,sunny 1br in long island city. brand new renov...,...,6,4,21,14,1,1,0.0,11109.0,Queens,2019-06-21


# Column Data Source
Bokeh has something called a ["ColumnDataSource"](https://bokeh.pydata.org/en/latest/docs/reference/models/sources.html), which will quickly become your best friend. You can read about it in the docs, but the high level way to think about it is it converts your Pandas dataframe to something Bokeh can easily use. You can see how we utilize this weapon of mass plotting in the charts below, but the general process is:

- Get your data in the proper format with pandas
- Make this properly formatted dataframe a ColumnDataSource
- Use this ColumnDataSource when you call your plotting function

We can very easily create a ColumnDataSource with any dataframe.

In [4]:
source = ColumnDataSource(df)

# Let's Create Some Visualizations

## Price vs. Square Footage

In [5]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

show(p)

## Let's make it more beautiful

The output isn't bad per se, but let's make it visually more appealing:

- Fill the whole width
- Bigger title
- Get rid of those ugly toolbar icons
- Remove gridlines
- Change the font size of our axis'

This can all be done with some easy code switches

In [6]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure, now with the sizing mode feature
p = figure(title="Price vs. Square Footage", sizing_mode="stretch_width", tools=[], toolbar_location=None)

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

# Grid lines and and font size
p.xgrid.grid_line_color, p.ygrid.grid_line_color = None, None
p.xaxis.major_label_text_font_size, p.yaxis.major_label_text_font_size = '11pt', '11pt'
p.title.text_font_size='14pt'

show(p)

## Themes - Isn't it annoying to make these style changes on every chart?
Yes! What if I *always* want my title to be size 14? Or I *always* want there to be no grid? Having to type these in for every chart will get quite old quickly. Introducing themes!

In [7]:
curdoc().theme = Theme(json={'attrs': {

# apply defaults to Figure properties
'Figure': {
    'toolbar_location': None,
    'outline_line_color': None,
    'min_border_right': 10,
    'sizing_mode': 'stretch_width'
},

'Grid': {
    'grid_line_color': None,
},
'Title': {
    'text_font_size': '14pt'
},

# apply defaults to Axis properties
'Axis': {
    'minor_tick_out': None,
    'minor_tick_in': None,
    'major_label_text_font_size': '11pt',
    'axis_label_text_font_size': '13pt',
    'axis_label_text_font': 'Work Sans'
},
# apply defaults to Legend properties
'Legend': {
    'background_fill_alpha': 0.8,
}}})


Now let's use the code from our original plot. We see we get everything done for us automatically.

## Price vs. Square Footage
Again, we now don't specify any of the styling attributes manually.

In [8]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)

show(p)

### Adding a color column
One nice feature of Bokeh is you can leverage Pandas to create columns and then use them in your plot. In this example we will map each discrete value for bedrooms to a color and then use that to color out plot.

In [13]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)

show(p)


### Adding Interactivity
Let's add some of those cool tooltips like Tableau has!

In [10]:
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]

# Create color mappings
df_has_area['bedrooms'].unique()

# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)

# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]

# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)

# Create our figure
p = figure(title="Price vs. Square Footage")

# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)

# Create our tooltip
tooltips = """
<div style="width:500px;">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Craigslist URL: </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">@url</h5>
</div>
<div class="tooltip-section">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Price ($): </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">$@price{0,0}</h5>
</div>
<div class="tooltip-section">
    <h5 style="color:#0015bc; display:inline; font-size:1.2em">Square Footage: </h5>
    <h5 style="color:#000000; font-size: 1.2em; display:inline;">@area{0,0}</h5>
</div>
"""

p.add_tools(HoverTool(tooltips=tooltips))

show(p)

## That's It
That is just the beginning of Bokeh. If you do use Pandas a lot I highly encourage you to continue learning with Bokeh as it has really served me well for creating visualizations, especially if you are them a lot with colleagues.