# Visualize Scraped Yelp Data using Bokeh

## Introduction

First, set up Bokeh. Next, scrape up to 1,000 restaurants using Yelp's API for each star rating. Then, visualize the physical locations, find the neighborhood or area with the highest density of good restaurants, and change some of the features of the graph to make it more visually appealing and digestible. Some possible options were to create a map of top restaurants, a map of bottom restaurants, a heat map/contour map, plot multi-dimensional data, popular restaurant types, dendrogram, and determine statistically significant features. Eventually, I decided to show plotting of restaurants and creating a heat map. Some interesting features to add were titles, legends, axis labels, captions, colors, shapes.  

## Setting up Bokeh

I had no problems setting up Bokeh through the recommended command on the command line (within my conda environment).

```
source activate <MyEnvironment>
conda install bokeh
```


## Overview of Bokeh

Bokeh is structured with three main branches that users will need to pay attention to to create their first plots. Bokeh.models contains tools that can be used to manipulate data, especially into a format that will be used to plot the data. Bokeh.io contains actions for the viewing, exporting, or otherwise interacting with the plots. Bokeh.plotting contains all objects that are plotted. Bokeh is relatively user friendly, offering some similar features to ggplot such as automatic color mapping. It is similar to plot.ly in that it's plots are meant to be viewed in an html setting and be interactive. Here, in a Jupyter Notebook, there is less flexibility with interactivity, but you should know that it's an exciting feature (https://bokeh.pydata.org/en/latest/docs/user_guide/interaction/widgets.html#userguide-interaction-widgets). 

## Scraping Data from Yelp API

You'll need your own API Key to obtain data from the Yelp API (https://www.yelp.com/developers/faq). I decided to obtain the largest sample possible (1,000) of restaurants in Pittsburgh.   You could change these features to specify a different place, include businesses other than restaurants, or return fewer restaurants. Yelp's business search returns a great number of fields, but the ones I was most interested in were, rating, name of restaurant, latitude, longitude.

In [2]:
import io, time, json
import requests


In [3]:
def read_api_key(filepath):
    with open(filepath, 'r') as f:
        return f.read().replace('\n','')

api_key = read_api_key("api_key_yelp.txt")

options = {"location":"Pittsburgh", "limit":50, "categories":"restaurants", 'offset':0 }

def all_restaurants(api_key, query):
    limit = 20
    url = "https://api.yelp.com/v3/businesses/search?"
    params = query
    headers = {
        'Authorization': 'Bearer %s' % api_key,
    }
    everyRestaurant = list()
    totalRestaurants = 1
    while len(everyRestaurant) < 1000:
        response = requests.get(url = url, params=params, headers=headers)
        responsejson = json.loads(bytes.decode(response.content))
        if "businesses" not in responsejson:
            print(responsejson)
        businesses = responsejson['businesses']
        totalRestaurants = responsejson['total']
        everyRestaurant.extend(businesses)
        params['offset']+=len(businesses)
        remainder = totalRestaurants - len(everyRestaurant)
        if  remainder < limit:
            params['limit'] = remainder 
    return(everyRestaurant)

I used a previously created function to scrape the data (with a small change to scrape 1,000 restaurants). This returns a list of dictionaries, one dictionary per restaurant. The first restaurant is printed below to show all the fields returned.<br>
This data set 

In [51]:
options = {"location":"Pittsburgh", "limit":50, "categories":"restaurants", 'offset':0 }
restaurants = all_restaurants(api_key, options)
restaurants[0]

{'categories': [{'alias': 'argentine', 'title': 'Argentine'}],
 'coordinates': {'latitude': 40.449043, 'longitude': -79.987573},
 'display_phone': '(412) 709-6622',
 'distance': 862.336104073366,
 'id': 'gaucho-parrilla-argentina-pittsburgh',
 'image_url': 'https://s3-media4.fl.yelpcdn.com/bphoto/OrC5JDiJz-XUtkTge9zjHA/o.jpg',
 'is_closed': False,
 'location': {'address1': '1601 Penn Ave',
  'address2': '',
  'address3': '',
  'city': 'Pittsburgh',
  'country': 'US',
  'display_address': ['1601 Penn Ave', 'Pittsburgh, PA 15222'],
  'state': 'PA',
  'zip_code': '15222'},
 'name': 'Gaucho Parrilla Argentina',
 'phone': '+14127096622',
 'price': '$$',
 'rating': 4.5,
 'review_count': 1398,
 'transactions': [],
 'url': 'https://www.yelp.com/biz/gaucho-parrilla-argentina-pittsburgh?adjust_creative=snG_zyN1Oz35Vi6rKpRhbw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=snG_zyN1Oz35Vi6rKpRhbw'}

Next, I started preparing my data to be plotted by breaking the restaurants into categories based on their star rating. After doing some initial plotting, I realized that Bokeh plots information from a dictionary of lists, so I came back and made a function, extractPlottingData, that extracted what I wanted to plot about each restaurant and returned a dictionary. I leave my original sorting into star lists to show that it's still possible to use this method and more easily show the number of ratings in integer star ratings.

In [64]:
oneStar = []
twoStar = []
threeStar = []
fourStar = []
fiveStar = []
legend = ["0-1.0 Stars", "1.0-2.0 Stars", "2.0-3.0 Stars", "3.0-4.0 Stars", "4.0-5.0 Stars"]
for i in restaurants:
    if i['rating'] <= 1:
        oneStar.append(i)
        i['legend'] = legend[0]
    elif i['rating'] <= 2:
        twoStar.append(i)
        i['legend'] = legend[1]
    elif i['rating'] <= 3:
        threeStar.append(i)
        i['legend'] = legend[2]
    elif i['rating'] <= 4:
        fourStar.append(i)
        i['legend'] = legend[3]
    elif i['rating'] <= 5:
        fiveStar.append(i)
        i['legend'] = legend[4]
        
def extractPlottingData(businessData):
    lats = []
    lons = []
    desc = []
    rating = []
    legend = []
    for i in businessData:
        lats.append(i["coordinates"]["latitude"])
        lons.append(i["coordinates"]["longitude"])
        desc.append(i["name"])
        rating.append(i["rating"])
        legend.append(i["legend"])
    return dict(lat=lats, lon=lons, desc=desc, rating=rating, legend=legend)

As you can see below, this data set is very concentrated at four stars, with absolutely no one or two star restaurants! This might be a huge problem if I try to research weak restaurants, especially because Yelp's API only allows sorting in one direction and has a limit on the number of restaurants it will return for a certain query. To get around this, I might have been able to narrow my location to a few specific zip codes, limit the radius of the search, still require a high number of restaurants.

In [57]:
print(" oneStar" +": " + str(len(oneStar)), "\n", 
      "twoStar" +": " + str(len(twoStar)), "\n", 
      "threeStar" +": " + str(len(threeStar)), "\n", 
      "fourStar" +": " + str(len(fourStar)),"\n", 
      "fiveStar" +": " + str(len(fiveStar)))

 oneStar: 0 
 twoStar: 0 
 threeStar: 151 
 fourStar: 645 
 fiveStar: 204


## Application: Visualizing Best Restaurants

### Basic Map Plotting Locations

With my data prepared to be plotted, I imported important modules from models, io, and plotting. <br>
Bokeh offers integration with several types of map data including pre-selected map "tiles", Google Map API, and GeoJSON data. For this project, I chose to use the Google Map API and followed the instructions <a href="https://developers.google.com/maps/documentation/javascript/get-api-key">here</a> to get my key. I saved my key in a text file called "api_key_google" and placed it in the same directory as my notebook.<br>
<br>From bokeh.models, I imported:
* **GMapOptions**, which configures the view of the overall map including the latitude, longitude, map type, and zoom
* **ColumnDataSource**, which configures the dictionary data for use in the plotting function's argument *source*
* **LinearColorMapper** and **LogColorMapper** which allow the plotting function to assign data points to categories and colors effeortlessly<br><br>

From bokeh.io, I imported:
* **show**, which is called to show the plot you create
* **output_notebook**, which allows Bokeh plots to be embedded directly in Jupyter Notebooks <br>
*Without output_notebook Bokeh will plot each new call to show(plot) in a new tab.* <br><br>
From bokeh.plotting, I imported:
* **gmap**, which creates the map plot and allows other Bokeh objects to be plotted on top of it<br><br>
 

In [89]:
from bokeh.models import GMapOptions, ColumnDataSource, LinearColorMapper, LogColorMapper, HoverTool
from bokeh.io import show, output_notebook
from bokeh.plotting import gmap
from bokeh.transform import transform
output_notebook(load_timeout=5000, notebook_type='jupyter')
g_api_key = read_api_key("api_key_google.txt")

To create the initial map plotting all 1,000 restaurants with circle markers, I used the Google Maps website to center the window view over what I believed would capture the majority of the restaurants. The address of that webpage had changed to https://www.google.com/maps/@40.4406483,-79.9420262,12.86z. The first number is the latitude, and the second the longitude, with the third being a "zoom" factor, ranging from 1:"World" to 20:"Building". When I chose these initial values in map_options, I discovered that I wasn't including some restaurants and that the API only takes integer values of zoom (and is slightly different from web browser zoom). I adjusted these features and came up with a reasonable map. <br><br>

GMapOptions specifies the view of the map window. gmap required only the api key and gmap options.
The source argument required a dictionary of lists, so I used my extractPlottingData for all of the restaurants to give them each a circle marker. The gmap objects cannot be copied once created and a new instance will need to be created to plot a different view every time. The Bokeh attribute circle used the keys from my source dictionary to mark the latitude, longitude coordinates of every restaurant with a blue circle.


In [15]:
map_options = GMapOptions(lat=40.4463402, lng=-79.9731238, map_type="roadmap", zoom=11)
gmap1 = gmap(g_api_key, map_options)
source = ColumnDataSource(data=extractPlottingData(restaurants))
gmap1.circle(x="lon", y="lat", size=10, fill_color="blue", fill_alpha=0.8, source=source)
show(gmap1)

### Adding Reviews as a color and other appealing elements

With a working map, I could now make the map more interesting by showing the variation in ratings geographically. To do this, I used my groups of star ratings to assign each rating, and thus restaurant, to a color. I also added a title and a legend to the map for clarity.

In [16]:
stars = [oneStar, twoStar, threeStar, fourStar, fiveStar]
gmapStar = gmap(g_api_key, map_options, title="1,000 Pittsburgh Restaurants by Rating")
for j in range(len(stars)):
    colors = ["blue", "red", "green", "yellow", "purple"]
    legend = ["0-1.0 Stars", "1.0-2.0 Stars", "2.0-3.0 Stars", "3.0-4.0 Stars", "4.0-5.0 Stars"]
    gmapStar.circle(x="lon", y="lat", size=9, fill_color=colors[j], fill_alpha=0.7, 
                    legend = legend[j],
                    source=ColumnDataSource(data=extractPlottingData(stars[j])))

show(gmapStar)

I created a new map to try out the color mapping functions. LogColorMapper takes a list of color names, hex codes, or a pre-defined palette and maps them to values when called in the fill_color argument where they are assigned to values using the transform key. 

In [43]:
gmapStar1 = gmap(g_api_key, map_options, title="1,000 Pittsburgh Restaurants by Rating")
palette = ["blue", "red", "green", "yellow", "purple", "grey", "orange"]
color_mapper =  LogColorMapper(palette)
gmapStar1.circle(x="lon", y="lat", size=9, fill_color={'field': 'rating', 'transform': color_mapper},
                    fill_alpha=0.7, legend = {'field':'rating', 'transform':color_mapper},
                    source=ColumnDataSource(data=extractPlottingData(restaurants)))
show(gmapStar1)


In [86]:
gmapStar2 = gmap(g_api_key, map_options, title="1,000 Pittsburgh Restaurants by Rating")
palette = ["blue", "green", "orange"]
color_mapper =  LinearColorMapper(palette)
gmapStar2.circle(x="lon", y="lat", size=9, fill_color={'field': 'rating', 'transform': color_mapper},
                    fill_alpha=0.7, legend = {'field':'legend', 'transform':color_mapper},
                    source=ColumnDataSource(data=extractPlottingData(restaurants)))
gmapStar2.add_tools(HoverTool(tooltips=, renderers=[cr], mode='hline'))
show(gmapStar2)

### Changing it to a heat map

This heat map shows where the highest concentration of restaurants occur. To show this data, I added rectangles at each location of the restaurant and filled them in minimally by manipulating the alpha value. This could additionally be expanded by adding the ratings as different colors and more rectangles showing a base color for the areas that don't have any restaurants.

In [161]:
sourceData=extractPlottingData(restaurants)
gmapHeat = gmap(g_api_key, map_options, title="Concentration of Restaurants in Pittsburgh")
gmapHeat.rect(x="lon", y="lat", width=1000, height=1000, fill_color="blue",
                    fill_alpha=0.02, 
                    source=ColumnDataSource(sourceData))
show(gmapHeat)

In [165]:
gmapHeat1 = gmap(g_api_key, map_options, title="Concentration of Restaurants in Pittsburgh")
palette = bokeh.palettes.Blues8
sourceData=extractPlottingData(restaurants)
color_mapper =  LinearColorMapper(palette, low=min(sourceData["rating"]), high=max(sourceData["rating"]))
gmapHeat1.rect(x="lon", y="lat", width=500, height=500, fill_color={'field': 'rating', 'transform': color_mapper},
                    fill_alpha=1,
                    source=ColumnDataSource(sourceData))

show(gmapHeat1)

## Additional Resources

The Bokeh documentation is relatively thorough and complete. I found the following additional resources useful:
* https://bokeh.pydata.org/en/latest/docs/reference.html
* https://bokeh.pydata.org/en/latest/docs/user_guide/styling.html
