# Analysis of Location of various restaurants
In this notbook we look at the distribution of the restaurants on the globe. The goal is to filter out a list of cities that will be ideal candidates for building a classifier on.

## Dependencies
In order to run this notebook we need to install:
- plotly
- folium

If not already installed, run the cells below to install them.

In [2]:
!pip install --upgrade pip
!pip install plotly

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/c2/d7/90f34cb0d83a6c5631cf71dfe64cc1054598c843a92b400e55675cc2ac37/pip-18.1-py2.py3-none-any.whl (1.3MB)
[K    100% |████████████████████████████████| 1.3MB 8.9MB/s eta 0:00:01
[31mpyspark 2.3.1 requires py4j==0.10.7, which is not installed.[0m
[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
[?25hInstalling collected packages: pip
  Found existing installation: pip 10.0.1
    Uninstalling pip-10.0.1:
      Successfully uninstalled pip-10.0.1
Successfully installed pip-18.1
Collecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/d6/3b/abec247e24e2b8f29793811788fe0607062f40eefe3104823ad57f06ecf1/plotly-3.3.0-py2.py3-none-any.whl (37.3MB)
[K    100% |████████████████████████████████| 37.3MB 1.1MB/s eta 0:00:01 0% |▏                               | 245kB 4.0MB/s eta 0:00:10    18% |██████                          | 6.9MB 17.1MB/s eta 0:00:02    21% |██████▊           

In [3]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/88/89/8186c3441eb2a224d2896d9a8db6ded20ddd225f109e6144494a9893a0c1/folium-0.6.0-py3-none-any.whl (79kB)
[K    100% |████████████████████████████████| 81kB 5.3MB/s ta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/b5/18/13c018655f722896f25791f1db687db5671bd79285e05b3dd8c309b36414/branca-0.3.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.0 folium-0.6.0


## Loading the dataset
Since we only care about the restaurants, we can filter the `yelp_academic_dataset_business.json` to exclude businesses without one of the following categories:
- Restaurant
- Food
- Bar

In [16]:
import folium
import itertools as it
import json
import pandas as pd
import plotly.offline as py
import pyspark as spark
import pyspark.sql.functions as F

In [5]:
py.init_notebook_mode(connected=True)

In [59]:
restaurant_categories = {'restaurants', 'food', 'bars'}
restaurants = sc.textFile('../data/raw/yelp_academic_dataset_business.json') \
    .map(lambda row: json.loads(row)) \
    .filter(lambda business: business['categories'] is not None and business.get('is_open', 0)) \
    .filter(lambda business: restaurant_categories & {x.strip().lower() for x in business['categories'].split(',')}) \
    .cache()

The number of restaurants we have are:

In [60]:
restaurants.count()

55743

## Restaurant distribution
Let's have a look at the global distribution of restaurants in our dataset. To start of with the highest granularity, let's take a look at the number of restaurants per city.

In [61]:
def group_restaurants_by_key(restaurants, key_selector):
    return restaurants.keyBy(key_selector) \
        .aggregateByKey(
            (0, None),
            lambda acc, restaurant: (acc[0] + 1, acc[1] or restaurant),
            lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] or acc2[1])
        ).map(lambda tup: spark.sql.Row(
            name=tup[0], num_restaurants=tup[1][0],
            lat=tup[1][1]['latitude'], long=tup[1][1]['longitude']
        )).toDF()
    
def get_map(restaurant_clusters, focus=False):
    restaurant_clusters.cache()
    num_restaurants = restaurant_clusters.rdd.map(lambda x: x.num_restaurants).sum()
    markers = restaurant_clusters.rdd \
        .map(lambda cluster: spark.sql.Row(
            location=(cluster.lat, cluster.long), popup=cluster.name,
            color='#0000%x' % int(cluster.num_restaurants / num_restaurants * 1000),
            cluster=cluster
        )).collect()
    lat_centroid = restaurant_clusters.rdd.map(lambda x: x.lat).sum() / restaurant_clusters.count()
    long_centroid = restaurant_clusters.rdd.map(lambda x: x.long).sum() / restaurant_clusters.count()
    
    m = folium.Map(location=((lat_centroid, long_centroid) if focus else None))
    for marker in markers:
        folium.CircleMarker(
            marker.location, color=marker.color, fill=True, fill_color='blue', radius=6,
            popup=folium.Popup(
                "<p>cluster: %s</p><p>Lat: %s, Long: %s</p><p>Count: %d</p>" % (marker.popup, *marker.location, marker.cluster.num_restaurants),
                parse_html=True
            )
        ).add_to(m)
    return m

In [62]:
city_distr = group_restaurants_by_key(restaurants, lambda restaurant: restaurant['city']).cache()
city_distr.sort('num_restaurants', ascending=False).show()

+-------------+--------------+-----------+---------------+
|          lat|          long|       name|num_restaurants|
+-------------+--------------+-----------+---------------+
|   43.6813277|   -79.4278838|    Toronto|           6985|
|   36.2017936|  -115.2819809|  Las Vegas|           5844|
|   33.6713751|  -112.0300171|    Phoenix|           3654|
|   45.5180358|   -73.5821744|   Montréal|           3438|
|51.0918130155|-114.031674872|    Calgary|           2915|
|    35.190366|    -80.922471|  Charlotte|           2605|
|    40.450866|    -79.933919| Pittsburgh|           2292|
|   41.4999894|   -81.6663746|  Cleveland|           1420|
|   43.7129464|   -79.6327631|Mississauga|           1388|
|   33.5303579|   -111.925905| Scottsdale|           1266|
|   33.4150296|  -111.7999032|       Mesa|           1158|
|   43.0352412|   -89.4535994|    Madison|           1043|
|   33.4423485|  -111.9554995|      Tempe|            893|
|   35.9429657|   -115.115893|  Henderson|            79

In [63]:
get_map(city_distr)

A few takeaways from the above map and table:
- Toronto, Pheonix, and Las Vagas are the most populus citis (restaurant wise)
- The city tag is not perfect. A lot of restaurants that should be in the same city are shown as different cities (due to differnce in counties, etc).
- Some city's lat/long data is incorrect. For example, in the above map, there is a city on the continent of Antartica (in reality, that city lies in the Carribian).

## Inner city distribution
Not let's have a look at the distribution of restaurants within a city by neighbourhood. We will start by looking at Toronto.

In [64]:
toronto_restaurants = restaurants.filter(lambda restaurant: restaurant['city'] == 'Toronto')
toronto_distr = group_restaurants_by_key(
    toronto_restaurants,
    lambda restaurant: restaurant['neighborhood']
)
toronto_distr.sort('num_restaurants', ascending=False).show(n=20)

+-------------+--------------+--------------------+---------------+
|          lat|          long|                name|num_restaurants|
+-------------+--------------+--------------------+---------------+
|   43.7459284|   -79.3246225|                    |           1322|
|   43.6544631|   -79.3806653|       Downtown Core|            646|
|   43.7887023|   -79.2667077|         Scarborough|            441|
|   43.5951494|   -79.5299771|           Etobicoke|            305|
|   43.6446974|   -79.3923951|Entertainment Dis...|            211|
|     43.64842|      -79.3819|  Financial District|            164|
|    43.660498|   -79.3429538|         Leslieville|            156|
|   43.6836027|   -79.3230886|        The Danforth|            154|
|43.7729924508|-79.4140518612|          Willowdale|            146|
|43.6695736876|-79.3823492115|Church-Wellesley ...|            144|
|43.6525113857|-79.4010374871|   Kensington Market|            144|
|   43.6496153|   -79.3719866|        St. Lawren

The total number of restaurants in toronto are:

In [65]:
toronto_restaurants.count()

6985

In [66]:
get_map(toronto_distr, focus=True)

It looks like most restaurants don't have a neighbourhood (almost 20%). So we can't use that field directly, we might have to cluster restaurants into neighbourhoods.

### Categories in Toronto
Next we take a look at the most popular categories of restaurants in Toronto.

In [84]:
category_counts = pd.Series(
    toronto_restaurants \
        .flatMap(lambda restaurant: [(x.strip().lower(), restaurant['neighborhood']) for x in restaurant.get('categories', '').split(',')]) \
        .filter(lambda tup: tup[0] not in restaurant_categories and tup[1]) \
        .keyBy(lambda x: x).countByKey()
)

In [85]:
category_counts.sort_values(ascending=False).head(50)

coffee & tea            Downtown Core               94
sandwiches              Downtown Core               72
chinese                 Milliken                    71
                        Scarborough                 67
nightlife               Downtown Core               66
fast food               Downtown Core               61
nightlife               Entertainment District      60
japanese                Downtown Core               54
breakfast & brunch      Downtown Core               51
chinese                 Chinatown                   50
cafes                   Downtown Core               43
coffee & tea            Etobicoke                   40
                        Scarborough                 40
nightlife               Scarborough                 39
burgers                 Downtown Core               39
specialty food          Scarborough                 38
nightlife               Little Italy                38
indian                  Scarborough                 38
chinese   