# Week 4 Assignment 1: A description of the problem and a discussion of the background. (15 marks)
## Problem 
<b>In a city of Toronto, if someone is looking to open a Chinese restaurant, where would you recommend it to be at? </b>
## Idea
<b>A somehow simple idea is to explore Toronto using Foursquare Venue services, find Chinese restaurants by neighborhoods, then choose the most dense neighborhood(s). The theory is that you new business probably will do equaly well in the hotest location.</b>

# Week 4 Assignment 2: A description of the data and how it will be used to solve the problem. (15 marks)
<b>To implement the idea described in "Week 4 Assignment 1" above, I need to built neighborhood data with measurement of its density in terms of Chinese restaurants. It can be built in the following 4 steps.  </b>
## Step 1. Neighborhood list data
<b>Will scrap the Toronto neighborhood list data from Web site https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</b>
## Step 2. Geocoding neighborhood list
<b>Call geocode service to add geo location for each neighborhood. These locations are essential for next step.</b>
## Step 3. Neighborhood venue data
<b>Call Foursquare venue services to get Chinese restaurant data for each neighborhood. This include list venues, and the get details about each venue.</b>
## Step 4. Process to get density measurement data
<b>At the simplest way, I measure the density with number of Chinese restaurants per neighborhood. </b>
<b>Then find the neighborhoods with the most number of Chinese restaurants, and it will be recommendated to be the most suitable </b>

# Week 5 Assignment 1:     A full report consisting of all of the following components (15 marks):
<ul>
    <li>Introduction where you discuss the business problem and who would be interested in this project.
    <li>Data where you describe the data that will be used to solve the problem and the source of the data.
    <li>Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    <li>Results section where you discuss the results.
    <li>Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
    <li>Conclusion section where you conclude the report.
</ul>

# Week 5 Assignment 2: A link to your Notebook on your Github repository, showing your code. (15 marks)
## Problem 
<b>In a city of Toronto, if someone is looking to open a Chinese restaurant, where would you recommend that they open it? </b>
## Idea
<b>A somehow simple idea is to explore Toronto using foursquare location service, find Chinese restaurants by neighborhoods, then choose the most dense neighborhood(s). The theory is that you new business probably will do equaly well in the hotest location.</b>

# Week 5 Assignment 3: Your choice of a presentation or blogpost. (10 marks)

<br>
<br>
<br>
<br>


## Below is the code for data collection & cleaning

Common imports

In [19]:
import pandas as pd
import numpy as np


## Step 1 & 2. Neighborhood list data with Geocoding
<b>Will scrap the Toronto neighborhood list data from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Web site</a>, adding geo location for each</b>

In [20]:
# scrape neighborhood lists, adding geo location for each. 
#!pip install BeautifulSoup4
#!pip install geopy

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
geolocator = Nominatim(user_agent="foursquare_agent")

df = pd.DataFrame(columns=["Neighborhood","PostalCode","Borough","Latitude","Longitude"])
df.set_index("Neighborhood", inplace=True)

# parsing/scrapping
table1 = soup.find('table', class_="wikitable sortable")
trs = table1.findAll('tr')
for i in range(0,len(trs)-1):
    tr1 = trs[i]
    tds = tr1.findAll('td')
    if (len(tds)==3 and tds[1].text.strip()!="Not assigned"):
        postalCode=tds[0].text.strip()
        borough=tds[1].text.strip()
        neighborhood=tds[2].text.strip()
        if neighborhood=="Not assigned":
            neighborhood = borough
        else:
            neighborhood = neighborhood.replace("\n","")
        
        if len(neighborhood)>0:
            try:
                df.loc[neighborhood,"PostalCode"] = postalCode
                df.loc[neighborhood,"Borough"] = borough
            except: 
                print("Warning: neighborbood {} borough {} zip {} may have duplicates. Existing record found in borough {}, zip {}"
                      .format(neighborhood, borough, postalCode, df.loc[neighborhood, "Borough"], df.loc[neighborhood, "PostalCode"]))  
            try:
                # getting geo location
                address=neighborhood + ", Toronto, CA"
                location = geolocator.geocode(address)
                df.loc[neighborhood,"Latitude"] = location.latitude
                df.loc[neighborhood,"Longitude"] = location.longitude
            except:
                print("possible geocoding error for neighborhood {}".format(neighborhood))
        else:
            pass
        
print(df.shape)        
df.head()


possible geocoding error for neighborhood Parkview Hill
possible geocoding error for neighborhood Humewood-Cedarvale
possible geocoding error for neighborhood Caledonia-Fairbanks
possible geocoding error for neighborhood CFB Toronto
possible geocoding error for neighborhood India Bazaar
possible geocoding error for neighborhood Del Ray
possible geocoding error for neighborhood Birch Cliff
possible geocoding error for neighborhood Canada Post Gateway Processing Centre
possible geocoding error for neighborhood Railway Lands
possible geocoding error for neighborhood Humber Bay Shores
possible geocoding error for neighborhood Albion Gardens
possible geocoding error for neighborhood Beaumond Heights
possible geocoding error for neighborhood Stn A PO Boxes 25 The Esplanade
possible geocoding error for neighborhood Business Reply Mail Processing Centre 969 Eastern
possible geocoding error for neighborhood Kingsway Park South East
possible geocoding error for neighborhood Kingsway Park South W

Unnamed: 0_level_0,PostalCode,Borough,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Parkwoods,M3A,North York,43.7588,-79.3202
Victoria Village,M4A,North York,43.7327,-79.3112
Harbourfront,M5A,Downtown Toronto,43.6401,-79.3801
Regent Park,M5A,Downtown Toronto,43.6607,-79.3605
Lawrence Heights,M6A,North York,43.7228,-79.4509


In [21]:
# drop rows with missing location values
geo_df=df.dropna(axis=0, subset=["Latitude", "Longitude"])
geo_df.shape
geo_df.head()

Unnamed: 0_level_0,PostalCode,Borough,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Parkwoods,M3A,North York,43.7588,-79.3202
Victoria Village,M4A,North York,43.7327,-79.3112
Harbourfront,M5A,Downtown Toronto,43.6401,-79.3801
Regent Park,M5A,Downtown Toronto,43.6607,-79.3605
Lawrence Heights,M6A,North York,43.7228,-79.4509


## Step 3. Neighborhood venue data
<B>Use Foursquare venue service to get Chinese restaurant data for each neighborhood.</B>


foursquare credentials

In [22]:
# The code was removed by Watson Studio for sharing.

function for exploring neighborhoods - per neighborhood area.

In [23]:
def getNearbyVenues(names, latitudes, longitudes, query="chinese restaurants", radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&query={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            query,
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
            # return only relevant information for each nearby venue
            for v in results:
                try:
                    venues_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name'],
                    v['venue']['id']
                    )]) 
                except:
                    print("except on {} v={} ".format(name, v ))
        except:
            print('except on neiborhood {}'.format(name))
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue ID' ]
    
    return(nearby_venues)


In [26]:
geo_df.reset_index(inplace=True)  # needed to use df['Neighborhood'] in next cell

In [27]:
# run the function to collect data
neighborhood_venues = getNearbyVenues(names=geo_df['Neighborhood'],
                                   latitudes=geo_df['Latitude'],
                                   longitudes=geo_df['Longitude']
                                  )


<B>Get Venue Details - get createAt and use it as the date the business started</B>


In [28]:
# Venue details - get createAt and use it as the date the business started
def getVenueDetails(venue_ids):
    
    venue_details=[]
    for venue_id in venue_ids:
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}&limit=1'.format(
            venue_id, 
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)

        # make the GET request
        v = requests.get(url).json()

        # return only relevant information for each venue - for v1: createAt
        try:
            venue_details.append([(
            venue_id, 
            v["response"]["venue"]["createdAt"] 
            )]) 
        except:
            print("except on venue ID {}, json={} ".format(venue_id, v ))

    pd_details = pd.DataFrame([item for v_list in venue_details for item in v_list])
    pd_details.columns = ['Venue ID',
                          'Venue CreatedAt']
    
    return(pd_details)

In [29]:
# The code was removed by Watson Studio for sharing.

                   Venue ID  Venue CreatedAt
0  4ab153d4f964a520066920e3       1253135316
1  4ad4c05ff964a52013f720e3       1255456863
2  4ad4c060f964a5205cf720e3       1255456864
3  4ad4c060f964a5205cf720e3       1255456864
4  4ad4c060f964a5205cf720e3       1255456864


In [16]:
# The code was removed by Watson Studio for sharing.

(84, 2)


In [30]:
# due to FourSquare quota restriction, so we break it into multiple days
# when all is completed, resume here to process - convert createdAt from epoch-count to date

neighborhood_venues2=pd.merge(neighborhood_venues, venue_details, how="left", 
                              left_on=["Venue ID"], 
                              right_on=["Venue ID"])
if neighborhood_venues2[neighborhood_venues2["Venue CreatedAt"]!=neighborhood_venues2["Venue CreatedAt"]].shape[0] < 1 :
    all_in_flag=True

print(neighborhood_venues2.shape)
print(neighborhood_venues2.head())

(585, 9)
   Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0     Parkwoods               43.75880              -79.320197   
1  Harbourfront               43.64008              -79.380150   
2  Harbourfront               43.64008              -79.380150   
3  Harbourfront               43.64008              -79.380150   
4  Harbourfront               43.64008              -79.380150   

                 Venue  Venue Latitude  Venue Longitude      Venue Category  \
0  Spicy Chicken House       43.760639       -79.325671  Chinese Restaurant   
1   Pearl Harbourfront       43.638157       -79.380688  Chinese Restaurant   
2     Szechuan Express       43.641346       -79.377960  Chinese Restaurant   
3         Shanghai 360       43.641647       -79.377920  Chinese Restaurant   
4          Water Front       43.641510       -79.375861  Chinese Restaurant   

                   Venue ID  Venue CreatedAt  
0  4c0150f4716bc9b65b9dbb55       1275154676  
1  4ae33054f964a520759121

<b>examine explored results and clean up and process as needed</b>

In [31]:
neighborhood_venues2=neighborhood_venues2.drop_duplicates()
print(neighborhood_venues2.shape)
print(neighborhood_venues2.head()[["Neighborhood", "Venue", "Venue Category", "Venue ID"]])
#neighborhood_venues["Venue Category"] = [s.strip() for s in neighborhood_venues["Venue Category"]] 
print("unique categories: ", neighborhood_venues2["Venue Category"].unique())
print("max count per neighborhood: ", neighborhood_venues2[["Neighborhood","Venue"]].groupby("Neighborhood").count().max())

(559, 9)
   Neighborhood                Venue      Venue Category  \
0     Parkwoods  Spicy Chicken House  Chinese Restaurant   
1  Harbourfront   Pearl Harbourfront  Chinese Restaurant   
2  Harbourfront     Szechuan Express  Chinese Restaurant   
3  Harbourfront         Shanghai 360  Chinese Restaurant   
4  Harbourfront          Water Front  Chinese Restaurant   

                   Venue ID  
0  4c0150f4716bc9b65b9dbb55  
1  4ae33054f964a520759121e3  
2  55df3345498e28c71648d892  
3  57eebf4b498e72ef33fd6211  
4  4db2f9fd6e8179a9135e5b45  
unique categories:  ['Chinese Restaurant' 'Asian Restaurant' 'Sushi Restaurant'
 'Cantonese Restaurant' 'Fried Chicken Joint' 'Bubble Tea Shop'
 'Taiwanese Restaurant' 'Dim Sum Restaurant' 'Peking Duck Restaurant'
 'Hakka Restaurant' 'Hong Kong Restaurant' 'Comfort Food Restaurant'
 'Dumpling Restaurant' 'Hotpot Restaurant' 'Szechuan Restaurant'
 'Dongbei Restaurant']
max count per neighborhood:  Venue    50
dtype: int64


## Step 4. Process to get density measurement data
<b>The simplest way to measure density is count by neighborhood, so we count the number of Chinese restaurants per neighborhood. </b>

In [32]:
#evaluate the model on the last 10
import time
N=200
n=0
Y=["" for s in range(N)]
y=["" for s in range(N)]
ev_df=neighborhood_venues2[["Neighborhood", "Venue", "Venue CreatedAt"]]
ev_df.reset_index(inplace=True) 
for i in range(0,N):
    # find when the last restaurant was openned
    max_createdAt=max(ev_df["Venue CreatedAt"])
    # set real target value
    tgt_df=ev_df[ev_df["Venue CreatedAt"]==max_createdAt]
    tgt_df.reset_index(inplace=True)
    #print(tgt_df.head(1))
    Y[i]=tgt_df.loc[0,"Neighborhood"]
    # filter out the last opened
    ev_df=ev_df[ev_df["Venue CreatedAt"]<max_createdAt]
    # calc density 
    ev_cnt_df=ev_df[["Neighborhood","Venue"]].groupby("Neighborhood").count()
    ev_cnt_df.reset_index(inplace=True) 
    # recommended location
    max_cnt=max(ev_cnt_df["Venue"])
    ev_tgt_df=ev_cnt_df[ev_cnt_df["Venue"]==max_cnt]
    ev_tgt_df.reset_index(inplace=True) 
    y[i]=ev_tgt_df.loc[0,"Neighborhood"]
    # evaluate
    if Y[i]==y[i]:
        n=n+1
print("correct={}, pct={}".format(n, n*100.0/N))
    

correct=24, pct=12.0


<B>Correct rate - at 12% - is quit low, this model definitely needs improvement.</B>
<ol>
<li>First may need to double check data issue (some venue could have been counted in multiple neighborhoods)
<li>Second, more likely need more features in addition to simply count venues.
</ol>
But to complete this course project work, here is the recommended location for your next Chinese restaurant: China Town! 

In [33]:
# analytic_df have full information about each venue
analytic_df=pd.merge(neighborhood_venues, geo_df, how="left", 
                     left_on=["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude"], 
                     right_on=["Neighborhood", "Latitude", "Longitude"])
analytic_df.reindex()
print(analytic_df.shape)
#print(analytic_df.head())

# merge_df has neighborhood and Chinese Restaurant counts
cnt_df=analytic_df[["Neighborhood","Venue"]].groupby("Neighborhood").count()
merge_df=pd.merge(geo_df, cnt_df, how="left", 
                     left_on=["Neighborhood"], 
                     right_on=["Neighborhood"])
print(merge_df.shape)

(559, 13)
(192, 7)


### Recommend location(s) 

In [34]:
# find the Neighborhoods with the max counts, these are to be recommended 
recommend1_df=merge_df[merge_df["Venue"]==max(merge_df["Venue"])]
print(recommend1_df)

     index Neighborhood PostalCode           Borough Latitude Longitude  Venue
143    143    Chinatown        M5T  Downtown Toronto  43.6529   -79.398   50.0


### Visualize it on map - recommended neighborhoods are marked with bigger size dots

In [35]:
# find Toronto location (lat,lng)

from geopy.geocoders import Nominatim 

geolocator = Nominatim(user_agent="foursquare_agent")

for address in ['Toronto, CA']:
    location = geolocator.geocode(address)
    toronto_lat = location.latitude
    toronto_lng = location.longitude
    print(address,  toronto_lat, toronto_lng)


Toronto, CA 43.653963 -79.387207


In [36]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium
import math
import matplotlib.cm as cm
import matplotlib.colors as colors



Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

In [37]:
max_cnt=max(merge_df["Venue"])

# set color scheme for the clusters
k = math.ceil(max_cnt/10)
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
print (k, rainbow)

for lat, lon, poi, cnt in zip(merge_df['Latitude'], merge_df['Longitude'], merge_df['Neighborhood'], merge_df['Venue'].fillna(0)):
    print(poi, cnt)

5 ['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000']
Parkwoods 1.0
Victoria Village 0.0
Harbourfront 4.0
Regent Park 1.0
Lawrence Heights 2.0
Lawrence Manor 0.0
Queen's Park 3.0
Islington Avenue 0.0
Rouge 1.0
Malvern 2.0
Don Mills North 2.0
Woodbine Gardens 1.0
Ryerson 12.0
Garden District 6.0
Glencairn 0.0
Cloverdale 1.0
Islington 0.0
Martin Grove 1.0
Princess Gardens 4.0
West Deane Park 0.0
Highland Creek 0.0
Rouge Hill 0.0
Port Union 0.0
Flemingdon Park 2.0
Don Mills South 2.0
Woodbine Heights 0.0
St. James Town 2.0
Bloordale Gardens 0.0
Eringate 0.0
Markland Wood 0.0
Old Burnhamthorpe 0.0
Guildwood 0.0
Morningside 0.0
West Hill 2.0
The Beaches 2.0
Berczy Park 5.0
Woburn 1.0
Leaside 1.0
Central Bay Street 0.0
Christie 6.0
Cedarbrae 1.0
Hillcrest Village 0.0
Bathurst Manor 0.0
Downsview North 0.0
Wilson Heights 1.0
Thorncliffe Park 3.0
Adelaide 10.0
King 8.0
Richmond 0.0
Dovercourt Village 1.0
Dufferin 2.0
Scarborough Village 2.0
Fairview 18.0
Henry Farm 0.0
Oriole 0.0
Northwood

In [38]:
# create map
map1 = folium.Map(location=[toronto_lat, toronto_lng], zoom_start=13)

# add markers to the map
max_cnt=max(merge_df["Venue"])

# set color scheme for the clusters
k = math.ceil(max_cnt/10)+1
x = np.arange(0,k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for lat, lon, poi, cnt in zip(merge_df['Latitude'], merge_df['Longitude'], merge_df['Neighborhood'], merge_df['Venue'].fillna(0)):
    cnt_i = int(cnt)
    color_i = math.ceil(cnt_i/10)

    label = folium.Popup(str(poi) + ' (count=' + str(cnt_i) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5*(1+math.floor(cnt_i/max_cnt)),
        popup=label,
        color=rainbow[color_i],
        fill=True,
        fill_color=rainbow[color_i],
        fill_opacity=0.7).add_to(map1)



In [39]:
map1
