<h1>Segmenting and Clustering Chinese Restaurants in Toronto<h1/>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

<h2> Introduction: Business Problem 

Cluster Toronto restaurant areas with respect to the average cost of meals in a restaurant.


<h2> Data 

To start requesting restaurant information from foursquare website, Toronto needs to be splitted into circular areas. A shape of Toronto in general can be described as trapezoid, there most/left right longitide-lattitude needs to be determined as well general width and bredth of the city.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. 

Simplified coordinates for Toronto can be found in the table below. 


<table align="center">
    <tr>
        <th>         </th>
        <th>Lattitude</th>
        <th>Longitude</th>
    </tr>
    <tr>
        <th>Top right coordinates</th>
        <th>43.857698</th>
        <th>-79.169111</th>
    </tr>
    <tr>
        <th>Top left coordinates</th>
        <th>43.750724</th>
        <th>-79.640452</th>
    </tr>
    <tr>
        <th>Bottom left coordinates</th>
        <th>43.577857</th>
        <th>-79.541984</th>
    </tr>
    <tr>
        <th>Bottom right coordinates:</th>
        <th> 43.749704</th>
        <th>-79.111260</th>
    </tr>
</table>

In [None]:
import pickle
import pandas as pd 
import numpy as np

In [None]:
from capstone import Toronto

top_right =      (43.857698, -79.169111)
top_left =       (43.750724, -79.640452)
bottom_left =    (43.557857, -79.541984)
bottom_right =   (43.749704, -79.111260)

toronto_map = Toronto(  
    top_right, 
    top_left, 
    bottom_left, 
    bottom_right
)
toronto_map.display_toronto_map()

Maximum number of foursquare api requests per day is 950. First, whole toronto area will be searched for restaurants with "Asian restaurant" venue category. Secondly, the maximum number of venue per query is 100, therefore search area will be decresed to avoid any chance to miss a restaurant. The diameter will be then chosen on the number of circles basis, which should be equal or less than 475.

Grid of circles can be found on the map below. Initial shape of the grid is rectangular and to minimise number of requests, see coordinates were deleted.

In [None]:
toronto_map.circle_diameter = 1400
list_of_centers = toronto_map.get_latlon_data()

toronto_map.display_circles(list_of_centers)

In [None]:
list_of_coordinates = toronto_map.latlon_list_for_circles  #class attribute not needed
print(f"The total number of areas create is {len(list_of_coordinates)}")

Using Foursquare API to collect information about chinese restaurants in every area 

In [None]:
from capstone import FoursquareSearch

circle_diameter = toronto_map.circle_diameter

foursq = FoursquareSearch(list_of_coordinates, circle_diameter)
foursq.get_explore_requests()

36 errors were raised for some requests during the first search. For that reason, second searched was requireed to fill in the gaps. For the second search, all restaurants were recorded and no errors were recieved. Data from a_radius_1212.txt file will be processed do get new search areas. 

In [None]:
with open("a1_radius_1212.txt", "rb") as file:   # Unpickling
    list_of_rest = pickle.load(file)

columns = ['total_results', 'recieved_no_results', 'searched_coordinates']
df = pd.DataFrame(columns=columns)

df.searched_coordinates = list_of_coordinates
df.total_results =   [x['response']['totalResults'] if x['response']\
                     else np.NaN for x in list_of_rest]
df.recieved_no_results =  [len(x['response']['groups'][0]['items']) if x['response'] \
                        else np.NaN for x in list_of_rest]      

Creating a requests dataframe where number of total_results recived is more than 100.   

In [None]:
new_search_df = df.sort_values(by='total_results', ascending=False).head(5)

toronto_map.display_circles(new_search_df.searched_coordinates, zoom_start=12)

Creating a reduced search area

In [None]:
reduced_top_right =      (43.694849, -79.361626)
reduced_top_left =       (43.673882, -79.456669)
reduced_bottom_left =    (43.630734, -79.427657)
reduced_bottom_right =   (43.651852, -79.345343)

updated_toronto_map = Toronto(  
    reduced_top_right, 
    reduced_top_left, 
    reduced_bottom_left, 
    reduced_bottom_right
)

updated_toronto_map.display_toronto_map(zoom_start=12, display_boundaries=False)  #switch off the boundaries

In [None]:
updated_toronto_map.circle_diameter = 600
updated_list_coordinates = updated_toronto_map.get_latlon_data()

updated_toronto_map.display_circles(updated_list_coordinates, zoom_start=12)

In [None]:
updated_circle_diameter = updated_toronto_map.circle_diameter
new_foursq = FoursquareSearch(updated_list_coordinates, updated_circle_diameter)

new_foursq.get_explore_requests()

a3_radius_519.txt and a4_radius_519.txt were obtained with restaurants from updated search area (1st searched showed one error). Next step is to process 4 files into one dataframe.

In [None]:
list_of_requests = []
with open("a1_radius_1212.txt", "rb") as file:  
    list_of_requests.extend(pickle.load(file))

with open("a2_radius_1212.txt", "rb") as file:  
    list_of_requests.extend(pickle.load(file))

with open("a3_radius_519.txt", "rb") as file:  
    list_of_requests.extend(pickle.load(file))

with open("a4_radius_519.txt", "rb") as file:  
    list_of_requests.extend(pickle.load(file))  

Deleting all requests where totalResults parameter equal to zero or response keys is empty (error was thrown in that case)

In [None]:
list_of_requests = [
    x for x in list_of_requests \
    if (bool(x['response']) == True)
]

list_of_requests = [
    x for x in list_of_requests \
    if (x['response']['totalResults'] != 0)
]


Transforming the list of requests into dataframe format 

In [None]:
df = FoursquareSearch.requests_to_dataframe(list_of_requests)

Duplicates were dropped with venueID acting a subset. Below we can see that dataframe venue_name column has repetitive names, however after a closer look it can be found that address and coordinates are different.

In [None]:
repetitive_names = df.venue_name.value_counts()[df.venue_name.value_counts()>1].index
temp_df = df[df.venue_name.isin(repetitive_names)]
temp_df = temp_df.sort_values(by='venue_name')

Displaying all the restaurants obtained form foursquare api 

In [None]:
import folium

map_toronto = folium.Map(location=[43.7, -79.4], zoom_start=11)

for lat,lng in zip(df.latitude, df.longitude): 
    color = 'blue'
    folium.CircleMarker([lat, lng], radius=1, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_toronto)
map_toronto

Now requesting information related to every restaurant

In [None]:
 FoursquareSearch.get_price_tier(df.venueID.to_list())

In [40]:
import pickle

with open("foursquare-requests//a5_venues_prices.txt", "rb") as file:  
    venues_list = pickle.load(file)

In [None]:
list_tiers = []
for x in venues_list:
    try: 
        list_tiers.append(x['response']['venue']['price']['tier'])
    except BaseException:
        list_tiers.append(np.nan)

df.price_tier = list_tiers

Saving all obtained information to single csv file 

In [None]:
df.to_csv('processed-requests.csv')

<h2>Methodology

In [10]:
import pandas as pd 

df = pd.read_csv('processed-requests.csv', index_col=0)

In [38]:
df[df.price_tier == 4]

Unnamed: 0,venueID,venue_name,category_name,categoryID,address,postcode,latitude,longitude,price_tier
6,4b637f5cf964a5207f7e2ae3,Szechuan Gourmet Restaurant,Chinese Restaurant,4bf58dd8d48988d145941735,1033 Steeles Ave. W. (at Carpenter Rd.)North Y...,M2R 2S9,43.79162,-79.448052,4.0
90,4b5d083ef964a520bd4f29e3,Ginko,Japanese Restaurant,4bf58dd8d48988d111941735,655 Dixon RoadToronto ON M9W 1J4Canada,M9W 1J4,43.689193,-79.578441,4.0
134,4c1180cf72caa59362ad5da4,Kapit Bahay,Filipino Restaurant,4eb1bd1c3b7b55596b4a748f,4218 Lawrence Ave E (Lawrence / Morningside)To...,M1E 4X9,43.769319,-79.184442,4.0
135,4bfefc3b68c7a593dc004044,Golden Wok Chinese Restaurant,Asian Restaurant,4bf58dd8d48988d142941735,120 Eringate Dr. Unit #3 (Renforth Dr)Toronto ...,M9C 3Z8,43.660491,-79.582319,4.0
217,50468014e4b01c18bb731df8,Bosk at Shangri-La,Asian Restaurant,4bf58dd8d48988d142941735,188 University Ave.Toronto ON M5H 0A3Canada,M5H 0A3,43.649023,-79.385826,4.0
242,4b12efd8f964a5202e9123e3,Prince Japanese Steakhouse,Japanese Restaurant,4bf58dd8d48988d111941735,5555 Eglinton Avenue West (Eglinton & Spectrum...,M9C 5M1,43.648952,-79.605052,4.0


In [12]:
df.price_tier.value_counts()

2.0    198
1.0     83
3.0      7
4.0      6
Name: price_tier, dtype: int64

Price tiers are distributed as follows: 
<br>
1 - Cheap 
<br>
2 - Moderate 
<br>
3 - Expensive
<br>
4 - Very Expensive

In [49]:
import folium
import numpy as np 

map_toronto = folium.Map(location=[43.7, -79.4], zoom_start=11)
colors = {
    1: 'red',
    2: 'green',
    3: 'blue',
    4: 'black',
}
for lat,lng, price in zip(df.latitude, df.longitude, df.price_tier): 
    if not pd.isna(price): 
        marker = folium.CircleMarker(    
            [lat, lng], 
            radius=1,
            color=colors[price], 
            fill=True,
            fill_color=colors[price],
            fill_opacity=1
        )
        marker.add_to(map_toronto)

map_toronto