# Capstone Project - The Battle of Neighborhoods
## Finding an optimized method for delivering fresh produce in the nearby restaurants/cafe in North York, Toronto

### Problem Description and Background Discussion

A farmer wants to expand its business in the local neighborhood and wants to deliver fresh produce everyday in the local areas of North York, Toronto. There should be timely delivery of fresh produce, veggies, bread, coffee etc. in order to successfully run the business. In order to achieve that, there should be some optimal method to deliver in every single restaurant/coffee shops/cafe nearby with no delay at all to any place. There are 10 delivery trucks available to use and all the areas to be covered. The best way would be to divide the potential merchants into segment based on their locations and address the mechanism seperately for each segment of merchants. This would allow the farmer to use all the resources in a efficient manner and hence help rum a successfull business plan.

### Description of the Data and Method to use the data to solve the problem

-> First thing first. We will need geological coordintes of the place that we are interested in setting up the business. So we will find the location coordinates of all the neighborhoods in North York, Torornto. the information about the postal code etc in Torornto is available at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
We will be using the geopy geocoders library and the data from the link in order to proceed with the project

-> Secondly, the methodology to process the data would be clustering. We will create clusters of the restaurants/cafe's/baking places/coffee shops by taking information from FourSquare. We will be using venue information and their geological coordinates in order to move ahead with the clustering algorithm and find an optimal solution on how to deliver the produce in a more structured way.

### Data Preparation

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import bs4
import lxml.html as lh

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = requests.get(url)

In [3]:
soup = bs4.BeautifulSoup(req.text, "html5lib")

In [4]:
data = soup.select('.wikitable.sortable')
print (type(data))
print (len(data))

<class 'list'>
1


In [5]:
doc = lh.fromstring(req.content)
tr_elements = doc.xpath('//tr')
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [6]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

In [7]:
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Creating the DataFrame from a list

In [8]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df = df[df.Borough != 'Not assigned']
df.rename(columns = {'Neighbourhood\n':'Neighbourhood'}, inplace = True)

In [9]:
for i in range(0,211):
    df.iloc[i,2] = df.iloc[i,2].strip('\n')

In [10]:
df = df.reset_index(drop = True)

In [11]:
for i in range(0,211):
    if df.iloc[i,2] == 'Not assigned':
        df.iloc[i,2] = df.iloc[i,1]

In [12]:
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [13]:
df1 = (df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(set(x.dropna()))).reset_index())
df1.shape

(103, 3)

Adding Geographical Coordinates in the data

In [14]:
latlon = pd.read_csv('Geospatial_Coordinates.csv')
latlon.rename(columns = {'Postal Code':'Postcode'}, inplace = True)
df_final = pd.merge(df1,latlon,on=['Postcode'], how='left')
df_final.shape

(103, 5)

In [15]:
df_final.to_csv("Toronto_data")

Using Foursquare for preparing the data for the particular borough

In [16]:
df1 = pd.read_csv('Toronto_data')
df1.drop(['Unnamed: 0'] , axis = 1 , inplace = True )
df_northyork = df1[df1['Borough'] == 'North York']

In [17]:
df_northyork.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
17,M2H,North York,Hillcrest Village,43.803762,-79.363452
18,M2J,North York,"Oriole,Fairview,Henry Farm",43.778517,-79.346556
19,M2K,North York,Bayview Village,43.786947,-79.385975
20,M2L,North York,"York Mills,Silver Hills",43.75749,-79.374714
21,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493


Neighborhood information for North York

In [18]:
def foursquare_explore (postal_code_list, neighborhood_list, lat_list, lng_list, LIMIT = 500, radius = 1000):
    result_ds = []
    counter = 0
    for postal_code, neighborhood, lat, lng in zip(postal_code_list, neighborhood_list, lat_list, lng_list):
         
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, 
            lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        tmp_dict = {}
        tmp_dict['Postal Code'] = postal_code; tmp_dict['Neighborhood(s)'] = neighborhood; 
        tmp_dict['Latitude'] = lat; tmp_dict['Longitude'] = lng;
        tmp_dict['Crawling_result'] = results;
        result_ds.append(tmp_dict)
        counter += 1
        
    return result_ds;

In [19]:
#FourSquare
CLIENT_ID = 'SNDJRIX52E0CF3WE1XK54GRUZEQDWDOPJRJSEIWFXVEALGY5' # your Foursquare ID
CLIENT_SECRET = 'QLPKCSVLE0ROOUHXJ0N3NQQVYCZULQE511IDV1BDIDKVR0OW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [20]:
NY_Foursquare_Dataset = foursquare_explore(list(df_northyork['Postcode']),list(df_northyork['Neighbourhood']),
                           list(df_northyork['Latitude']),list(df_northyork['Longitude']),)

In [21]:
NY_Foursquare_Dataset[1]

{'Postal Code': 'M2J',
 'Neighborhood(s)': 'Oriole,Fairview,Henry Farm',
 'Latitude': 43.7785175,
 'Longitude': -79.3465557,
 'Crawling_result': [{'reasons': {'count': 0,
    'items': [{'summary': 'This spot is popular',
      'type': 'general',
      'reasonName': 'globalInteractionReason'}]},
   'venue': {'id': '4e848fbb5c5c9240de8e6a80',
    'name': 'The LEGO Store',
    'location': {'address': '1800 Sheppard Ave E',
     'crossStreet': 'at Don Mills Rd',
     'lat': 43.77820727238842,
     'lng': -79.34348299621146,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.77820727238842,
       'lng': -79.34348299621146}],
     'distance': 249,
     'postalCode': 'M2J 5A7',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['1800 Sheppard Ave E (at Don Mills Rd)',
      'Toronto ON M2J 5A7',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d1f3941735',
      'name': 'Toy / Game Store',
      'pluralName': 

Saving the North York Details from FourSquare into a DataFrame

In [22]:
def get_venue_dataset(foursquare_dataset):
    result_df = pd.DataFrame(columns = ['Postcode', 'Neighbourhood', 
                                           'Neighbourhood Latitude', 'Neighbourhood Longitude','Venue_id',
                                          'Venue', 'Venue Category', 'Venue_lat' , 'Venue_lng'])
    # print(result_df)
    
    for neigh_dict in foursquare_dataset:
        postal_code = neigh_dict['Postal Code']; neigh = neigh_dict['Neighborhood(s)']
        lat = neigh_dict['Latitude']; lng = neigh_dict['Longitude']
    
        for venue_dict in neigh_dict['Crawling_result']:
            name = venue_dict['venue']['name']
            vlat = venue_dict['venue']['location']['lat']
            vlng = venue_dict['venue']['location']['lng']
            cat =  venue_dict['venue']['categories'][0]['name']
            vid = venue_dict['venue']['id']
            
            
            
          
            result_df = result_df.append({'Postcode': postal_code, 'Neighbourhood': neigh, 
                              'Neighbourhood Latitude': lat, 'Neighbourhood Longitude':lng,'Venue_id':vid,
                              'Venue': name,'Venue Category': cat, 'Venue_lat': vlat ,'Venue_lng': vlng }, 
                                ignore_index = True)
            
    return(result_df)

In [23]:
df_NY = get_venue_dataset(NY_Foursquare_Dataset)

In [24]:
df_NY.shape

(609, 9)

In [25]:
df_final = df_NY[(df_NY['Venue Category'].str.contains('Coffee')) | (df_NY['Venue Category'].str.contains('Restaurant')) 
                | (df_NY['Venue Category'].str.contains('Breakfast')) | (df_NY['Venue Category'].str.contains('Café'))
                | (df_NY['Venue Category'].str.contains('Bakery'))]

Final Dataset will look something like this. We will be using it for clustering in the next part

In [26]:
df_final.head()

Unnamed: 0,Postcode,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue_id,Venue,Venue Category,Venue_lat,Venue_lng
0,M2H,Hillcrest Village,43.803762,-79.363452,50c138e4e4b0276c8687f23c,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,Korean Restaurant,43.798391,-79.369187
1,M2H,Hillcrest Village,43.803762,-79.363452,4bd9842be914a593adbd56fa,Tastee,Bakery,43.807722,-79.356798
4,M2H,Hillcrest Village,43.803762,-79.363452,4ad88a4ff964a520251221e3,Tim Hortons,Coffee Shop,43.798945,-79.369644
10,M2H,Hillcrest Village,43.803762,-79.363452,4b2e6ff4f964a52022e024e3,Tim Hortons,Coffee Shop,43.811852,-79.359501
11,M2H,Hillcrest Village,43.803762,-79.363452,4dd44ae41f6ec4e0bb76d28e,New Greattime Corp.,Chinese Restaurant,43.807414,-79.356717


In [27]:
df_final.to_csv("Northyork_Venues")