## Segmenting and Clustering Neighborhoods in Toronto - Final Assessment (Part I)

# <span style='background :orange'><font color = 'blue'>1.Create DataFrame</font></span>

### <span style='background :yellow'>import the library we use to open URLs</font></span>

In [2]:
import urllib.request

### <span style='background :yellow'>specify which URL/web page we are going to be scraping</font></span>

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#### <span style='background :yellow'>open the url using urllib.request and put the HTML into the page variable</font></span>

In [4]:
page = urllib.request.urlopen(url)

### <span style='background :yellow'>Install Library beautifulSoup4 for extracting data from url</font></span>

In [5]:
pip install BeautifulSoup4


The following command must be run outside of the IPython shell:

    $ pip install BeautifulSoup4

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


### <span style='background :yellow'>Install 'lXml' librabry  for parsing</font></span>

In [6]:
pip install lxml


The following command must be run outside of the IPython shell:

    $ pip install lxml

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


### <span style='background :yellow'>import the BeautifulSoup library so we can parse HTML and XML documents</font></span>

In [7]:
from bs4 import BeautifulSoup

### <span style='background :yellow'>parse the HTML from our URL into the BeautifulSoup parse tree format</font></span>

In [8]:
soup = BeautifulSoup(page, "lxml")

In [9]:
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

### <span style='background :yellow'><font color = 'blue'>have a look into the HTML source code of the wikipedia page by right clicking on the web page and select view source </font></span>
<span style='background :yellow'><font color = 'blue'>To identify the tags... we observe that the table is between the table tags and each row is between tr and columns separated by td</font></span>

In [10]:
table=soup.find('table', class_='wikitable sortable')

#### <span style='background :yellow'>Empty arrays for each columns</font></span>

In [11]:
A=[]
B=[]
C=[]

### <span style='background :yellow'>iterating through each row "tr" tag and assigning column values to arrays A,B,C</font></span>

In [12]:
for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

#### <span style='background :yellow'>Cleaning up Data</font></span>

In [13]:
A = [sub.replace('\n', '') for sub in A] 
B = [sub.replace('\n', '') for sub in B] 
C = [sub.replace('\n', '') for sub in C] 

In [14]:
import pandas as pd
df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighborhood']=C


In [18]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Following steps to filter out cells where Borough = 'Not Assigned' to process the cells that have an assigned borough and Ignore cells with a borough that is Not assigned

In [19]:
Borough_NA = df['Borough']!='Not assigned'

In [20]:
df_B = df[Borough_NA]
df_B.head(15)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [21]:
df_B.shape

(103, 3)

### 2. Get the Longitude and Latitude coordinates of each neighborhood 

In [22]:
df_geo=pd.read_csv('http://cocl.us/Geospatial_data')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### join Neighborhood dataframe _df_B_ with geocode dataframe _df_geo_

In [23]:
df_loc = pd.merge(df_B, df_geo, on='Postal Code')
df_loc.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### 3. Explore and cluster the neighborhoods in Toronto

### creating dataframe with boroughs that contain the word Toronto

In [24]:
Toronto = df_loc['Borough'].str.contains("Toronto") 
df_BT = df_loc[Toronto]
df_BT.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [25]:
df_BT.shape

(39, 5)

In [26]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

In [27]:
pip install geopy


The following command must be run outside of the IPython shell:

    $ pip install geopy

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


In [28]:
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
altair-4.1.0         | 614 KB    | #####

In [29]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)


In [30]:
Latitude = location[1][0]
Longitude = location[1][1]
print('The geograpical coordinate of Toronto are {}, {}.'.format(Latitude, Longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


## Create map of Toronto using latitude and longitude values for  Boroughs containing the word 'Toronto' in  their name

In [31]:
map_Toronto = folium.Map(location=[Latitude, Longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df_BT['Latitude'], df_BT['Longitude'], df_BT['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

## Utilizing the Foursquare API to explore the neighborhoods and segment them.

In [45]:
CLIENT_ID = 'XXXXXXXXXXXXX' # your Foursquare ID
CLIENT_SECRET = ' XXXXXXXXXXXXXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

### Get top 100 venues in the Neighbourhoods marked in the above map 

In [39]:
LIMIT = 100
radius = 1000

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Latitude, 
    Longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=NH2KN2UXDJXIEETQQLNOIT1ZJFX4YIPA4EPPSKZC1MEEAIPE&client_secret=O4AF3AI1SGO4MVO2HUXNOPZBGGJK4NE55MZECZS4SFU4XST1&v=20180604&ll=43.6534817,-79.3839347&radius=1000&limit=100'

## Send the get request to examine the result and create dataframe with venue names, venue category and coordinates

In [40]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
results = requests.get(url).json()
#results

In [41]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [42]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Indigo,Bookstore,43.653515,-79.380696
3,Chatime 日出茶太,Bubble Tea Shop,43.655542,-79.384684
4,Textile Museum of Canada,Art Museum,43.654396,-79.3865
5,LUSH,Cosmetics Shop,43.653557,-79.3804
6,UNIQLO ユニクロ,Clothing Store,43.65591,-79.380641
7,CF Toronto Eaton Centre,Shopping Mall,43.65454,-79.380677
8,Ed Mirvish Theatre,Theater,43.655102,-79.379768
9,Four Seasons Centre for the Performing Arts,Concert Hall,43.650592,-79.385806


## <span style='background :yellow'><font color = 'blue'> Get the nearby Coffee Shops and their respective coordinates</font></span>

In [43]:
Coffee = nearby_venues['categories'].str.contains("Coffee") 
df_Coffee = nearby_venues[Coffee]
df_Coffee.head()

Unnamed: 0,name,categories,lat,lng
12,M Square Coffee Co,Coffee Shop,43.651218,-79.383555
24,Bulldog On The Block,Coffee Shop,43.650652,-79.384141
32,Hailed Coffee,Coffee Shop,43.658833,-79.383684
44,Jimmy's Coffee,Coffee Shop,43.658421,-79.385613
47,Tim Hortons,Coffee Shop,43.65857,-79.385123


## <span style='background :yellow'><font color = 'blue'> Mark the Coffee Shops in map for the Boroughs which contains the name Toronto </font></span>

In [44]:
nearby_CoffeeShops = folium.Map(location=[Latitude, Longitude], zoom_start=14)
# add markers to map
for lat, lng, label in zip(df_Coffee['lat'], df_Coffee['lng'], df_Coffee['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='brown',
        fill=True,
        fill_color='#800080',
        fill_opacity=0.7,
        parse_html=False).add_to(nearby_CoffeeShops)  
    
nearby_CoffeeShops