#### Applied Data Science Capstone : Week 3
# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto


## Instructions
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

--------------------------------

# My submission

> For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.
>
> 1.    Start by creating a new Notebook for this assignment.
> 2.    Use the Notebook to build the code to scrape the following Wikipedia page,  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe like the one shown below:
> <!--
> ![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1613952000000&hmac=JvKSi4GKq0HQtojOsFoeEBkFPN0xzcxSE5EoUy0mpLk)
> !-->

# Question 1: Create the PostalCode dataframe from Canada Wikipedia page

In [2]:
#!pip install yfinance
!pip install pandas
!pip install requests
!pip install bs4
#!pip install plotly
print("Installation done")

Installation done


> **Note**: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas  to read the table into a pandas dataframe.
>
> Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/
>
> Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [42]:
#import yfinance as yf
import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
import requests
from bs4 import BeautifulSoup
#import plotly.graph_objects as go
#from plotly.subplots import make_subplots

In [43]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url)
html_data

<Response [200]>

In [44]:
# Parse the html data using beautiful_soup
soup = BeautifulSoup(html_data.text, 'html.parser')

In [45]:
# soup.find("tbody")

> 3. To create the above dataframe:
>   - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [46]:
neighborhoods = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in soup.find("tbody").find_all("tr"):
    #print(row)
    col = row.find_all("td")
    if (len(col) == 0):
        continue
    #print(col)
    PostalCode = col[0].get_text(strip=True)
    #print(date)
    Borough = col[1].get_text(strip=True)
    #print(Revenue)
    Neighborhood = col[2].get_text(strip=True)
    
    neighborhoods = neighborhoods.append(
        {"PostalCode":PostalCode, "Borough":Borough, "Neighborhood":Neighborhood}, ignore_index=True)

In [47]:
#neighborhoods.head(12)

> - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [48]:
# Remove rows which borough is "Not assigned"
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']
#neighborhoods.head(12)

> - If a cell has a borough but a **Not assigned**  neighborhood, then the neighborhood will be the same as the borough.

In [49]:
# Check if the row which Neighborhood is "Not assigned" exists
# neighborhoods[neighborhoods.Neighborhood == "Not assigned"]

In [50]:
# Assign Borough value to Neighborhood column if the Nighborhood is "Not assigned"
neighborhoods.Neighborhood = neighborhoods.Borough.where(neighborhoods.Neighborhood == "Not assigned", 
                                                         neighborhoods.Neighborhood)

In [51]:
# Sort by PostalCode
#neighborhoods.sort_values(by=["PostalCode"], inplace=True)

In [52]:
# Renumber index
neighborhoods.reset_index(drop=True, inplace=True)

In [53]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [54]:
# Show head and tail of the dataFrame
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


> - In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.

In [55]:
neighborhoods.shape

(103, 3)

-----------------------------------------

# Question 2: Add the latitude and the longitude to the PostalCode datafrome

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking  postal code M5G as an example, your code would look something like this:

```python
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:

<!--
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1613952000000&hmac=geGbWYm188DjJP-tPLg4ZPmTphxDPENiC0Xup7hcM94)
-->

**Important Note**: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. (2 marks)

**Note**: While including the link do not copy paste the URL. Use the embedded link option in the formatting  tools of the Response field to include the link. Check the  displayed in image below

In [56]:
!pip install geocoder
print("installed")

installed


In [57]:
import geocoder # import geocoder
print("imported")

imported


In [58]:
def GetLatitudeLongitude(Postal_code):    
    
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      print(Postal_code)
      g = geocoder.google('{}, Toronto, Ontario'.format(Postal_code))
      print(g)
      lat_lng_coords = g.latlng
      print(lat_lng_coords)

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print("latitude=",latitude, "longiude=", longitude)
    return latitude, longitude

In [59]:
#GetLatitudeLongitude("M5G")

> Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [60]:
geo_postal_code_path = "http://cocl.us/Geospatial_data"
geo_postal_code = pd.read_csv(geo_postal_code_path)
geo_postal_code.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [61]:
#geo_postal_code.head()

In [62]:
def GetLatitudeLongitudeFromCSV(Postal_code):    
    lat_lng_coords = geo_postal_code[geo_postal_code["PostalCode"] == Postal_code].iloc[0]
    #print(lat_lng_coords)
    if( lat_lng_coords is None):
        latitude = None
        longitude = None
    else:
        latitude = lat_lng_coords["Latitude"]
        longitude = lat_lng_coords["Longitude"]
    #print("latitude=",latitude, "longiude=", longitude)
    return latitude, longitude

In [63]:
#lat, lng = GetLatitudeLongitudeFromCSV("M2H")
#print(lat, lng)

In [64]:
# Append dummy Latitude and Longitude
neighborhoods["Latitude"] = -1.0
neighborhoods["Longitude"] = -1.0

# Append Latitude and Longitude into the PostalCode dataframe
for i, row in neighborhoods.iterrows():
    latitude, longitude = GetLatitudeLongitudeFromCSV( row.at["PostalCode"])
    neighborhoods.at[i,'Latitude'] = latitude
    neighborhoods.at[i,'Longitude'] =  longitude

In [65]:
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


----------------------------

# Question 3: Generate Maps

> Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 
>
> Just make sure:
>
> 1. to add enough Markdown cells to explain what you decided to do and to report any observations you make. 
> 2. to generate maps to visualize your neighborhoods and how they cluster together. 

In [66]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium==0.5.0
import folium # map rendering library

print("Imported")

Imported


### Use geopy library to get the latitude and longitude values of Toronto City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [67]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### Create a map of Toronto with neighborhoods superimposed on top.

In [68]:
!jupyter trust Week3_Segmenting_and_Clustering_Neighborhoods.ipynb

Signing notebook: Week3_Segmenting_and_Clustering_Neighborhoods.ipynb


In [69]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [70]:
neighborhoods.groupby("Borough").count()

Unnamed: 0_level_0,PostalCode,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,19,19,19,19
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Scarborough,17,17,17,17
West Toronto,6,6,6,6
York,5,5,5,5


However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in *North York*. So let's slice the original dataframe and create a new dataframe of the *North York* data.

In [71]:
northyork_data = neighborhoods[neighborhoods['Borough'] == 'North York'].reset_index(drop=True)
northyork_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


Let's get the geographical coordinates of North York.

In [72]:
address = 'North York, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7543263, -79.44911696639593.


As we did with all of Toronto City, let's visualizat North York the neighborhoods in it.

In [73]:
# create map of North York using latitude and longitude values
map_northyork = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(northyork_data['Latitude'], northyork_data['Longitude'], northyork_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork