## Capstone project

- In this assignment, I will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so I need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.


- For the Toronto neighborhood data, a Wikipedia page exists that has all the information I need to explore and cluster the neighborhoods in Toronto. I will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

In [1]:
# importing libraries

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


##### Scrapping the wikipedia page

 - I am going to request the web page through request library and read it's HTML code via beautiful soup library, after getting all the HTML of wikipedia page, now I want specific table from that page, for that I need to find HTML code for that table, I used find method to find table tag inside that HTML and got HTML code for the table.

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'lxml')

table = soup.find('table')

#print(table.prettify())

# header = table.th.text

# print(header)

Now, I have data of table in HTML, to convert it into pandas dataframe, I used read_html method of pandas library and pass the HTML of table in it, now I have converted HTML to dataframe

In [4]:
post_code_can = pd.read_html(table.prettify(),header=None,skiprows=0)[0]

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
post_code_can = post_code_can[post_code_can.Borough != 'Not assigned']

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [6]:
df1=post_code_can.groupby("Postcode").agg(lambda x:','.join(set(x)))

In [13]:
df1.head()

Unnamed: 0_level_0,Borough,Neighborhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern,Rouge"
M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
M1E,Scarborough,"Morningside,West Hill,Guildwood"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [14]:
df1.loc[df1['Neighborhood']=="Not assigned",'Neighborhood']=df1.loc[df1['Neighborhood']=="Not assigned",'Borough']

In [15]:
# test

df1.loc[df1.Borough == "Queen's Park"]

Unnamed: 0_level_0,Borough,Neighborhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M7A,Queen's Park,Queen's Park


In [16]:
df1

Unnamed: 0_level_0,Borough,Neighborhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern,Rouge"
M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
M1E,Scarborough,"Morningside,West Hill,Guildwood"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"Kennedy Park,Ionview,East Birchmount Park"
M1L,Scarborough,"Oakridge,Clairlea,Golden Mile"
M1M,Scarborough,"Cliffside,Cliffcrest,Scarborough Village West"
M1N,Scarborough,"Birch Cliff,Cliffside West"


In [17]:
df1.shape

(103, 2)

### Finding Latitute and Longitude of Postal Code

In [18]:
geo_code = pd.read_csv('https://cocl.us/Geospatial_data')

# geo_data=pd.read_csv("https://cocl.us/Geospatial_data")
# geo_data

In [19]:
geo_code.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [20]:
df1['Latitute'] = geo_code['Latitude'].values
df1['Longitude'] = geo_code['Longitude'].values

In [21]:
df1

Unnamed: 0_level_0,Borough,Neighborhood,Latitute,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek",43.784535,-79.160497
M1E,Scarborough,"Morningside,West Hill,Guildwood",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476
M1J,Scarborough,Scarborough Village,43.744734,-79.239476
M1K,Scarborough,"Kennedy Park,Ionview,East Birchmount Park",43.727929,-79.262029
M1L,Scarborough,"Oakridge,Clairlea,Golden Mile",43.711112,-79.284577
M1M,Scarborough,"Cliffside,Cliffcrest,Scarborough Village West",43.716316,-79.239476
M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


### Part 3

In [55]:
# Get coordinates for Toronto for initial map
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinates of Toronto are {latitude}, {longitude}.')

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [58]:
# create map centered on Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df1['Latitute'], df1['Longitude'], df1['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [63]:
CLIENT_ID = 'OSWJ5S2V35IPYD2HQSJPYQATDYW3F04NCP2HBF2MRBVC1VCT'
CLIENT_SECRET = 'TW3KEEK5IG53SUM5CFE5AJNJJSLK1P4IXUXRN41AED5ZFMEK'

In [64]:
#build one DataFrame with top venues per neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    # Iterate through list of neighborhoods
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore'

        params = dict(client_id=CLIENT_ID,
                      client_secret=CLIENT_SECRET,
                      v='20180323',
                      ll=f'{lat},{lng}',
                      radius=radius,
                      limit=100
                     )
            
        # make the GET request
        results = requests.get(url=url, params=params).json()["response"]['groups'][0]['items']
        
        # List of list with only relevant information per venue
        venues_list.append([(name,
                             lat,
                             lng,
                             v['venue']['name'],
                             v['venue']['location']['lat'],
                             v['venue']['location']['lng'],
                             v['venue']['categories'][0]['name']) for v in results]
                          )

    # Create DataFrame
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [65]:
borough_toronto = df1[df1['Borough'].str.contains('Toronto')]
borough_toronto.shape

(39, 4)

In [66]:
# Create DataFrame with one row per venue returned matching the neighborhood it belongs to 
toronto_venues = getNearbyVenues(names=borough_toronto['Neighborhood'],
                                   latitudes=borough_toronto['Latitute'],
                                   longitudes=borough_toronto['Longitude']
                                  )

In [67]:
# Check size of DataFrame
toronto_venues.shape

(1675, 7)

In [68]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Glen Stewart Ravine,43.6763,-79.294784,Other Great Outdoors
4,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


In [69]:

print(f'For {len(borough_toronto["Neighborhood"].unique()) - len(toronto_venues["Neighborhood"].unique())} neighborhoods no venues have been returned.')

For 1 neighborhoods no venues have been returned.


In [70]:
# Compare total neighborhoods with neighborhoods for which venues have been found
nhoods_vs_venues = (borough_toronto[['Neighborhood']].merge(toronto_venues[['Neighborhood', 'Venue']]
                                                    .groupby(by='Neighborhood')
                                                    .agg('count')
                                                    .reset_index(drop=False), on='Neighborhood', how='outer')
                   )

# Find neighborhoods without any venues
nhoods_wo_venues = nhoods_vs_venues[nhoods_vs_venues['Venue'].isnull()]['Neighborhood'].values.tolist()
nl = '\n'
print(f'Neighborhoods without venues:{nl}{nl.join(nhoods_wo_venues)}')

Neighborhoods without venues:
Queen's Park


In [71]:
# Remove neighborhood without venue results from initial list
borough_toronto = borough_toronto[borough_toronto['Neighborhood'] != "Queen's Park"]

In [74]:
# Number of venues returned per neighborhood
toronto_venues['Neighborhood'].value_counts()

Adelaide,King,Richmond                                                                                  100
Underground city,First Canadian Place                                                                   100
Commerce Court,Victoria Hotel                                                                           100
Garden District,Ryerson                                                                                 100
Harbourfront East,Toronto Islands,Union Station                                                         100
Toronto Dominion Centre,Design Exchange                                                                 100
St. James Town                                                                                          100
Stn A PO Boxes 25 The Esplanade                                                                          97
Grange Park,Chinatown,Kensington Market                                                                  91
Church and Wellesley        

In [75]:
# Number of unique categories of venues
print(f'There are {len(toronto_venues["Venue Category"].unique())} different unique categories.')

There are 236 different unique categories.
