# HELLO CAPSTONE PROJECT COURSE!

This notebook is created in the scope of "Data Science Professional Certificate" course track provided by IBM in Coursera. Aim of the project is to use the things taught throughout the project to come up with a creative analysis called **"The Battle of Neighborhoods"**. 

So what we will do is given a city like the City of Toronto, we will segment it into different neighborhoods using the geographical coordinates of the center of each neighborhood, and then using a combination of location data and machine learning, we will group the neighbourhoods into clusters.

*__Let's dive into it!__*

# Week 3 - Part 1 : Scraping data 

In [None]:
!conda install beautifulsoup4
!conda install lxml
!conda install requests

print("Downloaded!")

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'   # this is the wikipedia page we want to scrape from

page_response = requests.get(url, timeout=5)  # here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "lxml")  #we use the lxml parser to parse the url content and store it in a variable
# print(page_content.prettify()) 


In [3]:
match=page_content.find('tbody') #finding the relevant part in html 
rows=match.find_all('tr') 
print(rows)


[<tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>, <tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>, <tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>, <tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>, <tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>, <tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td></tr>, <tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td></tr>, <tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td></tr>, <tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>, <tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>, <tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td></tr>, <tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>, <tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>, <tr>
<td>M4B
</td>
<td>East York
</td>
<td>Parkview H

In [4]:
import re   

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

[M9Z
, Not assigned
, 
]


str

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame(list_rows)  #turn the list into dataframe to allow operations
df.head(10)

Unnamed: 0,0
0,[]
1,"[M1A\n, Not assigned\n, \n]"
2,"[M2A\n, Not assigned\n, \n]"
3,"[M3A\n, North York\n, Parkwoods\n]"
4,"[M4A\n, North York\n, Victoria Village\n]"
5,"[M5A\n, Downtown Toronto\n, Regent Park / Harb..."
6,"[M6A\n, North York\n, Lawrence Manor / Lawrenc..."
7,"[M7A\n, Downtown Toronto\n, Queen's Park / Ont..."
8,"[M8A\n, Not assigned\n, \n]"
9,"[M9A\n, Etobicoke\n, Islington Avenue\n]"


In [6]:
df1 = df[0].str.split(',', expand=True) #expanding the data into columns
df1.head(10)

Unnamed: 0,0,1,2,3
0,[],,,
1,[M1A\n,Not assigned\n,\n],
2,[M2A\n,Not assigned\n,\n],
3,[M3A\n,North York\n,Parkwoods\n],
4,[M4A\n,North York\n,Victoria Village\n],
5,[M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n],
6,[M6A\n,North York\n,Lawrence Manor / Lawrence Heights\n],
7,[M7A\n,Downtown Toronto\n,Queen's Park / Ontario Provincial Government\n],
8,[M8A\n,Not assigned\n,\n],
9,[M9A\n,Etobicoke\n,Islington Avenue\n],


In [7]:
# renaming the columns

indexes=pd.Series(['PostalCode','Borough','Neighborhood']) #creating another dataframe and concatenating it to main dataframe
ind_df=pd.DataFrame([indexes]) 
df2=pd.concat([ind_df,df1], ignore_index=True) 

df2 = df2.rename(columns=df2.iloc[0]) #renaming the columns
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,nan
0,PostalCode,Borough,Neighborhood,
1,[],,,
2,[M1A\n,Not assigned\n,\n],
3,[M2A\n,Not assigned\n,\n],
4,[M3A\n,North York\n,Parkwoods\n],


In [8]:
# data wrangling (preparing data to analysis)

df2.drop([0], axis=0, inplace=True) #dropping an unnecessary row
df2.drop([1], axis=0, inplace=True) #dropping an unnecessary row
df2.drop(df2.columns[[3]], axis=1,inplace=True) #dropping an unnecessary column


# data cleaning

df2['PostalCode'] = df2['PostalCode'].str.strip('[\n]')
df2['Neighborhood'] = df2['Neighborhood'].str.strip(']')
df2['Neighborhood'] = df2['Neighborhood'].str.strip('\n')
df2['Borough'] = df2['Borough'].str.strip('\n')
df2['PostalCode'] = df2['PostalCode'].str.strip('\n')
df2.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M1A,Not assigned,
3,M2A,Not assigned,
4,M3A,North York,Parkwoods
5,M4A,North York,Victoria Village
6,M5A,Downtown Toronto,Regent Park / Harbourfront


In [9]:
df2.astype('str').dtypes #arranging data types for further processes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

In [10]:
# finding indexes to eliminate Not assigned values

index=df2[df2['Borough']==' Not assigned'].index 
index

Int64Index([  2,   3,   9,  12,  17,  18,  21,  26,  27,  30,  31,  35,  36,
             37,  39,  40,  44,  45,  46,  53,  54,  55,  62,  63,  64,  71,
             72,  73,  80,  81,  89,  90,  98,  99, 103, 107, 108, 112, 117,
            120, 121, 125, 126, 127, 129, 130, 133, 134, 135, 136, 138, 139,
            142, 143, 147, 148, 151, 152, 156, 157, 160, 161, 163, 164, 165,
            166, 168, 169, 172, 173, 174, 175, 176, 177, 178, 179, 181],
           dtype='int64')

In [12]:
# dropping related rows

df2.drop([2,   3,   9,  12,  17,  18,  21,  26,  27,  30,  31,  35,  36,
             37,  39,  40,  44,  45,  46,  53,  54,  55,  62,  63,  64,  71,
             72,  73,  80,  81,  89,  90,  98,  99, 103, 107, 108, 112, 117,
            120, 121, 125, 126, 127, 129, 130, 133, 134, 135, 136, 138, 139,
            142, 143, 147, 148, 151, 152, 156, 157, 160, 161, 163, 164, 165,
            166, 168, 169, 172, 173, 174, 175, 176, 177, 178, 179, 181], axis=0, inplace=True)


In [14]:
df2.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
162,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
167,M4Y,Downtown Toronto,Church and Wellesley
170,M7Y,East Toronto,Business reply mail Processing CentrE
171,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea ...
180,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Blo...
