# Scraping Toronto neighborhoods

This project's purpose is to cluster neighborhoods in Toronto. The first step towards our analysis is to obtain data regarding these neighborhoods. The data needs to include the name and geographical coordinates. In order to obtain this data we will go through 2 steps:
 1. Get the names of all neighborhoods in Toronto
 2. Find the coordinates of the neighborhood using a geocoder

Luckily a Wikipedia [page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) exists that has all the information we need to explore and cluster the neighborhoods in Toronto.

This notebook contains the code used to wrangle the data, clean it and save it into a pandas dataframe.

### Importing libraries

Scraping a simple HTML table can be done by only using <code>pandas</code>. 

In [1]:
import pandas as pd

### Downloading the dataset

The <code>read_html</code> function reads HTML tables into a <code>list</code> of <code>DataFrame</code> objects. In order to get the dataset we need to select the first element of the returned list. 

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

dfs_list = pd.read_html(url)

# select first element from list of DataFrames
df = dfs_list[0]
df.columns = ['PostalCode', 'Borough', 'Neighborhood']

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Transforming the dataset

We will only need the cells that have an assigned borough. We can filter the <code>DataFrame</code> to remove all the unnecessary rows.

In [3]:
# removing all cells with a borough not assigned
mask = (df['Borough']!='Not assigned')
df = df[mask]    

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Saving the dataset

We can then save our data to a <code>csv</code> file.

In [4]:
df.to_csv('data/toronto_neighborhoods_raw.csv', index=False)

### Shape

Show the shape of the resulting <code>DataFrame</code> as per assignment requirements.

In [5]:
print(df.shape)

(103, 3)


### Conclusion

We will now need to add the latitude and longitude coordinates in order to utilize the Foursquare API. This is done in [geocoder.ipynb](geocoder.ipynb).