# Toronto neighborhoods

**In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.** 

The neighborhood data for Toronto is not readily available on the internet. However, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. 

According to above, this assignment will consist of three notebooks, each of one for a different part: 

- **Parts 1 and 2**: you will be required to get the data, clean it, and then read it into a pandas dataframe so that it is in a suitable structured format. 

- **Part 3**: once the data is in a structured format, you can analyze the dataset to explore and cluster the neighborhoods in the city of Toronto.

Let's start!

In [1]:
#!conda install -c conda-forge geopy --yes           # uncomment only if it is necessary to install the 'geopy' library

Import the necessary libraries to run this notebook:

In [2]:
import pandas            as pd      # library for data analysis

## Part 1: Postal codes

**Scrape the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order**

**1) to obtain the data that is in the table of postal codes and** 

**2) to transform the data into a pandas dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.**

## 1.1. Obtain the data

**Use the BeautifulSoup package or *any other way* you are comfortable with to transform the data in the table on the Wikipedia page into the pandas dataframe.**

Instead of BeautifulSoup package, I preferred to use the function *'read_html'* from Pandas to read the HTML table into a dataframe.

In [3]:
df_cp = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

Explore the data:

In [4]:
df_cp.shape         # dataframe dimension (rows, columns)

(288, 3)

In [5]:
df_cp.columns       # column names

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [6]:
df_cp               # dataframe display

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## 1.2. Transform the data

**1.2.1.The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.**

Rename the column names to match the ones in the required dataframe:

In [7]:
df_cp.rename(columns={'Postcode':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)   # columns rename 
df_cp.columns                                                                                   # verification

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

**1.2.2. Only process the cells that have an assigned borough. Ignore cells with a borough that is *'Not assigned'*.**

In [8]:
filter_1 = (df_cp['Borough']!='Not assigned') # create a filter with the condition 'Borough' different to'Not assigned'
df_cp = df_cp[filter_1]                       # apply the filter 
'Not assigned' is df_cp['Borough']            # verification

False

In [9]:
df_cp                                         # visualize the resulting dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


**1.2.3. If a cell has a borough but a *'Not assigned'* neighborhood, then the neighborhood will be the same as the borough.** 

In [10]:
filter_21 = (df_cp['Neighborhood'] == 'Not assigned')   # create a filter
labels_21 = df_cp[filter_21].index                       # apply the filter and get the row labels
df_cp[filter_21]                                        # visualize the filtered dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Not assigned


There is only one row where the value of **'Neighborhood'** is *'Not assigned'*. For this row, we equal the Borough and Neighborhood columns.

In [11]:
# Change 'Neighborhood' if it is 'Not assigned'
# 1 - Valid only in this case, where filter_2 only has 1 element.
#df_cp.loc[rows_2[0]]['Neighbourhood'] = df_cp.loc[rows_2[0]]['Borough']  # df.loc - access to a row by label

# 2 - Valid in general, for any filter size.
for i in labels_21:                                             
    df_cp.loc[i]['Neighborhood'] = df_cp.loc[i]['Borough'] 

Validation (three different ways):

In [12]:
print('Not assigned' in df_cp['Neighborhood'])           # first

False


In [13]:
filter_22 = (df_cp['Neighborhood'] == 'Not assigned')    # second      
df_cp[filter_22]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [14]:
df_cp.loc[[8]]                                            # third

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Queen's Park


The value of **'Neigborhood'** is the same than the value of **'Borough'** in the row with label 8. That's right!

**1.2.4. When more than one neighborhood exists in one postal code area, these rows will be combined into one row with the neighborhoods separated with a comma.**

In [15]:
df_cp             # display the dataframe before combination

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [16]:
print('The city of Toronto has {} boroughs, {} postal codes and {} neighborhoods.'.format(
       len(df_cp['Borough'].unique()),len(df_cp['PostalCode'].unique()), df_cp.shape[0]))

The city of Toronto has 11 boroughs, 103 postal codes and 211 neighborhoods.


Get the row labels for the postal codes with more than one neighborhood:

In [17]:
filter_3 = df_cp[['PostalCode']].duplicated()
labels_3 = df_cp[filter_3].index
labels_3

Int64Index([  5,   7,  12,  16,  18,  23,  24,  25,  26,  28,
            ...
            268, 269, 270, 271, 272, 273, 283, 284, 285, 286],
           dtype='int64', length=108)

Combine the neighborhoods for these postal codes into one row:

In [18]:
for k in range(labels_3.size):
    neigh = df_cp.loc[labels_3[k]]['Neighborhood']  # get neighborhood to combine
    row = df_cp.index.get_loc(labels_3[k])          # get the row label origen
    df_cp.iloc[row-1]['Neighborhood'] = df_cp.iloc[row-1]['Neighborhood'] + ',' + neigh   # combine
    df_cp = df_cp.drop(labels_3[k], axis=0)         # delete original row

In [19]:
df_cp          # validation

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront,Regent Park"
6,M6A,North York,"Lawrence Heights,Lawrence Manor"
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge,Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens,Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson,Garden District"


**1.2.5. Change the row labels to match the ones in the required dataframe.**

In [20]:
df_cp = df_cp.reset_index(drop = True)        # rows rename 
df_cp                                         # verification

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


**1.2.6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

In [21]:
print('The number of rows of my postal codes dataframe is: ', df_cp.shape[0])

The number of rows of my postal codes dataframe is:  103


## Part 2: Geographical coordinates of each postal code

## Part 3: Segmentation, clustering and visualization