# ASSIGN NEIGHBORHOODS OF TORONTO

In [16]:
import pandas as pd
import numpy as np


In [17]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

## Pre-Processing the data:
Let's import, clear and rename the data frame

In [18]:
series_of_dataframes = pd.read_html(url)    # import raw data
df_first = series_of_dataframes[0]          # first dataframe from sereies of dataframes

# Select not assigned (na)
# na_nan_series = (df_first['Borough']=='Not assigned') & (df_first['Neighborhood'].isnull())
na_series = df_first['Borough']=='Not assigned'

# define clear of na's dataframe
df = df_first[na_series==False].rename(columns={'Postal code': 'PostalCode'})
print (df.head())

  PostalCode           Borough                                  Neighborhood
2        M3A        North York                                     Parkwoods
3        M4A        North York                              Victoria Village
4        M5A  Downtown Toronto                    Regent Park / Harbourfront
5        M6A        North York             Lawrence Manor / Lawrence Heights
6        M7A  Downtown Toronto  Queen's Park / Ontario Provincial Government


## Pre-Processing the data (continued):
1) If more than one neighborhood exist in one postal code area they need to be combined.

2) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

lets merge this properly:

In [19]:
# 1) First point (see above)

# let's check shape of initial DataFrame:
print ('The inititial Dataframe has {} number of rows'.format(df_first.shape[0]))
print ('The filtered Dataframe has {} number of rows'.format(df.shape[0]))
'''
One can see some empties were drop!
'''

# Check if PostalCode is unique for every row and there are no doubled postal codes
doubled = df['PostalCode'].unique().shape
if (df.shape[0]==doubled[0]):
    print ('PostalCode is OK, none of its values is doubled')
else:
    print ('some incongruences found, please check consistency')

# Check if neighborhoods are repited:
# let's split neighborhoods "hyphen-separated" into columns
neighborhoods = df['Neighborhood'].str.split(pat="/", expand=True)
# let's check duplicated rows
duplic = neighborhoods[neighborhoods.duplicated(keep=False)] 
print('Duplicated found in:')
print (duplic)

# let's check this rows in detail:
print (df_first.iloc[11])
print (df_first.iloc[20])
print (df_first.iloc[65])
print (df_first.iloc[74])
print (df_first.iloc[83])
print (df_first.iloc[91])
print (df_first.iloc[92])
print (df_first.iloc[109])

'''
CONCLUSION ON THIS:
There are neighborhoods so big that they have several Postal codes so:
    Nothing is wrong with them
'''

# let's continue and separate them with comma as asked
df['Neighborhood'] = df['Neighborhood'].str.replace('/', ',')
print ('Dataframe now comma-separated\n', df.head())



The inititial Dataframe has 180 number of rows
The filtered Dataframe has 103 number of rows
PostalCode is OK, none of its values is doubled
Duplicated found in:
              0     1     2     3     4     5     6     7
11    Don Mills  None  None  None  None  None  None  None
20    Don Mills  None  None  None  None  None  None  None
65    Downsview  None  None  None  None  None  None  None
74    Downsview  None  None  None  None  None  None  None
83    Downsview  None  None  None  None  None  None  None
91   Willowdale  None  None  None  None  None  None  None
92    Downsview  None  None  None  None  None  None  None
109  Willowdale  None  None  None  None  None  None  None
Postal code            M3B
Borough         North York
Neighborhood     Don Mills
Name: 11, dtype: object
Postal code            M3C
Borough         North York
Neighborhood     Don Mills
Name: 20, dtype: object
Postal code            M3K
Borough         North York
Neighborhood     Downsview
Name: 65, dtype: object
P

In [20]:
# 2) Second point (see above)

# let's check if a borough is given but neighborhood is empty
na_nan_series = ((df['Borough']!='Not assigned')|(df['Borough']!='')) & (df['Neighborhood'].isnull())
test = df[na_nan_series]
if (test.empty):
    print ('None of the given neighborhoods is unnamed')
else:
    print ('error')
    print ('###############')

# continue...

None of the given neighborhoods is unnamed


## Conclusion
The Data Frame is now clear of empties and in the format in which was asked.
Let's conclude this part with dimensions of resulting Data Frame and head & tail methods:

In [21]:
# shape of resulting Data Frame:
print ('The resulting shape of my Data Frame is: {}'.format(df.shape))
print ('Here head:')
print (df.head())
print ('and tail:')
print (df.tail())

The resulting shape of my Data Frame is: (103, 3)
Here head:
  PostalCode           Borough                                  Neighborhood
2        M3A        North York                                     Parkwoods
3        M4A        North York                              Victoria Village
4        M5A  Downtown Toronto                    Regent Park , Harbourfront
5        M6A        North York             Lawrence Manor , Lawrence Heights
6        M7A  Downtown Toronto  Queen's Park , Ontario Provincial Government
and tail:
    PostalCode           Borough  \
160        M8X         Etobicoke   
165        M4Y  Downtown Toronto   
168        M7Y      East Toronto   
169        M8Y         Etobicoke   
178        M8Z         Etobicoke   

                                          Neighborhood  
160    The Kingsway , Montgomery Road , Old Mill North  
165                               Church and Wellesley  
168              Business reply mail Processing CentrE  
169  Old Mill South , 

# GEOCODER

Here we need to get the latitude and the longitude coordinates of each neighborhood.
For this we are going to merge dataframes (one aout of wikipedia and the other sourced from Coursera as CSV).
I was not able to bring anything of geocoder to yield and I gave up! RRRRhhh.... :/

In [28]:
# !cat ../data/Geospatial_Coordinates.csv
geo_df = pd.read_csv('../data/Geospatial_Coordinates.csv')
geo_df = geo_df.rename(columns={'Postal Code': 'PostalCode'})

geo_df = df.merge(geo_df, how='outer', on='PostalCode')
print (geo_df)


    PostalCode           Borough  \
0          M3A        North York   
1          M4A        North York   
2          M5A  Downtown Toronto   
3          M6A        North York   
4          M7A  Downtown Toronto   
..         ...               ...   
98         M8X         Etobicoke   
99         M4Y  Downtown Toronto   
100        M7Y      East Toronto   
101        M8Y         Etobicoke   
102        M8Z         Etobicoke   

                                          Neighborhood   Latitude  Longitude  
0                                            Parkwoods  43.753259 -79.329656  
1                                     Victoria Village  43.725882 -79.315572  
2                           Regent Park , Harbourfront  43.654260 -79.360636  
3                    Lawrence Manor , Lawrence Heights  43.718518 -79.464763  
4         Queen's Park , Ontario Provincial Government  43.662301 -79.389494  
..                                                 ...        ...        ...  
98     The Kin

None
