# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Exploratory Data Analysis](#ExploratoryDataAnalysis)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## 1. Introduction : Business Problem <a name="introduction"></a>

 Living in a big city provides a number of options in terms of neighbourhood choices. When selecting a neighbourhood to live in a person might have number of considerations, including rental costs, housing prices, transportation , walkability, restaurants in neighbourhood, Gyms, running routes etc. 
 
If someone is relocating from one city to another while they might have a good idea of neighbourhoods in one city they might be unfamiliar with the neighbourhoods in another city. In this case study we are going to look at neighbourhoods in Manhattan and compare them with neighbour hoods in Toronto. So given a neighbourhood preference for a person who lives New york we will try to find similar neighbourhoods in the city of Toronto.

While the intention for this Capstone project is only to do this for two sample cities, this can easily be extended to add additional cities for comparison.

## 2. Data <a name="data"></a>

To group neighbourhoods into different categories and find comparable neighbourhoods from the other city, we can use the data from different API's  to find similar neighbourhoods. Apart from the foursuare API ,we can n also try investigating some other API's to see if we can get some useful data for clustering the neighbourhoods. Some possible API's to investigate would be:

1) Forquare Location Data API- Get restaurant,gyms and other venue information.  
2) Walkscore API- This API returns Walk score,Transit Score and Bike Score for different neighbourhoods.  
3) Rapidapi also provides API's for rental and real estate information that can be integrated in our analysis. 

 
The aim is to use as much useful information and come up with different neighbourhood clusters in each city and find similar neighbourhoods in the other city.


### Before downloading Data let us download all necessary dependencies

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

#Display all columns and rows in Jupyter output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#prints all outputs in Jupyter Output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Import New york Location Data into a panda

In [7]:
with open(r'C:\Users\rrvin\OneDrive\Desktop\Data Science\Coursera\Datascience Capstone Project\nyu_2451_34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)

Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [45]:
ny_neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [46]:
ny_neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [20]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [21]:
ny_neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [22]:
for data in neighborhoods_data:
    borough = ny_neighborhood_name = data['properties']['borough'] 
    ny_neighborhood_name = data['properties']['name']
        
    ny_neighborhood_latlon = data['geometry']['coordinates']
    ny_neighborhood_lat = ny_neighborhood_latlon[1]
    ny_neighborhood_lon = ny_neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': ny_neighborhood_name,
                                          'Latitude': ny_neighborhood_lat,
                                          'Longitude': ny_neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [24]:
ny_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [26]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(ny_neighborhoods['Borough'].unique()),
        ny_neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [27]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

In [28]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, ny_neighborhood in zip(ny_neighborhoods['Latitude'], ny_neighborhoods['Longitude'], ny_neighborhoods['Borough'], ny_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(ny_neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [30]:
manhattan_data = ny_neighborhoods[ny_neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.

In [31]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it.

In [32]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

## Now lets similarly Import Toronto Neighbourhood Data

## Scraping data from wikipedia and creating the data frame.

Import code from the following Wikipedia Page:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [34]:
#Scraping the webpage with pandas
dfs=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)
df_PostalCodesCanada=dfs[0]
df_PostalCodesCanada.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Note the dataframe is already imported with only one row per postal code with neighbourhoods combined into one.
Replace the forward slashes with commas

In [35]:
df_PostalCodesCanada=df_PostalCodesCanada.replace('/',',',regex=True)
df_PostalCodesCanada.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"


Count the Number of Borough with Not Assigned and drop them from the data frame

In [36]:
print('Size of dataframe before removing Not assigned Boroughs:',df_PostalCodesCanada.shape)
print('Number of Not assigned Boroughs:',df_PostalCodesCanada.Borough.value_counts()['Not assigned'])
df_PostalCodesCanada=df_PostalCodesCanada[~df_PostalCodesCanada.Borough.str.contains('Not assigned')]
print('Size of dataframe after removing Not assigned Boroughs:',df_PostalCodesCanada.shape)  
df_PostalCodesCanada=df_PostalCodesCanada.reset_index(drop=True)
df_PostalCodesCanada.head()

Size of dataframe before removing Not assigned Boroughs: (180, 3)
Number of Not assigned Boroughs: 77
Size of dataframe after removing Not assigned Boroughs: (103, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


Find Neighbourhoods that contain Not assigned.

In [37]:
df_PostalCodesCanada[df_PostalCodesCanada.Neighborhood.str.contains('Not assigned')]


Unnamed: 0,Postal code,Borough,Neighborhood


There are no Neighbourhoods that are not assigned.
Below shows the first five columns of the final dataframe

In [38]:
df_PostalCodesCanada.head()
print( 'Size of dataframe after processing:',df_PostalCodesCanada.shape)

Size of dataframe after processing: (103, 3)


## Get Latitude and Longitude information for each neighbourhood

In [39]:
df_PostalCodesCanada.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


# Import latitude Longitude data from csv file

In [40]:

lat_lng_coords=pd.read_csv('C:/Users/rrvin/OneDrive/Desktop/Data Science/Coursera/Datascience Capstone Project/Canada_Geospatial_Coordinates.csv')
print(lat_lng_coords.shape)
lat_lng_coords.head()

(103, 3)


Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


# Merge the two data frames to create a single dataframe

In [41]:
df_Postcodes_WithLatLng=pd.merge(df_PostalCodesCanada,lat_lng_coords, on='Postal code')
print(df_Postcodes_WithLatLng.shape)
df_Postcodes_WithLatLng.head()


(103, 5)


Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


## Exploring and clustering neighbourhoods in toronto. 
Tasks to complete  
  1) Add enough Markdown cells to explain what you decided to do and to report any observations you make.   
  2) to generate maps to visualize your neighborhoods and how they cluster together.  

## Use geopy to get the latitude longitude of tornonto

In [42]:
address='Toronto, Ontario'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronot are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronot are 43.6534817, -79.3839347.


## Create a new data frame which only contains borough's that have the word Toronto in them

In [43]:
df_toronto=df_Postcodes_WithLatLng[df_Postcodes_WithLatLng.Borough.str.contains('Toronto')].reset_index(drop=True)
print(df_toronto.shape)
df_toronto.head()


(39, 5)


Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


## Create a map of Toronoto 

In [44]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

 ## 3. Exploratory Data Analysis <a name="ExploratoryDataAnalysis"></a>