# Capstone Project - The Battle of Neighborhoods

## Title:

### Finding similar neighborhood across cities based on Venues across neighborhood

## Introduction: 
### Business Problem

Often people have to relocate to different cities due to Job change, there is always a confusion on which neighborhood to shift to in a particular city. Varieties of question arises, 
* Should I find the neighborhood which is closer to the new workplace? 
* Should I find the neighborhood which is similar to my current neighborhood?
* Should I explore some unique neighborhoods around?
* Does the cost of living matter there? Or Does the crime rate matter compared to my current neighborhood? and so many different questions arises as the person starts exploring the city..

As an instance, Person named "Adam" living in "St. James Town, Downtown Toronto, Canada" has earned a Job in "Midtown south, Manhattan, New York". The proposed model will suggest the neighborhood of Manhattan which is similar to Adam's current neighborhood and also the nearest to his new workplace.

### Datasets and Analytic Approach

#### 1. City Data:
__Canada__: 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M_
Following attribute is fetched from canada data:
* postal code
* Neighborhood
* Borough

Latitude and Longitude of the location can be fetched from the postal code.

You can check the sample of data below

In [16]:
#
# Canada data after cleaning
#
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find_all('table', class_ = 'wikitable') 
canada_data = pd.read_html(str(table))[0] # Store results onto DataFrame
canada_data.columns = canada_data.columns.str.replace(r'\\n', '', regex=True) # Cleanup \n from the column headers
canada_data = canada_data.replace(r'\\n','', regex=True)
indexNames = canada_data[ canada_data['Borough'] == 'Not assigned' ].index
canada_data.drop(indexNames, inplace=True)
canada_data.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


__New York__: https://cocl.us/new_york_dataset/newyork_data.json (Curated list of NewYork Neighborhoods)

Below you can check the sample of data 

In [15]:
import json
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)

#
# NewYork data after cleaning
#
manhattan_data.head()

/bin/sh: wget: command not found


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


#### 2. Foursquare:
Foursquare API gives us the following data:
* list of Venues across Neighborhood
* Venue ratings
* Venue Check-ins

We propose a method to solve the problem by finding similar venues across the source and destination neighborhoods. Checking against the time of Check-ins & the location entropy across the nearby similar venues and also considering the geographical space/distance from the desired location of work, we can suggest the set of Neighborhood the user can relocate to.