# Capstone Project - The Battle of Neighborhoods

## Title:

### Finding similar neighborhood across cities based on Venues across neighborhood

## 1. Introduction: 
### 1.1 Background:
Often people have to relocate to different cities due to Job change, there is always a confusion on which neighborhood to shift to in a particular city. Varieties of question arises, 
* Should I find the neighborhood which is closer to the new workplace? 
* Should I find the neighborhood which is similar to my current neighborhood?
* Should I explore some unique neighborhoods around?
* Does the cost of living matter there? Or Does the crime rate matter compared to my current neighborhood? and so many different questions arises as the person starts exploring the city..

### 1.2 Business Problem:
As we can get the neighborhood data of every city, the venues around the neighborhood, the distance between neighborhood and venues along with rating of venues, crime rates data, this project aims to predict the similar neighborhood based on the current neighborhood and also the nearest to the new desired place of work

### 1.3 Interest:
Almost all the working professional who desire to relocate to the new city for better opportunity to work.

As an instance, Person named "Adam" living in "St. James Town, Downtown Toronto, Canada" has earned a Job in "Midtown south, Manhattan, New York". The proposed model will suggest the neighborhood of Manhattan which is similar to Adam's current neighborhood and also the nearest to his new workplace.

## 2. Data Aquisition and Cleaning

### 2.1 City Data:
#### 2.1.1 __Canada__: 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M_
Following attribute is fetched from canada data:
* postal code
* Neighborhood
* Borough

Latitude and Longitude of the location can be fetched from the postal code.

For this project, "St. James Town, Downtown Toronto, Canada" is considered as the source city the user is relocating from.

You can check the sample of data below

In [31]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

#### Get the table data from Wiki

In [8]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find_all('table', class_ = 'wikitable') 

#### Cleanup the table data and store it into the DataFrame

In [28]:
toronto_data = pd.read_html(str(table))[0] # Store results onto DataFrame

toronto_data.columns = toronto_data.columns.str.replace(r'\\n', '', regex=True) # Cleanup \n from the column headers

toronto_data = toronto_data.replace(r'\\n','', regex=True) # Cleanup \n from the all rows

toronto_data = toronto_data.replace(r'/',',', regex=True) # Cleanup / from the all rows

indexNames = toronto_data[ toronto_data['Borough'] == 'Not assigned' ].index

toronto_data.drop(indexNames, inplace=True)

toronto_data = toronto_data.reset_index(drop = True)

#### Find out the Latitude and Longitude of each Neighborhood in Toronto

In [29]:
lat_lng_csv = pd.read_csv('https://cocl.us/Geospatial_data')

lat_lng_csv.rename(columns = {'Postal Code': 'Postal code'}, inplace = True)

toronto_data = pd.merge(toronto_data, lat_lng_csv[['Postal code', 'Latitude', 'Longitude']], on = 'Postal code')

toronto_data.drop(columns = ['Postal code'], inplace=True)

toronto_data = toronto_data.loc[toronto_data['Borough'].str.contains('Downtown Toronto')].reset_index(drop = True)

#### Filter "St. James Town, Downtown Toronto, Canada" from data set as it is the source Neighborhood

In [32]:
users_current_data = toronto_data[toronto_data['Neighborhood'] == 'St. James Town']

#
# User's current Neighborhood
#
users_current_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
3,Downtown Toronto,St. James Town,43.651494,-79.375418


#### 2.1.2 __New York__: 

https://cocl.us/new_york_dataset/newyork_data.json (Curated list of NewYork Neighborhoods)

Manhattan, Newyork is considered as our destination city for this project.

Below you can check the sample of data,

In [48]:
import json
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)

#
# Manhattan, NewYork data after cleaning
#
print(manhattan_data.head())

print('\n Manhattan has {} neighborhoods'.format(manhattan_data.shape[0]))

/bin/sh: wget: command not found
     Borough        Neighborhood   Latitude  Longitude
0  Manhattan         Marble Hill  40.876551 -73.910660
1  Manhattan           Chinatown  40.715618 -73.994279
2  Manhattan  Washington Heights  40.851903 -73.936900
3  Manhattan              Inwood  40.867684 -73.921210
4  Manhattan    Hamilton Heights  40.823604 -73.949688

 Manhattan has 40 neighborhoods


### 2.2 Foursquare API:
Foursquare API gives us the following data:
* list of Venues across Neighborhood
* Venue ratings
* Venue Check-ins

We propose a method to solve the problem by finding similar venues across the source and destination neighborhoods. Checking against the time of Check-ins & the location entropy across the nearby similar venues and also considering the geographical space/distance from the desired location of work, we can suggest the set of Neighborhood the user can relocate to.

You can check out the output of FourSquare API where it gives out the number of venues in each neighborhood and number of unique categories in the data set.

#### Define FourSquare credentials and version

In [35]:
CLIENT_ID = 'HMRCWPCJVQTWTZMZIEFGD3D4YD0CKZAHIDYSW3BJ55KKWROO' # your Foursquare ID
CLIENT_SECRET = 'LJP5G15ZQKJDFAEU0UGBLA40X43NDDK0O2KEAI44BYYOSW2W' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

#### Define a function to get the venues near the neighborhood

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    print("Below are the Venue information for the respective Neighborhoods:")
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        print('{} : {} Venues'.format(name, len(results)))

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [37]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Below are the Venue information for the respective Neighborhoods:
Marble Hill : 25 Venues
Chinatown : 100 Venues
Washington Heights : 88 Venues
Inwood : 55 Venues
Hamilton Heights : 62 Venues
Manhattanville : 47 Venues
Central Harlem : 44 Venues
East Harlem : 41 Venues
Upper East Side : 94 Venues
Yorkville : 100 Venues
Lenox Hill : 100 Venues
Roosevelt Island : 26 Venues
Upper West Side : 85 Venues
Lincoln Square : 100 Venues
Clinton : 100 Venues
Midtown : 100 Venues
Murray Hill : 100 Venues
Chelsea : 100 Venues
Greenwich Village : 100 Venues
East Village : 100 Venues
Lower East Side : 55 Venues
Tribeca : 91 Venues
Little Italy : 100 Venues
Soho : 100 Venues
West Village : 100 Venues
Manhattan Valley : 52 Venues
Morningside Heights : 39 Venues
Gramercy : 92 Venues
Battery Park City : 80 Venues
Financial District : 100 Venues
Carnegie Hill : 92 Venues
Noho : 100 Venues
Civic Center : 100 Venues
Midtown South : 100 Venues
Sutton Place : 100 Venues
Turtle Bay : 100 Venues
Tudor City : 78 

In [45]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))
print('\n')
print(manhattan_venues['Venue Category'].drop_duplicates().reset_index(drop=True))

There are 321 uniques categories.


0                          Pizza Place
1                          Yoga Studio
2                                Diner
3                           Donut Shop
4                          Coffee Shop
5                                  Gym
6                     Department Store
7                   Seafood Restaurant
8                       Tennis Stadium
9                             Pharmacy
10                      Discount Store
11                     Supplement Shop
12                      Ice Cream Shop
13                      Sandwich Place
14                  Miscellaneous Shop
15                 American Restaurant
16                                Bank
17                          Steakhouse
18                          Kids Store
19                       Shopping Mall
20                       Deli / Bodega
21                               Hotel
22                   Hotpot Restaurant
23                    Greek Restaurant
24                        Co