# Capstone Project - The Battle of the Neighborhoods (Week 1) By Taoufik Ghozzi
### Applied Data Science Capstone by IBM/Coursera

This notebook is for the capstone project for the 9-courses specialization in Data Science of IBM. Platform Coursera. [Data Science Specialization](https://www.coursera.org/specializations/ibm-data-science-professional-certificate)  
This project will try to solve a problem or question by applying data science methods on the location data gotten from FourSquare API.  


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Project Summary](#project_summary)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

New York is one of the largest metropolises in the world where over 8 millions people live and it has a population density of 10,715 people per square kilometer. As a resident of this city, I decided to use New York in my project. The city is divided into 5 main districts in total. However, the fact that the districts are squeezed into an area of approximately 783 square kilometers causes the city to have a very intertwined and mixed structure [1].

As you can see from the figures, New York is a city with a high population and population density. Being such a crowded city leads the owners of shops and social sharing places in the city where the population is dense. When we think of it by the investor, we expect from them to prefer the districts where there is a lower real estate cost and the type of business they want to install is less intense. If we think of the city residents, they may want to choose the regions where real estate values are lower, too. At the same time, they may want to choose the district according to the social places density. However, it is difficult to obtain information that will guide investors in this direction, nowadays.

When we consider all these problems, we can create a map and information chart where the real estate index is placed on New York and each district is clustered according to the venue density.

In [202]:
#!conda install -c conda-forge beautifulsoup4 --yes

#!conda install -c conda-forge geopy --yes

!conda install -c conda-forge folium=0.5.0 --yes

print('Libraries installed!')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.0               |             py_0         606 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following NEW packages will be 

In [204]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests
import json

from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim

import folium
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.preprocessing import StandardScaler, normalize, scale
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score

print('Libraries imported!')

Libraries imported!


# Project Summary  <a name="project_summary"></a>
Using data science techniques to analyze the following questions:  
- Is the surrounding venues can effect the price of real estates?  
- What kind of surrounding venues, and to what extend, can effect the price?  
- Can we use the surrounding venue to estimate the value of an accommodation over the average price of one area? And to what degree of confidence?  

The data will be:
- Average price of 1 Standard residential Unit in New York city's neighborhoods. ([kaggle](https://www.kaggle.com/new-york-city/nyc-property-sales))
- Venues surrounding each neighborhoods. ([FourSquare API](https://developer.foursquare.com/))  

Target audiences will be:
- Home buyers, who can roughly estimate the value of a target house over the average.  
- Planners, who can decide which venues to place around their product, so that the price is maximized.  
- Any normal person, who is wondering if that in-process building will effect his/her home's value.

## Data <a name="data"></a>
### 1. Download and explore Real Estate data:  

We will be using NYC real estate data from https://www.kaggle.com/new-york-city/nyc-property-sales, where properties sold in New York City over a 12-month period from September 2016 to September 2017.The full raw data been dowloaded in csv format and uploaded to *Google drive*, in next we will download the **nyc-sales.csv** file using a shared dowloading link.

In [178]:
df = pd.read_csv('https://drive.google.com/uc?authuser=0&id=1GA4cejxOrMzS1PTouiNmPBjfCVo6bJqH&export=download')
print('Data downloaded!')

Data downloaded!


In [179]:
df.shape

(84548, 22)

This raw data contain alot of entries, we will be focusing only in residential units type only.

In [180]:
df.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,,10009,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,,10009,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,,10009,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,,10009,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,,10009,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


In [181]:
df = df[df['RESIDENTIAL UNITS'].isin([0,1]) & (df['COMMERCIAL UNITS'] == 0)]

In [182]:
df.shape

(56574, 22)

In [183]:
df = df[['BOROUGH', 'NEIGHBORHOOD', 'SALE PRICE']]
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,SALE PRICE
12,1,ALPHABET CITY,1
13,1,ALPHABET CITY,499000
14,1,ALPHABET CITY,10
15,1,ALPHABET CITY,529500
16,1,ALPHABET CITY,423000


In [184]:
#drop all undefined prices
df.drop(df.loc[df['SALE PRICE']==' -  '].index, inplace=True)

In [185]:
df = df.astype({'SALE PRICE': int})
df = df.sort_values(by='SALE PRICE', ascending=True, na_position='first')
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,SALE PRICE
45203,3,PARK SLOPE SOUTH,0
30334,3,BRIGHTON BEACH,0
30333,3,BRIGHTON BEACH,0
30323,3,BRIGHTON BEACH,0
30320,3,BRIGHTON BEACH,0


In [186]:
df.drop(df.loc[df['SALE PRICE']< 100000].index, inplace=True)
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,SALE PRICE
70928,4,RIDGEWOOD,100000
62940,4,HOLLISWOOD,100000
30660,3,BRIGHTON BEACH,100000
38370,3,FLATBUSH-EAST,100000
6075,1,HARLEM-UPPER,100000


In [187]:
df.NEIGHBORHOOD[df.NEIGHBORHOOD.str.contains('UPPER EAST SIDE')] = 'UPPER EAST SIDE'
df.NEIGHBORHOOD[df.NEIGHBORHOOD.str.contains('UPPER WEST SIDE')] = 'UPPER WEST SIDE'
df.NEIGHBORHOOD[df.NEIGHBORHOOD.str.contains('WASHINGTON HEIGHTS')] = 'WASHINGTON HEIGHTS'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [188]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,SALE PRICE
70928,4,RIDGEWOOD,100000
62940,4,HOLLISWOOD,100000
30660,3,BRIGHTON BEACH,100000
38370,3,FLATBUSH-EAST,100000
6075,1,HARLEM-UPPER,100000


In [189]:
df = df.groupby(df['NEIGHBORHOOD'], as_index=False).mean()
df.head()

Unnamed: 0,NEIGHBORHOOD,BOROUGH,SALE PRICE
0,AIRPORT LA GUARDIA,4.0,455375.0
1,ALPHABET CITY,1.0,1711082.0
2,ANNADALE,5.0,576796.1
3,ARDEN HEIGHTS,5.0,379837.5
4,ARROCHAR,5.0,521087.5


In [190]:
df.BOROUGH[df.BOROUGH==1] = 'MANHATTAN'
df.BOROUGH[df.BOROUGH==2] = 'BRONX'
df.BOROUGH[df.BOROUGH==3] = 'BROOKLYN'
df.BOROUGH[df.BOROUGH==4] = 'QUEENS'
df.BOROUGH[df.BOROUGH==5] = 'STATEN ISLAND'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [191]:
df.rename(columns = {'SALE PRICE': 'AHP'}, inplace=True)
new_col_order = ['BOROUGH', 'NEIGHBORHOOD', 'AHP']
df = df[new_col_order]
df = df.astype({'AHP': int})
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,AHP
0,QUEENS,AIRPORT LA GUARDIA,455375
1,MANHATTAN,ALPHABET CITY,1711081
2,STATEN ISLAND,ANNADALE,576796
3,STATEN ISLAND,ARDEN HEIGHTS,379837
4,STATEN ISLAND,ARROCHAR,521087


In [192]:
df.rename(columns = {'BOROUGH':'Borough', 'NEIGHBORHOOD':'Neighborhood'}, inplace=True)
df['Borough'] = df['Borough'].str.capitalize() 
df['Neighborhood'] = df['Neighborhood'].str.capitalize() 

df.head()

Unnamed: 0,Borough,Neighborhood,AHP
0,Queens,Airport la guardia,455375
1,Manhattan,Alphabet city,1711081
2,Staten island,Annadale,576796
3,Staten island,Arden heights,379837
4,Staten island,Arrochar,521087


In [193]:
df.shape

(246, 3)

In [195]:
df.head()

Unnamed: 0,Borough,Neighborhood,AHP
0,Queens,Airport la guardia,455375
1,Manhattan,Alphabet city,1711081
2,Staten island,Annadale,576796
3,Staten island,Arden heights,379837
4,Staten island,Arrochar,521087


### 2. Download and Explore Neighborhood Dataset

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [196]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

Next, let's load the data.

In [54]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data.

In [55]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [109]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [110]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [111]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [112]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [113]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [130]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


#### Merging Sales dataset with neighborhood coordinations

In this part we will be considering to merge neighboors sales data (**df**) with Neighboors coordinations (**neighborhoods**) that having muching.

checking the size of both dataset

In [173]:
df.shape

(246, 4)

In [174]:
neighborhoods.shape

(306, 4)

Currently there is a missmuch between both dataset, in next steps we will consider only neighboorhoods that much in both datasets.
First, need to check what are the neighborhoods name syntax that has similarity on both datasets as the source is different in both sides.

we will Use difflib's get_close_matches to get the closest match

In [197]:
import difflib

neighborhood_names = list(neighborhoods['Neighborhood'].unique())

#function to define the matching elements
def closest(a):
    try:
        return difflib.get_close_matches(a, neighborhood_names)[0]
    except IndexError:
        return "Not Found"
    
df['closest_names'] = df.apply(lambda row: closest(row['Neighborhood']), axis=1)

df.head()

Unnamed: 0,Borough,Neighborhood,AHP,closest_names
0,Queens,Airport la guardia,455375,Not Found
1,Manhattan,Alphabet city,1711081,Not Found
2,Staten island,Annadale,576796,Annadale
3,Staten island,Arden heights,379837,Arden Heights
4,Staten island,Arrochar,521087,Arrochar


Need to define a cleaner version of the comparision dataframe

In [166]:
# remove all the 'Not Found' occurancies 
new_df = df[df.close_neighbor != 'Not Found']

# remove old Neighborhood names and replace it with the one muching with 'neighborhoods' dataframe
new_df.drop('Neighborhood', axis=1, inplace=True)
new_df.rename(columns={'close_neighbor': 'Neighborhood'}, inplace=True)

# rearrange the dataframe
new_df = new_df[['Borough', 'Neighborhood', 'AHP']]

new_df.head()

Unnamed: 0,Borough,Neighborhood,AHP
2,Staten island,Annadale,576796
3,Staten island,Arden Heights,379837
4,Staten island,Arrochar,521087
6,Queens,Arverne,328098
7,Queens,Astoria,729625


Due to the fact the comparesion can result a duplicated names, we will need to recalsulate AHP

In [198]:
new_df = new_df.groupby(new_df['Neighborhood'], as_index=False).mean()
new_df.head()

Unnamed: 0,Neighborhood,AHP
0,Annadale,576796.0
1,Arden Heights,379837.0
2,Arlington,476349.0
3,Arrochar,521087.0
4,Arverne,328098.0


Time to merge the new_df and neighborhoods 

In [199]:
merged = pd.merge(neighborhoods, new_df, on="Neighborhood")
merged = merged.astype({'AHP': int})

In [209]:
nyc_data = merged
nyc_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,AHP
0,Bronx,Wakefield,40.894705,-73.847201,353173
1,Bronx,Co-op City,40.874294,-73.829939,430000
2,Bronx,Eastchester,40.887556,-73.827806,333493
3,Bronx,Fieldston,40.895437,-73.905643,1124792
4,Bronx,Riverdale,40.890834,-73.912585,453971


#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [201]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

In [205]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [206]:
CLIENT_ID = 'T20RVJYKF5MCGL4CU1PEUKJDVA1HOCE4JE1YYXC0YSH4053M' # your Foursquare ID
CLIENT_SECRET = 'UIOZMXRYWQLXNE2BEZ3OWWESFN3F1KTZKNTQDLOFFYSPL0Z1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: T20RVJYKF5MCGL4CU1PEUKJDVA1HOCE4JE1YYXC0YSH4053M
CLIENT_SECRET:UIOZMXRYWQLXNE2BEZ3OWWESFN3F1KTZKNTQDLOFFYSPL0Z1


### 3. Explore Neighborhoods in New York City

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [211]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [213]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# type your answer here

nyc_venues = getNearbyVenues(names=nyc_data['Neighborhood'],
                                   latitudes=nyc_data['Latitude'],
                                   longitudes=nyc_data['Longitude']
                                  )


Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Woodlawn
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
Fordham
East Tremont
Hunts Point
Morrisania
Soundview
Throgs Neck
Country Club
Parkchester
Morris Park
Belmont
Schuylerville
Castle Hill
Pelham Gardens
Concourse
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Flatbush
Crown Heights
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Canarsie
Flatlands
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Ocean Hill
Bergen Beach
Midwood
South Side
Ocean Parkway
Chinatown
Washington Heights
Inwood
Roosevelt Island
Upper West Side
Clinton
Midtown
Murray Hill
Murray Hill
Chelsea
Chelsea
Greenwich Village
East Village
Lower East

#### Let's check the size of the resulting dataframe

In [214]:
print(nyc_venues.shape)
nyc_venues.head()

(6666, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Shell,40.894187,-73.845862,Gas Station
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


Let's check how many venues were returned for each neighborhood

In [215]:
nyc_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Annadale,9,9,9,9,9,9
Arden Heights,4,4,4,4,4,4
Arlington,5,5,5,5,5,5
Arrochar,18,18,18,18,18,18
Arverne,18,18,18,18,18,18
Astoria,100,100,100,100,100,100
Bath Beach,46,46,46,46,46,46
Bay Ridge,87,87,87,87,87,87
Baychester,20,20,20,20,20,20
Bayside,74,74,74,74,74,74


#### Let's find out how many unique categories can be curated from all the returned venues

In [216]:
print('There are {} uniques categories.'.format(len(nyc_venues['Venue Category'].unique())))

There are 390 uniques categories.


Lets summaryse all the data needed for next analyze phase in one dataframe

In [220]:
nyc_full = pd.merge(nyc_venues, nyc_data, on="Neighborhood")

In [221]:
nyc_full.drop('Latitude', axis=1, inplace=True)
nyc_full.drop('Longitude', axis=1, inplace=True)

In [222]:
nyc_full.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Borough,AHP
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,Bronx,353173
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,Bronx,353173
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,Bronx,353173
3,Wakefield,40.894705,-73.847201,Shell,40.894187,-73.845862,Gas Station,Bronx,353173
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,Bronx,353173


**This is the end of data collecting and preprocessing**  
**nyc_full will be the dataframe that will be used in the analyzing step below**