# Getting and Cleaning Manhattan Neighborhood Data

## Introduction

This notebook combines and cleans data from different sources to get geographical, demographic, and economic data for each neighborhood in Manhattan.

NYC census tract data was downloaded from this [Kaggle dataset](https://www.kaggle.com/muonneutrino/new-york-city-census-data). This dataset gives demographic, economic, and commuting information for each census tract in NYC. Each neighborhood consists of multiple census tracts, so we also need data to map these census tracts to their respective neighborhoods. This [Neighborhood Tabulation Area dataset](https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page) consists of a GeoJSON file of each neighborhood tabulation area, as well as a 2010 Census Tract to 2010 Neighborhood Tabulation Area Equivalency CSV file, specifying the neighborhood each census tract belongs to.

Finally, the latitude and longitude representing each neighborhood are calculated. Using the [Shapely package](https://pypi.org/project/Shapely/), the set of coordinates representing each neighborhood is converted into a polygon. The centroid of the polygon is then calculated, resulting in representative latitude and longitude coordinates for the neighborhood.


This data is combined and cleaned in order to create a dataset consisting of the following information for each neighborhood:
* Latitude and longitude
* Area
* Total population
* Population density
* Population breakdown by gender
* Population breakdown by race
* Median income
* Percent living in poverty
* Percent walking to work
* Mean commute time
* Unemployment rate

## Table of Contents
* [1. Importing and Cleaning Neighborhood Tabulation ](#first-bullet)
* [2. Importing and Cleaning Census Tract Data](#second-bullet)
* [3. Calculating Census Data by Neighborhood](#third-bullet)
* [4. Adding Additional Features](#fourth-bullet)
* [5. Visualization](#fifth-bullet)

### Importing the required libraries

In [28]:
import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from shapely.geometry import Polygon # library to create and handle polygons

from geopy import Nominatim # library to convert an address into latitude and longitude values

# uncomment the following line if folium is not already installed
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Importing and Cleaning Neighborhood Tabulation <a class="anchor" id="first-bullet"></a>

In this section, we extract the GeoJSON data as well as the census tract neighborhood equivalencies for each neighborhood in Manhattan.

### GeoJSON data

We start by loading the [GeoJSON file](https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page) for the tabulated NYC neighborhoods. We are only interested in the neighborhoods in Manhattan, so we drop the information for all other neighborhoods. We also want to drop the feature named "park-cemetery-etc-Manhattan", as this is not a useful neighborhood for us.

In [2]:
# Load geojson features of all neighborhoods in NYC
manhattan_geo = json.load(open("nyc-neighborhood-tabulation-areas.geojson"))

In [3]:
# Extract only the Manhattan features
borough_name = "Manhattan"

manhattan_geo['features'] = [nbh for nbh in manhattan_geo['features'] if nbh['properties']['BoroName'] == borough_name]

# Drop feature named "park-cemetery-etc-Manhattan"
manhattan_geo['features'] = [nbh for nbh in manhattan_geo['features'] if nbh['properties']['NTAName'] != 'park-cemetery-etc-Manhattan']

### Census Tract to Neighborhood Tabulation Area Equivalency

In order to understand the census data, which is organized by "census tract" rather than neighborhood, we need to find which neighborhood each census tract belongs to. To do this, we load the [2010 Neighborhood Tabulation Area Equivalency CSV file](https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page).

In [4]:
census_tab_equiv_df = pd.read_csv('nyc2010census_tabulation_equiv.csv', dtype=str, skiprows=3)

census_tab_equiv_df.head()

Unnamed: 0,Borough,2010 Census Bureau FIPS County Code,2010 NYC Borough Code,2010 Census Tract,PUMA,Neighborhood Tabulation Area (NTA),Unnamed: 6
0,,,,,,Code,Name
1,Bronx,5.0,2.0,31000.0,3704.0,BX31,Allerton-Pelham Gardens
2,Bronx,5.0,2.0,31200.0,3704.0,BX31,Allerton-Pelham Gardens
3,Bronx,5.0,2.0,31400.0,3704.0,BX31,Allerton-Pelham Gardens
4,Bronx,5.0,2.0,31600.0,3704.0,BX31,Allerton-Pelham Gardens


Now let's clean the data by renaming the columns to be consistent with the census data, and by extracting the rows for Manhattan.

In [5]:
# Rename columns
census_tab_equiv_df = census_tab_equiv_df.rename({'2010 Census Tract' : 'CensusTract', 
                                                  'Neighborhood Tabulation Area (NTA)': 'NTA Code', 
                                                  'Unnamed: 6': 'Neighborhood'}, axis=1)

# Extract Manhattan rows
census_tab_equiv_df = census_tab_equiv_df[census_tab_equiv_df['Borough'] == 'Manhattan']

census_tab_equiv_df.head()

Unnamed: 0,Borough,2010 Census Bureau FIPS County Code,2010 NYC Borough Code,CensusTract,PUMA,NTA Code,Neighborhood
1101,Manhattan,61,1,700,3810,MN25,Battery Park City-Lower Manhattan
1102,Manhattan,61,1,900,3810,MN25,Battery Park City-Lower Manhattan
1103,Manhattan,61,1,1300,3810,MN25,Battery Park City-Lower Manhattan
1104,Manhattan,61,1,1501,3810,MN25,Battery Park City-Lower Manhattan
1105,Manhattan,61,1,1502,3810,MN25,Battery Park City-Lower Manhattan


## 2. Importing and Cleaning Census Tract Data <a class="anchor" id="second-bullet"></a>

While the datasets built in Step 1 provide geographical data for each Manhattan neighborhood, we are still lacking demographic and economic information. The United States Census Bureau produces data about the American people and economy, and organizes it by areas called *census tracts*, generally encompassing 2,500-8,000 people. This [Kaggle dataset](https://www.kaggle.com/muonneutrino/new-york-city-census-data) provides such data for NYC, but does not specify what neighborhood each census tract belongs to. We start by importing this dataset.

In [6]:
census_tract_df = pd.read_csv('nyc_census_tracts.csv')

census_tract_df.head()

Unnamed: 0,CensusTract,County,Borough,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Citizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,36005000100,Bronx,Bronx,7703,7133,570,29.9,6.1,60.9,0.2,1.6,6476,,,2440.0,373.0,,,,,,,,,,,,,,,0,,,,,
1,36005000200,Bronx,Bronx,5403,2659,2744,75.8,2.3,16.0,0.0,4.2,3639,72034.0,13991.0,22180.0,2206.0,20.0,20.7,28.7,17.1,23.9,8.0,22.3,44.8,13.7,38.6,2.9,0.0,0.0,43.0,2308,80.8,16.2,2.9,0.0,7.7
2,36005000400,Bronx,Bronx,5915,2896,3019,62.7,3.6,30.7,0.0,0.3,4100,74836.0,8407.0,27700.0,2449.0,13.2,23.6,32.2,23.4,24.9,9.0,10.5,41.3,10.0,44.6,1.4,0.5,2.1,45.0,2675,71.7,25.3,2.5,0.6,9.5
3,36005001600,Bronx,Bronx,5879,2558,3321,65.1,1.6,32.4,0.0,0.0,3536,32312.0,6859.0,17526.0,2945.0,26.3,35.9,19.1,36.1,26.2,4.9,13.8,37.2,5.3,45.5,8.6,1.6,1.7,38.8,2120,75.0,21.3,3.8,0.0,8.7
4,36005001900,Bronx,Bronx,2591,1206,1385,55.4,9.0,29.0,0.0,2.1,1557,37936.0,3771.0,17986.0,2692.0,37.1,31.5,35.4,20.9,26.2,6.6,11.0,19.2,5.3,63.9,3.0,2.4,6.2,45.4,1083,76.8,15.5,7.7,0.0,19.2


The first five digits of the CensusTract in this dataframe correspond to the state and county, and are excluded in the 2010 Neighborhood Tabulation Area Equivalency CSV file. To match the census tracts in both the *census_tract_df* dataframe and the  *census_tab_equiv_df* dataframe, we remove the first five digits from the CensusTract in the *census_tract_df* dataframe.

In [7]:
# Define function to extract the last six digits from a CensusTract value
def six_digit_census_tract(census_tract):
    return str(census_tract)[5:]

# Apply the above function to the CensusTract column
census_tract_df['CensusTract'] = census_tract_df['CensusTract'].apply(six_digit_census_tract)

Let's clean the data by extracting the Manhattan rows and dropping all rows where the population is not greater than 50.

In [8]:
# Extract Manhattan rows
census_tract_df = census_tract_df[census_tract_df['Borough'] == 'Manhattan']

# Extract rows where the total population > 50
census_tract_df = census_tract_df[census_tract_df['TotalPop'] > 50]

census_tract_df.head()

Unnamed: 0,CensusTract,County,Borough,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Citizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
1101,201,New York,Manhattan,2791,1301,1490,35.3,12.4,6.2,0.0,46.1,2031,20521.0,4537.0,13062.0,2962.0,48.6,71.4,22.5,37.6,26.2,0.0,13.7,11.8,6.5,45.4,28.2,6.3,1.7,33.0,1105,90.1,7.1,2.8,0.0,2.6
1102,202,New York,Manhattan,7768,3314,4454,36.8,17.6,12.8,0.0,27.5,5926,29684.0,3510.0,27355.0,4365.0,24.6,19.6,40.7,27.6,26.3,2.8,2.6,9.3,3.3,41.5,29.7,9.1,7.0,30.9,2667,74.1,19.8,6.1,0.0,15.1
1104,600,New York,Manhattan,12554,5966,6588,33.2,3.4,12.0,0.4,50.6,8072,19863.0,5878.0,12802.0,2656.0,44.7,55.3,18.3,32.5,35.5,4.6,9.1,14.3,3.2,37.0,37.6,5.1,2.9,30.3,4028,85.0,10.2,4.8,0.0,8.7
1105,700,New York,Manhattan,8794,4214,4580,6.6,69.3,2.4,0.0,20.5,6198,117841.0,13607.0,89303.0,13293.0,10.6,5.3,71.9,3.0,24.2,0.9,0.0,2.5,0.0,62.3,27.8,0.5,6.9,24.9,6463,91.6,5.9,2.5,0.0,4.4
1106,800,New York,Manhattan,9465,4602,4863,3.7,5.6,0.9,0.1,83.9,5208,27137.0,7611.0,17426.0,2639.0,28.9,33.3,23.9,27.6,32.0,1.5,15.1,9.3,7.2,43.1,33.0,5.0,2.4,33.2,4132,86.5,4.7,8.5,0.3,10.3


## 3. Calculating Census Data by Neighborhood <a class="anchor" id="third-bullet"></a>

In order to group the census data by neighborhood, we need to merge the census data with the census tract equivalency data. We do this by merging the census tract and neighborhood data from the *census_tab_equiv_df* dataframe with the *census_tract_df* dataframe.

In [9]:
# Extract the census tract and neighborhood columns from the census_tab_equiv_df dataframe
tract_neighborhod_df = census_tab_equiv_df[['CensusTract', 'Neighborhood']]

# Merge the tract_neighborhod_df dataframe with the census_tract_df dataframe
census_data_merged = pd.merge(tract_neighborhod_df, census_tract_df, on='CensusTract')

census_data_merged.head()

Unnamed: 0,CensusTract,Neighborhood,County,Borough,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Citizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,700,Battery Park City-Lower Manhattan,New York,Manhattan,8794,4214,4580,6.6,69.3,2.4,0.0,20.5,6198,117841.0,13607.0,89303.0,13293.0,10.6,5.3,71.9,3.0,24.2,0.9,0.0,2.5,0.0,62.3,27.8,0.5,6.9,24.9,6463,91.6,5.9,2.5,0.0,4.4
1,900,Battery Park City-Lower Manhattan,New York,Manhattan,1626,946,680,8.2,74.4,1.8,0.0,12.7,1167,147500.0,30405.0,111599.0,16700.0,6.0,12.9,67.6,2.6,25.5,0.2,4.1,10.0,0.4,51.9,24.1,3.3,10.4,26.1,1120,89.3,4.6,6.1,0.0,2.8
2,1300,Battery Park City-Lower Manhattan,New York,Manhattan,4374,1896,2478,10.6,66.0,1.9,0.0,19.5,3207,123558.0,15444.0,93787.0,12403.0,7.4,0.0,75.2,2.5,20.2,2.0,0.0,2.5,2.1,59.4,27.6,1.4,6.9,22.8,3419,88.1,5.6,6.3,0.0,3.3
3,1501,Battery Park City-Lower Manhattan,New York,Manhattan,6502,3007,3495,10.2,57.6,5.6,0.0,25.1,5474,81867.0,11845.0,80454.0,18586.0,10.3,0.0,67.8,6.9,20.1,2.5,2.6,5.7,0.0,50.6,32.0,6.8,5.0,22.5,3663,86.3,10.8,2.9,0.0,5.0
4,1502,Battery Park City-Lower Manhattan,New York,Manhattan,7378,3664,3714,5.8,65.9,3.9,0.5,21.8,5407,133209.0,11906.0,77456.0,7728.0,8.3,0.0,74.4,5.0,19.8,0.0,0.8,2.9,0.0,60.8,28.0,2.5,5.8,23.7,5231,91.0,5.2,3.8,0.0,3.5


There are currently more features in the merged dataframe than are of interest, so we select columns that may be more relevant.

In [10]:
census_data = census_data_merged[['Neighborhood', 
                                  'TotalPop', 
                                  'Men', 
                                  'Women', 
                                  'Hispanic', 
                                  'White', 
                                  'Black', 
                                  'Asian', 
                                  'Income',
                                  'Poverty',
                                  'Walk',
                                  'MeanCommute',
                                  'Unemployment']]

Before grouping the rows by neighborhood, we convert percentage values to the number of people so that the data for each census tract can be summed according to neighborhood.

In [11]:
# Define the indices of the columns that contain percentage values
per_index = []

for i in range(4, 8):
    per_index.append(i)

for i in range(9, 11):
    per_index.append(i)
    
per_index.append(12)

In [12]:
# Convert the percentage values to the number of people
census_data.iloc[:, per_index] = census_data.iloc[:, per_index].multiply(census_data['TotalPop'] * 0.01, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Additionally, the income and mean commute time must be converted to a format that can be more easily grouped. Summing or averaging the income over the census tracts in a neighborhood would neglect the differences in population among census tracts. Therefore, we multiply the income and mean commute time by the population so that these values can be summed across the census tracts in a neighborhood and then later divided by the total population of the neighborhood.

In [13]:
census_data_sum_avg = census_data[['Neighborhood', 'Income', 'MeanCommute']]
census_data[['Income', 'MeanCommute']] = census_data[['Income', 'MeanCommute']].multiply(census_data['TotalPop'], axis=0)

census_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,Neighborhood,TotalPop,Men,Women,Hispanic,White,Black,Asian,Income,Poverty,Walk,MeanCommute,Unemployment
0,Battery Park City-Lower Manhattan,8794,4214,4580,580.404,6094.242,211.056,1802.77,1036294000.0,932.164,2444.732,218970.6,386.936
1,Battery Park City-Lower Manhattan,1626,946,680,133.332,1209.744,29.268,206.502,239835000.0,97.56,391.866,42438.6,45.528
2,Battery Park City-Lower Manhattan,4374,1896,2478,463.644,2886.84,83.106,852.93,540442700.0,323.676,1207.224,99727.2,144.342
3,Battery Park City-Lower Manhattan,6502,3007,3495,663.204,3745.152,364.112,1632.002,532299200.0,669.706,2080.64,146295.0,325.1
4,Battery Park City-Lower Manhattan,7378,3664,3714,427.924,4862.102,287.742,1608.404,982816000.0,612.374,2065.84,174858.6,258.23


The rows can now be grouped by neighborhood, summing the values of all census tracts in each neighborhood.

In [14]:
census_data_grouped = census_data.groupby(['Neighborhood']).sum()

census_data_grouped.head()

Unnamed: 0_level_0,TotalPop,Men,Women,Hispanic,White,Black,Asian,Income,Poverty,Walk,MeanCommute,Unemployment
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Battery Park City-Lower Manhattan,44436,21573,22863,3850.104,29028.612,1189.065,9077.082,6002085000.0,3278.811,12939.776,1122454.5,1668.354
Central Harlem North-Polo Grounds,82898,38715,44183,20300.252,7194.947,51027.318,1770.945,2940435000.0,24110.07,7422.433,3086654.8,13258.386
Central Harlem South,48993,22584,26409,9889.833,9461.609,25993.693,2361.387,2390774000.0,13499.45,3521.117,1690081.7,4044.863
Chinatown,45325,23334,21991,6116.979,7803.265,2183.933,27120.37,1741626000.0,13102.293,13916.232,1405153.5,3725.22
Clinton,43450,23764,19686,9219.709,24093.533,2399.796,6473.797,3454017000.0,5466.943,15748.383,1192689.7,2919.515


Now that the data is grouped by neighborhood, it must be converted back into a meaningful format. We divide the columns by the total population, and convert the columns that were originally in percent format back to percentages.

In [15]:
# Divide columns by the total population
census_data_grouped.iloc[:, 1:] = census_data_grouped.iloc[:, 1:].div(census_data_grouped['TotalPop'], axis=0)

# Define indices of the columns in percent format
percent_indices = [1, 2, 3, 4, 5, 6, 8, 9, 11]

# Convert the specified columns back to percent format
census_data_grouped.iloc[:, percent_indices] = census_data_grouped.iloc[:, percent_indices] * 100

As with the GeoJSON data, we drop the parks and cemetaries from the neighborhood data. There is also a misspelling of one of the neighborhoods (Flat Iron instead of Flatiron), so we correct that neighborhood name before moving on.

In [16]:
# Drop parks and cemetaries
census_data_grouped = census_data_grouped.drop(['park-cemetery-etc-Manhattan'])

# Correct neighborhood name (Flat Iron -> Flatiron)
census_data_grouped = census_data_grouped.rename({'Hudson Yards-Chelsea-Flat Iron-Union Square': 'Hudson Yards-Chelsea-Flatiron-Union Square'}, axis='index')

To make the data more readable, round to two decimal places.

In [17]:
# Round to two decimal places
census_data_grouped.iloc[:, 1:] = census_data_grouped.iloc[:, 1:].round(2)

census_data_grouped.head()

Unnamed: 0_level_0,TotalPop,Men,Women,Hispanic,White,Black,Asian,Income,Poverty,Walk,MeanCommute,Unemployment
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Battery Park City-Lower Manhattan,44436,48.55,51.45,8.66,65.33,2.68,20.43,135072.58,7.38,29.12,25.26,3.75
Central Harlem North-Polo Grounds,82898,46.7,53.3,24.49,8.68,61.55,2.14,35470.52,29.08,8.95,37.23,15.99
Central Harlem South,48993,46.1,53.9,20.19,19.31,53.06,4.82,48798.28,27.55,7.19,34.5,8.26
Chinatown,45325,51.48,48.52,13.5,17.22,4.82,59.84,38425.29,28.91,30.7,31.0,8.22
Clinton,43450,54.69,45.31,21.22,55.45,5.52,14.9,79494.05,12.58,36.24,27.45,6.72


## 4. Adding Additional Features <a class="anchor" id="fourth-bullet"></a>

We now have cleaned census data for each Manhattan neighborhood, but there are still some features of interest that are missing. For each neighborhood, the following features will be calculated:
* Population density
* Latitude and longitude

### Calculating population density

The population density is calculated by dividing the total population of the neighborhood by the area. The area of each neighborhood is extracted from the GeoJSON object and added to a new dataframe. This dataframe is then merged with the dataframe from Step 3.

In [18]:
# Initialize a dataframe for the area
nbh_area_df = pd.DataFrame(columns = ['Neighborhood', 'Area'])

# Get the neighborhood features from the GeoJSON object
neighborhood = manhattan_geo["features"]

# Loop through each neighborhood
for n in neighborhood:
    # Extract the neighborhood name
    nbh = n["properties"]["NTAName"]
    
    # Extract the neighborhood area and convert it from m^2 to mi^2
    area = n["properties"]["Shape__Area"] / (1609*1609) # converts from m^2 to mi^2
    
    # Append the neighborhood name and area to the dataframe
    nbh_area_df = nbh_area_df.append({'Neighborhood' : nbh, 'Area' : area}, ignore_index = True)

In [19]:
nbh_data = pd.merge(nbh_area_df, census_data_grouped, on='Neighborhood')

Now that we have the area of each neighborhood, we can calculate the population density and add it as a new column in the dataframe.

In [20]:
pop_density = nbh_data["TotalPop"] / nbh_data["Area"]
nbh_data.insert(loc=3, column='Population Density', value=pop_density)

nbh_data.head()

Unnamed: 0,Neighborhood,Area,TotalPop,Population Density,Men,Women,Hispanic,White,Black,Asian,Income,Poverty,Walk,MeanCommute,Unemployment
0,Clinton,7.092697,43450,6126.019376,54.69,45.31,21.22,55.45,5.52,14.9,79494.05,12.58,36.24,27.45,6.72
1,Battery Park City-Lower Manhattan,7.344586,44436,6050.170905,48.55,51.45,8.66,65.33,2.68,20.43,135072.58,7.38,29.12,25.26,3.75
2,Lincoln Square,6.105156,59921,9814.818141,46.72,53.28,9.51,74.22,2.72,10.68,122296.16,8.08,18.44,28.38,4.82
3,Midtown-Midtown South,11.662014,28080,2407.817386,48.28,51.72,7.53,65.05,3.74,20.95,125437.65,13.86,42.67,22.89,5.38
4,Upper East Side-Carnegie Hill,7.750593,58161,7504.070522,44.65,55.35,6.1,84.7,1.07,5.77,164660.74,5.16,18.37,27.09,3.76


### Calculating the neighborhood latitude and longitude values

The final feature to be added to this dataset is the representative latitude and longitude value of each neighborhood. The neighborhoods are tabulated as polygons, and many have irregular forms. Some neighborhoods consist of multiple polygons. To choose one geographical coordinate to represent each neighborhood, we take the centroid of the neighborhood's largest polygon. These coordinates are then merged with our dataset.

In [21]:
# Intialize data frame for the centroid coordinates of each neighborhood
nbh_centroid_df = pd.DataFrame(columns = ['Neighborhood', 'Latitude', 'Longitude'])

# Get the geojson features
neighborhood = manhattan_geo["features"]

# Loop through each neighborhood
for n in neighborhood:
    # Get the neighborhood name
    nbh = n["properties"]["NTAName"]
    
    # Each neighborhood can consist of multiple sets of coordinates, each corresponding to a different shape.
    # Find the set of coordinates that corresponds to the biggest polygon.
    biggest_polygon = Polygon()
    for shape_coords in n['geometry']['coordinates']:
        polygon = Polygon()
        
        try:
            polygon = Polygon(shape_coords)
        except:
            polygon = Polygon(shape_coords[0])
        
        if polygon.area > biggest_polygon.area:
            biggest_polygon = polygon
    
    # Find the centroid of the polygon, corresponding to the geographical coordinates of the neighborhood
    lat = biggest_polygon.centroid.y
    lon = biggest_polygon.centroid.x
    
    # Add the neighborhood name and its latitude and longitude coordinates to the dataframe
    nbh_centroid_df = nbh_centroid_df.append({'Neighborhood' : nbh, 
                                              'Latitude' : lat, 
                                              'Longitude' : lon}, 
                                             ignore_index = True)

In [22]:
# Merge centroid coordinates with dataset
manhattan_nbh_data = pd.merge(nbh_centroid_df, nbh_data, on='Neighborhood')

manhattan_nbh_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Area,TotalPop,Population Density,Men,Women,Hispanic,White,Black,Asian,Income,Poverty,Walk,MeanCommute,Unemployment
0,Clinton,40.764175,-73.992395,7.092697,43450,6126.019376,54.69,45.31,21.22,55.45,5.52,14.9,79494.05,12.58,36.24,27.45,6.72
1,Battery Park City-Lower Manhattan,40.708547,-74.010916,7.344586,44436,6050.170905,48.55,51.45,8.66,65.33,2.68,20.43,135072.58,7.38,29.12,25.26,3.75
2,Lincoln Square,40.774855,-73.984701,6.105156,59921,9814.818141,46.72,53.28,9.51,74.22,2.72,10.68,122296.16,8.08,18.44,28.38,4.82
3,Midtown-Midtown South,40.755742,-73.983504,11.662014,28080,2407.817386,48.28,51.72,7.53,65.05,3.74,20.95,125437.65,13.86,42.67,22.89,5.38
4,Upper East Side-Carnegie Hill,40.774738,-73.961176,7.750593,58161,7504.070522,44.65,55.35,6.1,84.7,1.07,5.77,164660.74,5.16,18.37,27.09,3.76


### Export dataset

Now that we have finished building our dataset, we can export it in CSV format.

In [23]:
# Write dataframe to CSV
manhattan_nbh_data.to_csv('manhattan-neighborhood-data.csv')

## 5. Visualization <a class="anchor" id="fifth-bullet"></a>

Let's check and make sure the latitude and longitude points representing each neighborhood are at reasonable locations. We can do this by plotting these latitude and longitude points along with the Neighborhood Tabulation Areas. We can shade the Neighborhood Tabulation Areas according to population density to visualize our calculated population density data by neighborhood as well.

In [29]:
# Get the latitude and longitude of Manhattan (for us to use when we create our map)
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="nyc_explorer")
location = geolocator.geocode(address)
mht_latitude = location.latitude
mht_longitude = location.longitude

In [31]:
# Create a map of Manhattan using its geographical coordinates
map_nyc = folium.Map(location=[mht_latitude, mht_longitude], zoom_start=11)

# Add the color for the chloropleth of the population density for each neighborhood:
choropleth = folium.Choropleth(
    geo_data=manhattan_geo,
    data=manhattan_nbh_data,
    columns=['Neighborhood', 'Population Density'],
    key_on='feature.properties.NTAName',
    fill_color='YlGn',
    name='Population Density',
    show=False,
).add_to(map_nyc)

# Add markers for each neighborhood to the map
for lat, lng, label in zip(manhattan_nbh_data['Latitude'], manhattan_nbh_data['Longitude'], manhattan_nbh_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nyc)

# Display the map
map_nyc

As shown in the map above, the calculated latitudes and longitudes for each neighborhood are centered reasonably to represent their respective neighborhoods.