## The Battle of the Neighborhoods - Week 5

### Download and Explore New York city geographical coordinates dataset

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

First, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# conda install -c anaconda beautiful-soup --yes
from bs4 import BeautifulSoup # package for parsing HTML and XML documents

import csv # implements classes to read and write tabular data in CSV form


print('Libraries imported.')

Libraries imported.


The json file is downloaded and it is placed on the server. So run a `wget` command and access the data.

In [2]:
!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

All the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, define a new variable that includes this data.

In [4]:
neighborhoods_data = newyork_data['features']

Take a look at the first item in this list.

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe
The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. Start by creating an empty dataframe.

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [7]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then loop through the data and fill the dataframe one row at a time.

In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Let's make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [11]:
neighborhoods.to_csv('BON1_NYC_GEO.csv',index=False)

#### Use geopy library to get the latitude and longitude values of New York City.

In [12]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="Jupyter")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

**Folium** is a great visualization library. We can zoom into the below map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

In [13]:
# create map of Toronto using latitude and longitude values
map_NewYork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NewYork)  
    
map_NewYork

#### A : POPULATION DATA

 Web scrapping of Population data from wikipedia page - https://en.wikipedia.org/wiki/New_York_City

#### Download all the dependencies that is needed.

#### Web scrapping of Population data from wikipedia page using BeautifulSoup.

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

In [62]:
website_url = requests.get('https://en.wikipedia.org/wiki/Demographics_of_New_York_City').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find('table',{'class':'wikitable sortable'})
#print(soup.prettify())

headers = [header.text for header in table.find_all('th')]

table_rows = table.find_all('tr')        
rows = []
for row in table_rows:
   td = row.find_all('td')
   row = [row.text for row in td]
   rows.append(row)

with open('BON2_POPULATION1.csv', 'w') as f:
   writer = csv.writer(f)
   writer.writerow(headers)
   writer.writerows(row for row in rows if row)

In [63]:
Pop_data=pd.read_csv('BON2_POPULATION1.csv')

Pop_data.drop(Pop_data.columns[[3,8,9,10,11,12,13,14]], axis=1,inplace=True)

print('Data downloaded!')

Data downloaded!


#### Remove whitespaces and rename columns

In [65]:
Pop_data.columns = Pop_data.columns.str.replace(' ', '')
Pop_data.columns = Pop_data.columns.str.replace('\'','')
Pop_data.rename(columns={'Borough':'persons_sq_mi','County':'persons_sq_km'}, inplace=True)
Pop_data

Unnamed: 0,NewYorkCitysfiveboroughsvte,Jurisdiction,Population,Landarea,Density,persons_sq_mi,persons_sq_km
0,The Bronx\n,\n Bronx\n,"1,471,160\n","19,570\n",42.10\n,109.04\n,"34,653\n"
1,Brooklyn\n,\n Kings\n,"2,648,771\n","23,900\n",70.82\n,183.42\n,"37,137\n"
2,Manhattan\n,\n New York\n,"1,664,727\n","378,250\n",22.83\n,59.13\n,"72,033\n"
3,Queens\n,\n Queens\n,"2,358,582\n","31,310\n",108.53\n,281.09\n,"21,460\n"
4,Staten Island\n,\n Richmond\n,"479,458\n","23,460\n",58.37\n,151.18\n,"8,112\n"
5,City of New York,8622698,806.863,302.64,783.83,28188,"10,947\n"
6,State of New York,19849399,1547.116,47214,122284,416.4,159\n
7,Sources:[14] and see individual borough articl...,,,,,,


In [66]:
Pop_data.rename(columns = {'NewYorkCitysfiveboroughsvte\n' : 'Borough',
                   'Jurisdiction\n':'County',
                   'Population\n':'Estimate_2017', 
                   'Landarea\n':'square_miles',
                    'Density\n':'square_km'}, inplace=True)
Pop_data

Unnamed: 0,Borough,County,Estimate_2017,square_miles,square_km,persons_sq_mi,persons_sq_km
0,The Bronx\n,\n Bronx\n,"1,471,160\n","19,570\n",42.10\n,109.04\n,"34,653\n"
1,Brooklyn\n,\n Kings\n,"2,648,771\n","23,900\n",70.82\n,183.42\n,"37,137\n"
2,Manhattan\n,\n New York\n,"1,664,727\n","378,250\n",22.83\n,59.13\n,"72,033\n"
3,Queens\n,\n Queens\n,"2,358,582\n","31,310\n",108.53\n,281.09\n,"21,460\n"
4,Staten Island\n,\n Richmond\n,"479,458\n","23,460\n",58.37\n,151.18\n,"8,112\n"
5,City of New York,8622698,806.863,302.64,783.83,28188,"10,947\n"
6,State of New York,19849399,1547.116,47214,122284,416.4,159\n
7,Sources:[14] and see individual borough articl...,,,,,,


#### Replace newline('\n') from each string from left and right sides

In [67]:
Pop_data['Borough']=Pop_data['Borough'].replace(to_replace='\n', value='', regex=True)
Pop_data['County']=Pop_data['County'].replace(to_replace='\n', value='', regex=True)
Pop_data['Estimate_2017']=Pop_data['Estimate_2017'].replace(to_replace='\n', value='', regex=True)
Pop_data['square_miles']=Pop_data['square_miles'].replace(to_replace='\n', value='', regex=True)
Pop_data['square_km']=Pop_data['square_km'].replace(to_replace='\n', value='', regex=True)
Pop_data['persons_sq_mi']=Pop_data['persons_sq_mi'].replace(to_replace='\n', value='', regex=True)
Pop_data['persons_sq_km']=Pop_data['persons_sq_km'].replace(to_replace='\n', value='', regex=True)
Pop_data

Unnamed: 0,Borough,County,Estimate_2017,square_miles,square_km,persons_sq_mi,persons_sq_km
0,The Bronx,Bronx,1471160.0,19570.0,42.1,109.04,34653.0
1,Brooklyn,Kings,2648771.0,23900.0,70.82,183.42,37137.0
2,Manhattan,New York,1664727.0,378250.0,22.83,59.13,72033.0
3,Queens,Queens,2358582.0,31310.0,108.53,281.09,21460.0
4,Staten Island,Richmond,479458.0,23460.0,58.37,151.18,8112.0
5,City of New York,8622698,806.863,302.64,783.83,28188.0,10947.0
6,State of New York,19849399,1547.116,47214.0,122284.0,416.4,159.0
7,Sources:[14] and see individual borough articles,,,,,,


In [68]:
Pop_data.loc[5:,['persons_sq_mi','persons_sq_km']] = Pop_data.loc[2:,['persons_sq_mi','persons_sq_km']].shift(1,axis=1)
Pop_data.loc[5:,['square_km','persons_sq_mi']] = Pop_data.loc[2:,['square_km','persons_sq_mi']].shift(1,axis=1)
Pop_data.loc[5:,['square_miles','square_km']] = Pop_data.loc[2:,['square_miles','square_km']].shift(1,axis=1)
Pop_data.loc[5:,['Estimate_2017','square_miles']] = Pop_data.loc[2:,['Estimate_2017','square_miles']].shift(1,axis=1)
Pop_data.loc[5:,['County','Estimate_2017']] = Pop_data.loc[2:,['County','Estimate_2017']].shift(1,axis=1)
Pop_data.loc[5:,['Borough','County']] = Pop_data.loc[2:,['Borough','County']].shift(1,axis=1)
Pop_data

Unnamed: 0,Borough,County,Estimate_2017,square_miles,square_km,persons_sq_mi,persons_sq_km
0,The Bronx,Bronx,1471160.0,19570.0,42.1,109.04,34653.0
1,Brooklyn,Kings,2648771.0,23900.0,70.82,183.42,37137.0
2,Manhattan,New York,1664727.0,378250.0,22.83,59.13,72033.0
3,Queens,Queens,2358582.0,31310.0,108.53,281.09,21460.0
4,Staten Island,Richmond,479458.0,23460.0,58.37,151.18,8112.0
5,,City of New York,8622698.0,806.863,302.64,783.83,28188.0
6,,State of New York,19849399.0,1547.116,47214.0,122284.0,416.4
7,,Sources:[14] and see individual borough articles,,,,,


### Remove 'NAN'

In [69]:
Pop_data = Pop_data.fillna('')
Pop_data

Unnamed: 0,Borough,County,Estimate_2017,square_miles,square_km,persons_sq_mi,persons_sq_km
0,The Bronx,Bronx,1471160.0,19570.0,42.1,109.04,34653.0
1,Brooklyn,Kings,2648771.0,23900.0,70.82,183.42,37137.0
2,Manhattan,New York,1664727.0,378250.0,22.83,59.13,72033.0
3,Queens,Queens,2358582.0,31310.0,108.53,281.09,21460.0
4,Staten Island,Richmond,479458.0,23460.0,58.37,151.18,8112.0
5,,City of New York,8622698.0,806.863,302.64,783.83,28188.0
6,,State of New York,19849399.0,1547.116,47214.0,122284.0,416.4
7,,Sources:[14] and see individual borough articles,,,,,


## Drop the last row

In [72]:
i = Pop_data[((Pop_data.County == 'Sources:[14] and see individual borough articles'))].index
Pop_data.drop(i)

Unnamed: 0,Borough,County,Estimate_2017,square_miles,square_km,persons_sq_mi,persons_sq_km
0,The Bronx,Bronx,1471160,19570.0,42.1,109.04,34653.0
1,Brooklyn,Kings,2648771,23900.0,70.82,183.42,37137.0
2,Manhattan,New York,1664727,378250.0,22.83,59.13,72033.0
3,Queens,Queens,2358582,31310.0,108.53,281.09,21460.0
4,Staten Island,Richmond,479458,23460.0,58.37,151.18,8112.0
5,,City of New York,8622698,806.863,302.64,783.83,28188.0
6,,State of New York,19849399,1547.116,47214.0,122284.0,416.4


## Save dataframe as csv file

In [73]:
Pop_data.to_csv('NYC_refined_data.csv',index=False)


B : DEMOGRAPHICS DATA


In [76]:
# have prepared  demographics data from https://en.wikipedia.org/wiki/New_York_City for city population
ny_demo_data =pd.read_csv('NY_demographics.csv')
print("Data downloaded")

Data downloaded


In [81]:
ny_demo_data.columns=['Racial composition', '2010', '1990', '1970','1940']

In [82]:
ny_demo_data

Unnamed: 0,Racial composition,2010,1990,1970,1940
0,—Non-Hispanic,33.3%,43.2%,62.9%[240],92.00%
1,Black or African American,25.5%,28.7%,21.1%,6.10%
2,Hispanic or Latino (of any race),28.6%,24.4%,16.2%[240],1.60%
3,Asian,12.7%,7.0%,1.2%,−


### Strip '[240]' from third column - 1970

In [84]:
ny_demo_data['1970'] = ny_demo_data['1970'].str.rstrip('[240]')
ny_demo_data

Unnamed: 0,Racial composition,2010,1990,1970,1940
0,—Non-Hispanic,33.3%,43.2%,62.9%,92.00%
1,Black or African American,25.5%,28.7%,21.1%,6.10%
2,Hispanic or Latino (of any race),28.6%,24.4%,16.2%,1.60%
3,Asian,12.7%,7.0%,1.2%,−


In [86]:
ny_demo_data.to_csv('NY_refined_demographics.csv',index=False)