# The Battle of Neghborhoods - Part I

This project aims to answer the following question:

**Does culture influence the kind of venues in a Neighborhood?**

To address this issue, the locations data of three major cities will be used, all three represent different continents and cultures: 

1. **Tokyo** 

 Representing Asia, one of the biggest cities in the continent. 

2. **Barcelona** 

 Representing Europe, a large city in the south of Spain with a differentiated culture. 

3. **New York** 

 More specifically, Queens' area, representing America. 

For Tokyo, we will get information about the neighborhoods in this [wikidedia link](https://en.wikipedia.org/wiki/Special_wards_of_Tokyo) then we will get the location data via Nominatim module in Geopy. For Barcelona we will use the same method, using this [wikipedia link](https://en.wikipedia.org/wiki/Districts_of_Barcelona). Lastly, for New York we will use [json file](https://cocl.us/new_york_dataset) used in the labs getting Queens' neighborhoods data as respective locations. 

Once we have the location data for the neighborhoods, we will get the venues information through Foursquare. All the data for the neighborhoods of different cities (representing different countries, continents and cultures) will be compiled and clustered. 

_Will the different cities be mainly represented by a specific cluster hence their culture highly influences the outcome?_

_Or will the clusters based in the venues be distributed evenly along the three cities? How multicultural and global are this big cities?_

Let's find out!

## Tokyo location data

In [1]:
import pandas as pd # import pandas library

Get Tokyo districts dataset:

In [2]:
df_tokyo = pd.read_html('https://en.wikipedia.org/wiki/Special_wards_of_Tokyo')[3] # Read html table 
df_tokyo.drop(index = 23, inplace = True) # Drop line with no info
df_tokyo = df_tokyo[['Name', 'Major districts']] # Pick the columns with the useful info
print(df_tokyo.shape) # Dataset shape
df_tokyo.head()

(23, 2)


Unnamed: 0,Name,Major districts
0,Chiyoda,"Nagatachō, Kasumigaseki, Ōtemachi, Marunouchi,..."
1,Chūō,"Nihonbashi, Kayabachō, Ginza, Tsukiji, Hatchōb..."
2,Minato,"Odaiba, Shinbashi, Hamamatsuchō, Mita, Roppong..."
3,Shinjuku,"Shinjuku, Takadanobaba, Ōkubo, Kagurazaka, Ich..."
4,Bunkyō,"Hongō, Yayoi, Hakusan"


The dataset consists in 23 special wards which one with the descrition of the major neigborhoods.

Transform the data so we get one line for each neighborhood:

In [6]:
neighs = ['District','Neighborhood']
neighs_tokyo = pd.DataFrame(columns = neighs)

for i in range(len(df_tokyo)):
    rowdata = []
    neigh_list = df_tokyo.loc[i, 'Major districts'].split(',')
    district = df_tokyo.loc[i,'Name']
    for neigh in neigh_list:
        rowdata = [district, neigh]
        neighs_tokyo.loc[len(neighs_tokyo)] = rowdata

print('The dataset has ', len(neighs_tokyo), ' lines (neighborhoods).')
neighs_tokyo.head()

The dataset has  106  lines (neighborhoods).


Unnamed: 0,District,Neighborhood
0,Chiyoda,Nagatachō
1,Chiyoda,Kasumigaseki
2,Chiyoda,Ōtemachi
3,Chiyoda,Marunouchi
4,Chiyoda,Akihabara


In [7]:
from geopy.geocoders import Nominatim # Import Nominatim

Get location data for all entries in Tokyo dataset:

In [8]:
df_coor_tokyo = pd.DataFrame(columns = ['Latitude', 'Longitude'])
for n, d in zip(neighs_tokyo['Neighborhood'], neighs_tokyo['District']):
    address = n + ',' + d
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    coordinates = [latitude, longitude]
    df_coor_tokyo.loc[len(df_coor_tokyo)] = coordinates

In [9]:
df_coor_tokyo.shape # Confirm shape matches with neigh_tokyo

(106, 2)

In [10]:
df_tokyo = neighs_tokyo.join(df_coor_tokyo) # join the 2
print(df_tokyo.shape)
df_tokyo.head()

(106, 4)


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Chiyoda,Nagatachō,35.675618,139.743469
1,Chiyoda,Kasumigaseki,35.674054,139.750972
2,Chiyoda,Ōtemachi,35.686757,139.763616
3,Chiyoda,Marunouchi,35.680656,139.765222
4,Chiyoda,Akihabara,35.699736,139.77125


Remove special characters in the object type values:

In [11]:
import unidecode

In [12]:
df_tokyo['District'] = df_tokyo['District'].apply(unidecode.unidecode) # Change District Column
df_tokyo['Neighborhood'] = df_tokyo['Neighborhood'].apply(unidecode.unidecode) # Change Neighborhood Column
print(df_tokyo.shape)
df_tokyo.head()

(106, 4)


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Chiyoda,Nagatacho,35.675618,139.743469
1,Chiyoda,Kasumigaseki,35.674054,139.750972
2,Chiyoda,Otemachi,35.686757,139.763616
3,Chiyoda,Marunouchi,35.680656,139.765222
4,Chiyoda,Akihabara,35.699736,139.77125


In [13]:
# Get Tokyo coordinates 
address = 'Tokyo'
geolocator = Nominatim(user_agent="foursquare_agent") 
location = geolocator.geocode(address)
tokyo_lat = location.latitude
tokyo_long = location.longitude

Visualize the neighborhoods location with a folium map:

In [14]:
import folium

In [15]:
map_tokyo = folium.Map(location=[tokyo_lat, tokyo_long], zoom_start=11)  # create map object

# add neighborhoods to map
for lat, lng, neighborhood, district in zip(df_tokyo['Latitude'], df_tokyo['Longitude'], df_tokyo['Neighborhood'], df_tokyo['District']):
    label = '{}''{}''{}'.format(neighborhood, ' , ', district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#D5F5E3',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tokyo)  
    
map_tokyo

## Barcelona location data

Get Barcelona districts dataset:

In [16]:
import requests
from bs4 import BeautifulSoup # Import needed libraries

In [17]:
barca_url = 'https://en.wikipedia.org/wiki/Districts_of_Barcelona' # Get data from url
barca_source = requests.get(barca_url).text

In [18]:
barca_soup = BeautifulSoup(barca_source, 'lxml') # Create a beautiful soup entity

In [19]:
barca = barca_soup.find('table', class_ ='wikitable').tbody # Fetch the table

Create the dataset structure:

In [20]:
columns = ['Number', 'District', 'Size', 'Pop','Density', 'Neighborhoods','Councilman','Party']
df_barca = pd.DataFrame(columns = columns)

In [21]:
# Loop to get all table info
for tr_cell in barca.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==8:
        df_barca.loc[len(df_barca)] = row_data

In [22]:
df_barca.head()

Unnamed: 0,Number,District,Size,Pop,Density,Neighborhoods,Councilman,Party
0,1,Ciutat Vella,4.49,111290,24786,"La Barceloneta, El Gòtic, El Raval, Sant Pere,...",Jordi Rabassa i Massons,Barcelona en Comú
1,2,Eixample,7.46,262485,35586,"L'Antiga Esquerra de l'Eixample, La Nova Esque...",Jordi Martí Grau,Barcelona en Comú
2,3,Sants-Montjuïc,21.35,177636,8321,"La Bordeta, la Font de la Guatlla, Hostafrancs...",Marc Serra Solé,Barcelona en Comú
3,4,Les Corts,6.08,82588,13584,"les Corts, la Maternitat i Sant Ramon, Pedralbes",Xavier Marcé Carol,Socialists' Party of Catalonia
4,5,Sarrià-Sant Gervasi,20.09,140461,6992,"El Putget i Farró, Sarrià, Sant Gervasi - la B...",Albert Batlle i Bastardas,Socialists' Party of Catalonia


Transform the dataset to match our needs:

In [23]:
neighs = ['District','Neighborhood'] # Select useful columns

# Get one line for each neighborhood

df_neigh = pd.DataFrame(columns = neighs) 
for i in range(len(df_barca)):
    rowdata = []
    neigh_list = df_barca.loc[i, 'Neighborhoods'].split(',')
    district = df_barca.loc[i,'District']
    for neigh in neigh_list:
        rowdata = [district, neigh]
        df_neigh.loc[len(df_neigh)] = rowdata

print(df_neigh.shape) # get new dataset shape
df_neigh.head()

(76, 2)


Unnamed: 0,District,Neighborhood
0,Ciutat Vella,La Barceloneta
1,Ciutat Vella,El Gòtic
2,Ciutat Vella,El Raval
3,Ciutat Vella,Sant Pere
4,Ciutat Vella,Santa Caterina i la Ribera


Get location data for all entries in Barcelona dataset:

In [26]:
import numpy as np # Import numpy

In [27]:
df_coor_barca = pd.DataFrame(columns = ['Latitude', 'Longitude']) # Define locations dataframe

for n, d in zip(df_neigh['Neighborhood'], df_neigh['District']):
    
    try:
        address = n +','+ d
        geolocator = Nominatim(user_agent="foursquare_agent")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        coordinates = [latitude, longitude]
        df_coor_barca.loc[len(df_coor_barca)] = coordinates
    except:
         df_coor_barca.loc[len(df_coor_barca)] = np.nan # Not all lines will return location data

In [37]:
print(df_coor_barca.shape) # See if shape matches df_neigh shape
df_coor_barca.head()

(76, 2)


Unnamed: 0,Latitude,Longitude
0,41.380653,2.189927
1,41.381505,2.177418
2,41.379518,2.168368
3,41.388322,2.177411
4,41.38665,2.184194


However there is a line without location data:

In [38]:
missing_data = df_coor_barca.isnull() # False when not null, true when null
for column in missing_data.columns.values.tolist(): # Count missing values for each column
    print(column)
    print(missing_data[column].value_counts())
    print("") 

Latitude
False    75
True      1
Name: Latitude, dtype: int64

Longitude
False    75
True      1
Name: Longitude, dtype: int64



Merge the two datasets:

In [39]:
df_barca = df_neigh.join(df_coor_barca)
print(df_barca.shape) # Check shape
df_barca.head()

(76, 4)


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Ciutat Vella,La Barceloneta,41.380653,2.189927
1,Ciutat Vella,El Gòtic,41.381505,2.177418
2,Ciutat Vella,El Raval,41.379518,2.168368
3,Ciutat Vella,Sant Pere,41.388322,2.177411
4,Ciutat Vella,Santa Caterina i la Ribera,41.38665,2.184194


Drop the missing value. Data without location cannot be used.

In [40]:
df_barca.dropna(subset=["Latitude"], axis=0, inplace=True) # Drop lines missing locations values
df_barca.reset_index(drop=True, inplace=True) 
print(df_barca.shape)
df_barca.head()

(75, 4)


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Ciutat Vella,La Barceloneta,41.380653,2.189927
1,Ciutat Vella,El Gòtic,41.381505,2.177418
2,Ciutat Vella,El Raval,41.379518,2.168368
3,Ciutat Vella,Sant Pere,41.388322,2.177411
4,Ciutat Vella,Santa Caterina i la Ribera,41.38665,2.184194


Remove special characters in the object type values:

In [42]:
df_barca['District'] = df_barca['District'].apply(unidecode.unidecode)
df_barca['Neighborhood'] = df_barca['Neighborhood'].apply(unidecode.unidecode)
df_barca.head()

Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Ciutat Vella,La Barceloneta,41.380653,2.189927
1,Ciutat Vella,El Gotic,41.381505,2.177418
2,Ciutat Vella,El Raval,41.379518,2.168368
3,Ciutat Vella,Sant Pere,41.388322,2.177411
4,Ciutat Vella,Santa Caterina i la Ribera,41.38665,2.184194


Visualize locations with a folium map:

In [43]:
address = 'Barcelona'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
barca_lat = location.latitude
barca_long = location.longitude

In [44]:
map_barca = folium.Map(location=[barca_lat, barca_long], zoom_start=11)

# add districts to map
for lat, lng, district, neighborhood in zip(df_barca['Latitude'], df_barca['Longitude'], df_barca['District'],df_barca['Neighborhood']):
    label = '{}''{}''{}'.format(neighborhood,' , ' ,district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#FDEDEC',
        fill_opacity=0.7,
        parse_html=False).add_to(map_barca)  
    
map_barca

## New York : Queens' Neighboorhoods location data

Get New york districts dataset:

In [45]:
import urllib.request # import library to download the data
url = 'https://cocl.us/new_york_dataset' # Labs dataset
filename = 'newyork_data.json'
urllib.request.urlretrieve(url, filename) # Retrive data 

('newyork_data.json', <http.client.HTTPMessage at 0x1e90ecdc748>)

In [46]:
import json # import to handle json

In [48]:
with open('newyork_data.json') as json_data: 
    newyork_data = json.load(json_data) # load json data

In [49]:
neighborhoods_data = newyork_data['features']

In [50]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names) # Create dataframe structure

In [51]:
# Load the dataframe with json data

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [52]:
print(neighborhoods.shape)
neighborhoods.head()

(306, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Select Queens' Borough for our analysis:

In [53]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
print(queens_data.shape)
queens_data.head()

(81, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


Queens' has a total of 81 neighborhoods.

Visualize location data with a Folium map:

In [54]:
address = 'South Ozone Park , Queens, NY' # will be used as the map center 

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
queens_lat = location.latitude
queens_long = location.longitude

In [55]:
map_queens = folium.Map(location=[queens_lat, queens_long], zoom_start=11)

# add queens neighborhoods to map
for lat, lng, neighborhood, borough in zip(queens_data['Latitude'], queens_data['Longitude'], queens_data['Neighborhood'],  queens_data['Borough']):
    label = '{}''{}''{}'.format(neighborhood,' , ', borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#D6EAF8',
        fill_opacity=0.7,
        parse_html=False).add_to(map_queens)  
    
map_queens

Now we have the location data for **106 neighborhoods in Tokyo, 75 in Barcelona and 81 in New York: Queens.**

We can now study the type of venues in these locations.

---