# Capstone Project - The Battle of the Neighborhoods

## Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and discussion](#results)
* [Conclusion](#conclusion)

## Introduction. <a name="introduction"></a>
This project will help someone to understand how similar two areas from different cities are.
We will try to compare two areas of two cities: Moscow and New York. Similarity of two areas can help to make a decision about migration from one city to another or about business expansion or just can provide some interesting information about  two areas from different sides of the globe. 

## Data <a name="data"></a>

First we have to understand terrytory division of two cities to understand which areas should be explored.
Then every area should be explored to define initial criteries of similarity.
In the end - detailed exploration of chosen areas.

Based on definition of our problem following data sources will be needed to extract/generate the required information:
- geographical coordinates of the studied cities
- territory division
- initial information about every area

Most of initail information will be scraped from Wikipedia - free and open data source. Most detailed information about every location will be obtained using Foursquare API.

In [1]:
# compare two heatmaps of average temp by month in NYC and Moscow

New York City (NYC) is the most populous city in the United States. With an estimated 2018 population of 8,398,748 distributed over a land area of about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States. A global power city, New York City has been described as the cultural, financial, and media capital of the world, and exerts a significant impact upon commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports. Situated on one of the world's largest natural harbors, New York City consists of five boroughs, each of which is a separate county of the State of New York. The five boroughs – Brooklyn, Queens, Manhattan, The Bronx, and Staten Island – were consolidated into a single city in 1898.

Moscow is the capital and most populous city of Russia, with approximately 12.6 million residents within the city limits. Moscow is the northernmost and coldest megacity on the Earth. Moscow is a major political, economic, cultural, and scientific centre of Russia and Eastern Europe. It is the second-most populous city in Europe, the most populous city entirely within Europe, as well as the largest city (by area) on the European continent. Moscow has been ranked as the ninth most expensive city in the world and has one of the world's largest urban economies, being ranked as an alpha global city, and is also one of the fastest growing tourist destinations in the world. Moscow is home to the third-highest number of billionaires of any city in the world, and has the highest number of billionaires of any city in Europe. The city of Moscow is divided into twelve administrative okrugs. By its territorial expansion on July 1, 2012 southwest into the Moscow Oblast, the area of the capital more than doubled, going from 1,091 to 2,511 square kilometers (421 to 970 sq mi), resulting in Moscow becoming the largest city on the European continent by area; it also gained an additional population of 233,000 people.

Let's collect some data about these cities

Importing libraries

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import re
from sklearn.cluster import KMeans
import json
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import time
import random

In [3]:
#We will use geopy library to get the latitude and longitude values.
#Let's write a function and name user agent as Moscow_explorer
def get_coord(addr):
    address = '%s' % addr
    geolocator = Nominatim(user_agent="Moscow_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return latitude, longitude

### Part 1 - Collecting data about Moscow

In [4]:
#Getting coordinates of Moscow
url = 'https://en.wikipedia.org/wiki/Moscow'
r = requests.get(url)
website = r.text
coords=re.findall('"wgCoordinates":({.*\n.*}),', website)[0].replace('\n','')
Moscow_latitude = float(re.findall('{"lat":(.*),', coords)[0])
Moscow_longitude = float(re.findall('"lon":(.*)}', coords)[0])
print('The geograpical coordinates of Moscow are {}, {}.'.format(Moscow_latitude, Moscow_longitude))

The geograpical coordinates of Moscow are 55.755833333333335, 37.617222222222225.


In [5]:
# Let's create empty DataFrame to fill it with data in future
df = pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

# Let's get Okrug names at first
ovrpsurl = "https://overpass.kumi.systems/api/interpreter"
overpass_url = ovrpsurl
overpass_query = """
[out:json];
area["addr:country"="RU"]["addr:region"="Москва"][admin_level=5];
out;
"""
response = requests.get(overpass_url, 
                        params={'data': overpass_query})
data = response.json()

Okrug_Names = []
for i in range(len(data['elements'])):
    Okrug_Names.append(data['elements'][i]['tags']['name:en'])

df['Okrug_Name'] = Okrug_Names
df

Unnamed: 0,Okrug_Name
0,Northern Administrative Okrug
1,Western Administrative Okrug
2,North-Western Administrative Okrug
3,North-Eastern Administrative Okrug
4,South-Eastern Administrative Okrug
5,Southern Administrative Okrug
6,South-Western Administrative Okrug
7,Eastern Administrative Okrug
8,Zelenogradsky Administrative Okrug
9,Central Administrative Okrug


In [6]:
# # Let's get districts of each Okrug
# Okrug_districts = []
# Okrug_dict = {}
# for Okrug in df['Okrug_Name']:
#     print('Processing', Okrug)
#     overpass_url = ovrpsurl
#     overpass_query = """
#     [out:json];
#     area['name:en' = '%s'][type="boundary"];
#     rel(area)[admin_level=8][boundary=administrative];
#     out;
#     """ % Okrug
#     response = requests.get(overpass_url, 
#                             params={'data': overpass_query})
#     data = response.json()
#     Okrug_districts = []
#     for i in range(len(data['elements'])):
#         try:
#             Okrug_districts.append(data['elements'][i]['tags']['name:en'])
#         except:
#             Okrug_districts.append(data['elements'][i]['tags']['name'])
#     Okrug_dict[Okrug] = Okrug_districts
# full = []
# for el in Okrug_dict.keys():
#     for el2 in Okrug_dict[el]:
#         full.append([el, el2])
# dff = pd.DataFrame(data=full, index=None, columns=['Okrug', 'District'], dtype=None, copy=False)
# dff.shape

In [7]:
# full_coords_lat = []
# full_coords_lon = []
# for index, row in dff.iterrows():
#     print('Processing', row['District'], 'of', row['Okrug'])
#     lat, lon = get_coord(','.join([row['Okrug'], row['District']]))
#     full_coords_lat.append(lat)
#     full_coords_lon.append(lon)
# dff['Latitude'] = full_coords_lat
# dff['Longitude'] = full_coords_lon

# # Save df to file
# dff.to_csv('coords_full.csv', sep=',', encoding='utf-8', index=False)
# #read complete data from file
dff = pd.read_csv('coords_full.csv')
dff.head()

Unnamed: 0,Okrug,District,Latitude,Longitude
0,Northern Administrative Okrug,Levoberezhny District,55.865663,37.465859
1,Northern Administrative Okrug,Khovrino District,55.869357,37.488795
2,Northern Administrative Okrug,Zapadnoye Degunino District,55.870548,37.520804
3,Northern Administrative Okrug,Timiryazevsky District,55.825817,37.557744
4,Northern Administrative Okrug,Koptevo District,55.830065,37.521738


In [8]:
# Draw the map of Moscow districts
def color_picker(Okrug):
    color = ['darkred', 'blue', 'green', 'lightblue', 'darkblue', \
             'pink', 'white', 'orange', 'lightgreen', 'purple', 'red', 'black']
    return(color[int((df[df['Okrug_Name'] == Okrug].index)[0])])


# Let's draw all districts on map
map_Moscow = folium.Map(location=[Moscow_latitude - 0.1, Moscow_longitude], zoom_start=9)
for Okrug, District, lat, lng in zip(dff['Okrug'], \
                             dff['District'], \
                             dff['Latitude'], \
                             dff['Longitude']):
    label = '{}, {}'.format(Okrug, District)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=color_picker(Okrug),
        fill=True,
        fill_color='#863100',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Moscow)  
map_Moscow

### Part 2 - Collecting data about New York

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

In [9]:
#Getting coordinates of New York
url = 'https://en.wikipedia.org/wiki/New_York_City'
r = requests.get(url)
website = r.text
coords=re.findall('"wgCoordinates":(.*)},"wgCentralAuthMobileDomain"', website)[0]
NYC_latitude = float(re.findall('"lat":(.*),', coords)[0])
NYC_longitude = float(re.findall('"lon":(.*)', coords)[0])
print('The geograpical coordinates of New York are {}, {}.'.format(NYC_latitude, NYC_longitude))

The geograpical coordinates of New York are 40.71274, -74.005974.


In [10]:
# Let's create empty DataFrame to fill it with data in future
import urllib.request
url = 'https://cocl.us/new_york_dataset'
urllib.request.urlretrieve(url, 'newyork_data.json')
#Next, let's load the data and tranform the data into a pandas dataframe
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
#neighborhoods # Checking empty dataframe

#Then let's loop through the data and fill the dataframe one row at a time. Then let's examine the resulting dataframe.
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

neighborhoods.head(3)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806


In [11]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [12]:
# Create a map of New York with neighborhoods superimposed on top
def color_pickerNY(borough):
    color = ['darkred', 'blue', 'green', 'black', 'darkblue']
    return(color[list(neighborhoods['Borough'].unique()).index(borough)])

# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[NYC_latitude, NYC_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=color_pickerNY(borough),
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### Part 3 - comparison candidates

Let's explore some other data about this cities

1. Moscow

In [13]:
url = 'https://ru.wikipedia.org/wiki/Административно-территориальное_деление_Москвы'
r = requests.get(url)
website = r.text.replace(',','.').replace('↗','').replace('%','').replace('&#160;','').replace('\n','') # Correcting decimal separator
table_Moscow = pd.read_html( website, encoding="UTF-8", na_values=None, keep_default_na=False)[1]
table_Moscow

Unnamed: 0,Административный округ,Площадь км²1.07.2012[4][5],отобщейплощади,Местопоплощади,Населениечел. 01.01.2020[6],отобщегонаселения,Место понаселению,Плотностьнаселениячел. / км²01.01.2020,Место поплотностинаселения
0,Центральный,66.18,2.62,11.0,783886[6],6.18,9.0,11845.56,5.0
1,Северный,113.73,4.5,7.0,1188312[6],9.37,7.0,10448.9,7.0
2,Северо-Восточный,101.88,4.03,9.0,1434842[6],11.32,4.0,14083.23,1.0
3,Восточный,154.84,6.13,3.0,1527316[6],12.05,2.0,9864.12,8.0
4,Юго-Восточный,117.56,4.65,6.0,1433828[6],11.31,5.0,12196.59,4.0
5,Южный,131.77,5.22,5.0,1796267[6],14.17,1.0,13631.54,2.0
6,Юго-Западный,111.36,4.41,8.0,1448130[6],11.42,3.0,13003.78,3.0
7,Западный,153.03,6.06,4.0,1397114[6],11.02,6.0,9129.42,9.0
8,Северо-Западный,93.28,3.69,10.0,1012949[6],7.99,8.0,10859.11,6.0
9,Зеленоградский,37.2,1.47,12.0,250453[6],1.98,11.0,6732.63,10.0


In [14]:
table_Moscow.columns = ['Administrative Okrug', 'Land area km²', '% of all area', 'Place by area', 'Population',\
             '% of all population', 'Place by population', 'Density pers/km²', 'Place by density']
table_Moscow['Population'] = table_Moscow['Population'].str[:-3].astype(int)
table_Moscow = table_Moscow.replace(['Центральный', 'Северный', 'Северо-Восточный', 'Восточный', 'Юго-Восточный',\
               'Южный', 'Юго-Западный', 'Западный', 'Северо-Западный',\
                'Зеленоградский', 'Троицкий', 'Новомосковский', 'Вся Москва'],\
               ['Central', 'Northern', 'North-Eastern', 'Eastern', 'South-Eastern',\
                'Southern', 'South-Western', 'Wester', 'North-Western', \
                'Zelenogradsky', 'Troitsky', 'Novomoskovsky', 'All Moscow'])
table_Moscow

Unnamed: 0,Administrative Okrug,Land area km²,% of all area,Place by area,Population,% of all population,Place by population,Density pers/km²,Place by density
0,Central,66.18,2.62,11.0,783886,6.18,9.0,11845.56,5.0
1,Northern,113.73,4.5,7.0,1188312,9.37,7.0,10448.9,7.0
2,North-Eastern,101.88,4.03,9.0,1434842,11.32,4.0,14083.23,1.0
3,Eastern,154.84,6.13,3.0,1527316,12.05,2.0,9864.12,8.0
4,South-Eastern,117.56,4.65,6.0,1433828,11.31,5.0,12196.59,4.0
5,Southern,131.77,5.22,5.0,1796267,14.17,1.0,13631.54,2.0
6,South-Western,111.36,4.41,8.0,1448130,11.42,3.0,13003.78,3.0
7,Wester,153.03,6.06,4.0,1397114,11.02,6.0,9129.42,9.0
8,North-Western,93.28,3.69,10.0,1012949,7.99,8.0,10859.11,6.0
9,Zelenogradsky,37.2,1.47,12.0,250453,1.98,11.0,6732.63,10.0


2. New York

In [15]:
url = 'https://en.wikipedia.org/wiki/Boroughs_of_New_York_City'
r = requests.get(url)
website = r.text
table_NYC = pd.read_html(website, encoding="UTF-8", na_values=None, keep_default_na=False)[0]
table_NYC

Unnamed: 0_level_0,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte,New York City's five boroughsvte
Unnamed: 0_level_1,Jurisdiction,Jurisdiction,Population,Gross Domestic Product,Gross Domestic Product,Land area,Land area,Density,Density
Unnamed: 0_level_2,Borough,County,Estimate (2019)[3],billions(2012 US$)[4],per capita(US$),square miles,squarekm,persons / sq. mi,persons /km2
0,The Bronx,Bronx,1418207,42.695,30100,42.10,109.04,33867,13006
1,Brooklyn,Kings,2559903,91.559,35800,70.82,183.42,36147,13957
2,Manhattan,New York,1628706,600.244,368500,22.83,59.13,71341,27544
3,Queens,Queens,2253858,93.310,41400,108.53,281.09,20767,8018
4,Staten Island,Richmond,476143,14.514,30500,58.37,151.18,8157,3150
5,City of New York,City of New York,8336817,842.343,101000,302.64,783.83,27547,10636
6,State of New York,State of New York,19453561,1731.910,89000,47126.40,122056.82,412,159
7,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles,Sources:[5] and see individual borough articles


In [16]:
table_NYC = table_NYC[:-2]
table_NYC.columns = ['Borough','County','Population','GDP billions','GDP per capita',\
                  'Land area sq mi','Land area km²','Density pers/sq mi','Density pers/km²']
table_NYC = table_NYC.drop('County', 1)
table_NYC = table_NYC.drop('GDP billions', 1)
table_NYC = table_NYC.drop('GDP per capita', 1)
table_NYC = table_NYC.drop('Land area sq mi', 1)
table_NYC = table_NYC.drop('Density pers/sq mi', 1)
table_NYC['Population'] = table_NYC['Population'].astype(int)
table_NYC['Land area km²'] = table_NYC['Land area km²'].astype(float)
table_NYC['Density pers/km²'] = table_NYC['Density pers/km²'].astype(int)
table_NYC['% of all population'] = (table_NYC['Population']/table_NYC['Population'].max()*100).round(2)
table_NYC['% of all area'] = (table_NYC['Land area km²']/table_NYC['Land area km²'].max()*100).round(2)
table_NYC

Unnamed: 0,Borough,Population,Land area km²,Density pers/km²,% of all population,% of all area
0,The Bronx,1418207,109.04,13006,17.01,13.91
1,Brooklyn,2559903,183.42,13957,30.71,23.4
2,Manhattan,1628706,59.13,27544,19.54,7.54
3,Queens,2253858,281.09,8018,27.03,35.86
4,Staten Island,476143,151.18,3150,5.71,19.29
5,City of New York,8336817,783.83,10636,100.0,100.0


In [17]:
table_NYC.sort_values(by='% of all area',ascending=0, inplace=True)
table_NYC['Place by area'] = ['',1,2,3,4,5]
table_NYC.sort_values(by='Population',ascending=0, inplace=True)
table_NYC['Place by population'] = ['',1,2,3,4,5]
table_NYC.sort_values(by='Density pers/km²',ascending=0, inplace=True)
table_NYC['Place by density'] = [1,2,3,'',4,5]
table_NYC.sort_index(inplace=True)
cols_changed = ['Borough', 'Land area km²', '% of all area', 'Place by area', 'Population', '% of all population',\
                'Place by population', 'Density pers/km²', 'Place by density']
table_NYC = table_NYC[cols_changed]
table_NYC

Unnamed: 0,Borough,Land area km²,% of all area,Place by area,Population,% of all population,Place by population,Density pers/km²,Place by density
0,The Bronx,109.04,13.91,4.0,1418207,17.01,4.0,13006,3.0
1,Brooklyn,183.42,23.4,2.0,2559903,30.71,1.0,13957,2.0
2,Manhattan,59.13,7.54,5.0,1628706,19.54,3.0,27544,1.0
3,Queens,281.09,35.86,1.0,2253858,27.03,2.0,8018,4.0
4,Staten Island,151.18,19.29,3.0,476143,5.71,5.0,3150,5.0
5,City of New York,783.83,100.0,,8336817,100.0,,10636,


Now let's take two similar (land area, population, density) districts for comaprison:

In [18]:
new = table_Moscow[table_Moscow['Administrative Okrug'] == 'South-Eastern']
new.rename(columns={'Administrative Okrug':'Name'}, inplace=True)
new1 = table_NYC[table_NYC['Borough'] == 'The Bronx']
new1.rename(columns={'Borough':'Name'}, inplace=True)
new = new.append(new1, sort = False)
new

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,Name,Land area km²,% of all area,Place by area,Population,% of all population,Place by population,Density pers/km²,Place by density
4,South-Eastern,117.56,4.65,6,1433828,11.31,5,12196.59,4
0,The Bronx,109.04,13.91,4,1418207,17.01,4,13006.0,3


In [19]:
NYC_compare_candidate_df = neighborhoods[neighborhoods['Borough'] == 'Bronx'].reset_index(drop=True)
Moscow_compare_candidate_df = dff[dff['Okrug'] == 'South-Eastern Administrative Okrug'].reset_index(drop=True)
print('South-Eastern area in Moscow has %s districts and Bronx Area from New York has %s neighborhoods'\
      % (Moscow_compare_candidate_df.shape[0], NYC_compare_candidate_df.shape[0]))

South-Eastern area in Moscow has 12 districts and Bronx Area from New York has 52 neighborhoods


Comparison of 12 and 52 areas will not be fair. So let's use postal offices coordinates from these 12 districts. It will let us to increase number of points inside study area. Let's take some data from https://data.mos.ru/opendata/1095. I have prepared csv file for South-Eastern Administrative Okrug. This file includes adresses of post offices.

In [21]:
df_coord = pd.read_csv('OPS.csv')
df_coord.head()

Unnamed: 0,ShortName,PostalCode,AdmArea,District,Longitude,Latitude
0,OPS_20,111020,South-Eastern Administrative Okrug,Lefortovo District,37.716897,55.766902
1,OPS_24,111024,South-Eastern Administrative Okrug,Lefortovo District,37.717184,55.750886
2,OPS_33,111033,South-Eastern Administrative Okrug,Lefortovo District,37.687112,55.758889
3,OPS_52,109052,South-Eastern Administrative Okrug,Nizhegorodsky District,37.721188,55.730655
4,OPS_88,115088,South-Eastern Administrative Okrug,Yuzhnoportovy District,37.676188,55.716392


In [22]:
print('Now we have %s points, which is close to number of neighborhoods in New York.\
 Now we will show them on map' % (df_coord.shape[0]))

Now we have 56 points, which is close to number of neighborhoods in New York. Now we will show them on map


In [23]:
map_OPS_Moscow = folium.Map(location=[Moscow_latitude - 0.1, Moscow_longitude], zoom_start=10)
for postcode, lat, lng, rai in zip(df_coord['ShortName'], \
                             df_coord['Latitude'], \
                             df_coord['Longitude'], \
                             df_coord['District']):
    label = '{}, {}'.format(postcode, rai)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#863100',
        fill_opacity=0.7,
        parse_html=False).add_to(map_OPS_Moscow)  
map_OPS_Moscow

In [24]:
df_coord["OPS_District"] = df_coord['District'].map(str)+'_'+df_coord['ShortName'].map(str)
df_coord.head()

Unnamed: 0,ShortName,PostalCode,AdmArea,District,Longitude,Latitude,OPS_District
0,OPS_20,111020,South-Eastern Administrative Okrug,Lefortovo District,37.716897,55.766902,Lefortovo District_OPS_20
1,OPS_24,111024,South-Eastern Administrative Okrug,Lefortovo District,37.717184,55.750886,Lefortovo District_OPS_24
2,OPS_33,111033,South-Eastern Administrative Okrug,Lefortovo District,37.687112,55.758889,Lefortovo District_OPS_33
3,OPS_52,109052,South-Eastern Administrative Okrug,Nizhegorodsky District,37.721188,55.730655,Nizhegorodsky District_OPS_52
4,OPS_88,115088,South-Eastern Administrative Okrug,Yuzhnoportovy District,37.676188,55.716392,Yuzhnoportovy District_OPS_88


Finally we can create one dataframe with all points to study

In [25]:
Moscow_compare_candidate_df = df_coord[['AdmArea', 'OPS_District', 'Latitude', 'Longitude']]
Moscow_compare_candidate_df['City'] = 'Moscow'
Moscow_compare_candidate_df = Moscow_compare_candidate_df.rename(columns={"AdmArea": "Borough", "OPS_District": "Neighborhood"})
Moscow_compare_candidate_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,City
0,South-Eastern Administrative Okrug,Lefortovo District_OPS_20,55.766902,37.716897,Moscow
1,South-Eastern Administrative Okrug,Lefortovo District_OPS_24,55.750886,37.717184,Moscow
2,South-Eastern Administrative Okrug,Lefortovo District_OPS_33,55.758889,37.687112,Moscow
3,South-Eastern Administrative Okrug,Nizhegorodsky District_OPS_52,55.730655,37.721188,Moscow
4,South-Eastern Administrative Okrug,Yuzhnoportovy District_OPS_88,55.716392,37.676188,Moscow


In [26]:
NYC_compare_candidate_df = neighborhoods[neighborhoods['Borough'] == 'Bronx'].reset_index(drop=True)
NYC_compare_candidate_df['City'] = 'New York'
NYC_compare_candidate_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,City
0,Bronx,Wakefield,40.894705,-73.847201,New York
1,Bronx,Co-op City,40.874294,-73.829939,New York
2,Bronx,Eastchester,40.887556,-73.827806,New York
3,Bronx,Fieldston,40.895437,-73.905643,New York
4,Bronx,Riverdale,40.890834,-73.912585,New York


In [27]:
df_all = Moscow_compare_candidate_df.append(NYC_compare_candidate_df, sort=False).reset_index(drop=True)
df_all

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,City
0,South-Eastern Administrative Okrug,Lefortovo District_OPS_20,55.766902,37.716897,Moscow
1,South-Eastern Administrative Okrug,Lefortovo District_OPS_24,55.750886,37.717184,Moscow
2,South-Eastern Administrative Okrug,Lefortovo District_OPS_33,55.758889,37.687112,Moscow
3,South-Eastern Administrative Okrug,Nizhegorodsky District_OPS_52,55.730655,37.721188,Moscow
4,South-Eastern Administrative Okrug,Yuzhnoportovy District_OPS_88,55.716392,37.676188,Moscow
...,...,...,...,...,...
103,Bronx,Mount Eden,40.843826,-73.916556,New York
104,Bronx,Mount Hope,40.848842,-73.908299,New York
105,Bronx,Bronxdale,40.852723,-73.861726,New York
106,Bronx,Allerton,40.865788,-73.859319,New York


## Methodology. <a name="methodology"></a>

We are trying to compare two areas of biggest cities to undestand how similar these areas are. Such information will be usefull when you are trying to change living location to be sure that familiar things are nearby, or for example to understand is it possible to expand your business there - open a cafe, bar or gym. 

By now we have collected some data about cities. We have learned about land area and population of districts inside cities. 
Based on this knowledge we have chosen two districts similar by land area and population. 

Next we will try to use Foursquare API to get more knowledge about chosen areas - venues around every point. 

Similarity of these areas will be evaluated after venue clusterisation after getting data from Foursquare API.

## Analysis <a name="analysis"></a>

Foursquare Credentials and Version

In [28]:
CLIENT_ID = '3Y4VRJ3XEEFJVFYIOJZI222GCS5YZJQWK5Y0DYVL43KLHFCM' # Foursquare ID
CLIENT_SECRET = '1JJEBLWZEKCSFWIZTIWR2K5TFYJ4XOC10Y4T2VFOG2JRC2FM' # Foursquare Secret
VERSION = '20191111' # Foursquare API version

Function for processing all the neighborhoods

In [29]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
import time
time.sleep(1)

In [30]:
# Getting all the data
#All_venues = getNearbyVenues(names=df_all['Neighborhood'],
#                                   latitudes=df_all['Latitude'],
#                                   longitudes=df_all['Longitude']
#                                  )
All_venues = pd.read_csv('All_venues.csv', sep='\t', encoding='utf-8')

In [31]:
# checking resulting dataframe
All_venues.head()
#All_venues.to_csv('All_venues.csv', sep='\t', encoding='utf-8')
# print(All_venues.shape)

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Lefortovo District_OPS_20,55.766902,37.716897,Комус,55.766754,37.716923,Paper / Office Supplies Store
1,1,Lefortovo District_OPS_20,55.766902,37.716897,Музей истории Лефортово,55.769557,37.710855,History Museum
2,2,Lefortovo District_OPS_20,55.766902,37.716897,Платформа «Сортировочная»,55.76337,37.720381,Train Station
3,3,Lefortovo District_OPS_20,55.766902,37.716897,Пятерочка,55.76899,37.716225,Supermarket
4,4,Lefortovo District_OPS_20,55.766902,37.716897,Пятёрочка,55.765892,37.710419,Supermarket


In [54]:
print('There are {} unique categories.'.format(len(All_venues['Venue Category'].unique())))

There are 243 unique categories.


Let's analyze Each Neighborhood

In [33]:
# one hot encoding
All_onehot = pd.get_dummies(All_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
All_onehot['Neighborhood'] = All_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [All_onehot.columns[-1]] + list(All_onehot.columns[:-1])
All_onehot = All_onehot[fixed_columns]

All_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Waste Facility,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Lefortovo District_OPS_20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lefortovo District_OPS_20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Lefortovo District_OPS_20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Lefortovo District_OPS_20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Lefortovo District_OPS_20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
All_onehot.shape

(2091, 244)

let's group rows by neighborhood by taking the mean of the frequency of occurrence of each category

In [35]:
All_grouped = All_onehot.groupby('Neighborhood').mean().reset_index()
All_grouped.head(3)

Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Waste Facility,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Baychester,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bedford Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
All_grouped.shape

(108, 244)

Let's print each neighborhood along with the top 5 most common venues

In [37]:
num_top_venues = 5

for hood in All_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = All_grouped[All_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allerton----
              venue  freq
0       Pizza Place  0.15
1       Supermarket  0.07
2    Cosmetics Shop  0.07
3       Bus Station  0.07
4  Department Store  0.07


----Baychester----
            venue  freq
0      Donut Shop  0.10
1     Pizza Place  0.05
2  Cosmetics Shop  0.05
3     Men's Store  0.05
4     Supermarket  0.05


----Bedford Park----
                venue  freq
0  Mexican Restaurant  0.11
1               Diner  0.11
2         Pizza Place  0.08
3       Deli / Bodega  0.08
4  Chinese Restaurant  0.05


----Belmont----
                venue  freq
0  Italian Restaurant  0.18
1         Pizza Place  0.10
2       Deli / Bodega  0.07
3              Bakery  0.05
4        Dessert Shop  0.03


----Bronxdale----
                         venue  freq
0                  Pizza Place  0.08
1  Eastern European Restaurant  0.08
2           Spanish Restaurant  0.08
3               Breakfast Spot  0.08
4           Mexican Restaurant  0.08


----Castle Hill----
            venue  fr

                  venue  freq
0        Cosmetics Shop  0.10
1     Mobile Phone Shop  0.07
2  Gym / Fitness Center  0.07
3      Sushi Restaurant  0.07
4        Lingerie Store  0.07


----Maryino District_OPS_369----
                       venue  freq
0  Middle Eastern Restaurant  0.17
1                       Pool  0.17
2                        Gym  0.17
3             Clothing Store  0.17
4                Supermarket  0.17


----Maryino District_OPS_451----
                  venue  freq
0  Gym / Fitness Center  0.08
1            Beer Store  0.05
2              Pharmacy  0.05
3  Fast Food Restaurant  0.05
4     Food & Drink Shop  0.05


----Maryino District_OPS_469----
                venue  freq
0         Pizza Place  0.14
1                 Bar  0.09
2  Chinese Restaurant  0.09
3         Yoga Studio  0.05
4         Karaoke Bar  0.05


----Maryino District_OPS_651----
         venue  freq
0  Supermarket  0.12
1         Pool  0.06
2         Road  0.06
3   Sports Bar  0.06
4  Pizza Place  0

4  Italian Restaurant  0.04


----Yuzhnoportovy District_OPS_432----
                    venue  freq
0       Convenience Store  0.11
1                    Café  0.11
2  Furniture / Home Store  0.11
3                 Brewery  0.11
4              Beer Store  0.11


----Yuzhnoportovy District_OPS_88----
                       venue  freq
0       Gym / Fitness Center  0.07
1  Middle Eastern Restaurant  0.07
2                Pizza Place  0.07
3          Electronics Store  0.07
4                       Café  0.07




Let's put that into a pandas dataframe

In [38]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = All_grouped['Neighborhood']

for ind in np.arange(All_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(All_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Allerton,Pizza Place,Cosmetics Shop,Deli / Bodega,Bus Station,Supermarket
1,Baychester,Donut Shop,Spanish Restaurant,Gym / Fitness Center,Bus Station,Mattress Store
2,Bedford Park,Mexican Restaurant,Diner,Deli / Bodega,Pizza Place,Spanish Restaurant
3,Belmont,Italian Restaurant,Pizza Place,Deli / Bodega,Bakery,Bank
4,Bronxdale,Pizza Place,Bank,Performing Arts Venue,Paper / Office Supplies Store,Chinese Restaurant


### Clustering time

In [40]:
# set number of clusters (almost random number, not so big and not so small)
kclusters = 8

All_grouped_clustering = All_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(All_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 1, 1, 1, 4, 4, 1, 4, 4, 1, 1, 5, 1, 1, 0, 1, 7, 1, 1, 1, 0,
       3, 1, 1, 2, 0, 2, 2, 2, 2, 2, 0, 4, 2, 4, 2, 4, 1, 0, 0, 2, 4, 0,
       4, 0, 2, 0, 2, 1, 0, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 2, 2, 2, 0, 0, 1, 1, 1, 1, 4, 2, 2, 2, 1, 1, 4, 2, 2, 0, 2, 2,
       1, 1, 1, 1, 2, 0, 0, 0, 0, 2, 4, 0, 0, 4, 4, 1, 6, 1, 2, 2])

Let's create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

In [41]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

All_merged = df_all

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
All_merged = All_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [42]:
All_merged.head(5)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,City,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,South-Eastern Administrative Okrug,Lefortovo District_OPS_20,55.766902,37.716897,Moscow,0,Supermarket,History Museum,Train Station,Paper / Office Supplies Store,Mediterranean Restaurant
1,South-Eastern Administrative Okrug,Lefortovo District_OPS_24,55.750886,37.717184,Moscow,2,Gym / Fitness Center,Caucasian Restaurant,Supermarket,Park,Convenience Store
2,South-Eastern Administrative Okrug,Lefortovo District_OPS_33,55.758889,37.687112,Moscow,2,Gym / Fitness Center,Sandwich Place,Supermarket,Auto Workshop,Flea Market
3,South-Eastern Administrative Okrug,Nizhegorodsky District_OPS_52,55.730655,37.721188,Moscow,0,River,Light Rail Station,Bus Line,Hotel,Convenience Store
4,South-Eastern Administrative Okrug,Yuzhnoportovy District_OPS_88,55.716392,37.676188,Moscow,2,Electronics Store,Pizza Place,Vietnamese Restaurant,Gym / Fitness Center,Middle Eastern Restaurant


Finally, let's visualize the resulting clusters

In [43]:
Moscow_all = All_merged.loc[All_merged['City'] == 'Moscow']
NY_all = All_merged.loc[All_merged['City'] == 'New York']

In [44]:
# create map
map_clusters_Moscow = folium.Map(location=[Moscow_latitude, Moscow_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
Moscow_all['Cluster Labels'] = Moscow_all['Cluster Labels'].astype(int)
markers_colors = []
for lat, lon, poi, cluster in zip(Moscow_all['Latitude'], Moscow_all['Longitude'], Moscow_all['Neighborhood'], Moscow_all['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_Moscow)
       
map_clusters_Moscow

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [45]:
# create map
map_clusters_NYC = folium.Map(location=[NYC_latitude, NYC_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
NY_all['Cluster Labels'] = NY_all['Cluster Labels'].astype(int)
markers_colors = []
for lat, lon, poi, cluster in zip(NY_all['Latitude'], NY_all['Longitude'], NY_all['Neighborhood'], NY_all['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_NYC)
       
map_clusters_NYC

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


If we will try to compare results by colors - we can see that these two areas have nothing similar.

Let's inspect frequency of each cluster in both cities:

In [46]:
NY_all['Cluster Labels'].value_counts()

1    39
4     8
0     2
7     1
6     1
5     1
Name: Cluster Labels, dtype: int64

In [47]:
Moscow_all['Cluster Labels'].value_counts()

2    26
0    22
4     6
3     1
1     1
Name: Cluster Labels, dtype: int64

Most popular clusters in NY are 1 and 4 (and it's 39+8=47 places (47/52*100% = 90,4% of all)) and most popular clusters in Moscow are 2 and 0 (it's 26+22=48 places (85,7% of all), very similar quantity to NY).

Let's inspect most common venues in each of these four cluster.

In [48]:
Moscow_20 = Moscow_all.loc[(Moscow_all['Cluster Labels'] == 2) | (Moscow_all['Cluster Labels'] == 0), ['1st Most Common Venue', \
                                           '2nd Most Common Venue', '3rd Most Common Venue', \
                                           '4th Most Common Venue', '5th Most Common Venue']]
Moscow_20.head(5)

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Supermarket,History Museum,Train Station,Paper / Office Supplies Store,Mediterranean Restaurant
1,Gym / Fitness Center,Caucasian Restaurant,Supermarket,Park,Convenience Store
2,Gym / Fitness Center,Sandwich Place,Supermarket,Auto Workshop,Flea Market
3,River,Light Rail Station,Bus Line,Hotel,Convenience Store
4,Electronics Store,Pizza Place,Vietnamese Restaurant,Gym / Fitness Center,Middle Eastern Restaurant


In [49]:
df20 = Moscow_20.melt(var_name='columns', value_name='ind')
df20_counts = df20['ind'].value_counts().rename_axis('Moscow Place').reset_index(name='Moscow counts')
df20_counts

Unnamed: 0,Moscow Place,Moscow counts
0,Supermarket,29
1,Pizza Place,11
2,Gym / Fitness Center,11
3,Food & Drink Shop,9
4,Cosmetics Shop,7
...,...,...
85,Flea Market,1
86,History Museum,1
87,Korean Restaurant,1
88,Beer Store,1


In [50]:
NY_14 = NY_all.loc[(NY_all['Cluster Labels'] == 1) | (NY_all['Cluster Labels'] == 4), ['1st Most Common Venue', \
                                           '2nd Most Common Venue', '3rd Most Common Venue', \
                                           '4th Most Common Venue', '5th Most Common Venue']]
NY_14.head(5)

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
56,Pharmacy,Laundromat,Ice Cream Shop,Dessert Shop,Donut Shop
57,Bus Station,Restaurant,Pharmacy,Mattress Store,Fast Food Restaurant
58,Caribbean Restaurant,Diner,Deli / Bodega,Bowling Alley,Platform
60,Park,Bus Station,Home Service,Gym,Playground
61,Pizza Place,Bar,Mexican Restaurant,Supermarket,Sandwich Place


In [51]:
df14 = NY_14.melt(var_name='columns', value_name='ind')
df14_counts = df14['ind'].value_counts().rename_axis('NY Place').reset_index(name='NY counts')
df14_counts

Unnamed: 0,NY Place,NY counts
0,Pizza Place,28
1,Deli / Bodega,14
2,Pharmacy,13
3,Donut Shop,12
4,Bus Station,11
...,...,...
69,Sporting Goods Shop,1
70,Platform,1
71,Bookstore,1
72,Furniture / Home Store,1


And now let's combine both lists in one data frame

In [52]:
df_compare = pd.DataFrame()
# df_compare['Moscow places', 'Moscow counts'] = df20['ind'].value_counts().head(10)
# df_compare['NY places', 'NY counts'] = df14['ind'].value_counts().head(10)
df_compare['Moscow Place'] = df20_counts['Moscow Place']
df_compare['Moscow counts'] = df20_counts['Moscow counts']
df_compare['NY Place'] = df14_counts['NY Place']
df_compare['NY counts'] = df14_counts['NY counts']
df_compare.head(15)

Unnamed: 0,Moscow Place,Moscow counts,NY Place,NY counts
0,Supermarket,29,Pizza Place,28.0
1,Pizza Place,11,Deli / Bodega,14.0
2,Gym / Fitness Center,11,Pharmacy,13.0
3,Food & Drink Shop,9,Donut Shop,12.0
4,Cosmetics Shop,7,Bus Station,11.0
5,Fast Food Restaurant,7,Grocery Store,11.0
6,Park,7,Bank,10.0
7,Bus Stop,7,Spanish Restaurant,7.0
8,Pharmacy,6,Italian Restaurant,6.0
9,Convenience Store,6,Sandwich Place,6.0


## Results and discussion. <a name="results"></a>

Achieved results show us that two areas a very different with only 30% of similar places in top 15 places of each area. They are different after comparison of most common places in almost the same quantity of small areas inside of each big area.

## Conclusion. <a name="conclusion"></a>

As we can see from previous research - two areas with similar land area, population and density from two biggest cities from different sides of the globe are very different. From top 15 places of more than 85% of all common places for these areas only pizza places, pharmacy, bus stops, grocery stores and parks are present in both areas in different proporions. Such similary will allow you to eat pizza, buy pills and go to park if you will decide to change your location. But you will not be able to visit fitness center and cosmetics shop if you will migrate from Moscow to NYC, and you will not be able to buy donuts and visit spanish restaurant in case of NYC to Moscow migration. In conclusion I would like to say that location change from South-Eastern part of Moscow to The bronx in New York or vice versa will not be the easiest thing with familiar places in not familiar distance.