# Venues Data Analysis of Moscow City
# Capstone Project - The Battle of the Neighborhoods
## Applied Data Science Capstone by IBM/Coursera

# Introduction <a name="Introduction"></a>

## Background <a name="Background"></a>

Moscow, one of the largest metropolises in the world with a population of more than 12 million people, covers an area of ​​more than 2561.5 km² with an average density of inheritance of 4924.96 people / km² [1](https://ru.wikipedia.org/wiki/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0).

Moscow is divided into 12 districts (125 boroughs, 2 urban boroughs, 19 settlement boroughs).

Moscow has a very uneven population density from 30429 people / km² for the "Зябликово" borough, to 560 people / km² for the "Молжаниновский" borough [2](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B9%D0%BE%D0%BD%D1%8B_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B).

The average cost of real estate varies from 68,768 rubles / m² for the "Кленовское" borough to 438,568 rubles / m² for the "Арбат" borough [3](https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir).

## Business Problem <a name="Business Problem"></a>

Owners of cafes, fitness centers and other social facilities are expected to prefer boroughs with a high population density. Investors will prefer areas with low housing costs and low competitiveness.

On the part of residents, the preference is expected for a boroughs with a low cost of housing and good accessibility of social places.

In my research, I will try to determine the optimal places for the location of fitness centers in Moscow boroughs, taking into account the number of people, the cost of real estate and the density of other fitness facilities.

The key criteria for selecting suitable locations for fitness centers will be:
- High density of the borough population
- Low cost of real estate in the area
- The absence in the immediate vicinity of other fitness facilities of a similar profile

I will use the approaches and methods of machine learning to determine the location of fitness centers in accordance with the specified criteria.

The main stakeholders of my research will be investors interested in opening new fitness centers.

# Data acquisition and cleaning <a name="data"></a>

## Data requirements

Based on the problem and the established selection criteria, to conduct the research, I will need the following information:  

1. main dataset with the list of Moscow Borough, containing the following attributes:
    - name of the each Moscow Borough
    - type of the each Moscow Borough
    - name of the each Moscow District in which Borough is belong to
    - area of the each Moscow Borough in square kilometers
    - the population of the each Moscow Borough
    - housing area of the each Moscow Borough in square meters
    - average housing price of the each Moscow Borough
2. geographical coordinates of the each Moscow Borough
3. shape of the each Moscow Borough in GEOJSON format
4. list of venues placed in the each Moscow Borough with their geographical coordinates and categories

## Decribe data sources 

### Moscow Boroughs dataset

Data for Moscow Boroughs dataset were downloaded from multiple HTTP page combined into one pandas dataframe.
- List of Moscow District and they Boroughs were downloaded from the page [Moscow Boroughs](https://gis-lab.info/qa/moscow-atd.html)
- Information about area of the each Moscow Borough in square kilometers, their population and housing area in square meters were downloaded from the page [Moscow Boroughs Population Density](https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B)
- Information about housing price of the each Moscow Borough were downloaded from the page [Moscow Boroughs Housing Price](https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir)

A special Python function has been developed for HTML table parse. This function help me:
- to find number of rows and columns in a HTML table
- to get cloumns titles, if posible
- to convert string to float, if posible
- return result in form of the Pandas dataframe

### Moscow Boroughs geographical coordinates

Geographical coordinates of the each Moscow Borough were queried through Nominatim service.   
As the Nominatim service are quite unstable it was quite a challenge to request coordinate in several iterations.

### Moscow Boroughs shape in GEOJSON format

Shape of the each Moscow Borough in GEOJSON format was downloaded from the page [Moscow Boroughs GEOJSON](http://gis-lab.info/data/mos-adm/mo.geojson)

### Moscow Boroughs venues

To determine **venues** the service **Forsquare API** was used.  
The API of **Forsquare** service have the restriction of 100 **venues**, which it can return in one request.  
To obtain list of all **venues** I used the following approach:  
  - present Moscow area in the form of a regular grid of circles of quite small diameter, no more than 100 **venues** in each circle  
  - perform exploration using **Forsquare API** with quite bigger radius than circle of a grid to make sure it overlaps/full coverage to don't miss any venues
  - cleaning list of venues from duplicates.  

This approach and some of the Python code was taken from the work presented here. https://cocl.us/coursera_capstone_notebook

Circle of 28 000 meter in radius cover all Moscow Boroughs.  
In my research grid of circles contains 7899 cells with radius 300 meter.  
Foursquare API have a certain limitation for API call in one day to explore venues.  
In my case it was about 2000 calls per day.  
So in addition I have to divide grid dataset into subset and call Foursquare API for several days. 

## Decribe data cleansing 

### Moscow Boroughs dataset cleansing

As data for Moscow Boroughs dataset were downloaded from multiple HTTP page it was necessary to perform a data cleaning. Such as:  
- remove some unused colums 
- strip text columns from additional information like ' \n\t'
- replace some Borough_Name as of russian letters "е" and "ё" 
- change places of some words in Borough_Name
- clear Borough Name from additional information, such as ', поселение ', ', городской округ '
- replace '\n', ' ↗' and '↘' in some columns
- delete extra spaces in numeric columns
- replace ',' to '.' for float columns
- convert from float to int for integer columns
- convert from string to float for numeric columns

As the result I had a dataset with all 146 Moscow Boroughs. Result dataset contains columns:
- **Borough_Name** - name of the Moscow Borough - is a unique key of the dataset
- **District_Name** - name of the Moscow District in which Borough is belong to
- **Borough_Type** - type of the Moscow Borough
- **OKATO_Borough_Code** - numeric code of the Moscow Borough
- **OKTMO_District_Code** - numeric code of the Moscow District
- **Borough_Area** - area of the Moscow Borough in square kilometers
- **Borough_Population** - population of the Moscow Borough
- **Borough_Population_Density** - population density of the Moscow Borough
- **Borough_Housing_Area** - housing area of the Moscow Borough in square meters
- **Borough_Housing_Area_Per_Person** - housing area per person of the Moscow Borough in square meters
- **Latitude** - geograprical Latitude of the Moscow Borough
- **Longitude** - geograprical Longitude of the Moscow Borough
- **Borough_Housing_Price** - average housing price of the Moscow Borough

I had a problem to found proper statistics about “housing prices” and “housing area” for some Moscow boroughs, so I had to exclude 26 boroughs from my analysis.   
Fortunately, they all had a low population density, which meat criteria of my research and did not reduce it quality.

#### The result Moscow Boroughs dataset

![Moscow Boroughs dataset](https://raw.githubusercontent.com/romapres2010/Coursera_Capstone/master/img/Moscow_borough_df.png)

### Moscow Boroughs geographical coordinates cleansing

Nominatim service not only quite unstable.  
It also have a occasionally problem with russian leter **ё**. So I have to manyaly obtain coordinates for such boroughs as:
 - Дес**ё**новское, Поселение, Новомосковский  
 - Сав**ё**лки, Муниципальный округ, ЗелАО
 - Кл**ё**новское, Поселение, Троицкий  
 - And some others.

Another problem with Nominatim service is that it return not very accurate coordinate of some Boroughs.  
So I needed to adjust they manually in the map.

### Moscow Boroughs shape in GEOJSON format cleansing

GEOJSON file downloaded from the page [Moscow Boroughs GEOJSON](http://gis-lab.info/data/mos-adm/mo.geojson) was quite good and not requied any addition clearing.

### Moscow Boroughs venues cleansing

Usning **Forsquare API** I obtrained 34460 venues in 7899 cells.  
As I used a quite bigger radius (350 meters) for venue explorations than circle of a grid (300 meters), there was a need to remove duplicates venus.  
After duplicates removal I had 27622 unique venues in the circle radius of 28 000 meters around the Moscow City.  

The second task was to bind each venue to Moscow Boroughs in which borders they were placed.  
To perform this task I created a polygons for each Moscow Borough from GEOJSON file and found wich venues coordinate included into each polygon.  

The third task was to remove all the venues that placed outside of the Moscow boroughs.  

The fourth tas was to get main category from the category list for each venue.  

As the result I had list of 20864 venues placed in the Moscow Boroughs with their geographical coordinates and categories

## Performing data gathering and cleansing

#### Import requied libraries

In [None]:
# Install if it needed in your environment
#!conda install -c conda-forge shapely --yes 
#!conda install -c conda-forge pyproj --yes 
#!conda install -c conda-forge Beautifulsoup4 --yes 
#!conda install -c conda-forge lxml --yes 
#!conda install -c conda-forge html5lib --yes 
#!conda install -c conda-forge requests --yes 
#!conda install -c conda-forge geopy --yes
#!conda install -c conda-forge geocoder --yes

In [122]:
# Import requied libraries
import requests
import pandas as pd
import json
import geopy
import folium
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import shapely.geometry
import pyproj
import math
from shapely.geometry import shape, Point

#### Define function for HTML table parse. This function help us
- find number of rows and columns in a HTML table 
- get cloumns titles, if posible
- convert string to float, if posible
- return result Pandas dataframe

In [2]:
def parse_html_table(table):
    n_columns = 0
    n_rows=0
    column_names = []

    # Find number of rows and columns
    # we also find the column titles if we can
    for row in table.find_all('tr'):
        
        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows+=1
            if n_columns == 0:
                # Set the number of columns for our table
                n_columns = len(td_tags)
                
        # Handle column names if we find them
        th_tags = row.find_all('th') 
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())

    # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0,n_columns)
    df = pd.DataFrame(columns = columns,
                        index= range(0,n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1
            
    # Convert to float if possible
    for col in df:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            pass
    
    return df

### Request and clear Moscow Boroughs list 

#### Load and parse Moscow Boroughs dataset from HTML page and examine the raw dataframe

In [8]:
# Load page with Moscow Boroughs
url = "https://gis-lab.info/qa/moscow-atd.html"
try:
    print('Try to request url "{}"'.format(url))
    response = requests.get(url)
    print('Url "{}" requested. '.format(url))
    
    # Take second HTML table with districts from the page and parse it into dataframe
    print('Try to extract dataset from HTML table'.format(url))
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.findAll('table', { 'class' : 'wikitable sortable' }, limit=2) 
    Moscow_df = parse_html_table(tables[1])  
    print('Success extract dataset from HTML table'.format(url))

    # Define columns for dataframe
    Moscow_df.columns=["Borough_index", "Borough_Name", "District_Name", "Borough_Type", "OKATO_Borough_Code", "OKTMO_District_Code"]

    # Save dataframe for future use
    Moscow_df.to_csv("data\Moscow_df_raw.csv", index = False)

except Exception as err: 
    print('Request Url "{}" failed. Load previously saved dataframe'.format(url))
    print('Error is: {}'.format(err))
    # Load previously saved dataframe
    Moscow_df = pd.read_csv("data\Moscow_df_raw.csv")

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_df.head())
print(Moscow_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_df.dtypes)

Try to request url "https://gis-lab.info/qa/moscow-atd.html"
Url "https://gis-lab.info/qa/moscow-atd.html" requested. 
Try to extract dataset from HTML table
Request Url "https://gis-lab.info/qa/moscow-atd.html" failed. Load previously saved dataframe
Error is: list index out of range
Take a look at the dataframe
   Borough_index     Borough_Name District_Name           Borough_Type  \
0            1.0  Академический\n        ЮЗАО\n  Муниципальный округ\n   
1            2.0   Алексеевский\n        СВАО\n  Муниципальный округ\n   
2            3.0   Алтуфьевский\n        СВАО\n  Муниципальный округ\n   
3            4.0          Арбат\n         ЦАО\n  Муниципальный округ\n   
4            5.0       Аэропорт\n         САО\n  Муниципальный округ\n   

   OKATO_Borough_Code  OKTMO_District_Code  
0          45293554.0           45397000.0  
1          45280552.0           45349000.0  
2          45280554.0           45350000.0  
3          45286552.0           45374000.0  
4          4527

As we can see Moscow Boroughs dataset is not very good. We have to
- remove "\n" in the end of the text data
- convert float to int for code columns

#### Clear Moscow Boroughs dataset

In [9]:
print('Clear the dataframe')

# Drop Borough_index colums 
Moscow_df.drop("Borough_index", axis=1, inplace=True)

# Remove "\n" in the end of the text data
Moscow_df.replace('\n', '', regex=True, inplace=True)

# convert float to int for code columns
Moscow_df["OKATO_Borough_Code"] = Moscow_df["OKATO_Borough_Code"].astype(int)
Moscow_df["OKTMO_District_Code"] = Moscow_df["OKTMO_District_Code"].astype(int)

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_df.head())
print(Moscow_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_df.dtypes)

# Save dataframe for future use
Moscow_df.to_csv("data\Moscow_df.csv", index = False)

Clear the dataframe
Take a look at the dataframe
    Borough_Name District_Name         Borough_Type  OKATO_Borough_Code  \
0  Академический          ЮЗАО  Муниципальный округ            45293554   
1   Алексеевский          СВАО  Муниципальный округ            45280552   
2   Алтуфьевский          СВАО  Муниципальный округ            45280554   
3          Арбат           ЦАО  Муниципальный округ            45286552   
4       Аэропорт           САО  Муниципальный округ            45277553   

   OKTMO_District_Code  
0             45397000  
1             45349000  
2             45350000  
3             45374000  
4             45333000  
(146, 5)
Take a look at the dataframe data types
Borough_Name           object
District_Name          object
Borough_Type           object
OKATO_Borough_Code      int32
OKTMO_District_Code     int32
dtype: object


Now Moscow Boroughs dataset looks quite well

### Request coordinate of Moscow Boroughs

In [27]:
# instantiate the dataframe
Moscow_coord_df = pd.DataFrame(columns=['Borough_Name', 'Latitude', 'Longitude'] )

# create class instance of Nominatim
geolocator = Nominatim(user_agent="foursquare_agent", timeout=2)

# The Nominatim is not stable, so catch exception and in error load previously saved dataframe
try:
    # loop frough all Boroughs
    for Borough_Name, Borough_Type, District_Name in zip(Moscow_df['Borough_Name'], Moscow_df['Borough_Type'], Moscow_df['District_Name']):
        address = '{}, {}, {}, Москва, Россия'.format(Borough_Name, Borough_Type, District_Name)
        print(address, end='')

        location = None

        # make up to 10 attempts
        for x in range(0, 9):
            print('.', end='')
            try:
                location = geolocator.geocode(address)
                if location is not None:
                    print(' - coordinate are {}, {}'.format(location.latitude, location.longitude))
                    latitude = location.latitude
                    longitude = location.longitude
                    Moscow_coord_df = Moscow_coord_df.append({'Borough_Name': Borough_Name, 'Latitude': latitude, 'Longitude': longitude}, ignore_index=True) 
                    break
            except Exception as err:
                print('')
                print(type(err))
                print(err) 
                raise

    # If faild get coordinate then load previously saved dataframe
    if location is None:
        raise geopy.exc.GeocoderTimedOut

    # Save copy of the dataframe as service Nominatim not stable
    Moscow_coord_df.to_csv("data\Moscow_coord_df.csv", index = False)

except Exception as err:
    print('')
    print(err)
    print('Request Nominatim failed. Load previously saved dataframe')
    Moscow_coord_df = pd.read_csv("data\Moscow_coord_df.csv")

# Take a look at the dataframe
print('Take a look at the dataframe'.format(url))
print(Moscow_coord_df.head())
print(Moscow_coord_df.shape)

print('Take a look at the dataframe data types'.format(url))
print(Moscow_coord_df.dtypes)

Академический, Муниципальный округ, ЮЗАО, Москва, Россия. - coordinate are 55.6897377, 37.5767712
Алексеевский, Муниципальный округ, СВАО, Москва, Россия. - coordinate are 55.8148783, 37.6506684
Алтуфьевский, Муниципальный округ, СВАО, Москва, Россия. - coordinate are 55.880255, 37.5816349
Арбат, Муниципальный округ, ЦАО, Москва, Россия. - coordinate are 55.74620815, 37.58945652138118
Аэропорт, Муниципальный округ, САО, Москва, Россия. - coordinate are 55.8004021, 37.5331563
Бабушкинский, Муниципальный округ, СВАО, Москва, Россия. - coordinate are 55.8659576, 37.6638944
Басманный, Муниципальный округ, ЦАО, Москва, Россия. - coordinate are 55.779396, 37.6878576
Беговой, Муниципальный округ, САО, Москва, Россия. - coordinate are 55.7819165, 37.5662996
Бескудниковский, Муниципальный округ, САО, Москва, Россия. - coordinate are 55.8574669, 37.5616979
Бибирево, Муниципальный округ, СВАО, Москва, Россия. - coordinate are 55.8838943, 37.6035774
Бирюлёво Восточное, Муниципальный округ, ЮАО, Мо

Moscow Boroughs Coordinate dataset looks quite well

### Dowload GEOJSON for Moscow Boroughs

In [6]:
# download geojson file
url = 'http://gis-lab.info/data/mos-adm/mo.geojson'
try:
    print('Try to request url "{}"'.format(url))
    download_file = requests.get(url)
    print('Url "{}" requested. '.format(url))

    mo_geojson_utf8 = 'data\mo.geojson.utf8'
    open(mo_geojson_utf8, 'wb').write(download_file.content)    
    print('GeoJSON file downloaded!')

    mo_geojson = 'data\mo.geojson'

    # Encode file from utf8 to cp1251 as my computer use Russian locale
    f = open(mo_geojson, "wb")
    for line in open(mo_geojson_utf8, "rb"):
        f.write(line.decode('u8').encode('cp1251', 'ignore'))
    f = open(mo_geojson, "wb")
    for line in open(mo_geojson_utf8, "rb"):
        f.write(line.decode('u8').encode('cp1251', 'ignore'))

    # validate geojson file
    with open(mo_geojson) as json_file:
        data = json_file.read()
        try:
            data = json.loads(data)
        except ValueError as e:
            print('invalid json: %s' % e)

except: 
    print('Request Url "{}" failed'.format(url))
    mo_geojson = 'data\mo.geojson'
    print('GeoJSON file downloaded!')

Try to request url "http://gis-lab.info/data/mos-adm/mo.geojson"
Url "http://gis-lab.info/data/mos-adm/mo.geojson" requested. 
GeoJSON file downloaded!


### Request and clear Moscow Boroughs Housing Price

#### Load and parse Moscow Boroughs Housing Price dataset from HTML page and examine the raw dataframe

In [7]:
# Load page with Moscow Boroughs Housing Price
url = "https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir"
try:
    print('Try to request url "{}"'.format(url))
    response = requests.get(url)
    print('Url "{}" requested. '.format(url))

    # Take first HTML table with districts from the page and parse it into dataframe
    print('Try to extract dataset from HTML table'.format(url))
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.findAll('table', limit=1) 
    Moscow_housing_price_df = parse_html_table(tables[0])  
    print('Success extract dataset from HTML table'.format(url))

    # Save dataframe for future use
    Moscow_housing_price_df.to_csv("data\Moscow_housing_price_df_raw.csv", index = False)

except Exception as err: 
    print('Request Url "{}" failed. Load previously saved dataframe'.format(url))
    print('Error is: {}'.format(err))
    # Load previously saved dataframe
    Moscow_housing_price_df = pd.read_csv("data\Moscow_housing_price_df_raw.csv")

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_housing_price_df.head())
print(Moscow_housing_price_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_housing_price_df.dtypes)

Try to request url "https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir"
Url "https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir" requested. 
Try to extract dataset from HTML table
Success extract dataset from HTML table
Take a look at the dataframe
       0                  1                       2                     3  \
0  \n№\n       \nРайон\n \n  \nЦена, \nруб./кв. м\n  \nПрирост \nза год\n   
1  \n1\n          \nАрбат\n              \n438568\n            \n−0,20%\n   
2  \n2\n      \nХамовники\n              \n425741\n             \n4,50%\n   
3  \n3\n       \nЯкиманка\n              \n404471\n             \n1,30%\n   
4  \n4\n  \nЗамоскворечье\n              \n398544\n             \n3,80%\n   

                             4                     5  
0  \nЦена \nквартиры, \nруб.\n  \nПрирост \nза год\n  
1                 \n33702123\n             \n0,10%\n  
2                 \n2719630

As we can see Moscow Boroughs Housing Price dataset is not very good. We have to
- remove some unused colums 
- set columns for dataframe
- strip Borough Name from additional information like ' \n\t'
- replace '\n' in text columns
- convert from string to numeric
- replace some Borough_Name as of russian letters "е" and "ё" and change places of some words 

#### Clear Moscow Boroughs Housing Price dataset

In [10]:
print('Clear the dataframe')

# Drop some unused colums 
Moscow_housing_price_df.drop([Moscow_housing_price_df.columns[0], Moscow_housing_price_df.columns[3], Moscow_housing_price_df.columns[4], Moscow_housing_price_df.columns[5]], axis=1, inplace=True)
Moscow_housing_price_df.drop(0, axis=0, inplace=True)

# Set columns for dataframe
Moscow_housing_price_df.columns=["Borough_Name", "Borough_Housing_Price"]

# Clear Borough Name from additional information
Moscow_housing_price_df["Borough_Name"] = Moscow_housing_price_df["Borough_Name"].str.strip(' \n\t')

# Replace '\n' in some columns
Moscow_housing_price_df.replace('\n', '', regex=True, inplace=True)

# Convert from string to numeric
Moscow_housing_price_df["Borough_Housing_Price"] = Moscow_housing_price_df["Borough_Housing_Price"].astype(int)

# replace some Borough_Name as of russian letters "е" and "ё" and change places of some words 
Moscow_housing_price_df["Borough_Name"].replace('Бирюлево Восточное', 'Бирюлёво Восточное', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Бирюлево-Западное', 'Бирюлёво Западное', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Дегунино Восточное', 'Восточное Дегунино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Измайлово Восточное', 'Восточное Измайлово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Дегунино Западное', 'Западное Дегунино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Савеловский', 'Савёловский', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Измайлово Северное', 'Северное Измайлово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Медведково Северное', 'Северное Медведково', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тушино Северное', 'Северное Тушино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Теплый Стан', 'Тёплый Стан', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тропарево-Никулино', 'Тропарёво-Никулино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Филевский Парк', 'Филёвский Парк', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Хорошево-Мневники', 'Хорошёво-Мнёвники', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Хорошевский', 'Хорошёвский', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Черемушки', 'Черёмушки', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Медведково Южное', 'Южное Медведково', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тушино Южное', 'Южное Тушино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Мосрентген', '"Мосрентген"', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Бутово Северное', 'Северное Бутово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Бутово Южное', 'Южное Бутово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Десеновское', 'Десёновское', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Кленовское', 'Клёновское', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Новофедоровское', 'Новофёдоровское', regex=True, inplace=True)


# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_housing_price_df.head())
print(Moscow_housing_price_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_housing_price_df.dtypes)

# Save copy of the dataframe
Moscow_housing_price_df.to_csv("data\Moscow_housing_price_df.csv", index = False)

Clear the dataframe
Take a look at the dataframe
    Borough_Name  Borough_Housing_Price
1          Арбат                 438568
2      Хамовники                 425741
3       Якиманка                 404471
4  Замоскворечье                 398544
5       Тверской                 386255
(143, 2)
Take a look at the dataframe data types
Borough_Name             object
Borough_Housing_Price     int32
dtype: object


Now Moscow Boroughs Housing Price dataset looks quite well

### Request and clear Moscow Boroughs Population Density

#### Load and parse Moscow Boroughs Population Density dataset from HTML page and examine the raw dataframe

In [11]:
# Load page with Moscow Boroughs Population Density
url = "https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B"

try:
    print('Try to request url "{}"'.format(url))
    response = requests.get(url)
    print('Url "{}" requested. '.format(url))

    # Take first HTML table with districts from the page and parse it into dataframe
    print('Try to extract dataset from HTML table'.format(url))
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.findAll('table', { 'class' : 'standard sortable' }, limit=1) 
    Moscow_dens_df = parse_html_table(tables[0]) 
    print('Success extract dataset from HTML table'.format(url))

    # Save dataframe for future use
    Moscow_dens_df.to_csv("data\Moscow_dens_df_raw.csv", index = False)

except Exception as err: 
    print('Request Url "{}" failed. Load previously saved dataframe'.format(url))
    print('Error is: {}'.format(err))
    # Load previously saved dataframe
    Moscow_dens_df = pd.read_csv("data\Moscow_dens_df_raw.csv")

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_dens_df.head())
print(Moscow_dens_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_dens_df.dtypes)

Try to request url "https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B"
Url "https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B" requested. 
Try to extract dataset from HTML table
Success extract dataset from HTML table
Take a look at the dataframe
     № Флаг Герб Название района[2]/поселения[3][4]  \
0  1.0                               Академический    
1  2.0                                Алексеевский    
2  3.0                                Алтуфьевский    
3  4.0                                       Арбат    
4  5.0                                    Аэропорт    

  Название cоответствующего внутригородского муниципального образования: муниципального округа / поселен

As we can see Moscow Boroughs Population Density dataset is not very good. We have to
- drop some unused colums 
- set columns name for dataframe
- clear Borough Name from additional information, such as ', поселение ', ', городской округ '
- strip string columns from additional information like ' \n\t'
- replace '\n', ' ↗' and '↘' in some columns
- delete extra spaces in numeric string columns
- replace ',' to '.' for float columns
- convert from string to numeric

#### Clear Moscow Boroughs Population Density dataset

In [12]:
print('Clear the dataframe')

# Drop some unused colums 
Moscow_dens_df.drop([Moscow_dens_df.columns[0], Moscow_dens_df.columns[1], Moscow_dens_df.columns[2], Moscow_dens_df.columns[3], Moscow_dens_df.columns[5]], axis=1, inplace=True)

# Set columns for dataframe
Moscow_dens_df.columns=["Borough_Name", "Borough_Area", "Borough_Population", "Borough_Population_Density", "Borough_Housing_Area", "Borough_Housing_Area_Per_Person"]

# Clear Borough Name from additional information
Moscow_dens_df["Borough_Name"].replace(', поселение ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Name"].replace(', городской округ ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Name"] = Moscow_dens_df["Borough_Name"].str.strip(' \n\t')
Moscow_dens_df["Borough_Name"].replace('Мосрентген', '"Мосрентген"', regex=True, inplace=True)

# Replace '\n' and ' ↗' in some columns
Moscow_dens_df.replace('\n', '', regex=True, inplace=True)
Moscow_dens_df.replace('↗', '', regex=True, inplace=True)
Moscow_dens_df.replace('↘', '', regex=True, inplace=True)

# Delete extra spaces in numeric columns
Moscow_dens_df["Borough_Area"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population"].replace('\xa0', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population_Density"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area_Per_Person"].replace(' ', '', regex=True, inplace=True)

# Replace ',' to '.' for float columns
Moscow_dens_df["Borough_Area"].replace(',', '.', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area"].replace(',', '.', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area_Per_Person"].replace(',', '.', regex=True, inplace=True)

# Convert from string to numeric
Moscow_dens_df["Borough_Population"] = Moscow_dens_df["Borough_Population"].astype(int)
Moscow_dens_df["Borough_Population_Density"] = Moscow_dens_df["Borough_Population_Density"].astype(int)
Moscow_dens_df["Borough_Area"] = Moscow_dens_df["Borough_Area"].astype(float)
Moscow_dens_df['Borough_Housing_Area'] = pd.to_numeric(Moscow_dens_df['Borough_Housing_Area'], errors='coerce')
Moscow_dens_df['Borough_Housing_Area_Per_Person'] = pd.to_numeric(Moscow_dens_df['Borough_Housing_Area_Per_Person'], errors='coerce')

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_dens_df.head())
print(Moscow_dens_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_dens_df.dtypes)

# Save copy of the dataframe
Moscow_dens_df.to_csv("data\Moscow_dens_df.csv", index = False)

Clear the dataframe
Take a look at the dataframe
    Borough_Name  Borough_Area  Borough_Population  \
0  Академический          5.83              109387   
1   Алексеевский          5.29               80534   
2   Алтуфьевский          3.25               57596   
3          Арбат          2.11               36125   
4       Аэропорт          4.58               79486   

   Borough_Population_Density  Borough_Housing_Area  \
0                       18762                2467.0   
1                       15223                1607.9   
2                       17721                 839.3   
3                       17120                 731.0   
4                       17355                1939.7   

   Borough_Housing_Area_Per_Person  
0                             22.7  
1                             20.5  
2                             15.5  
3                             26.0  
4                             25.9  
(146, 6)
Take a look at the dataframe data types
Borough_Name            

Now Moscow Boroughs Population Density dataset looks quite well

### Join all datasets into result Moscow Boroughs dataset

We do not have statistics on “housing prices” and “housing area” for all boroughs, so we exclude these boroughs from our analysis

In [31]:
# Merge datasets
Moscow_Borough_df = pd.merge(left=Moscow_df, right=Moscow_dens_df, how='left', left_on='Borough_Name', right_on='Borough_Name')
Moscow_Borough_df = pd.merge(left=Moscow_Borough_df, right=Moscow_coord_df, how='left', left_on='Borough_Name', right_on='Borough_Name')
Moscow_Borough_df = pd.merge(left=Moscow_Borough_df, right=Moscow_housing_price_df, how='left', left_on='Borough_Name', right_on='Borough_Name')

# We do not have statistics on “housing prices” and “housing area” for all boroughs, so we exclude these boroughs from our analysis
print('Print Boroughs without Housing Price')
Moscow_Borough_df[pd.isnull(Moscow_Borough_df['Borough_Housing_Price'])]
print('Delete Boroughs without Housing Price')
Moscow_Borough_df.dropna(subset=['Borough_Housing_Price'], inplace=True)

print('Print Boroughs without Housing Area')
Moscow_Borough_df[pd.isnull(Moscow_Borough_df['Borough_Housing_Area'])]
print('Delete Boroughs without Housing Area')
Moscow_Borough_df.dropna(subset=['Borough_Housing_Area'], inplace=True)

# reset index
Moscow_Borough_df.reset_index(drop=True, inplace=True)

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_Borough_df.head())
print(Moscow_Borough_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_Borough_df.dtypes)

# Save result dataframe
Moscow_Borough_df.to_csv("data\Moscow_Borough_df.csv", index = False)

Print Boroughs without Housing Price
Delete Boroughs without Housing Price
Print Boroughs without Housing Area
Delete Boroughs without Housing Area
Take a look at the dataframe
    Borough_Name District_Name         Borough_Type  OKATO_Borough_Code  \
0  Академический          ЮЗАО  Муниципальный округ            45293554   
1   Алексеевский          СВАО  Муниципальный округ            45280552   
2   Алтуфьевский          СВАО  Муниципальный округ            45280554   
3          Арбат           ЦАО  Муниципальный округ            45286552   
4       Аэропорт           САО  Муниципальный округ            45277553   

   OKTMO_District_Code  Borough_Area  Borough_Population  \
0             45397000          5.83              109387   
1             45349000          5.29               80534   
2             45350000          3.25               57596   
3             45374000          2.11               36125   
4             45333000          4.58               79486   

   Borough_

So our result dataset contains all needed columns:
- **Borough_Name** - name of the Moscow Borough - is a unique key of the dataset
- **District_Name** - name of the Moscow District in which Borough is belong to
- **Borough_Type** - type of the Moscow Borough
- **OKATO_Borough_Code** - numeric code of the Moscow Borough
- **OKTMO_District_Code** - numeric code of the Moscow District
- **Borough_Area** - area of the Moscow Borough in square kilometers
- **Borough_Population** - population of the Moscow Borough
- **Borough_Population_Density** - population density of the Moscow Borough
- **Borough_Housing_Area** - housing area of the Moscow Borough in square meters
- **Borough_Housing_Area_Per_Person** - housing area per person of the Moscow Borough in square meters
- **Latitude** - geograprical Latitude of the Moscow Borough
- **Longitude** - geograprical Longitude of the Moscow Borough
- **Borough_Housing_Price** - average housing price of the Moscow Borough

Now we have all needed data for Venues searching and analysis

## Visialize a map of Moscow Boroughs

In [99]:
# Load previously saved dataframe
Moscow_Borough_df = pd.read_csv("data\Moscow_Borough_df.csv")
mo_geojson = 'data\mo.geojson'

# Moscow latitude and longitude values
Moscow_lat= 55.7504461
Moscow_lng= 37.6174943

# create map and display it
Moscow_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=10)

# Generate choropleth map with Borough Population
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Population in Moscow City'
)

folium.LayerControl().add_to(Moscow_map)

# Add Borougs center as markers to Moscow map 
for Borough_Name, lat, lng, Borough_Population in zip(Moscow_Borough_df['Borough_Name'], Moscow_Borough_df['Latitude'], Moscow_Borough_df['Longitude'], Moscow_Borough_df['Borough_Population']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        #popup='{}, Москва, Россия ({:})'.format(Borough_Name, Borough_Population),
        popup=folium.Popup('{}, Москва, Россия ({:})'.format(Borough_Name, Borough_Population), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)

    folium.Circle([lat, lng], radius=1000, color='blue', fill=False).add_to(Moscow_map)

# display map
Moscow_map

## Performing Venues data gathering and cleansing

As we can see, use center of the brough for searching venues is quite useless as eache borough have very sophisticated shape  
So I needed to present Moscow area in the form of a regular grid of circles of quite small diameter

Display a circle of 28 000 meter in radius, wich cover all the Moscow Boroughs in my reseach

In [100]:
# Display a circle of 28 000 meter in radius, wich cover all the Moscow Boroughs in my reseach
Moscow_Circle_lat= 55.7398697
Moscow_Circle_lng= 37.5365271
Circle_radius=28000
folium.Circle([Moscow_Circle_lat,Moscow_Circle_lng], radius=Circle_radius, color='blue', fill=False).add_to(Moscow_map)

# display map
Moscow_map

### Create a hexagonal grid of area candidates

Create a grid of area candidates, equaly spaced, centered around circle center and within 28 km  
Create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters

Define functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters)

In [76]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Moscow center longitude={}, latitude={}'.format(Moscow_lat, Moscow_lng))
x, y = lonlat_to_xy(Moscow_lat, Moscow_lng)
print('Moscow center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Moscow center longitude={}, latitude={}'.format(lo, la))


Coordinate transformation check
-------------------------------
Moscow center longitude=55.7504461, latitude=37.6174943
Moscow center UTM X=4153844.213879714, Y=5041137.396315997
Moscow center longitude=55.750446099999984, latitude=37.61749430000002


Define a function to create a hexagonal grid of cells

In [84]:
def create_hexagonal_grid (lat, lon, distance_limit, cell_radius):
    center_x, center_y = lonlat_to_xy(lon, lat) 

    # create a hexagonal grid of cells: we offset every other row, and adjust vertical row 
    # spacing so that every cell center is equally distant from all it's neighbors.

    k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
    x_min = center_x - distance_limit
    x_step = cell_radius *2 
    y_min = center_y - distance_limit - (int((distance_limit/cell_radius+1)/k)*k*(cell_radius *2) - (distance_limit*2))/2
    y_step = cell_radius *2  * k 
    
    latitudes = []
    longitudes = []
    cells_id = []
    distances_from_center = []
    xs = []
    ys = []
    for i in range(0, int((distance_limit/cell_radius+1)/k)):
        y = y_min + i * y_step
        x_offset = cell_radius if i%2==0 else 0
        for j in range(0, int(distance_limit/cell_radius+1)):
            x = x_min + j * x_step + x_offset
            distance_from_center = calc_xy_distance(center_x, center_y, x, y)
            if (distance_from_center <= (distance_limit+1)):
                lon, lat = xy_to_lonlat(x, y)
                latitudes.append(lat)
                longitudes.append(lon)
                cells_id.append('{},{}'.format(lat, lon))
                distances_from_center.append(distance_from_center)
                xs.append(x)
                ys.append(y)

    # Create and return new Pandas dataframe with all cells
    return pd.DataFrame(list(zip(cells_id, latitudes, longitudes)), columns =['Cell_id', 'Cell_Latitude', 'Cell_Longitude']) 

Folium library have a problem to visualize more then 1000 item in single map  
So for test purpose create a grid of area candidates, equally spaced, centered around circle center and within 10 km and visualize it

In [101]:
distance_limit = 10000
cell_radius = 300
Moscow_Circle_lat= 55.7398697
Moscow_Circle_lng= 37.5365271    

Moscow_cells_df =  create_hexagonal_grid(Moscow_Circle_lat, Moscow_Circle_lng, distance_limit, cell_radius)
print(Moscow_cells_df.shape[0], 'candidate neighborhood centers generated.')

# Visualize circle center location and candidate neighborhood centers
Moscow_cell_map = folium.Map(location=[Moscow_Circle_lat, Moscow_Circle_lng], zoom_start=12)

# Generate choropleth map with Borough Population
Moscow_cell_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Population in Moscow City'
)

# Add grid of area candidates
for lat, lng in zip(Moscow_cells_df['Cell_Latitude'], Moscow_cells_df['Cell_Longitude']):
    folium.Circle([lat, lng], radius=cell_radius, color='blue', fill=False).add_to(Moscow_cell_map)

Moscow_cell_map

1009 candidate neighborhood centers generated.


Looks very good.  
So create a grid of area candidates, equaly spaced, centered around circle center and within radius 28 000 m

In [102]:
# Create a grid of area candidates, equaly spaced, centered around circle center and within 28 000 m
distance_limit = 28000
cell_radius = 300

Moscow_Circle_lat= 55.7398697
Moscow_Circle_lng= 37.5365271    

Moscow_cells_df =  create_hexagonal_grid(Moscow_Circle_lat, Moscow_Circle_lng, distance_limit, cell_radius)
Moscow_cells_df.index = Moscow_cells_df['Cell_id']
print(Moscow_cells_df.shape[0], 'candidate neighborhood centers generated.')

# Save dataframe
Moscow_cells_df.to_csv("data\Moscow_cells_df.csv", index = False)

7899 candidate neighborhood centers generated.


### Explore the neighborhood in the grid

Foursquare API have a certain limitation for API call in one day to explore venues  
So I have to divide cells dataset into subset and call Foursquare API for several days

Define some supplemenatry functions

In [103]:
def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Россия', '')
    address = address.replace(', Москва', '')
    return address

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def get_venues_near_location(lat, lon, client_id, client_secret, radius=300, limit=100):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, radius, limit)
    #print(url)
    results = requests.get(url).json()['response']['groups'][0]['items']
    venues = [(item['venue']['id'],
               item['venue']['name'],
               get_categories(item['venue']['categories']),
               item['venue']['location']['lat'], 
               item['venue']['location']['lng'],
               format_address(item['venue']['location']),
               item['venue']['location']['distance']) for item in results]        
    return venues

Declare Moscow venues dataframe, intially empty

In [104]:
# Declare Moscow venues dataframe, intially empty
Moscow_venues_df = pd.DataFrame()
# Declare dataframe to store already explored cells
Moscow_cells_explored_df = pd.DataFrame(columns=['Cell_id'])

Define Foursquare Credentials and Version

In [105]:
client_id = 'KO15SRHFPF3BBN5XUCEYLDU3ROGESHFXRWPK1Q32QSIDFFBM'
client_secret = 'POBZ0ULK51ARULZXVMOZMAVRCEUWOSXX5HD2QLERC43JCUEK'
version = '20180604'

limit = 100
# Using radius=cell_radius+50 to make sure we have overlaps/full coverage so we don't miss any venues
explore_radius = cell_radius+50

Proccess all cells if it not yet processed  
**!! It takes about 4 hours in several days to compleet so comment it and load previously saved dataframe !!**

In [None]:
# Prepare dataset with cell which is not yet have been processed. 1000 record in one batch
# Moscow_cells_for_explore_df = Moscow_cells_df[~Moscow_cells_df['Cell_id'].isin(Moscow_cells_explored_df['Cell_id'])].head(1000)

# Itterate through all cell prepared for explore
#for index, lat, lng in zip(Moscow_cells_for_explore_df.index, Moscow_cells_for_explore_df['Cell_Latitude'], Moscow_cells_for_explore_df['Cell_Longitude']):
#    print('Explore Cell {}'.format(index), end='')
#    
#    try:
#        venues = get_venues_near_location(lat, lng, client_id, client_secret, radius=explore_radius, limit=limit)
#        print(' - found {} veenues'.format(len(venues)))
#
#        # if found any venues add they to dataframe
#        if (len(venues) > 0):
#            Moscow_venues_df = Moscow_venues_df.append([(index, lat, lng, v[0], v[1], v[2], v[3], v[4], v[5], v[6], "") for v in venues], ignore_index=True)
#
#        # save cell id as already explored
#        Moscow_cells_explored_df.loc[index] = index 
#        
#    except Exception as err:
#        print(err)
#        pass
#
#Moscow_cells_explored_df.to_csv("data\Moscow_cells_explored_df.csv", index = False)
#Moscow_venues_df.to_csv("data\Moscow_venues_df_raw.csv", index = False)

### Clear Venues dataset

Load previously saved dataframe

In [111]:
Moscow_venues_df = pd.read_csv("data\Moscow_venues_df_raw.csv")

# Columns of result dataset
column_names = ['Cell_id', 'Cell_Latitude', 'Cell_Longitude', 'Venue_Id', 'Venue_Name', 
              'Venue_All_Categories','Venue_Latitude', 'Venue_Longitude', 'Venue_Location', 'Venue_Distance', 'Borough_Name'] 

# Rename columns
Moscow_venues_df.columns=column_names

# Take a look at the dataframe
print('Take a look at the dataframe')
print(Moscow_venues_df.head())
print(Moscow_venues_df.shape)

print('Take a look at the dataframe data types')
print(Moscow_venues_df.dtypes)

Take a look at the dataframe
                                Cell_id  Cell_Latitude  Cell_Longitude  \
0    55.5224833695889,37.33999278481363      55.522483       37.339993   
1  55.52079366429822,37.348782073217926      55.520794       37.348782   
2  55.52079366429822,37.348782073217926      55.520794       37.348782   
3  55.52079366429822,37.348782073217926      55.520794       37.348782   
4  55.51910336213278,37.357570430251045      55.519103       37.357570   

                   Venue_Id        Venue_Name  \
0  5013eb60e4b0d8233e3cc1f9     Дам на поляне   
1  559fc452498ee47a29994987      клуб "Драйв"   
2  55e98f1a498eb1a84c226c4b             Иртыш   
3  5c20ef4e4a1cc000392a49b3  ПИВО ТУТ. Triple   
4  4dbe1dba1e72b351cada9b8f      Пруд Платный   

                                Venue_All_Categories  Venue_Latitude  \
0         [('Vineyard', '4bf58dd8d48988d1de941735')]       55.523795   
1  [('General Entertainment', '4bf58dd8d48988d1f1...       55.520057   
2  [('Food & Dr

#### Delete duplicate venues

In [117]:
# Count duplicates venues
print('Unique Venues {} of {}'.format(Moscow_venues_df['Venue_Id'].nunique(), Moscow_venues_df['Venue_Id'].shape[0]))

# Drop duplicates
print('Delete duplicates')
Moscow_venues_df.drop_duplicates(subset ="Venue_Id", keep = 'first', inplace = True) 

# Reset index
Moscow_venues_df.reset_index(inplace = True) 

# Take a look at the dataframe
print('Take a look at the dataframe shape')
print(Moscow_venues_df.shape)

Unique Venues 27622 of 27622
Delete duplicates
Take a look at the dataframe shape
(27622, 14)


#### Get first category for each Venue

In [120]:
# Get first category for each Venue
Moscow_venues_df['Venue_Category_Name'] = Moscow_venues_df['Venue_All_Categories'].apply(lambda x: x.strip('[()]').split(', ')[0].strip("'"))
Moscow_venues_df['Venue_Category_Id'] = Moscow_venues_df['Venue_All_Categories'].apply(lambda x: x.strip('[()]').split(', ')[1].strip("'"))

print('Take a look at the Venue Category')
print(Moscow_venues_df[['Venue_Name', 'Venue_Category_Name', 'Venue_Category_Id']].head())

Take a look at the Venue Category
         Venue_Name    Venue_Category_Name         Venue_Category_Id
0     Дам на поляне               Vineyard  4bf58dd8d48988d1de941735
1      клуб "Драйв"  General Entertainment  4bf58dd8d48988d1f1931735
2             Иртыш      Food & Drink Shop  4bf58dd8d48988d1f9941735
3  ПИВО ТУТ. Triple             Beer Store  5370f356bcbc57f1066c94c2
4      Пруд Платный                 Resort  4bf58dd8d48988d12f951735


#### Bind each venue to Moscow Boroughs in which borders they were placed

In [123]:
# load GeoJSON file with Boroughs and create geometry shape
with open(mo_geojson) as json_file:
    geojson_data = json.loads(json_file.read())

    
# Itterate through all Borough shape and find all Venues, that is placed in it
for feature in geojson_data['features']:
    # shape of the Borough
    polygon = shape(feature['geometry'])
    borough_name = feature['properties']['NAME']
    
    print('Process borough "{}"'.format(borough_name), end='')
    
    # Itterate throug all Venues
    for index, name, lat, lng in zip(Moscow_venues_df.index, Moscow_venues_df['Venue_Name'], Moscow_venues_df['Venue_Latitude'], Moscow_venues_df['Venue_Longitude']):
        # construct point based on lon/lat
        point = Point(lng, lat)
    
        if polygon.contains(point):
            print('.', end='')
            Moscow_venues_df.loc[index, 'Borough_Name'] = borough_name
    
    print('done')

.................................................done
Process borough "Бескудниковский"...............................................done
Process borough "Гагаринский"........................................................................................................................................................................................done
Process borough "Тимирязевский"......................................................................................................................................................................done
Process borough "Северное Бутово".................................................................................................................................done
Process borough "Лианозово"..................................................................................done
Process borough "Хамовники".....................................................................................................................................

In [125]:
print('Take a look at the Venue of some Boroughs')
print(Moscow_venues_df[['Venue_Name', 'Venue_Category_Name', 'Borough_Name']].head(10))

Take a look at the Venue of some Boroughs
         Venue_Name    Venue_Category_Name Borough_Name
0     Дам на поляне               Vineyard  Десёновское
1      клуб "Драйв"  General Entertainment  Десёновское
2             Иртыш      Food & Drink Shop  Десёновское
3  ПИВО ТУТ. Triple             Beer Store  Десёновское
4      Пруд Платный                 Resort  Десёновское
5   Поселок Десна-3              Rest Area  Десёновское
6            Сельпо         Farmers Market  Десёновское
7     Садовый Центр          Garden Center  Десёновское
8       Грибной лес                  Trail  Десёновское
9      хинкали-gали   Caucasian Restaurant  Десёновское


#### Delete Venues that placed outside of the Moscow Boroughs

In [126]:
# Calcuate Venue placed outside Moscow Borough
print('{} Venue placed outside Moscow Boroughs'.format(Moscow_venues_df[~Moscow_venues_df['Borough_Name'].isin(Moscow_Borough_df['Borough_Name'])].shape[0]))

# Delete venues with is not in our scope of Moscow Boroughs
print('Delete Venue placed outside Moscow Boroughs')
Moscow_venues_df.drop(Moscow_venues_df[~Moscow_venues_df['Borough_Name'].isin(Moscow_Borough_df['Borough_Name'])].index, inplace=True)

# Reset index
Moscow_venues_df.reset_index(inplace = True) 

# Take a look at the dataframe
print('Take a look at the dataframe shape')
print(Moscow_venues_df.shape)

# Save result dataset with Venues 
Moscow_venues_df.to_csv("data\Moscow_venues_df.csv", index = False)

6758 Venue placed outside Moscow Boroughs
Delete Venue placed outside Moscow Boroughs
Take a look at the dataframe shape
(20864, 15)


### Visialize a map of some Moscow Boroughs with venues in it

In [127]:
Moscow_Borough_df = pd.read_csv("data\Moscow_Borough_df.csv")
Moscow_venues_df = pd.read_csv("data\Moscow_venues_df.csv")
mo_geojson = 'mo.geojson'

# Moscow latitude and longitude values
Moscow_subset_lat = Moscow_Borough_df[Moscow_Borough_df['Borough_Name'] == 'Бирюлёво Западное']['Latitude'].iloc[0]
Moscow_subset_lng = Moscow_Borough_df[Moscow_Borough_df['Borough_Name'] == 'Бирюлёво Западное']['Longitude'].iloc[0]

# create map and display it
Moscow_map = folium.Map(location=[Moscow_subset_lat, Moscow_subset_lng], zoom_start=12)

# generate choropleth map
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Population in Moscow City'
)

#==============================================================================
# Create Venues subset for some Boroughs
#==============================================================================
Moscow_venues_subset = Moscow_venues_df[Moscow_venues_df['Borough_Name'].isin(['Орехово-Борисово Северное', 'Чертаново Южное', 'Южное Бутово'])]

#==============================================================================
# Add markers to map for venues
#==============================================================================
for Venue_name, lat, lng in zip(Moscow_venues_subset['Venue_Name'], Moscow_venues_subset['Venue_Latitude'], Moscow_venues_subset['Venue_Longitude']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        popup=folium.Popup('{}'.format(Venue_name), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)


# display map
Moscow_map


Now data are ready for analysis

I gathered following information:  

1. main dataset with the list of Moscow Borough, containing the following attributes:
    - name of the each Moscow Borough
    - type of the each Moscow Borough
    - name of the each Moscow District in which Borough is belong to
    - area of the each Moscow Borough in square kilometers
    - the population of the each Moscow Borough
    - housing area of the each Moscow Borough in square meters
    - average housing price of the each Moscow Borough
2. geographical coordinates of the each Moscow Borough
3. shape of the each Moscow Borough in GEOJSON format
4. list of venues placed in the each Moscow Borough with their geographical coordinates and categories