# Capstone Project - The Battle of the Neighborhoods (Week 2)
## Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
* [References](#references)

# Introduction <a name="Introduction"></a>

## Background <a name="Background"></a>

Москва - один самых больших мегаполисов в мире с населением более 12 миллионов человек занимает площадь более 2561,5 км² со средней плотностью наследения 4924,96 чел./км² [1](https://ru.wikipedia.org/wiki/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0).

Москва разделена на 12 административных округов (125 районов, 2 городских округа, 19 поселений).

Москва имеет очень неравномерную плотность населения от 30429 чел./км² для района "Зябликово", до 560 чел./км² "Молжаниновский район" [2](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B9%D0%BE%D0%BD%D1%8B_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B).

Средняя стоимость недвижимости варируется от 68768 руб./м² для района "Кленовское" до 438568 руб./м² для района "Арбат" [3](https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir).

## Business Problem <a name="Business Problem"></a>

Владельцы кафе, фитнес центров и других социальных объектов ожидаемо будут предпочитать районы с высокой плотность населения. 
Инвесторы будут предпочитать районы с невысокой стоимость жилья и невысокой конкурентностью.

Со стороны жителей (резидентов) ожидаемо предпочтение района с невысокой стоимость жилья и хорошей доступностью социальных мест.

В своем исследовании я постараюсь определить оптимальные места для расположения фитнесс центров в районах Москвы с учетом количества населения, стоимость недвижимости и плотности расположения других фитнесс объектов.

Ключевыми критериями отбора подходящих мест для расположения фитнесс центров будут являться: 
- Высокая плотность (численность) населения района
- Невысокая стоимость недвижимости в районе
- Отсутствие в непосредственной близости других фитнесс объектов аналогичного профиля

Я буду использовать подходы и методы машинного обучения для определения мест расположения фитнесс центров в соответствии с указанными критериями.

Основными стекхолдерами, моего исследования будет инвесторы, заинтересованные в открытие новых фитнес центров.

# Data acquisition and cleaning <a name="data"></a>

## Data sources 

Исходя из формулировки проблемы и установленных критериев отбора, мне потребуется следующая информация:

* Список районов Москвы [Moscow Boroughs](https://gis-lab.info/qa/moscow-atd.html)
* Географические координаты районов Москвы
* Стоимость недвижимости по районам Москвы [Moscow Boroughs Housing Price](https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir) 
* Плотность населения районов  Москвы [Moscow Boroughs Population Density](https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B)
* Географические границы районов Москвы [Moscow Boroughs GEOJSON](http://gis-lab.info/data/mos-adm/mo.geojson)
* Список, координаты и классификация социальных объектов, расположенных по районам Москвы

Информация по списку районов Москвы, стоимости недвижимости и плотность населения будет извлекаться из HTML таблиц, что приводит к неообходимости выполнять большой объем очистки данных, так как:
- Удаление концевых и начальных символов " \n\t"
- Удаление различных спецсимволов, например " ↗↘"
- Удаление лишних пробелов в числах
- Замена десятичного разделителя в числах
- Замена созвучных русских символов "е" и "ё"
- Конвертация чисел из строк в int и float

Географические координаты будут запрашиваться с использованем сервиса Nominatim, который отличается не очень стабильной работой, что требует промежуточных сохранений данных и повторных перезапусков.

Для определения социальных объектов (**мест**) будет использовать сервис **Forsquare API**.   
- Особенностью данного сервиса является ограничение в 100 **мест**, которые он может вернуть в одном запросе.  
- Для получения полного списка всех мест, необходимо представить Москву, в виде регулярной сетки окружностей небольшого диаметра.
- Для каждой из окружностей в полученной сетки необходимо запрость расположение **мест**
- Итоговый список **мест** необходимо будет очистить от дубликатов и привязать к районам Москвы, в границах которых они расположены.


## Request and data cleansing

#### Import requied libraries

In [3]:
# Import requied libraries
import requests
import pandas as pd
import json
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

#### Define function for HTML table parse. This function help us
- find number of rows and columns in a HTML table 
- get cloumns titles, if posible
- convert string to float, if posible
- return result Pandas dataframe

In [4]:
def parse_html_table(table):
    n_columns = 0
    n_rows=0
    column_names = []

    # Find number of rows and columns
    # we also find the column titles if we can
    for row in table.find_all('tr'):
        
        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows+=1
            if n_columns == 0:
                # Set the number of columns for our table
                n_columns = len(td_tags)
                
        # Handle column names if we find them
        th_tags = row.find_all('th') 
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())

    # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0,n_columns)
    df = pd.DataFrame(columns = columns,
                        index= range(0,n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1
            
    # Convert to float if possible
    for col in df:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            pass
    
    return df

### Request and clear Moscow Boroughs list 

#### Load and parse Moscow Boroughs dataset from HTML page and examine the raw dataframe

In [25]:
# Load page with Moscow Boroughs
url = "https://gis-lab.info/qa/moscow-atd.html"
response = requests.get(url)

# Take second HTML table with districts from the page and parse it into dataframe
soup = BeautifulSoup(response.text, 'lxml')
tables = soup.findAll('table', { 'class' : 'wikitable sortable' }, limit=2) 
Moscow_df = parse_html_table(tables[1])  

# Define columns for dataframe
Moscow_df.columns=["Borough_index", "Borough_Name", "District_Name", "Borough_Type", "OKATO_Borough_Code", "OKTMO_District_Code"]

# Take a look at the dataframe
print(Moscow_df.head())
print(Moscow_df.shape)
print(Moscow_df.dtypes)

Borough_index     Borough_Name District_Name           Borough_Type  \
0            1.0  Академический\n        ЮЗАО\n  Муниципальный округ\n   
1            2.0   Алексеевский\n        СВАО\n  Муниципальный округ\n   
2            3.0   Алтуфьевский\n        СВАО\n  Муниципальный округ\n   
3            4.0          Арбат\n         ЦАО\n  Муниципальный округ\n   
4            5.0       Аэропорт\n         САО\n  Муниципальный округ\n   

   OKATO_Borough_Code  OKTMO_District_Code  
0          45293554.0           45397000.0  
1          45280552.0           45349000.0  
2          45280554.0           45350000.0  
3          45286552.0           45374000.0  
4          45277553.0           45333000.0  
(146, 6)
Borough_index          float64
Borough_Name            object
District_Name           object
Borough_Type            object
OKATO_Borough_Code     float64
OKTMO_District_Code    float64
dtype: object


As we can see Moscow Boroughs dataset is not very good. We have to
- remove "\n" in the end of the text data
- convert float to int for code columns

#### Clear Moscow Boroughs dataset

In [26]:
# Drop Borough_index colums 
Moscow_df.drop("Borough_index", axis=1, inplace=True)

# Remove "\n" in the end of the text data
Moscow_df.replace('\n', '', regex=True, inplace=True)

# convert float to int for code columns
Moscow_df["OKATO_Borough_Code"] = Moscow_df["OKATO_Borough_Code"].astype(int)
Moscow_df["OKTMO_District_Code"] = Moscow_df["OKTMO_District_Code"].astype(int)

# Take a look at the result dataframe
print(Moscow_df.head())
print(Moscow_df.dtypes)

# Save dataframe for future use
Moscow_df.to_csv("Moscow_df.csv", index = False)

Borough_Name District_Name         Borough_Type  OKATO_Borough_Code  \
0  Академический          ЮЗАО  Муниципальный округ            45293554   
1   Алексеевский          СВАО  Муниципальный округ            45280552   
2   Алтуфьевский          СВАО  Муниципальный округ            45280554   
3          Арбат           ЦАО  Муниципальный округ            45286552   
4       Аэропорт           САО  Муниципальный округ            45277553   

   OKTMO_District_Code  
0             45397000  
1             45349000  
2             45350000  
3             45374000  
4             45333000  
Borough_Name           object
District_Name          object
Borough_Type           object
OKATO_Borough_Code      int32
OKTMO_District_Code     int32
dtype: object


Now Moscow Boroughs dataset looks quite well

### Request coordinate of Moscow Borough

In [28]:
# define the dataframe columns
column_names = ['Borough_Name', 'Latitude', 'Longitude'] 

# instantiate the dataframe
Moscow_coord_df = pd.DataFrame(columns=column_names)

# create class instance of Nominatim
geolocator = Nominatim(user_agent="foursquare_agent", timeout=2)

# loop frough all Boroughs
for Borough_Name, Borough_Type, District_Name in zip(Moscow_df['Borough_Name'], Moscow_df['Borough_Type'], Moscow_df['District_Name']):
    address = '{}, {}, {}, Москва, Россия'.format(Borough_Name, Borough_Type, District_Name)
    print(address)

    location = None

    # loop until you get the coordinates
    while(location is None):
        location = geolocator.geocode(address)

    print('The geograpical coordinate of {}, {}, {} are {}, {}.'.format(Borough_Name, Borough_Type, District_Name, location.latitude, location.longitude))

    latitude = location.latitude
    longitude = location.longitude
    Moscow_coord_df = Moscow_coord_df.append({'Borough_Name': Borough_Name, 'Latitude': latitude, 'Longitude': longitude}, ignore_index=True) 

Академический, Муниципальный округ, ЮЗАО, Москва, Россия


GeocoderQuotaExceeded: HTTP Error 429: Too Many Requests

In [29]:
# Take a look at the dataframe
print(Moscow_coord_df.head())
print(Moscow_coord_df.shape)

# Save copy of the dataframe as service Nominatim not stable
Moscow_coord_df.to_csv("Moscow_coord_df.csv", index = False)

Empty DataFrame
Columns: [Borough_Name, Latitude, Longitude]
Index: []
(0, 3)


### Dowload GEOJSON for Moscow Boroughs

In [17]:
# download geojson file
url = 'http://gis-lab.info/data/mos-adm/mo.geojson'
download_file = requests.get(url)
mo_geojson_utf8 = 'mo.geojson.utf8'
mo_geojson = 'mo.geojson'
open(mo_geojson_utf8, 'wb').write(download_file.content)    
print('GeoJSON file downloaded!')

# Encode file from utf8 to cp1251 as my computer use Russian locale
f = open(mo_geojson, "wb")
for line in open(mo_geojson_utf8, "rb"):
    f.write(line.decode('u8').encode('cp1251', 'ignore'))
f = open(mo_geojson, "wb")
for line in open(mo_geojson_utf8, "rb"):
    f.write(line.decode('u8').encode('cp1251', 'ignore'))


# validate geojson file    
with open(mo_geojson) as json_file:
    data = json_file.read()
    try:
        data = json.loads(data)
    except ValueError as e:
        print('invalid json: %s' % e)

GeoJSON file downloaded!


### Request and clear Moscow Boroughs Housing Price

#### Load and parse Moscow Boroughs Housing Price dataset from HTML page and examine the raw dataframe

In [23]:
# Load page with Moscow Boroughs Housing Price
url = "https://www.mirkvartir.ru/journal/analytics/2018/02/25/reiting-raionov-moskvi-po-stoimosti-kvartir"
response = requests.get(url)

# Take first HTML table with districts from the page and parse it into dataframe
soup = BeautifulSoup(response.text, 'lxml')
tables = soup.findAll('table', limit=1) 
Moscow_housing_price_df = parse_html_table(tables[0]) 

# Take a look at the dataframe
print(Moscow_housing_price_df.head())
print(Moscow_housing_price_df.shape)
print(Moscow_housing_price_df.dtypes)

0                  1                       2                     3  \
0  \n№\n       \nРайон\n \n  \nЦена, \nруб./кв. м\n  \nПрирост \nза год\n   
1  \n1\n          \nАрбат\n              \n438568\n            \n−0,20%\n   
2  \n2\n      \nХамовники\n              \n425741\n             \n4,50%\n   
3  \n3\n       \nЯкиманка\n              \n404471\n             \n1,30%\n   
4  \n4\n  \nЗамоскворечье\n              \n398544\n             \n3,80%\n   

                             4                     5  
0  \nЦена \nквартиры, \nруб.\n  \nПрирост \nза год\n  
1                 \n33702123\n             \n0,10%\n  
2                 \n27196303\n             \n5,20%\n  
3                 \n26920221\n             \n4,10%\n  
4                 \n26910141\n             \n5,70%\n  
(144, 6)
0    object
1    object
2    object
3    object
4    object
5    object
dtype: object


As we can see Moscow Boroughs Housing Price dataset is not very good. We have to
- remove some unused colums 
- set columns for dataframe
- strip Borough Name from additional information like ' \n\t'
- replace '\n' in text columns
- convert from string to numeric
- replace some Borough_Name as of russian letters "е" and "ё" and change places of some words 

#### Clear Moscow Boroughs Housing Price dataset

In [24]:
# Drop some unused colums 
Moscow_housing_price_df.drop([Moscow_housing_price_df.columns[0], Moscow_housing_price_df.columns[3], Moscow_housing_price_df.columns[4], Moscow_housing_price_df.columns[5]], axis=1, inplace=True)
Moscow_housing_price_df.drop(0, axis=0, inplace=True)

# Set columns for dataframe
Moscow_housing_price_df.columns=["Borough_Name", "Borough_Housing_Price"]

# Clear Borough Name from additional information
Moscow_housing_price_df["Borough_Name"] = Moscow_housing_price_df["Borough_Name"].str.strip(' \n\t')

# Replace '\n' in some columns
Moscow_housing_price_df.replace('\n', '', regex=True, inplace=True)

# Convert from string to numeric
Moscow_housing_price_df["Borough_Housing_Price"] = Moscow_housing_price_df["Borough_Housing_Price"].astype(int)

# replace some Borough_Name as of russian letters "е" and "ё" and change places of some words 
Moscow_housing_price_df["Borough_Name"].replace('Бирюлево Восточное', 'Бирюлёво Восточное', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Бирюлево-Западное', 'Бирюлёво Западное', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Дегунино Восточное', 'Восточное Дегунино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Измайлово Восточное', 'Восточное Измайлово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Дегунино Западное', 'Западное Дегунино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Савеловский', 'Савёловский', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Измайлово Северное', 'Северное Измайлово', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Медведково Северное', 'Северное Медведково', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тушино Северное', 'Северное Тушино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Теплый Стан', 'Тёплый Стан', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тропарево-Никулино', 'Тропарёво-Никулино', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Филевский Парк', 'Филёвский Парк', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Хорошево-Мневники', 'Хорошёво-Мнёвники', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Хорошевский', 'Хорошёвский', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Черемушки', 'Черёмушки', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Медведково Южное', 'Южное Медведково', regex=True, inplace=True)
Moscow_housing_price_df["Borough_Name"].replace('Тушино Южное', 'Южное Тушино', regex=True, inplace=True)

# Take a look at the result dataframe
print(Moscow_housing_price_df.head())
print(Moscow_housing_price_df.shape)
print(Moscow_housing_price_df.dtypes)

# Save copy of the dataframe
Moscow_housing_price_df.to_csv("Moscow_housing_price_df.csv", index = False)

Borough_Name  Borough_Housing_Price
1          Арбат                 438568
2      Хамовники                 425741
3       Якиманка                 404471
4  Замоскворечье                 398544
5       Тверской                 386255
(143, 2)
Borough_Name             object
Borough_Housing_Price     int32
dtype: object


Now Moscow Boroughs Housing Price dataset looks quite well

### Request and clear Moscow Boroughs Population Density

#### Load and parse Moscow Boroughs Population Density dataset from HTML page and examine the raw dataframe

In [30]:
# Load page with Moscow Boroughs Population Density
url = "https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B"
response = requests.get(url)

# Take first HTML table with districts from the page and parse it into dataframe
soup = BeautifulSoup(response.text, 'lxml')
tables = soup.findAll('table', { 'class' : 'standard sortable' }, limit=1) 
Moscow_dens_df = parse_html_table(tables[0]) 

# Take a look at the dataframe
print(Moscow_dens_df.head())
print(Moscow_dens_df.shape)
print(Moscow_dens_df.dtypes)

№ Флаг Герб Название района[2]/поселения[3][4]  \
0  1.0                               Академический    
1  2.0                                Алексеевский    
2  3.0                                Алтуфьевский    
3  4.0                                       Арбат    
4  5.0                                    Аэропорт    

  Название cоответствующего внутригородского муниципального образования: муниципального округа / поселения / городского округа[5]  \
0                                     Академический                                                                                 
1                                      Алексеевский                                                                                 
2                                      Алтуфьевский                                                                                 
3                                             Арбат                                                                                 
4        

As we can see Moscow Boroughs Population Density dataset is not very good. We have to
- drop some unused colums 
- set columns name for dataframe
- clear Borough Name from additional information, such as ', поселение ', ', городской округ '
- strip string columns from additional information like ' \n\t'
- replace '\n', ' ↗' and '↘' in some columns
- delete extra spaces in numeric string columns
- replace ',' to '.' for float columns
- convert from string to numeric

#### Clear Moscow Boroughs Population Density dataset

In [31]:
# Drop some unused colums 
Moscow_dens_df.drop([Moscow_dens_df.columns[0], Moscow_dens_df.columns[1], Moscow_dens_df.columns[2], Moscow_dens_df.columns[3], Moscow_dens_df.columns[5]], axis=1, inplace=True)

# Set columns for dataframe
Moscow_dens_df.columns=["Borough_Name", "Borough_Area", "Borough_Population", "Borough_Population_Density", "Borough_Housing_Area", "Borough_Housing_Area_Per_Person"]

# Clear Borough Name from additional information
Moscow_dens_df["Borough_Name"].replace(', поселение ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Name"].replace(', городской округ ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Name"] = Moscow_dens_df["Borough_Name"].str.strip(' \n\t')
Moscow_dens_df["Borough_Name"].replace('Мосрентген', '"Мосрентген"', regex=True, inplace=True)

# Replace '\n' and ' ↗' in some columns
Moscow_dens_df.replace('\n', '', regex=True, inplace=True)
Moscow_dens_df.replace('↗', '', regex=True, inplace=True)
Moscow_dens_df.replace('↘', '', regex=True, inplace=True)

# Delete extra spaces in numeric columns
Moscow_dens_df["Borough_Area"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population"].replace('\xa0', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Population_Density"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area"].replace(' ', '', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area_Per_Person"].replace(' ', '', regex=True, inplace=True)

# Replace ',' to '.' for float columns
Moscow_dens_df["Borough_Area"].replace(',', '.', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area"].replace(',', '.', regex=True, inplace=True)
Moscow_dens_df["Borough_Housing_Area_Per_Person"].replace(',', '.', regex=True, inplace=True)

# Convert from string to numeric
Moscow_dens_df["Borough_Population"] = Moscow_dens_df["Borough_Population"].astype(int)
Moscow_dens_df["Borough_Population_Density"] = Moscow_dens_df["Borough_Population_Density"].astype(int)
Moscow_dens_df["Borough_Area"] = Moscow_dens_df["Borough_Area"].astype(float)
Moscow_dens_df['Borough_Housing_Area'] = pd.to_numeric(Moscow_dens_df['Borough_Housing_Area'], errors='coerce')
Moscow_dens_df['Borough_Housing_Area_Per_Person'] = pd.to_numeric(Moscow_dens_df['Borough_Housing_Area_Per_Person'], errors='coerce')

# Take a look at the dataframe
print(Moscow_dens_df.head())
print(Moscow_dens_df.dtypes)

# Save copy of the dataframe
Moscow_dens_df.to_csv("Moscow_dens_df.csv", index = False)

Borough_Name  Borough_Area  Borough_Population  \
0  Академический          5.83              109387   
1   Алексеевский          5.29               80534   
2   Алтуфьевский          3.25               57596   
3          Арбат          2.11               36125   
4       Аэропорт          4.58               79486   

   Borough_Population_Density  Borough_Housing_Area  \
0                       18762                2467.0   
1                       15223                1607.9   
2                       17721                 839.3   
3                       17120                 731.0   
4                       17355                1939.7   

   Borough_Housing_Area_Per_Person  
0                             22.7  
1                             20.5  
2                             15.5  
3                             26.0  
4                             25.9  
Borough_Name                        object
Borough_Area                       float64
Borough_Population                   int

Now Moscow Boroughs Population Density dataset looks quite well

### Join all datasets into result Moscow Borough dataset

In [12]:
# Merge datasets
Moscow_Borough_df = pd.merge(left=Moscow_df, right=Moscow_dens_df, left_on='Borough_Name', right_on='Borough_Name')
Moscow_Borough_df = pd.merge(left=Moscow_Borough_df, right=Moscow_coord_df, left_on='Borough_Name', right_on='Borough_Name')
Moscow_Borough_df = pd.merge(left=Moscow_Borough_df, right=Moscow_housing_price_df, left_on='Borough_Name', right_on='Borough_Name')

# Take a look at the dataframe
print(Moscow_Borough_df.head())
print(Moscow_Borough_df.dtypes)

# Save result dataframe
Moscow_Borough_df.to_csv("Moscow_Borough_df.csv", index = False)

Borough_Name District_Name         Borough_Type  OKATO_Borough_Code  \
0  Академический          ЮЗАО  Муниципальный округ            45293554   
1   Алексеевский          СВАО  Муниципальный округ            45280552   
2   Алтуфьевский          СВАО  Муниципальный округ            45280554   
3          Арбат           ЦАО  Муниципальный округ            45286552   
4       Аэропорт           САО  Муниципальный округ            45277553   

   OKTMO_District_Code  Borough_Area  Borough_Population  \
0             45397000          5.83              109387   
1             45349000          5.29               80534   
2             45350000          3.25               57596   
3             45374000          2.11               36125   
4             45333000          4.58               79486   

   Borough_Population_Density  Borough_Housing_Area  \
0                       18762                2467.0   
1                       15223                1607.9   
2                       177

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>

## References <a name="references"></a>

1. [Moscow](https://ru.wikipedia.org/wiki/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0)

2. [Moscow distrits](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B9%D0%BE%D0%BD%D1%8B_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B)

3. [Список районов и поселений Москвы](https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%BE%D0%B2_%D0%B8_%D0%BF%D0%BE%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B)

In [8]:
###############################################################################
# import library
###############################################################################
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup # library for html scrap
from geopy.geocoders import Nominatim
import folium
from folium import plugins
import shapely.geometry
import pyproj
import math


In [9]:
###############################################################################
# Load previously prepeared dataset 
###############################################################################
Moscow_Borough_df = pd.read_csv("Moscow_Borough_df.csv")
mo_geojson = 'mo.geojson'

In [10]:
###############################################################################
# Visialize a map of Moscow Borough
###############################################################################
# Moscow latitude and longitude values
Moscow_lat= 55.7504461
Moscow_lng= 37.6174943

# create map and display it
Moscow_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=10)

#==============================================================================
# Generate choropleth map with Borough Population
#==============================================================================
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough in Moscow City'
)

folium.LayerControl().add_to(Moscow_map)

#==============================================================================
# Add borough center as markers to Moscow map 
#==============================================================================
for Borough_Name, lat, lng, Borough_Population in zip(Moscow_Borough_df['Borough_Name'], Moscow_Borough_df['Latitude'], Moscow_Borough_df['Longitude'], Moscow_Borough_df['Borough_Population']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        #popup='{}, Москва, Россия ({:})'.format(Borough_Name, Borough_Population),
        popup=folium.Popup('{}, Москва, Россия ({:})'.format(Borough_Name, Borough_Population), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)

    folium.Circle([lat, lng], radius=1000, color='blue', fill=False).add_to(Moscow_map)


#==============================================================================
# display map
#==============================================================================
Moscow_map

#==============================================================================
# As we can see, use center of the brough for searching venues is quite useless as eache borough have very sophisticated shape 
# Api для поиска поддерживает поиск в радиусе до 1000 м
# нам же необходимо искать в радиусе 21 000 м от центра Москвы
#==============================================================================

# Display a circle of 21 000 meter in diametre
folium.Circle([Moscow_lat, Moscow_lng], radius=21000, color='blue', fill=False).add_to(Moscow_map)

# display map
Moscow_map


In [58]:
###############################################################################
# Create a grid of area candidates, equaly spaced, centered around city center and within 20km
# Create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters
###############################################################################

#==============================================================================
# Create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) 
# and UTM Cartesian coordinate system (X/Y coordinates in meters)
#==============================================================================
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Moscow center longitude={}, latitude={}'.format(Moscow_lng, Moscow_lat))
x, y = lonlat_to_xy(Moscow_lng, Moscow_lat)
print('Moscow center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Moscow center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Moscow center longitude=37.6174943, latitude=55.7504461
Moscow center UTM X=1905338.9660287427, Y=6412552.724743128
Moscow center longitude=37.6174943, latitude=55.7504461


In [12]:

distance_limit = 2000
cell_radius = 300

# City center in Cartesian coordinates
Moscow_center_x, Moscow_center_y = lonlat_to_xy(Moscow_lng, Moscow_lat) 

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = Moscow_center_x - distance_limit
x_step = cell_radius *2 
y_min = Moscow_center_y - distance_limit - (int((distance_limit/cell_radius+1)/k)*k*(cell_radius *2) - (distance_limit*2))/2
y_step = cell_radius *2  * k 

latitudes = []
longitudes = []
cells_id = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int((distance_limit/cell_radius+1)/k)):
    y = y_min + i * y_step
    x_offset = cell_radius if i%2==0 else 0
    for j in range(0, int(distance_limit/cell_radius+1)):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(Moscow_center_x, Moscow_center_y, x, y)
        if (distance_from_center <= (distance_limit+1)):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            cells_id.append('{},{}'.format(lat, lon))
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

# Create new Pandas dataframe with all cells
Moscow_cells_df = pd.DataFrame(list(zip(cells_id, latitudes, longitudes)), columns =['Cell_id', 'Cell_Latitude', 'Cell_Longitude']) 
             
            
print(len(latitudes), 'candidate neighborhood centers generated.')



39 candidate neighborhood centers generated.


In [13]:
#==============================================================================
# Visualize the data: city center location and candidate neighborhood centers
#==============================================================================
Moscow_cell_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=10)
for lat, lon in zip(Moscow_cells_df['Cell_Latitude'], Moscow_cells_df['Cell_Longitude']):
    folium.Circle([lat, lon], radius=cell_radius, color='blue', fill=False).add_to(Moscow_cell_map)
Moscow_cell_map

In [14]:
Moscow_venues = pd.read_csv("Moscow_venues.csv")
# Take a look at the dataframe
print(Moscow_venues.head())
print(Moscow_venues.shape)


Cell_id  Cell_Latitude  Cell_Longitude  \
0  55.739800449135714,37.59784049913523        55.7398        37.59784   
1  55.739800449135714,37.59784049913523        55.7398        37.59784   
2  55.739800449135714,37.59784049913523        55.7398        37.59784   
3  55.739800449135714,37.59784049913523        55.7398        37.59784   
4  55.739800449135714,37.59784049913523        55.7398        37.59784   

                   Venue_Id                              Venue_Name  \
0  5badc4a4fb8e59002c4991af    Crowne Plaza Moscow - Tretyakovskaya   
1  4f4fa290e4b05223bbacd8c0                        Агентство Art.ru   
2  4fc1f143e4b099ce1751142b  Aquamarine Hotel (Гостиница Аквамарин)   
3  5799aef3498e5c18ada5df0b                               Кофеварня   
4  58d0d7fb5da8f440a033c2e2                            Coffee Point   

                                 Venue_Categorys  Venue_Latitude  \
0        [('Hotel', '4bf58dd8d48988d1fa931735')]       55.739680   
1  [('Art Gallery', '4bf

In [22]:
#==============================================================================
# Visualize the data: city center location and candidate neighborhood centers
#==============================================================================
Moscow_venues_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=10)

# Add markers to map 
for Venue_Name, lat, lng in zip(Moscow_venues['Venue_Name'][:20], Moscow_venues['Venue_Latitude'][:20], Moscow_venues['Venue_Longitude'][:20]):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        popup='{}'.format(Venue_Name),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_venues_map)

Moscow_venues_map

In [41]:
#test = pd.read_csv("Moscow_venues55.74512377912906,37.607666003522596.csv")
test = pd.read_csv("Moscow_venues.csv")

Moscow_cell_map2 = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=10)

#for lat, lon in zip(test['Cell_Latitude'], test['Cell_Longitude']):
#    folium.Circle([lat, lon], radius=cell_radius, color='blue', fill=False).add_to(Moscow_cell_map2)
    

venues_marker_cluster = plugins.MarkerCluster().add_to(Moscow_cell_map2)

for Venue_Name, lat, lng in zip(test['Venue_Name'][:1000], test['Venue_Latitude'][:1000], test['Venue_Longitude'][:1000]):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        #popup=folium.Popup('{}'.format(Venue_Name), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_marker_cluster)   
   
Moscow_cell_map2

In [34]:
Moscow_venues_df = pd.read_csv("Moscow_venues_df.csv")
Moscow_Borough_df = pd.read_csv("Moscow_Borough_gym_df.csv")
Moscow_gum_venues_df = pd.read_csv("Moscow_gum_venues_df.csv")
mo_geojson = 'mo.geojson'

In [18]:
###############################################################################
# Visialize a map of some Moscow Boroughs with venues in it
###############################################################################
# Moscow latitude and longitude values
Moscow_subset_lat = Moscow_Borough_df[Moscow_Borough_df['Borough_Name'] == 'Орехово-Борисово Северное']['Latitude'].iloc[0]
Moscow_subset_lng = Moscow_Borough_df[Moscow_Borough_df['Borough_Name'] == 'Орехово-Борисово Северное']['Longitude'].iloc[0]

# create map and display it
Moscow_map = folium.Map(location=[Moscow_subset_lat, Moscow_subset_lng], zoom_start=12)

# generate choropleth map
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Population in Moscow City'
)


#==============================================================================
# Add markers to map for borough 
#==============================================================================
# Create Venues subset for some venue
Moscow_venues_subset = Moscow_venues_df[Moscow_venues_df['Borough_Name'].isin(['Орехово-Борисово Северное', 'Братеево', 'Нагатинский Затон'])]

for Venue_name, lat, lng in zip(Moscow_venues_subset['Venue_Name'], Moscow_venues_subset['Venue_Latitude'], Moscow_venues_subset['Venue_Longitude']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        popup=folium.Popup('{}'.format(Venue_name), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)


# display map
Moscow_map


In [35]:
###############################################################################
# Visialize a map of some Moscow Boroughs with venues in it
###############################################################################
# create map and display it
Moscow_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=12)

# generate choropleth map
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_df,
    name='Population Density',
    columns=['Borough_Name', 'Borough_Population_Per_Gym'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Population in Moscow City'
)


#==============================================================================
# Add markers to map for borough 
#==============================================================================
for Venue_name, lat, lng in zip(Moscow_gum_venues_df['Venue_Name'], Moscow_gum_venues_df['Venue_Latitude'], Moscow_gum_venues_df['Venue_Longitude']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        #popup=folium.Popup('{}'.format(Venue_name), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)


# display map
Moscow_map


In [24]:
from folium import plugins
from folium.plugins import HeatMap


In [42]:
# Moscow latitude and longitude values
Moscow_lat= 55.7504461
Moscow_lng= 37.6174943

# create map
Moscow_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=11)

# List comprehension to make out list of lists
heat_data = [[row['Venue_Latitude'], row['Venue_Longitude']] for index, row in Moscow_gum_venues_df.iterrows()]

# Add HeatMap
HeatMap(heat_data).add_to(Moscow_map)
folium.GeoJson(mo_geojson).add_to(Moscow_map)

# display map
Moscow_map


In [5]:
# Read previously saved dataset
Moscow_Borough_Gym_Clustering_df = pd.read_csv("Moscow_Borough_Gym_Clustering_df.csv")
Moscow_gym_venues_df = pd.read_csv("Moscow_gym_venues_df.csv")
mo_geojson = 'mo.geojson'

# Moscow latitude and longitude values
Moscow_lat= 55.7504461
Moscow_lng= 37.6174943

# create map 
Moscow_map = folium.Map(location=[Moscow_lat, Moscow_lng], zoom_start=11)

# generate choropleth map
Moscow_map.choropleth(
    geo_data=mo_geojson,
    data=Moscow_Borough_Gym_Clustering_df,
    name='Population Density',
    columns=['Borough_Name', 'Cluster_Labels'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Borough Gym Clustering in Moscow City'
)


#==============================================================================
# Add markers to map for borough 
#==============================================================================
for Venue_name, lat, lng in zip(Moscow_gym_venues_df['Venue_Name'], Moscow_gym_venues_df['Venue_Latitude'], Moscow_gym_venues_df['Venue_Longitude']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5, # define how big you want the circle markers to be
        color='yellow',
        fill=True,
        #popup=folium.Popup('{}'.format(Venue_name), parse_html=True),
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Moscow_map)
    folium.Circle([lat, lng], radius=250, color='blue', fill=False).add_to(Moscow_map)



# display map
Moscow_map


## Install requed library

In [1]:
!conda install -c conda-forge shapely --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.8.28-0, anaconda::openssl-1.1.1d-he774522_2
  - anaconda::openssl-1.1.1d-he774522_2, defaults::ca-certificates-2019.8.28-0
  - anaconda::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2
  - defaults::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2done

## Package Plan ##

  environment location: C:\Users\Roman\Anaconda3

  added / updated specs:
    - shapely


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geos-3.8.0                 |       he025d50_0         1.0 MB  conda-forge
    shapely-1.6.4              |py37h2130f3d_1007         387 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages wi

In [2]:
!conda install -c conda-forge pyproj --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.8.28-0, anaconda::openssl-1.1.1d-he774522_2
  - anaconda::openssl-1.1.1d-he774522_2, defaults::ca-certificates-2019.8.28-0
  - anaconda::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2
  - defaults::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2done

## Package Plan ##

  environment location: C:\Users\Roman\Anaconda3

  added / updated specs:
    - pyproj


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    proj-6.2.0                 |

In [3]:
!conda install -c conda-forge Beautifulsoup4 --yes 
!conda install -c conda-forge lxml --yes 
!conda install -c conda-forge html5lib --yes 
!conda install -c conda-forge requests --yes 
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge geocoder --yes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.8.28-0, anaconda::openssl-1.1.1d-he774522_2
  - anaconda::openssl-1.1.1d-he774522_2, defaults::ca-certificates-2019.8.28-0
  - anaconda::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2
  - defaults::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.8.28-0, anaconda::openssl-1.1.1d-he774522_2
  - anaconda::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2
  - anaconda::openssl-1.1.1d-he774522_2, defaults::ca-certificates-2019.8.28-0
  - defaults::ca-certificates-2019.8.28-0, defaults::openssl-1.1.1d-he774522_2done

# All requested packages already installed.

Collecting package metadata (current_repodata.json):

## Prepare dataset

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup # library for html scrap
from geopy.geocoders import Nominatim
import folium

Define function for HTML table parse

In [None]:
def parse_html_table(table):
    n_columns = 0
    n_rows=0
    column_names = []

    # Find number of rows and columns
    # we also find the column titles if we can
    for row in table.find_all('tr'):
        
        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows+=1
            if n_columns == 0:
                # Set the number of columns for our table
                n_columns = len(td_tags)
                
        # Handle column names if we find them
        th_tags = row.find_all('th') 
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())

    # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0,n_columns)
    df = pd.DataFrame(columns = columns,
                        index= range(0,n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1
            
    # Convert to float if possible
    for col in df:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            pass
    
    return df

Dowload Wikipedia page

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(url)

Take first HTML table from the page and parse it into dataframe

In [None]:
soup = BeautifulSoup(response.text, 'lxml')
df = parse_html_table(soup.find('table'))  

Take a look at the dataframe

In [None]:
df.head()

Rename some columns

In [None]:
df.rename(columns={"Neighborhood\n": "Neighborhood", "Postcode": "PostalCode"}, inplace=True)

Replace "\n" in the end of the "Neighborhood" column

In [None]:
df['Neighborhood'].replace('\n', '', regex=True, inplace=True)

Drop rows with 'Borough' == 'Not assigned'

In [None]:
df.drop(df[df['Borough'] == 'Not assigned'].index, axis=0, inplace=True)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [None]:
df['Neighborhood'].replace('Not assigned', df['Borough'], inplace=True)

Concatenate 'Neighborhood' for one postal code area 

In [None]:
# Group by 'PostalCode', 'Borough'
# Apply concatenation 'Neighborhood' with ','
# Convert to pandas DataFrame
# Reset index
Toronto_df = df.groupby(['PostalCode','Borough'], as_index=True)['Neighborhood'].apply(lambda tags: ','.join(tags)).to_frame().reset_index()

Take a look at the result dataframe

In [None]:
Toronto_df.head()

In [None]:
Toronto_df.shape

## Use geopy library to get the latitude and longitude values
## !!!! Unsuccessfuly !!!!

Import library

In [None]:
from geopy.geocoders import Nominatim
import geocoder

Define dataframe for latitude and longitude 

In [None]:
# define the dataframe columns
column_names = ['PostalCode', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Try to use geo library

In [None]:
#geolocator = Nominatim(user_agent="ny_explorer", timeout=3)
geolocator = Nominatim(user_agent="foursquare_agent", timeout=1)

# loop all Postcode
for postal_code in Toronto_df['PostalCode']:
    address = '{}, Toronto, Ontario'.format(postal_code)
    print(address)

    location = None
    i = 0

 # loop until you get the coordinates
    while(location is None):
        location = geolocator.geocode(address)

    print('The geograpical coordinate of {} are {}, {}.'.format(postal_code, location.latitude, location.longitude))

    latitude = location.latitude
    longitude = location.longitude
    neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                        'Latitude': latitude,
                                        'Longitude': longitude}, ignore_index=True) 
    print(location)

## Get the latitude and longitude values from .csv

Load csv file

In [None]:
Toronto_postalcode_ll = pd.read_csv('Geospatial_data.csv')
Toronto_postalcode_ll.head()

Merge two dataframe

In [None]:
Toronto_df = pd.merge(left=Toronto_df, right=Toronto_postalcode_ll, left_on='PostalCode', right_on='PostalCode')

Take a look at the result dataframe

In [None]:
Toronto_df.head()

In [None]:
Toronto_df.shape

## Create a map of Toronto neighborhoods

Use geopy library to get the latitude and longitude values of New York City

In [None]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Create a map of Toronto with neighborhoods superimposed on top.

In [None]:
import folium # map rendering library

# create map of Toronto using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto_df['Latitude'], Toronto_df['Longitude'], Toronto_df['Borough'], Toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

## Explore the neighborhood in Toronto

Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'KO15SRHFPF3BBN5XUCEYLDU3ROGESHFXRWPK1Q32QSIDFFBM'
CLIENT_SECRET = 'Q2ALZAPWRRRTKBOWCLHNBWZNEO3BIHQ23HMOMURZGX2ZWZQC'
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Define a function to get nearby venues according to the given latitudes and longitudes

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each neighborhood and create a new dataframe

In [None]:
toronto_venues = getNearbyVenues(names=Toronto_df['Neighborhood'],
                                   latitudes=Toronto_df['Latitude'],
                                   longitudes=Toronto_df['Longitude']
                                  )

Take a look at the result dataframe

In [None]:
toronto_venues.head()

Save Toronto venues to .csv file 

In [None]:
toronto_venues.to_csv('toronto_venues.csv', index = False)
#toronto_venues = pd.read_csv('toronto_venues.csv')

In [None]:
toronto_venues.shape

## Analyze Neighborhood

How many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Toronto_Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Examine the new dataframe size.

In [None]:
toronto_onehot.shape

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Toronto_Neighborhood').mean().reset_index()
toronto_grouped

Confirm the new size

In [None]:
toronto_grouped.shape

Print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Toronto_Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Toronto_Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Define a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Toronto_Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Toronto_Neighborhood'] = toronto_grouped['Toronto_Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
neighborhoods_venues_sorted.shape

## Cluster Neighborhoods

Run k-means to cluster the neighborhood

In [None]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Toronto_Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

### Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

Add clustering labels

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_.astype(int))

As not all Neighborhood have venues, we need to use left join

In [None]:
print('Toronto_df shape {}'.format(Toronto_df.shape))
print('neighborhoods_venues_sorted shape {}'.format(neighborhoods_venues_sorted.shape))


# as not all Neighborhood have venues, we need to use left join
toronto_merged = pd.merge(left=Toronto_df, right=neighborhoods_venues_sorted, left_on='Neighborhood', right_on='Toronto_Neighborhood', how='left')

#replace the missing 'Cluster Labels' values by 5
toronto_merged["Cluster Labels"].replace(np.nan, 5, inplace=True)

# drop Toronto_Neighborhood colum
toronto_merged.drop("Toronto_Neighborhood", axis = 1, inplace=True)

print('toronto_merged shape {}'.format(toronto_merged.shape))
toronto_merged.head()

Visualize the resulting clusters

In [None]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters+1)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Without claster

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]