# Classification of Moscow Metro stations

## Introduction

Moscow Metro has 265 stations and is one of the largest public transit systems in the world. It is used by more than 6 million people daily.  
For this project, we want to look at the neighborhoods surrounding metro stations and classify them. Some neighborhoods are mostly residential, some have more business or commercial spaces surrounding them. The venues closest to a station determine why and how people use it. E.g. if there are no professional places in a neighborhood its residents are likely to travel to other areas for work. This creates daily migrations of people.  
By analyzing this data we can classify stations by primary usage. This data is useful for city planners to determine where from and where to people are most likely to travel for work and leisure. This can help plan further extension of the network and find places for new development.

## Data

In [51]:
import pandas as pd
import json
import requests
from bs4 import BeautifulSoup
import folium

### List of stations and their geographical coordinates
We can get the list of stations and their coordinates from Wikipedia

In [68]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_Moscow_Metro_stations'
wiki_page = requests.get(wiki_url).text
wiki_doc = BeautifulSoup(wiki_page, 'lxml')

# get the table containing the postal codes
stations_table = wiki_doc.find('table', {'class': 'wikitable sortable'})

In [76]:
# Grab English name, Russian name and coordinates (convert to comma-separated) from list
indices = [0,1,6]

stations_df = pd.DataFrame(columns=['English name','Russian name','Coordinates'])

for tr in stations_table.find_all('tr')[1:]:
    cells = tr.find_all('td')
    #ignore cells that don't have coordinates
    if cells[6].text.strip() != '':
        stations_df = stations_df.append({
            'English name': cells[0].text.strip(),
            'Russian name': cells[1].text.strip(),
            'Coordinates': cells[6].find('span', {'class': 'geo'}).text.strip().replace('; ',',')
        }, ignore_index=True)    
        
stations_df.head()
stations_df.to_csv('stations.csv')

Let's visualize the stations data

In [123]:
#create map of Moscow with all stations
map_moscow_metro = folium.Map(location=[55.755825, 37.617298], zoom_start=10)
#add markers
for station, coordinates in zip(stations_df['English name'], stations_df['Coordinates']):
    latlong = [float(x) for x in coordinates.split(',')]
    #Also add a 500-meter circle around the station to visualize our neighborhoods
    folium.Circle(
        latlong,        
        radius=500
    ).add_to(map_moscow_metro)
    #Add marker with popup
    folium.Circle(
        latlong,
        popup=station,
        radius=20
    ).add_to(map_moscow_metro)
    
    
map_moscow_metro

### Venues and categories
We will use Foursquare API to explore venue categories surrounding each station. Venues can be categorized as residential, professional, shopping or leisure.
Let's see what venue categories Foursquare identifies.

In [8]:
secrets = json.load(open('secrets.json'))
CLIENT_ID = secrets['CLIENT_ID']
CLIENT_SECRET = secrets['CLIENT_SECRET']
VERSION = secrets['VERSION']

In [77]:
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
            
# make the GET request
results = requests.get(categories_url).json()

In [78]:
len(results['response']['categories'])

10

There are 10 top-level categories and multiple subcategories

In [118]:
categories_list = []
# Let's print only the top-level categories and their IDs and also add them to categories_list

def print_categories(categories, level=0, max_level=0):
    if level>max_level: return
    out = ''
    out += '-'*level
    for category in categories:
        print(out + category['name'] + ' (' + category['id'] + ')')
        print_categories(category['categories'], level+1)
        categories_list.append((category['name'], category['id']))
        
print_categories(results['response']['categories'], 0)

Arts & Entertainment (4d4b7104d754a06370d81259)
College & University (4d4b7105d754a06372d81259)
Event (4d4b7105d754a06373d81259)
Food (4d4b7105d754a06374d81259)
Nightlife Spot (4d4b7105d754a06376d81259)
Outdoors & Recreation (4d4b7105d754a06377d81259)
Professional & Other Places (4d4b7105d754a06375d81259)
Residence (4e67e38e036454776db1fb3a)
Shop & Service (4d4b7105d754a06378d81259)
Travel & Transport (4d4b7105d754a06379d81259)


We can use the foursquare explore API with categoryId to query the number of venues of each category in a specific radius. The response contains a totalResults value for the specified coordinates, radius and category.

In [108]:
def get_venues_count(ll, radius, categoryId):
    explore_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={}&radius={}&categoryId={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION,
                ll,
                radius,
                categoryId)

    # make the GET request
    return requests.get(explore_url).json()['response']['totalResults']

In [121]:
#Find the count of venues in each category for a sample station downtown
station = stations_df.loc[stations_df['English name']=='Krasnopresnenskaya']

for category in categories_list:
    count = get_venues_count(station.Coordinates.iloc[0], 500, category[1])
    print(category[0] + '\t' + str(count))

Arts & Entertainment	24
College & University	13
Event	0
Food	15
Nightlife Spot	8
Outdoors & Recreation	16
Professional & Other Places	80
Residence	10
Shop & Service	53
Travel & Transport	9


### (Optional) Housing data
We can use housing data from https://www.reformagkh.ru/opendata?gid=2280999&cids=house_management&pageSize=10 (in Russian). This dataset provides addresses and total residential areas for all residential buildings in Moscow. We can use residential area to estimate the number of residents in a neighborhood.  
This data set doesn't have coordinates so we'll have to use a geocoder.