<h1>Exploration of air pollution and venues in the boroughs of St. Petersburg </h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">

## Table of contents
* [Introduction: Business Problem](#Introduction:-Business-Problem)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results and discussion](#Results-and-discussion)
* [Conclusion](#Conclusion)
</div>

<h2>Introduction: Business Problem</h2>

This project is aimed, first of all, to all people who are looking for a house or an apartment in the city or who want to open a new venue. This project will allow people to look for the best house in terms of environmental conditions and the availability of various facilities and venues in the borough. This report will be especially useful for those who have respiratory diseases or other diseases in which clean air is important.

I will consider various indicators of air pollution from stations located in different boroughs of the city. I also explore these areas by variety of venues. Based on the data I will try to group different boroughs of the city and give the name and characteristics of each group so that you can easily make a decision about which borough is the best for you.

<h2>Data</h2>


I will use data about air pollution from <a href="http://www.infoeco.ru/index.php?id=8222"> the offical site about ecology in Saint Petersburg</a>.
There is one docx file for each day. Each file contains observations from 24 stations with next indicators:
<ul>
<li>carbon_monoxide</li>
<li>nitric_oxide</li>
<li>nitrogen_dioxide</li>	
</ul>
I decided to use observations for January because, since February, the city has quarantined the coronavirus.

I will use Google Maps API for geocoding stations addresses and Folium for building a map.

Also, I will use the Foursquare API for getting venues in each borough. 

<h3> Stations</h3>


Firstly, I do the list of all stations by my own in CSV file. Load the file and create a dataframe.

In [84]:
import pandas as pd
stations=pd.read_csv('eco_data/stations.csv',sep=';')
stations.head()

Unnamed: 0,station_name,borough,address
0,АСМ-АВ №10,Адмиралтейский район,"Московский пр., дом 19"
1,АСМ-АВ №24,Василеостровский район,"В.О.Средний пр., дом 74"
2,АСМ-АВ №6,Василеостровский район,"пр. КИМа, дом 26 лит. А"
3,АСМ-АВ №3,Выборгский район,"ул. Карбышева, дом 7"
4,АСМ-АВ №18,Калининский район,"ул. Ольги Форш, дом 6"


Installing googlemaps package in order to get coordinates of each stations

In [85]:
! pip install -U googlemaps

Requirement already up-to-date: googlemaps in /Users/anastasiiapoplaukhina/.conda/envs/Coursera_Capstone/lib/python3.8/site-packages (4.4.1)


Connection to google maps

In [87]:
import googlemaps
f=open('temp/api_google_maps_key')
mykey=f.read()
mykey
f.close()
gmaps = googlemaps.Client(key=mykey)

Getting coordinates of each stations

In [88]:
stations['lat']=0.0
stations['lng']=0.0
for i in range(len(stations)):
    geocode_result = gmaps.geocode('Санкт-Петербург, '+stations.loc[i,'address'])
    stations.loc[i,'lat']=geocode_result[0]['geometry']['location']['lat']
    stations.loc[i,'lng']=geocode_result[0]['geometry']['location']['lng']

stations.head()

Unnamed: 0,station_name,borough,address,lat,lng
0,АСМ-АВ №10,Адмиралтейский район,"Московский пр., дом 19",59.91798,30.316883
1,АСМ-АВ №24,Василеостровский район,"В.О.Средний пр., дом 74",59.938864,30.262505
2,АСМ-АВ №6,Василеостровский район,"пр. КИМа, дом 26 лит. А",59.953574,30.243792
3,АСМ-АВ №3,Выборгский район,"ул. Карбышева, дом 7",59.992284,30.350745
4,АСМ-АВ №18,Калининский район,"ул. Ольги Форш, дом 6",60.043028,30.392076


We need coordinates of St. Petersburg if we want to create a map.

In [89]:
geocode_piter = gmaps.geocode('Санкт-Петербург')
piter_lat=geocode_result[0]['geometry']['location']['lat']
piter_lng=geocode_result[0]['geometry']['location']['lng']
print('Coordinates of St.Petersburg are',piter_lat,', ',piter_lng)

Coordinates of St.Petersburg are 59.9489663 ,  30.3748283


I add transliterate library to convert russian letters into english.

In [90]:
! pip install transliterate



In [91]:
from transliterate import translit

After that I create a map with all the stations using Folium

In [92]:
import folium
map_piter = folium.Map(location=[piter_lat, piter_lng], zoom_start=9)

# add markers to map
for lat, lng, name in zip(stations['lat'],stations['lng'],stations['station_name']):
    label = '{}'.format(translit(name,reversed=True))
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_piter)
    
map_piter

<h3>Getting air pollution indicators<h3>

Indicators placed in docx files. Firstly, I install python-docx

In [93]:
! pip install python-docx



In [94]:
import docx

Creation dataframe for indicators

In [95]:
eco_df=pd.DataFrame(columns=['station_name','date','carbon_monoxide','nitric_oxide','nitrogen_dioxide'])

Each file has a name like "ddmmyyyy.docx". I use a "while" loop to look through all files. Each file contains tables for all the stations. I use "for" loop to write into eco_df dataframe all the data.

In [96]:
import datetime
tdate=datetime.datetime(2020,1,1)
k=0
while tdate<=datetime.datetime(2020,1,31):
    docname='eco_data/'+tdate.strftime("%d%m%Y")+'.docx'
    doc=docx.Document(docname)
    for i in range(len(stations)):
        eco_df.loc[i+k*len(stations),'date']=tdate.strftime("%d.%m.%Y")
        eco_df.loc[i+k*len(stations),'station_name']=stations.loc[i,'station_name']
        eco_df.loc[i+k*len(stations),'borough']=stations.loc[i,'borough']
        for j in range(len(doc.tables[i].rows)):
            if doc.tables[i].cell(j,0).text=='Оксид углерода':
                eco_df.loc[i+k*len(stations),'carbon_monoxide']=doc.tables[i].cell(j,1).text
            elif doc.tables[i].cell(j,0).text=='Оксид азота':
                eco_df.loc[i+k*len(stations),'nitric_oxide']=doc.tables[i].cell(j,1).text
            elif doc.tables[i].cell(j,0).text=='Диоксид азота':
                eco_df.loc[i+k*len(stations),'nitrogen_dioxide']=doc.tables[i].cell(j,1).text
    k+=1
    tdate+=datetime.timedelta(days=1)
eco_df.shape

(744, 6)

Let's check eco_df

In [97]:
eco_df.head()

Unnamed: 0,station_name,date,carbon_monoxide,nitric_oxide,nitrogen_dioxide,borough
0,АСМ-АВ №10,01.01.2020,-*,-*,-*,Адмиралтейский район
1,АСМ-АВ №24,01.01.2020,0.1,менее 0.1,0.3,Василеостровский район
2,АСМ-АВ №6,01.01.2020,менее 0.1,менее 0.1,0.3,Василеостровский район
3,АСМ-АВ №3,01.01.2020,0.1,менее 0.1,0.5,Выборгский район
4,АСМ-АВ №18,01.01.2020,0.2,менее 0.1,0.3,Калининский район


Let's clear the data. "менее 0.1" means "less 0.1". I replace it with 0. Unfortunally, we don't have all indicators for all days. I replace this data with the most frequent value.

In [98]:
eco_df.replace('менее 0.1',0.0,inplace=True)
eco_df['carbon_monoxide'].replace('-*',eco_df['carbon_monoxide'].value_counts().idxmax(),inplace=True)
eco_df['nitric_oxide'].replace('-*',eco_df['nitric_oxide'].value_counts().idxmax(),inplace=True)
eco_df['nitrogen_dioxide'].replace('-*',eco_df['nitrogen_dioxide'].value_counts().idxmax(),inplace=True)
eco_df.head()

Unnamed: 0,station_name,date,carbon_monoxide,nitric_oxide,nitrogen_dioxide,borough
0,АСМ-АВ №10,01.01.2020,0.1,0,0.5,Адмиралтейский район
1,АСМ-АВ №24,01.01.2020,0.1,0,0.3,Василеостровский район
2,АСМ-АВ №6,01.01.2020,0.0,0,0.3,Василеостровский район
3,АСМ-АВ №3,01.01.2020,0.1,0,0.5,Выборгский район
4,АСМ-АВ №18,01.01.2020,0.2,0,0.3,Калининский район


<h2>Lets try to get data from Foursquere about St. Petersburg</h2>

Firstly, I need to create a dataframe with borough and their coordinates.

In [99]:
boroughs_df=stations[['borough','station_name']].groupby(['borough']).count()
boroughs_df=boroughs_df.reset_index()
boroughs_df['lat']=0.0
boroughs_df['lng']=0.0
boroughs_df.rename(columns={'station_name':'number od stations'})
boroughs_df.head()

Unnamed: 0,borough,station_name,lat,lng
0,Адмиралтейский район,1,0.0,0.0
1,Василеостровский район,2,0.0,0.0
2,Выборгский район,1,0.0,0.0
3,Калининский район,1,0.0,0.0
4,Кировский район,2,0.0,0.0


Lets get the boroughs coordinates.

In [100]:
for i in range(len(boroughs_df)):
    geocode_result = gmaps.geocode('Санкт-Петербург, '+boroughs_df.loc[i,'borough'])
    boroughs_df.loc[i,'lat']=geocode_result[0]['geometry']['location']['lat']
    boroughs_df.loc[i,'lng']=geocode_result[0]['geometry']['location']['lng']

boroughs_df.head()

Unnamed: 0,borough,station_name,lat,lng
0,Адмиралтейский район,1,59.910896,30.295336
1,Василеостровский район,2,59.947757,30.231663
2,Выборгский район,1,60.080843,30.255691
3,Калининский район,1,59.994318,30.395216
4,Кировский район,2,59.851233,30.253351


Lets create a dataframe with venues near each borough

In [101]:
venues_df=pd.DataFrame(columns=['borough','borough_lat','borough_lng',
                       'venue_name','venue_lat','venue_lng','venue_category'])

Credentials and parameteres for requests to Foursquere API

In [102]:
import requests # library to handle requests

f=open('temp/foursquere_credentials')
CLIENT_ID=f.readline()
CLIENT_SECRET=f.readline()
f.close()

VERSION = '20180605'

LIMIT = 1000 # limit of number of venues returned by Foursquare API
radius = 3000 # define radius


Creating a function for getting nearby venues

In [103]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['borough',
                             'borough_lat',
                             'borough_lng',
                             'venue_name',
                             'venue_lat',
                             'venue_lng',
                             'venue_category']

    return(nearby_venues)

Lets get venues near each stations

In [106]:
venues_df=getNearbyVenues(boroughs_df['borough'],boroughs_df['lat'],boroughs_df['lng'],radius)
venues_df.head()

Адмиралтейский район
Василеостровский район
Выборгский район
Калининский район
Кировский район
Колпинский район
Красногвардейский район
Красносельский район
Кронштадтский район
Курортный район
Московский район
Невский район
Петроградский  район
Петродворцовый район
Приморский район
Пушкинский район
Фрунзенский район
Центральный район


Unnamed: 0,borough,borough_lat,borough_lng,venue_name,venue_lat,venue_lng,venue_category
0,Адмиралтейский район,59.910896,30.295336,Булочная Ф. Вольчека,59.910548,30.297582,Bakery
1,Адмиралтейский район,59.910896,30.295336,Палантин,59.913655,30.296419,Hotel
2,Адмиралтейский район,59.910896,30.295336,Sibaristica Coffee Roasters,59.910371,30.284017,Coffee Shop
3,Адмиралтейский район,59.910896,30.295336,А1 на Циолковского,59.910931,30.286705,Music Venue
4,Адмиралтейский район,59.910896,30.295336,The Gamma Hotel,59.908311,30.292119,Hotel


Check the number of rows and columns of venues_df

In [107]:
venues_df.shape

(1330, 7)

<h2>Methodology</h2>


Firstly, I am going to build a choropleth map based on the average value of all indicators. It will show us the quality of the air of each borough of the city.

Secondly, I am going to group venues on categories for each area and calculate the frequency of each category in each borough. After that, I will use the top-10 categories for each borough in the next analysis. I want to add labels with top-10 on the choropleth map I have built in the previous step.

Thirdly, I am going to use the K-means algorithm to cluster the data with air pollution indicators and top-10 venue categories. I am going to visualize the result on the map with markers where each cluster will have its own color.

As a result, I represented this map, the table with data and descriptions for each cluster.

<h1>Analysis</h1>


<h2>Choropleth map</h2>

Lets find the month average value for each indicators in each borough



In [108]:
#At first, I convert date to floats
eco_df['carbon_monoxide']=eco_df['carbon_monoxide'].astype(float)
eco_df['nitric_oxide']=eco_df['nitric_oxide'].astype(float)
eco_df['nitrogen_dioxide']=eco_df['nitrogen_dioxide'].astype(float)


eco_avg_df=eco_df[['borough','carbon_monoxide','nitric_oxide','nitrogen_dioxide']].groupby(['borough']).mean()
eco_avg_df=eco_avg_df.reset_index()
eco_avg_df.head()

Unnamed: 0,borough,carbon_monoxide,nitric_oxide,nitrogen_dioxide
0,Адмиралтейский район,0.096774,0.074194,0.558065
1,Василеостровский район,0.096774,0.06129,0.533871
2,Выборгский район,0.1,0.122581,0.751613
3,Калининский район,0.112903,0.029032,0.470968
4,Кировский район,0.093548,0.172581,0.533871


After that, i scale the data and add a sum value.

In [109]:
from sklearn import preprocessing
# standardize the data attributes
eco_avg_df['carbon_monoxide'] = preprocessing.scale(eco_avg_df['carbon_monoxide'])
eco_avg_df['nitric_oxide'] = preprocessing.scale(eco_avg_df['nitric_oxide'])
eco_avg_df['nitrogen_dioxide'] = preprocessing.scale(eco_avg_df['nitrogen_dioxide'])
eco_avg_df['sum']=eco_avg_df['carbon_monoxide']+eco_avg_df['nitric_oxide']+eco_avg_df['nitrogen_dioxide']

eco_avg_df.head()

Unnamed: 0,borough,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
0,Адмиралтейский район,0.09245,0.031978,0.407486,0.531914
1,Василеостровский район,0.09245,-0.177332,0.297611,0.21273
2,Выборгский район,0.3698,0.816888,1.286483,2.473172
3,Калининский район,1.479201,-0.700605,0.011937,0.790532
4,Кировский район,-0.1849,1.627962,0.297611,1.740673


Lets create a choropleth map.

In [110]:
map_piter = folium.Map(location=[piter_lat, piter_lng], zoom_start=9)
map_piter.choropleth(
    geo_data=r'eco_data/spb.geojson',
    data=eco_avg_df,
    columns=['borough','sum'],
    key_on='feature.properties.name',
    fill_color='RdPu',
    fill_opacity=0.8,
    line_opacity=0.2,
    legend_name='Air pollution in St.Petersburg'
)
map_piter

<h2>Grouping venues</h2>

Lets group venues by boroughs and categories.

In [111]:
venues_df_gr=venues_df[['borough','venue_category','venue_name']].groupby(['borough','venue_category']).count()
venues_df_gr=venues_df_gr.rename(columns={'venue_name':'number of venues'})
venues_df_gr

Unnamed: 0_level_0,Unnamed: 1_level_0,number of venues
borough,venue_category,Unnamed: 2_level_1
Адмиралтейский район,Arcade,1
Адмиралтейский район,Bakery,6
Адмиралтейский район,Bar,2
Адмиралтейский район,Bath House,1
Адмиралтейский район,Bed & Breakfast,1
...,...,...
Центральный район,Sushi Restaurant,1
Центральный район,Theater,6
Центральный район,Wine Bar,3
Центральный район,Wine Shop,1


Lets sort venues_df_gr by boroughs, categories and the number of venues

In [112]:
venues_df_gr=venues_df_gr.reset_index()
venues_df_gr=venues_df_gr.sort_values(by=['borough','number of venues'],ascending=False)
venues_df_gr=venues_df_gr.reset_index(drop=True)
venues_df_gr.head()

Unnamed: 0,borough,venue_category,number of venues
0,Центральный район,Coffee Shop,8
1,Центральный район,Bakery,6
2,Центральный район,Bar,6
3,Центральный район,Theater,6
4,Центральный район,Bookstore,4


Lets prepare another dataframe for most common venues

In [113]:
import numpy as np
num_top_venues = 10

columns = ['borough']
for ind in np.arange(num_top_venues):
    columns.append(ind+1)
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted 

Unnamed: 0,borough,1,2,3,4,5,6,7,8,9,10


After that, I fill venues_sorted with data from venues_df_gr

In [114]:
k=0
j=1
temp=venues_df_gr.loc[0,'borough']
venues_sorted.loc[0,'borough']=venues_df_gr.loc[0,'borough']
for i in range(len(venues_df_gr)):
    if temp!=venues_df_gr.loc[i,'borough']:
        temp=venues_df_gr.loc[i,'borough']
        k+=1
        venues_sorted.loc[k,'borough']=venues_df_gr.loc[i,'borough']
        j=1
    if j<=10:
        venues_sorted.loc[k,j]=venues_df_gr.loc[i,'venue_category']
        j+=1
venues_sorted

Unnamed: 0,borough,1,2,3,4,5,6,7,8,9,10
0,Центральный район,Coffee Shop,Bakery,Bar,Theater,Bookstore,Cocktail Bar,Gastropub,Hookah Bar,Health Food Store,Italian Restaurant
1,Фрунзенский район,Bakery,Café,Park,Soccer Field,Clothing Store,Doner Restaurant,Gym / Fitness Center,Wine Shop,Arcade,Auto Workshop
2,Пушкинский район,Historic Site,Park,Stables,Bakery,History Museum,Palace,Russian Restaurant,Salon / Barbershop,Soccer Field,Auto Workshop
3,Приморский район,Bakery,Caucasian Restaurant,Park,Gym / Fitness Center,Hookah Bar,Playground,Bar,Beer Bar,Gym,Health & Beauty Service
4,Петродворцовый район,Supermarket,Auto Workshop,Café,Japanese Restaurant,Mobile Phone Shop,Park,Train Station,Arcade,Automotive Shop,Bed & Breakfast
5,Петроградский район,Coffee Shop,Gastropub,Bakery,Bar,Plaza,Spa,Wine Bar,Wine Shop,Gym,Hookah Bar
6,Невский район,Bakery,Clothing Store,Auto Workshop,Park,Gym,Restaurant,Cosmetics Shop,Gym / Fitness Center,Middle Eastern Restaurant,Shoe Store
7,Московский район,Clothing Store,Airport Lounge,Boutique,Airport,Airport Service,Auto Workshop,Flower Shop,Shoe Store,Coffee Shop,Hotel
8,Курортный район,Restaurant,Beach,Café,Outdoor Sculpture,Bus Stop,Food & Drink Shop,Hotel,Pool,Spa,Deli / Bodega
9,Кронштадтский район,Historic Site,Park,History Museum,Bakery,Fountain,Gym,Harbor / Marina,Athletics & Sports,Bath House,Beach


Adding borough coordinates into venues_sorted

In [115]:
venues_sorted=venues_sorted.join(boroughs_df[['borough','lat','lng']].set_index('borough'),on='borough')
venues_sorted.head()

Unnamed: 0,borough,1,2,3,4,5,6,7,8,9,10,lat,lng
0,Центральный район,Coffee Shop,Bakery,Bar,Theater,Bookstore,Cocktail Bar,Gastropub,Hookah Bar,Health Food Store,Italian Restaurant,59.93093,30.361898
1,Фрунзенский район,Bakery,Café,Park,Soccer Field,Clothing Store,Doner Restaurant,Gym / Fitness Center,Wine Shop,Arcade,Auto Workshop,59.859413,30.392462
2,Пушкинский район,Historic Site,Park,Stables,Bakery,History Museum,Palace,Russian Restaurant,Salon / Barbershop,Soccer Field,Auto Workshop,59.721878,30.410222
3,Приморский район,Bakery,Caucasian Restaurant,Park,Gym / Fitness Center,Hookah Bar,Playground,Bar,Beer Bar,Gym,Health & Beauty Service,60.023158,30.208458
4,Петродворцовый район,Supermarket,Auto Workshop,Café,Japanese Restaurant,Mobile Phone Shop,Park,Train Station,Arcade,Automotive Shop,Bed & Breakfast,59.889345,29.796354


Adding labels with the most common venues on the map

In [116]:
for i in range(len(venues_sorted)):
    mystr=translit(venues_sorted.loc[i,'borough'],reversed=True)+" "
    for j in range(10):
        mystr=mystr+str(j+1)+'. '+venues_sorted.loc[i,j+1]+' '
    label = folium.Popup(mystr, parse_html=True)
    folium.CircleMarker(
        [venues_sorted.loc[i,'lat'], venues_sorted.loc[i,'lng']],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map_piter)
map_piter

<h2>Clusterization</h2>

Lets check venues_df

In [117]:
venues_df=venues_df.set_index('borough')
venues_df.head()

Unnamed: 0_level_0,borough_lat,borough_lng,venue_name,venue_lat,venue_lng,venue_category
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Адмиралтейский район,59.910896,30.295336,Булочная Ф. Вольчека,59.910548,30.297582,Bakery
Адмиралтейский район,59.910896,30.295336,Палантин,59.913655,30.296419,Hotel
Адмиралтейский район,59.910896,30.295336,Sibaristica Coffee Roasters,59.910371,30.284017,Coffee Shop
Адмиралтейский район,59.910896,30.295336,А1 на Циолковского,59.910931,30.286705,Music Venue
Адмиралтейский район,59.910896,30.295336,The Gamma Hotel,59.908311,30.292119,Hotel


Next, let's group rows by borough and by taking the mean of the frequency of occurrence of each category

In [118]:
venues_df_freq = pd.get_dummies(venues_df[['venue_category']], prefix="", prefix_sep="")
venues_df_freq = venues_df_freq.groupby('borough').mean().sort_values('borough',ascending=False).reset_index()
venues_df_freq=venues_df_freq.join(eco_avg_df.set_index('borough'),on='borough').drop('sum',1)
venues_df_freq.head()

Unnamed: 0,borough,Accessories Store,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Apres Ski Bar,Arcade,Art Gallery,...,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo Exhibit,carbon_monoxide,nitric_oxide,nitrogen_dioxide
0,Центральный район,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.03,0.01,0.0,0.03,0.0,-0.1849,-0.909914,1.052084
1,Фрунзенский район,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,...,0.0,0.0,0.0,0.03,0.02,0.0,0.01,0.9245,1.078525,0.041237
2,Пушкинский район,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3698,-1.171551,-1.570259
3,Приморский район,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.012346,0.024691,0.0,0.012346,0.0,-0.1849,0.659906,0.231686
4,Петродворцовый район,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.46225,-0.909914,-1.584908


Run *k*-means to cluster the boroughs into clusters.

In [119]:
from sklearn.cluster import KMeans
kclusters = 5

venues_df_clustering = venues_df_freq.drop('borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:] 

array([2, 0, 1, 0, 1, 1, 3, 2, 4, 4, 0, 0, 4, 0, 2, 0, 2, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.

In [120]:
sbp_merged = venues_sorted
sbp_merged = sbp_merged.join(eco_avg_df.set_index('borough'),on='borough')
sbp_merged.insert(0, 'cluster_labels', kmeans.labels_)

sbp_merged.head() 

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
0,2,Центральный район,Coffee Shop,Bakery,Bar,Theater,Bookstore,Cocktail Bar,Gastropub,Hookah Bar,Health Food Store,Italian Restaurant,59.93093,30.361898,-0.1849,-0.909914,1.052084,-0.04273
1,0,Фрунзенский район,Bakery,Café,Park,Soccer Field,Clothing Store,Doner Restaurant,Gym / Fitness Center,Wine Shop,Arcade,Auto Workshop,59.859413,30.392462,0.9245,1.078525,0.041237,2.044262
2,1,Пушкинский район,Historic Site,Park,Stables,Bakery,History Museum,Palace,Russian Restaurant,Salon / Barbershop,Soccer Field,Auto Workshop,59.721878,30.410222,0.3698,-1.171551,-1.570259,-2.37201
3,0,Приморский район,Bakery,Caucasian Restaurant,Park,Gym / Fitness Center,Hookah Bar,Playground,Bar,Beer Bar,Gym,Health & Beauty Service,60.023158,30.208458,-0.1849,0.659906,0.231686,0.706692
4,1,Петродворцовый район,Supermarket,Auto Workshop,Café,Japanese Restaurant,Mobile Phone Shop,Park,Train Station,Arcade,Automotive Shop,Bed & Breakfast,59.889345,29.796354,-0.46225,-0.909914,-1.584908,-2.957073


In [121]:
# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for i in range(len(sbp_merged)):
    mystr=translit(sbp_merged.loc[i,'borough'],reversed=True)+" "
    for j in range(10):
        mystr=mystr+str(j+1)+'. '+venues_sorted.loc[i,j+1]+' '
    label = folium.Popup(mystr, parse_html=True)
    folium.CircleMarker(
        [sbp_merged.loc[i,'lat'], sbp_merged.loc[i,'lng']],
        radius=5,
        popup=label,
        fill=True,
        color=rainbow[sbp_merged.loc[i,'cluster_labels']],
        fill_color=rainbow[sbp_merged.loc[i,'cluster_labels']],
        fill_opacity=0.7).add_to(map_piter)
map_piter

<h1>Results and discussion</h1>

Now, we can examine each cluster and determine the discriminating venue categories and the level of ar pollution that distinguish each cluster. 

<h2>Cluster 1<h2>

It as a claster with resedential area with the moderate level of air pollution. These boroughs have bakeries, gyms and auto workshops, parks and soccer fields. Thera are bars and restaurants in these boroughs, but their number is less than in the city center. These boroughs suits for everybody, it is good place for life.

In [122]:
sbp_merged[sbp_merged['cluster_labels']==0]

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
1,0,Фрунзенский район,Bakery,Café,Park,Soccer Field,Clothing Store,Doner Restaurant,Gym / Fitness Center,Wine Shop,Arcade,Auto Workshop,59.859413,30.392462,0.9245,1.078525,0.041237,2.044262
3,0,Приморский район,Bakery,Caucasian Restaurant,Park,Gym / Fitness Center,Hookah Bar,Playground,Bar,Beer Bar,Gym,Health & Beauty Service,60.023158,30.208458,-0.1849,0.659906,0.231686,0.706692
10,0,Красносельский район,Auto Workshop,Bakery,Beer Bar,Caucasian Restaurant,Women's Store,Arts & Crafts Store,Big Box Store,Blini House,Bookstore,Border Crossing,59.83228,30.126223,0.3698,0.345942,-0.442212,0.27353
11,0,Красногвардейский район,Pharmacy,Supermarket,Auto Workshop,Caucasian Restaurant,Bus Stop,Café,Coffee Shop,Gym,Gym / Fitness Center,Park,59.974905,30.471507,0.3698,0.398269,1.027667,1.795737
13,0,Кировский район,Bakery,Auto Workshop,Gym / Fitness Center,Wine Shop,Soccer Field,Baby Store,Bath House,Beer Store,Bus Stop,Coffee Shop,59.851233,30.253351,-0.1849,1.627962,0.297611,1.740673
15,0,Выборгский район,Racetrack,Stables,Café,Fast Food Restaurant,Motorcycle Shop,Mountain,Park,Pet Store,American Restaurant,Athletics & Sports,60.080843,30.255691,0.3698,0.816888,1.286483,2.473172


<h2>Cluster 2<h2>

It is a cluster with the low level of air pollution. Also boroughs in this cluster have different enterteiment venues like historic sites, stables and restaurants. These boroughs suit people with unusual interests who wants to open new things every day.

In [123]:
sbp_merged[sbp_merged['cluster_labels']==1]

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
2,1,Пушкинский район,Historic Site,Park,Stables,Bakery,History Museum,Palace,Russian Restaurant,Salon / Barbershop,Soccer Field,Auto Workshop,59.721878,30.410222,0.3698,-1.171551,-1.570259,-2.37201
4,1,Петродворцовый район,Supermarket,Auto Workshop,Café,Japanese Restaurant,Mobile Phone Shop,Park,Train Station,Arcade,Automotive Shop,Bed & Breakfast,59.889345,29.796354,-0.46225,-0.909914,-1.584908,-2.957073
5,1,Петроградский район,Coffee Shop,Gastropub,Bakery,Bar,Plaza,Spa,Wine Bar,Wine Shop,Gym,Hookah Bar,59.963515,30.289567,0.3698,0.031978,-0.95496,-0.553182


<h2>Cluster 3<h2>

It is another good place for living. Air in this place are better than in cluster 1, but worse than in cluster 2. There are theaters, dance studios, concerts halls and art galleries. These boroughs suit for people who are intersted in art.

In [124]:
sbp_merged[sbp_merged['cluster_labels']==2]

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
0,2,Центральный район,Coffee Shop,Bakery,Bar,Theater,Bookstore,Cocktail Bar,Gastropub,Hookah Bar,Health Food Store,Italian Restaurant,59.93093,30.361898,-0.1849,-0.909914,1.052084,-0.04273
7,2,Московский район,Clothing Store,Airport Lounge,Boutique,Airport,Airport Service,Auto Workshop,Flower Shop,Shoe Store,Coffee Shop,Hotel,59.812904,30.303712,0.3698,-0.59595,0.715135,0.488985
14,2,Калининский район,Park,Bakery,Auto Workshop,Gym / Fitness Center,Gastropub,Cosmetics Shop,Dance Studio,Lingerie Store,Clothing Store,Coffee Shop,59.994318,30.395216,1.479201,-0.700605,0.011937,0.790532
16,2,Василеостровский район,Park,Dance Studio,Bakery,Coffee Shop,Restaurant,Art Gallery,Beach,Beer Store,Café,Gastropub,59.947757,30.231663,0.09245,-0.177332,0.297611,0.21273
17,2,Адмиралтейский район,Bakery,Hotel,Theater,Café,Coffee Shop,Concert Hall,Historic Site,Hookah Bar,Plaza,Vegetarian / Vegan Restaurant,59.910896,30.295336,0.09245,0.031978,0.407486,0.531914


<h2>Cluster 4<h2>

There is the worst air in this cluster. It is a usual resedential area. I recommend to choose cluster 1 instes of this one.

In [125]:
sbp_merged[sbp_merged['cluster_labels']==3]

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
6,3,Невский район,Bakery,Clothing Store,Auto Workshop,Park,Gym,Restaurant,Cosmetics Shop,Gym / Fitness Center,Middle Eastern Restaurant,Shoe Store,59.882034,30.470063,1.756551,2.439036,2.018981,6.214568


<h2>Cluster 5<h2>

This cluster has the clearest air in the ciry. Also, all the boroughs of this cluster have a beach or a harbor. These boroughs suit for people who like marine views and clear air.

In [126]:
sbp_merged[sbp_merged['cluster_labels']==4]

Unnamed: 0,cluster_labels,borough,1,2,3,4,5,6,7,8,9,10,lat,lng,carbon_monoxide,nitric_oxide,nitrogen_dioxide,sum
8,4,Курортный район,Restaurant,Beach,Café,Outdoor Sculpture,Bus Stop,Food & Drink Shop,Hotel,Pool,Spa,Deli / Bodega,60.167212,29.910505,-2.126351,-1.066897,-1.16006,-4.353307
9,4,Кронштадтский район,Historic Site,Park,History Museum,Bakery,Fountain,Gym,Harbor / Marina,Athletics & Sports,Bath House,Beach,59.995947,29.765595,-2.126351,-1.014569,-1.262609,-4.403529
12,4,Колпинский район,Stables,Auto Workshop,Convenience Store,Dance Studio,Department Store,Food Truck,Go Kart Track,Harbor / Marina,History Museum,Hookah Bar,59.778424,30.588759,-1.2943,-0.883751,-0.412912,-2.590963


<h1>Conclusion</h1>

In this project I explored our city using air pollution indicators and venue categories.

I used data from the official site about ecology in Saint Petersburg and data about venues from Foursquare. I used Google map API for getting coordinates, docx library for reading files with air pollution indicators and Foursquare API for getting a list of venues.

I created choropleth maps using Folium. Also, I added labels with the top-10 venue catagories in each borough.

After that, I clustered all the boroughs using the k-mean algorithm. I got 5 clusters and made description of each cluster. Besides that, I visualized these cluster on the map. Each cluster had its own color.

As a result, this map and description of each cluster can help people when they choose their new home.
