<h1>Exploration air pollution and venues in the boroughs of St. Petersburg </h1>

## Table of contents
* [Introduction: Business Problem](#Introduction:-Business-Problem)
* [Data](#Data)

<h2>Introduction: Business Problem</h2>

This project is aimed, first of all, to all people who are looking for a house or an apartment in the city or who want to open a new venue. This project will allow people to look for the best house in terms of environmental conditions and the availability of various facilities and venues in the area. This report will be especially useful for those who have respiratory diseases or other diseases in which clean air is important.

I will consider various indicators of air pollution from stations located in different areas of the city. I also explore these areas by the number and variety of venues. Based on the data I will try to group different areas of the city and give the name and characteristics of each group so that you can easily make a decision about which area is the best for you.

<h2>Data</h2>


I will use data about air pollution from <a href="http://www.infoeco.ru/index.php?id=8222"> the offical site about ecology in Saint Petersburg</a>.
There is one docx file for each day. Each file contains observations from 24 stations with next indicators:
<ul>
<li>carbon_monoxide</li>
<li>nitric_oxide</li>
<li>nitrogen_dioxide</li>	
</ul>
I decided to use observations for January because, since February, the city has quarantined the coronavirus.

I will use Google Maps API for geocoding stations addresses and Folium for building a map.

Also, I will use the Foursquare API for getting venues near stations. I will group venues on categories for each area. After that, I will use the top-5 categories for making the final dataframe. 

<h3> Stations</h3>


Firstly, I do the list of all stations by my own in CSV file. Load the file and create a dataframe.

In [1]:
import pandas as pd
stations=pd.read_csv('eco_data/stations.csv',sep=';')
stations.head()

Unnamed: 0,name,borough,address
0,АСМ-АВ №10,Адмиралтейский район,"Московский пр., дом 19"
1,АСМ-АВ №24,Василеостровский район,"В.О.Средний пр., дом 74"
2,АСМ-АВ №6,Василеостровский район,"пр. КИМа, дом 26 лит. А"
3,АСМ-АВ №3,Выборгский район,"ул. Карбышева, дом 7"
4,АСМ-АВ №18,Калининский район,"ул. Ольги Форш, дом 6"


Installing googlemaps package in order to get coordinates of each stations

In [2]:
! pip install -U googlemaps

Requirement already up-to-date: googlemaps in /Users/anastasiiapoplaukhina/.conda/envs/Coursera_Capstone/lib/python3.8/site-packages (4.4.1)


Connection to google maps

In [3]:
import googlemaps
f=open('temp/api_google_maps_key')
mykey=f.read()
mykey
f.close()
gmaps = googlemaps.Client(key=mykey)

Getting coordinates of each stations

In [4]:
stations['lat']=0.0
stations['lng']=0.0
for i in range(len(stations)):
    geocode_result = gmaps.geocode('Санкт-Петербург, '+stations.loc[i,'address'])
    stations.loc[i,'lat']=geocode_result[0]['geometry']['location']['lat']
    stations.loc[i,'lng']=geocode_result[0]['geometry']['location']['lng']

stations.head()

Unnamed: 0,name,borough,address,lat,lng
0,АСМ-АВ №10,Адмиралтейский район,"Московский пр., дом 19",59.91798,30.316883
1,АСМ-АВ №24,Василеостровский район,"В.О.Средний пр., дом 74",59.938864,30.262505
2,АСМ-АВ №6,Василеостровский район,"пр. КИМа, дом 26 лит. А",59.953574,30.243792
3,АСМ-АВ №3,Выборгский район,"ул. Карбышева, дом 7",59.992284,30.350745
4,АСМ-АВ №18,Калининский район,"ул. Ольги Форш, дом 6",60.043028,30.392076


We need coordinates of St. Petersburg if we want to create a map.

In [5]:
geocode_piter = gmaps.geocode('Санкт-Петербург')
piter_lat=geocode_result[0]['geometry']['location']['lat']
piter_lng=geocode_result[0]['geometry']['location']['lng']
print('Coordinates of St.Petersburg are',piter_lat,', ',piter_lng)

Coordinates of St.Petersburg are 59.9489663 ,  30.3748283


After that I create a map with all the stations using Folium

In [6]:
import folium
map_piter = folium.Map(location=[piter_lat, piter_lng], zoom_start=9)

# add markers to map
for lat, lng, name in zip(stations['lat'],stations['lng'],stations['name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_piter)
    
map_piter

<h3>Getting air pollution indicators<h3>

Indicators placed in docx files. Firstly, I install python-docx

In [7]:
! pip install python-docx



In [8]:
import docx

Creation dataframe for indicators

In [9]:
eco_df=pd.DataFrame(columns=['station_name','date','carbon_monoxide','nitric_oxide','nitrogen_dioxide'])

Each file has a name like "ddmmyyyy.docx". I use a "while" loop to look through all files. Each file contains table for all the station. I use "for" loop to write into eco_df dataframe all the data.

In [10]:
import datetime
tdate=datetime.datetime(2020,1,1)
k=0
while tdate<=datetime.datetime(2020,1,31):
    docname='eco_data/'+tdate.strftime("%d%m%Y")+'.docx'
    doc=docx.Document(docname)
    for i in range(len(stations)):
        eco_df.loc[i+k*len(stations),'date']=tdate.strftime("%d.%m.%Y")
        eco_df.loc[i+k*len(stations),'station_name']=stations.loc[i,'name']
        for j in range(len(doc.tables[i].rows)):
            if doc.tables[i].cell(j,0).text=='Оксид углерода':
                eco_df.loc[i+k*len(stations),'carbon_monoxide']=doc.tables[i].cell(j,1).text
            elif doc.tables[i].cell(j,0).text=='Оксид азота':
                eco_df.loc[i+k*len(stations),'nitric_oxide']=doc.tables[i].cell(j,1).text
            elif doc.tables[i].cell(j,0).text=='Диоксид азота':
                eco_df.loc[i+k*len(stations),'nitrogen_dioxide']=doc.tables[i].cell(j,1).text
    k+=1
    tdate+=datetime.timedelta(days=1)
eco_df.shape

(744, 5)

Let's check eco_df

In [11]:
eco_df.head()

Unnamed: 0,station_name,date,carbon_monoxide,nitric_oxide,nitrogen_dioxide
0,АСМ-АВ №10,01.01.2020,-*,-*,-*
1,АСМ-АВ №24,01.01.2020,0.1,менее 0.1,0.3
2,АСМ-АВ №6,01.01.2020,менее 0.1,менее 0.1,0.3
3,АСМ-АВ №3,01.01.2020,0.1,менее 0.1,0.5
4,АСМ-АВ №18,01.01.2020,0.2,менее 0.1,0.3


Let's clear the data. "менее 0.1" means "less 0.1". I replace it with 0. Unfortunally, we don't have all indicators for all days. I replace this data with the most frequent value.

In [12]:
eco_df.replace('менее 0.1',0.0,inplace=True)
eco_df['carbon_monoxide'].replace('-*',eco_df['carbon_monoxide'].value_counts().idxmax(),inplace=True)
eco_df['nitric_oxide'].replace('-*',eco_df['nitric_oxide'].value_counts().idxmax(),inplace=True)
eco_df['nitrogen_dioxide'].replace('-*',eco_df['nitrogen_dioxide'].value_counts().idxmax(),inplace=True)
eco_df.head()

Unnamed: 0,station_name,date,carbon_monoxide,nitric_oxide,nitrogen_dioxide
0,АСМ-АВ №10,01.01.2020,0.1,0,0.5
1,АСМ-АВ №24,01.01.2020,0.1,0,0.3
2,АСМ-АВ №6,01.01.2020,0.0,0,0.3
3,АСМ-АВ №3,01.01.2020,0.1,0,0.5
4,АСМ-АВ №18,01.01.2020,0.2,0,0.3


<h2>Lets try to get data from Foursquere about St. Petersburg</h2>

Lets create a dataframe for venues near each stations

In [13]:
venues_df=pd.DataFrame(columns=['station_name','station_lat','station_lng',
                       'venue_name','venue_lat','venue_lng','venue_category'])

Credentials and parameteres for requests to Foursquere API

In [14]:
import requests # library to handle requests

f=open('temp/foursquere_credentials')
CLIENT_ID=f.readline()
CLIENT_SECRET=f.readline()
f.close()

VERSION = '20180605'

LIMIT = 500 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

Creating a function for getting nearby venues

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['station_name',
                             'station_lat',
                             'station_lng',
                             'venue_name',
                             'venue_lat',
                             'venue_lng',
                             'venue_category']

    return(nearby_venues)

Lets get venues near each stations

In [16]:
venues_df=getNearbyVenues(stations['name'],stations['lat'],stations['lng'],radius)
venues_df.head()

АСМ-АВ №10
АСМ-АВ №24
АСМ-АВ №6
АСМ-АВ №3
АСМ-АВ №18
АСМ-АВ №5
АСМ-АВ №22
АСМ-АВ №2
АСМ-АВ №25
АСМ-АВ №4
АСМ-АВ №13
АСМ-АВ №14
АСМ-АВ №19
АСМ-АВ №15
АСМ-АВ №11
АСМ-АВ №16
АСМ-АВ №20
АСМ-АВ №1
АСМ-АВ №23
АСМ-АВ №21
АСМ-АВ №8
АСМ-АВ №17
АСМ-АВ №9
АСМ-АВ №7


Unnamed: 0,station_name,station_lat,station_lng,venue_name,venue_lat,venue_lng,venue_category
0,АСМ-АВ №10,59.91798,30.316883,Таблица Менделеева,59.918184,30.317408,Outdoor Sculpture
1,АСМ-АВ №10,59.91798,30.316883,Молодёжный театр на Фонтанке,59.91904,30.31303,Theater
2,АСМ-АВ №10,59.91798,30.316883,"Молодёжный театр на Фонтанке, Малая сцена",59.918919,30.312596,Theater
3,АСМ-АВ №10,59.91798,30.316883,Измайловский сад,59.91953,30.312857,Park
4,АСМ-АВ №10,59.91798,30.316883,Hungry Bags,59.920143,30.319283,Comic Shop
