<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Data-that-we-might-need-(to-scrape)" data-toc-modified-id="Data-that-we-might-need-(to-scrape)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data that we might need (to scrape)</a></span></li><li><span><a href="#Importing-the-datasets" data-toc-modified-id="Importing-the-datasets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing the datasets</a></span><ul class="toc-item"><li><span><a href="#Columns-that-we-don't-need-from-accidents" data-toc-modified-id="Columns-that-we-don't-need-from-accidents-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Columns that we don't need from accidents</a></span></li></ul></li><li><span><a href="#Get-postal-code-of-every-'compteur'" data-toc-modified-id="Get-postal-code-of-every-'compteur'-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Get postal code of every 'compteur'</a></span></li></ul></div>

# Objectives
Analyse the number of cyclists in Paris

* Quantify the rise of cyclists in Paris
    * Get data from traffic, accidents
    * Bike lane construction
* Correlate the accidents with time of day, condition of the road, gender
* Is the increase in car traffic leading to more bike accidents?
* Is the increase bikes lanes helping in the drecrease of bike accidents?
* Is the increase of bike traffic leading to more bike lanes? And in which areas?

# Data that we might need (to scrape)
* Public investment (bike lane construction, public incentives to buy bikes)
* Car Traffic in Paris
* Number of bycicles sold in Paris (we might have info about sales of eletric bikes in Paris)
* Average Salary in Paris
* Average bike prices

# Importing the datasets

In [2]:
import pandas as pd
from urllib.request import urlopen
import zipfile, io

In [2]:
# Importing the accidents dataset - Imports with no issues
accidents = pd.read_csv('https://www.data.gouv.fr/en/datasets/r/3d5f2317-5afd-4a9f-a9c5-bd4fe0113f39', low_memory=False)

In [73]:
# Importing traffic file of 2018
url = 'https://www.data.gouv.fr/en/datasets/r/58d6b982-4c70-4648-afe4-b80eab61d28d'
archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
csv_path = '2018_comptage-velo-donnees-compteurs.csv' # The desired csv file in the archive
traffic_2018 = pd.read_csv(io.BytesIO(archive.read(csv_path)), sep = ';')

In [74]:
# Importing traffic file of 2019
url = 'https://www.data.gouv.fr/en/datasets/r/9c23d147-4032-429c-9c18-86dabd53e63f'
archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
csv_path = '2019_comptage-velo-donnees-compteurs-2.csv' # The desired csv file in the archive
traffic_2019 = pd.read_csv(io.BytesIO(archive.read(csv_path)), sep = ';')

In [3]:
# Importing the bike lanes dataset
bike_lanes = pd.read_csv('https://www.data.gouv.fr/en/datasets/r/1211e838-4b77-4ee4-9567-03d78d55f0bf', sep=';')
bike_lanes

Unnamed: 0,Typologie,Aménagement bidirectionnel,Régime de vitesse,Sens vélo,Voie,Arrondissement,Bois,Longueur du tronçon en m,Longueur du tronçon en km,Position aménagement,Circulation générale interdite,Piste,Couloir bus,Continuité cyclable,Réseau cyclable,Date de livraison,geo_shape,geo_point_2d
0,Pistes cyclables,Non,Voie 50,Sens de circulation générale,BOULEVARD VINCENT AURIOL,13.0,Non,331.982187,0.331982,Latéral,,Niveau chaussée,,,,2019-09-15,"{""type"": ""LineString"", ""coordinates"": [[2.3670...","48.8354585987,2.36903535246"
1,Couloirs de bus ouverts aux vélos,Oui,Voie 50,Sens de circulation générale,PONT D AUSTERLITZ,13.0,Non,87.569734,0.087570,,,,Protégé,,,2008-12-31,"{""type"": ""LineString"", ""coordinates"": [[2.3651...","48.8446462139,2.36561303618"
2,Pistes cyclables,Non,Voie 50,Sens de circulation générale,AVENUE D ITALIE,13.0,Non,21.056739,0.021057,,,Niveau chaussée,,,,2005-12-31,"{""type"": ""LineString"", ""coordinates"": [[2.3572...","48.8258206727,2.35725633367"
3,Pistes cyclables,Non,Voie 50,Sens de circulation générale,AVENUE D ITALIE,13.0,Non,26.100499,0.026100,,,Niveau chaussée,,,,2005-12-31,"{""type"": ""LineString"", ""coordinates"": [[2.3570...","48.8264179159,2.35707237369"
4,Couloirs de bus ouverts aux vélos,Non,Voie 50,Sens de circulation générale,RUE TRONCHET,8.0,Non,213.354024,0.213354,,,,Marqué,,,,"{""type"": ""LineString"", ""coordinates"": [[2.3252...","48.8718082566,2.32585179311"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11949,Autres itinéraires cyclables (ex : Aires piéto...,Oui,,Sens de circulation générale,Route de Suresnes,16.0,,569.611903,0.569612,,Oui,,,,,2019-07-20,"{""type"": ""LineString"", ""coordinates"": [[2.2723...","48.869946602,2.26905428583"
11950,Autres itinéraires cyclables (ex : Aires piéto...,,Zone 30,,Rue des Frigos,13.0,,97.550409,0.097550,,,,,,,2020-03-31,"{""type"": ""LineString"", ""coordinates"": [[2.3798...","48.8314530219,2.37948830922"
11951,Autres itinéraires cyclables (ex : Aires piéto...,,Zone 30,Contresens,Rue du Chevaleret,13.0,,135.890653,0.135891,,,,,,,2020-03-31,"{""type"": ""LineString"", ""coordinates"": [[2.3765...","48.82908925,2.37611568574"
11952,Autres itinéraires cyclables (ex : Aires piéto...,,Zone 30,Contresens,Rue Cantagrel,13.0,,252.100429,0.252100,,,,,,,2020-03-31,"{""type"": ""LineString"", ""coordinates"": [[2.3776...","48.8267036506,2.37606943859"


In [6]:
# Importing the traffic dataset
traffic = pd.read_csv('https://www.data.gouv.fr/en/datasets/r/237382af-0e7a-4ef8-9508-b3e9e78adcfd', sep=';')

KeyboardInterrupt: 

In [75]:
# Append the traffic datasets
total_traffic=traffic_2018.append(traffic_2019)

In [19]:
# Importing the postal codes
postal_codes=pd.read_csv('https://raw.githubusercontent.com/tmcdonald92/Projects/master/localisations.csv', sep=',')

In [20]:
postal_codes=postal_codes.drop_duplicates()

In [76]:
# Merging the postal codes in the traffic dataset
total_traffic=total_traffic.merge(postal_codes, how='left', on='Identifiant du compteur')

In [77]:
# Adding the year, month, day and hour columns
total_traffic['Year']=total_traffic['Date et heure de comptage'].str[:4]
total_traffic['Month']=total_traffic['Date et heure de comptage'].str[5:7]
total_traffic['Day']=total_traffic['Date et heure de comptage'].str[8:10]
total_traffic['Hour']=total_traffic['Date et heure de comptage'].str[11:13]

In [78]:
# Adding day of week column. It is necessary a column with the date in datetime format first
total_traffic['Date']=total_traffic['Date et heure de comptage'].str[:10]
total_traffic.Date=pd.to_datetime(total_traffic.Date)
total_traffic['day_week']=total_traffic.Date.dt.dayofweek+1 #+1 because the week starts at 0

In [79]:
# Drop unnecessary columns from traffic dataset
columns_to_drop=['Nom du compteur_x',
                 'Identifiant du site de comptage',
                 'Nom du site de comptage',
                'Lien vers photo du site de comptage',
                'Coordonnées géographiques_x',
                'Nom du compteur_y',
                'Coordonnées géographiques_y',
                'Date et heure de comptage']

total_traffic.drop(columns=columns_to_drop, inplace=True)

In [82]:
total_traffic

Unnamed: 0,Identifiant du compteur,Comptage horaire,Date d'installation du site de comptage,Postal code,Year,Month,Day,Hour,Date,day_week
0,100047547-104047547,4,2018-11-28,75014,2018,11,29,01,2018-11-29,4
1,100047547-104047547,30,2018-11-28,75014,2018,11,29,22,2018-11-29,4
2,100047547-104047547,116,2018-11-28,75014,2018,11,30,17,2018-11-30,5
3,100047547-104047547,0,2018-11-28,75014,2018,12,03,01,2018-12-03,1
4,100047547-104047547,18,2018-11-28,75014,2018,12,03,11,2018-12-03,1
...,...,...,...,...,...,...,...,...,...,...
594549,100056336-105056336,85,2019-11-14,75005,2019,12,27,11,2019-12-27,5
594550,100056336-105056336,96,2019-11-14,75005,2019,12,27,19,2019-12-27,5
594551,100056336-105056336,114,2019-11-14,75005,2019,12,29,14,2019-12-29,7
594552,100056336-105056336,43,2019-11-14,75005,2019,12,30,06,2019-12-30,1


## Columns that we don't need from accidents
* Circulation (143 non-missing values)


# Get postal code of every 'compteur'

In [22]:
#qqq=traffic.pivot_table(index='Identifiant du compteur', values='Coordonnées géographiques', aggfunc='head')

Unnamed: 0,Coordonnées géographiques,Identifiant du compteur
0,"48.83511,2.33338",100003096-SC
1,"48.83511,2.33338",100003096-SC
2,"48.83511,2.33338",100003096-SC
3,"48.83511,2.33338",100003096-SC
4,"48.83511,2.33338",100003096-SC
...,...,...
761705,"48.896825,2.345648",100063173-SC
761708,"48.896825,2.345648",100063173-SC
761711,"48.896825,2.345648",100063173-SC
761714,"48.896825,2.345648",100063173-SC


In [23]:
#qqq=qqq['Coordonnées géographiques'].str.split(',',expand=True)
#qqq=qqq.applymap(float)

In [24]:
#import googlemaps
#from datetime import datetime

#gmaps = googlemaps.Client(key='Key')

In [49]:
#postal_codes=qqq.apply(lambda x: gmaps.reverse_geocode((x[0],x[1]))[0]['address_components'][-1]['long_name'], axis=1)

In [51]:
#postal_codes=pd.DataFrame(postal_codes, columns=['Postal code'])

Unnamed: 0,Postal code
0,75014
1,75014
2,75014
3,75014
4,75014
...,...
761705,75018
761708,75018
761711,75018
761714,75018


In [54]:
#localisations=traffic.merge(postal_codes,left_index=True, right_index=True)
#localisations=localisations[['Identifiant du compteur','Nom du compteur','Coordonnées géographiques','Postal code']]

In [58]:
#localisations.to_csv('localisations.csv',sep=',',index=False)