<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#To-Do" data-toc-modified-id="To-Do-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>To Do</a></span><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Questions-to-answer" data-toc-modified-id="Questions-to-answer-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Questions to answer</a></span></li><li><span><a href="#Visualizations-to-use" data-toc-modified-id="Visualizations-to-use-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Visualizations to use</a></span></li><li><span><a href="#Possible-DataFrames-to-build" data-toc-modified-id="Possible-DataFrames-to-build-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Possible DataFrames to build</a></span></li></ul></li><li><span><a href="#Data-that-we-might-need-(to-scrape)" data-toc-modified-id="Data-that-we-might-need-(to-scrape)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data that we might need (to scrape)</a></span></li><li><span><a href="#Importing-the-datasets" data-toc-modified-id="Importing-the-datasets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing the datasets</a></span></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Missing-Values" data-toc-modified-id="Missing-Values-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Missing Values</a></span></li><li><span><a href="#Columns-that-we-don't-need-from-accidents" data-toc-modified-id="Columns-that-we-don't-need-from-accidents-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Columns that we don't need from accidents</a></span></li></ul></li><li><span><a href="#Get-postal-code-of-every-'compteur'" data-toc-modified-id="Get-postal-code-of-every-'compteur'-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Get postal code of every 'compteur'</a></span></li></ul></div>

# To Do

## Objectives
Analyse the number of cyclists in Paris

* Quantify the rise of cyclists in Paris
    * Get data from traffic, accidents
    * Bike lane construction

## Questions to answer
* Correlate the accidents with time of day, condition of the road, gender
* Is the increase in car traffic leading to more bike accidents?
* Is the increase bikes lanes helping in the drecrease of bike accidents?
* Is the increase of bike traffic leading to more bike lanes? And in which areas?

## Visualizations to use

* Histograms to measure the frequency of traffic
* Correlation matrix to measure the correlation between traffic and number of bike lanes

## Possible DataFrames to build
* Traffic agreggated by hour
* Traffic agreggated by day of the week
* Traffic agreggated by month
* Average number of cyclists per hour, day of the week, month?
* Traffic in each year (columns) per month (rows)

# Data that we might need (to scrape)
* Public investment (bike lane construction, public incentives to buy bikes)
* Car Traffic in Paris
* Number of bycicles sold in Paris (we might have info about sales of eletric bikes in Paris)
* Average Salary in Paris
* Average bike prices

# Importing the datasets

In [1]:
import pandas as pd
from urllib.request import urlopen
import zipfile, io

In [2]:
# Importing the accidents dataset - Imports with no issues
accidents = pd.read_csv('https://www.data.gouv.fr/en/datasets/r/3d5f2317-5afd-4a9f-a9c5-bd4fe0113f39', low_memory=False)

In [3]:
# Importing traffic file of 2018
url = 'https://www.data.gouv.fr/en/datasets/r/58d6b982-4c70-4648-afe4-b80eab61d28d'
archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
csv_path = '2018_comptage-velo-donnees-compteurs.csv' # The desired csv file in the archive
traffic_2018 = pd.read_csv(io.BytesIO(archive.read(csv_path)), sep = ';')

In [4]:
# Importing traffic file of 2019
url = 'https://www.data.gouv.fr/en/datasets/r/9c23d147-4032-429c-9c18-86dabd53e63f'
archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
csv_path = '2019_comptage-velo-donnees-compteurs-2.csv' # The desired csv file in the archive
traffic_2019 = pd.read_csv(io.BytesIO(archive.read(csv_path)), sep = ';')

In [5]:
# Importing the traffic dataset last 12M
last_traffic = pd.read_csv('https://media.githubusercontent.com/media/tmcdonald92/Projects/master/comptage-velo-donnees-compteurs.csv', sep=';')

In [6]:
# Filtering the dataset to only have data from 2020
last_traffic=last_traffic.loc[last_traffic['Date et heure de comptage'].str[:4]=='2020']

In [7]:
# Importing the bike lanes dataset
bike_lanes = pd.read_csv('https://www.data.gouv.fr/en/datasets/r/1211e838-4b77-4ee4-9567-03d78d55f0bf', sep=';')

# Data Cleaning

## Traffic

In [8]:
# Append the traffic datasets
total_traffic=traffic_2018.append([traffic_2019,last_traffic])

In [9]:
# Importing the postal codes
postal_codes=pd.read_csv('https://media.githubusercontent.com/media/tmcdonald92/Projects/master/localisations.csv', sep=',')

In [10]:
postal_codes=postal_codes.drop_duplicates()

In [11]:
# Merging the postal codes in the traffic dataset
total_traffic=total_traffic.merge(postal_codes, how='left', on='Identifiant du compteur')

In [12]:
# Adding the year, month, day and hour columns
total_traffic['Year']=total_traffic['Date et heure de comptage'].str[:4]
total_traffic['Month']=total_traffic['Date et heure de comptage'].str[5:7]
total_traffic['Day']=total_traffic['Date et heure de comptage'].str[8:10]
total_traffic['Hour']=total_traffic['Date et heure de comptage'].str[11:13]

In [13]:
# Adding day of week column. It is necessary a column with the date in datetime format first
total_traffic['Date']=total_traffic['Date et heure de comptage'].str[:10]
total_traffic.Date=pd.to_datetime(total_traffic.Date)
total_traffic['day_week']=total_traffic.Date.dt.dayofweek # Etienne : I removed the -1 to match my list indices for week_days

In [14]:
# Drop unnecessary columns from traffic dataset
columns_to_drop=['Nom du compteur_x',
                 'Identifiant du site de comptage',
                 'Nom du site de comptage',
                'Lien vers photo du site de comptage',
                'Coordonnées géographiques_x',
                'Nom du compteur_y',
                'Coordonnées géographiques_y',
                'Date et heure de comptage']

total_traffic.drop(columns=columns_to_drop, inplace=True)

In [15]:
# Adding week days & months names
wk_days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
months=['first','January','February','March','April','May','June','July','August','September','October','November','December']

total_traffic['day_week']=total_traffic['day_week'].apply(lambda x: wk_days[x])
total_traffic['Month']=total_traffic['Month'].apply(lambda x: int(x))
total_traffic['month_name']=total_traffic['Month'].apply(lambda x: months[x])

In [16]:
# Get arrondissements from postal code and remove all postal codes not starting with 75

total_traffic['Postal code']=total_traffic['Postal code'].apply(lambda x: str(x))

total_traffic=total_traffic.loc[total_traffic['Postal code'].str[:2]=='75']

total_traffic['Arrondissement']=total_traffic['Postal code'].str[-2:]

total_traffic['Arrondissement']=total_traffic['Arrondissement'].apply(lambda x: int(x))

In [17]:
total_traffic.head()

Unnamed: 0,Identifiant du compteur,Comptage horaire,Date d'installation du site de comptage,Postal code,Year,Month,Day,Hour,Date,day_week,month_name,Arrondissement
0,100047547-104047547,4,2018-11-28,75014,2018,11,29,1,2018-11-29,Thursday,November,14
1,100047547-104047547,30,2018-11-28,75014,2018,11,29,22,2018-11-29,Thursday,November,14
2,100047547-104047547,116,2018-11-28,75014,2018,11,30,17,2018-11-30,Friday,November,14
3,100047547-104047547,0,2018-11-28,75014,2018,12,3,1,2018-12-03,Monday,December,14
4,100047547-104047547,18,2018-11-28,75014,2018,12,3,11,2018-12-03,Monday,December,14


### Missing Values

In [18]:
total_traffic.isna().sum()

Identifiant du compteur                    0
Comptage horaire                           0
Date d'installation du site de comptage    0
Postal code                                0
Year                                       0
Month                                      0
Day                                        0
Hour                                       0
Date                                       0
day_week                                   0
month_name                                 0
Arrondissement                             0
dtype: int64

There are no missing values in the traffic dataset ;)

## Bike lanes

In [19]:
bike_lanes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11954 entries, 0 to 11953
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Typologie                       11953 non-null  object 
 1   Aménagement bidirectionnel      11488 non-null  object 
 2   Régime de vitesse               11937 non-null  object 
 3   Sens vélo                       10924 non-null  object 
 4   Voie                            11898 non-null  object 
 5   Arrondissement                  11951 non-null  float64
 6   Bois                            10409 non-null  object 
 7   Longueur du tronçon en m        11954 non-null  float64
 8   Longueur du tronçon en km       11954 non-null  float64
 9   Position aménagement            1097 non-null   object 
 10  Circulation générale interdite  1110 non-null   object 
 11  Piste                           3189 non-null   object 
 12  Couloir bus                     

In [20]:
# Drop unnecessary columns from bike_lanes dataset

cols_to_drop = ['Aménagement bidirectionnel',
                'Régime de vitesse',
                'Sens vélo',
                'Bois',
                'Longueur du tronçon en km',
                'Position aménagement',
                'Circulation générale interdite',
                'Piste',
                'Couloir bus',
                'Continuité cyclable',
                'Réseau cyclable',
                'geo_shape'] 

bike_lanes.drop(columns=cols_to_drop, inplace=True)


In [21]:
# Get rid of data with no delivery date 

bike_lanes.dropna(subset=['Date de livraison','Arrondissement'], inplace=True)

In [22]:
# Split delivery dates into 3 columns

bike_lanes[['Year','Month','Day']]=bike_lanes['Date de livraison'].str.split('-',expand=True)

In [23]:
# Correct data types 

bike_lanes['Month']=bike_lanes['Month'].apply(lambda x: int(x))
bike_lanes['Year']=bike_lanes['Year'].apply(lambda x: int(x))
bike_lanes['Arrondissement']=bike_lanes['Arrondissement'].apply(lambda x: int(x))


In [24]:
# Replace month names 

bike_lanes['month_name']=bike_lanes['Month'].apply(lambda x: months[x])

In [25]:
# Find out how many bike lanes have been delivered starting from 2017 
# Idea : we have to check for dates prior to 2018 in order to 
# check whether it is a bike lane increase that drove traffic or the opposite

bike_lanes=bike_lanes.loc[bike_lanes['Year']>=2017]

### Missing values 

In [26]:
bike_lanes.isna().sum()

Typologie                    0
Voie                        29
Arrondissement               0
Longueur du tronçon en m     0
Date de livraison            0
geo_point_2d                 0
Year                         0
Month                        0
Day                          0
month_name                   0
dtype: int64

29 streets missing, but not an issue because we have their arrondissement 

## Columns that we don't need from accidents
* Circulation (143 non-missing values)


# Get postal code of every 'compteur'

In [27]:
#qqq=traffic.pivot_table(index='Identifiant du compteur', values='Coordonnées géographiques', aggfunc='head')

In [28]:
#qqq=qqq['Coordonnées géographiques'].str.split(',',expand=True)
#qqq=qqq.applymap(float)

In [29]:
#import googlemaps
#from datetime import datetime

#gmaps = googlemaps.Client(key='Key')

In [30]:
#postal_codes=qqq.apply(lambda x: gmaps.reverse_geocode((x[0],x[1]))[0]['address_components'][-1]['long_name'], axis=1)

In [31]:
#postal_codes=pd.DataFrame(postal_codes, columns=['Postal code'])

In [32]:
#localisations=traffic.merge(postal_codes,left_index=True, right_index=True)
#localisations=localisations[['Identifiant du compteur','Nom du compteur','Coordonnées géographiques','Postal code']]

In [33]:
#localisations.to_csv('localisations.csv',sep=',',index=False)