# 1: Data Processing

## 1.1 Data Gathering

### Introduction

Public bike rental schemes have been launched in many cities around Europe, including successful networks in Irish towns and cities such as Dublin, Cork, and Galway. These provide local active transport links for commuting and leisure. This project explores the publically available data pertaining to the Dublin bike network, makes comparisons to the London network run by Santander, and makes recommendations on network adjustments that could be made to improve the network.

There are several data sources to pull from and process:

Data pertaining to the Dublin bike scheme - number of hires, availability, locations of stations, etc.
- https://data.gov.ie/dataset/dublinbikes-api - Historical data for Dublin bike hire (CSV format)
- https://developer.jcdecaux.com/#/opendata/vls?page=static&contract=dublin - (JSON format)

Similar data pertaining to the London bike scheme in order provide a point of comparison.
- https://data.london.gov.uk/dataset/number-bicycle-hires - The number of bicycle hires in London (XLS format)
- https://data.london.gov.uk/dataset/cycle-hire-availability - The availability of bike hires in London

Population data for the Dublin area. It is possible that the population in the vicinity of bike stations plays a significant role in their usage/availability, so this is gathered to enable features to be derived relating to this.
- https://ie-cso.maps.arcgis.com/apps/webappviewer/index.html?id=0fe164e96d254776866425e2fd3e73af - 2022 census population data for Dublin (GIS data)
- https://sdi.eea.europa.eu/catalogue/srv/eng/catalog.search#/metadata/3c362237-daa4-45e2-8c16-aaadfb1a003b - GIS grid data for Ireland (GIS data / parquet)

Weather data for Dublin - it is possible that the weather (temperature/rainfall) has a role in bike usage, so this is gathered in order to explore this link.
- https://www.met.ie/climate/available-data/historical-data - Historical weather data for Dublin (CSV format)

These data sources are publically available with very few restrictions - they generally ask that the sources are given attribution and it is noted if changes are made. 
Dublinbikes and met eireann data is available under the creative commons license version 4.0 - https://creativecommons.org/licenses/by/4.0/
The UK open government license requests that a link to the license is included, which han be found here - https://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/
European population grid data is publically available for non-commercial purposes with attribution - https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography

In [None]:
## unquote to install dependencies from file if not already installed
# pip install -r requirements.txt

In [35]:
import os
import requests
from pathlib import Path
import pandas as pd
import warnings
from bs4 import BeautifulSoup
from tqdm import tqdm
import json

warnings.filterwarnings('ignore')

# Flag to control redownloading - avoids unnecessary web traffic
REDOWNLOAD = False

data_folder = Path('data/')
dublin_bike_folder = Path('data/dublinbikes')

if not data_folder.exists():
    data_folder.mkdir()
    
if not dublin_bike_folder.exists():
    dublin_bike_folder.mkdir()
    (dublin_bike_folder / 'monthly').mkdir()
    (dublin_bike_folder / 'quarterly').mkdir()

### Dublin Bikes Data

Firstly, to get dublinbikes historical data, the download links are scraped from the dublinbikes data page data.gov.ie. BeautifulSoup is used for this, which allows us to parse the html response from a get request to the data landing page, and filter down to the 'a' tags which have a dublinbike .csv href. These come in quarterly and monthly cadences, so the list of links can be split based on the url format, and each file can be downloaded using pandas, with the os path basename allowing us to easily parse out the filename from the url.

In [8]:
# dublinbikes csv data
main_page_url = 'https://data.gov.ie/dataset/dublinbikes-api'
main_page_html = requests.get(main_page_url).content

In [17]:
soup = BeautifulSoup(main_page_html)
download_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.csv') and 'dublinbike' in a['href']]
download_links 

['https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/9496fac5-e4d7-4ae9-a49a-217c7c4e83d9/download/dublinbikes_20180701_20181001.csv',
 'https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/67ea095f-67ad-47f5-b8f7-044743043848/download/dublinbikes_20181001_20190101.csv',
 'https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/538165d7-535e-4e1d-909a-1c1bfae901c5/download/dublinbikes_20190101_20190401.csv',
 'https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/76fdda3d-d8be-441b-92dd-0ee36d9c5316/download/dublinbikes_20190401_20190701.csv',
 'https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/305d39ac-b6a0-4216-a535-0ae2ddf59819/download/dublinbikes_20190701_20191001.csv',
 'https://data.smartdublin.ie/dataset/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab/resource/5d23332e-4f49-4c41-b6a0-bffb77b33d64/download/dublinbikes_20191001_20200101.csv',
 'ht

These files can be quite large, so we set a flag to avoid redownloading them if needed.

We also use tqdm to display simple progress bars, as the data takes a reasonably long time to download. 

In [27]:
# Warning: this downloads several Gbs of bike data
quarterly_links = [x for x in download_links if 'dublinbikes' in x]
monthly_links = [x for x in download_links if 'dublinbikes' not in x]

if REDOWNLOAD:
    for link in tqdm(quarterly_links):
        df = pd.read_csv(link)
        df.to_csv((dublin_bike_folder / 'quarterly' / os.path.basename(link)), index=False)
                  
    for link in tqdm(monthly_links):
        df = pd.read_csv(link)
        df.to_csv((dublin_bike_folder / 'monthly' / os.path.basename(link)), index=False)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [06:36<00:00, 28.33s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [01:01<00:00,  2.28s/it]


We also want to get the location data for each station, including the address and latitude and longitude. This is available as a json file from jcdecaux, who manage the dublinbikes service and provide a rest api.

In [39]:
location_data_url = "https://developer.jcdecaux.com/rest/vls/stations/dublin.json"
location_data_json = requests.get(location_data_url).json()
with open(dublin_bike_folder / 'dublin_locations.json', 'w') as f:
    json.dump(location_data_json, f)
    
location_data_json[:3]

[{'number': 42,
  'name': 'SMITHFIELD NORTH',
  'address': 'Smithfield North',
  'latitude': 53.349562,
  'longitude': -6.278198},
 {'number': 30,
  'name': 'PARNELL SQUARE NORTH',
  'address': 'Parnell Square North',
  'latitude': 53.3537415547453,
  'longitude': -6.26530144781526},
 {'number': 54,
  'name': 'CLONMEL STREET',
  'address': 'Clonmel Street',
  'latitude': 53.336021,
  'longitude': -6.26298}]

In [40]:
dublin_bike_stations_df = pd.DataFrame(location_data_json)
dublin_bike_stations_df.head()

Unnamed: 0,number,name,address,latitude,longitude
0,42,SMITHFIELD NORTH,Smithfield North,53.349562,-6.278198
1,30,PARNELL SQUARE NORTH,Parnell Square North,53.353742,-6.265301
2,54,CLONMEL STREET,Clonmel Street,53.336021,-6.26298
3,108,AVONDALE ROAD,Avondale Road,53.359405,-6.276142
4,20,JAMES STREET EAST,James Street East,53.336597,-6.248109


### London Bikes Data

To compare the Dublin bike hire trends to those of London, the number of bike hires for the snatander-run service can be retrieved from data.london.giv.uk. This comes as an xlsx file, so the raw response from the get request can be written to a file opened in write bytes (wb) mode.

This xlsx file is in an untidy (Wickham 2011) format, as multiple tables are stored on the same sheet, and there is a data header above the data itself, so some additional processing is required to isolate the necessary signals. Pandas can read excel files, but the header needs to be stripped with the skiprows argument, and columns need to be specified so that the different tables can be isolated from their shared sheet.

In [42]:
london_bike_hires_url = 'https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx'

london_bike_raw_data = requests.get(london_bike_hires_url ).content
with open(data_folder / 'london_bike_hires.xlsx', 'wb') as f:
    f.write(london_bike_raw_data)

### Dublin Population Data

The population in the vicinity of a bike station is envisioned to influence its availability & usage.
We can get population data within each square kilometer from [here]. This data is on the Irish 1km grid, which is available as a parquet file [here]. The data is 
The location data is projected differently between the different data sources - 1km grid data is encoded using the "EPSG:3035" standard, however the more familiar latitude and longitude system is based on the "EPSG:4326" standard.

We can convert between the two projections using the pyproj package. This gives the additional advantage that the EPSG:3035 encoding's Eastings and Northings are in meters, so we can easily find distances between points on the earth's surface using metrics such as the euclidean distance or manhattan distance without dealing with spherical coordinates under the latlon system.

It is important that we can easily find distances using a variety of metrics, as we are dealing with real-world distances within a city, so it is possible that the grid-based Manhattan measure might be more appropriate that the "as the crow flies" euclidean measure. We will want to be able to easily calculate distance features for both for comparison.

## 1.2 Initial Data Cleaning