Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
312 lines (254 sloc) 11.6 KB
---
title: "Exploring Denver B-cycle data"
author: "Tyler Byers"
date: "August 18, 2015"
output: html_document
---
```{r set_environment}
library(ggplot2); library(dplyr); library(tidyr)
library(lubridate); library(xml2); library(readxl)
library(ggmap);
Sys.setenv(TZ = 'America/Denver')
```
This document shows our work in constructing the data sets for the Denver B-cycle 2014 data exploration. The data construction process was fairly painstaking, so it doesn't make sense to do that in the production-level script. Data sets are constructed and saved here.
## B-cycle Data
### Load B-cycle Data
Load the bcycle data, which we downloaded from https://denver.bcycle.com/company. This study only uses 2014 data, from near the bottom of that page.
```{r read_bcycle}
bcycle <- read_excel('./data/2014denverbcycletripdata_public.xlsx')
names(bcycle) <- c('program','user_id','zip','membership_type','bike',
'checkout_date','checkout_time', 'checkout_kiosk',
'return_date','return_time','return_kiosk',
'duration_mins')
head(bcycle)
```
It may work better if the kiosks are factor variables.
```{r kiosks_as_factors}
bcycle$checkout_kiosk <- as.factor(bcycle$checkout_kiosk)
bcycle$return_kiosk <- as.factor(bcycle$return_kiosk)
```
The time in the columns is showing 1899 as a year due to horrifying Microsoft Excel behavior, so we're going to remove that.
```{r fix_time}
bcycle <- bcycle %>% separate(checkout_time, c('ugly_date','checkout_time'),
sep = ' ') %>% select(-ugly_date)
bcycle <- bcycle %>% separate(return_time, c('ugly_date', 'return_time'),
sep = ' ') %>% select(-ugly_date)
# convert these to datetime values
bcycle$checkout_datetime <- ymd_hms(paste(bcycle$checkout_date,
bcycle$checkout_time),
tz = 'America/Denver')
bcycle$return_datetime <- ymd_hms(paste(bcycle$return_date,
bcycle$return_time),
tz = 'America/Denver')
```
### Load Station Address Data
One of the parts of this project will be exploring the number of miles ridden by the B-cyclers. In order to do this, we need the station addresses. To get the addresses, we went to the [B-cycle homepage](https://denver.bcycle.com), and painstakingly copied and pasted all of the marker names and addresses to an Excel spreadsheet (we could not find a way to do this programmatically that would be faster than grabbing the addresses by hand).
Below we open that spreadsheet and make sure we grabbed all the stations (it would be easy to forget some of them).
```{r load_bcycle_stations}
stations <- read_excel('./data/bcycle_stations.xlsx')
stations <- as.data.frame(t(stations))
stations <- stations[2:nrow(stations), c('V2','V4', 'V5')]
rownames(stations) <- NULL
names(stations) <- c('name','street_address','city_state_zip')
```
Do we have all the stations?
```{r check_for_checkout_kiosks}
unique_checkout_kiosks <- unique(bcycle$checkout_kiosk)
length(unique_checkout_kiosks)
length(unique(stations$name))
which(!(unique_checkout_kiosks %in% stations$name))
unique_checkout_kiosks[which(!(unique_checkout_kiosks %in% stations$name))]
```
First time through, we found we were missing the following stations:
* 17th & Curtis
* 14th & Ogden
* 2045 Franklin
The 17th & Curtis location was not on the current map on the B-cycle webpage (maybe it was moved?), so found the location on this otherwise outdated map: https://www.google.com/maps/d/viewer?mid=zlN1mPiliDOE.kbb3RbqyLl1w&hl=en_US. The other two stations we had just missed, so added those to the excel worksheet. So now there are 3 more stations in our data than there were used in 2014 -- likely B-cycle added some new stations for 2015, and our data reflects that.
Which stations were added?
```{r check_for_checkout_kiosks}
unique_checkout_kiosks <- unique(bcycle$checkout_kiosk)
length(unique_checkout_kiosks)
length(unique(stations$name))
which(!(stations$name %in% unique_checkout_kiosks))
stations$name[which(!(stations$name %in% unique_checkout_kiosks))]
```
### Station Distances
Now we want to use ggmap to get station distances.
How many kiosk start/return pairs do we have?
```{r kiosk_start_return}
kiosk_pairs <- bcycle %>%
group_by(checkout_kiosk, return_kiosk) %>%
summarise(n = n()) %>% arrange(-n)
kiosk_pairs$checkout_address <-
sapply(kiosk_pairs$checkout_kiosk, function(station){
s_row <- which(stations$name == as.character(station))
if(length(s_row) == 0) { # no match found
NA
} else {
paste0(stations$street_address[s_row], ', ',
stations$city_state_zip[s_row])
}
} )
kiosk_pairs$return_address <-
sapply(kiosk_pairs$return_kiosk, function(station){
s_row <- which(stations$name == as.character(station))
if(length(s_row) == 0) { # no match found
NA
} else {
paste0(stations$street_address[s_row], ', ',
stations$city_state_zip[s_row])
}
} )
sum(is.na(kiosk_pairs$checkout_address))
sum(is.na(kiosk_pairs$return_address))
```
Based on this, we have 0 invalid checkout addresses, but 137 lines of invalid return addresses.
### Google Distance Matrix API
Now we'll use the Google Distance Matrix API to get the distances between stations.
However, we're limited to 2500 queries a day. So we're going to split up the kiosk pairs into 3 groups.
```{r split_kiosk_pairs}
kiosk_pairs1 <- kiosk_pairs[1:2300,]
kiosk_pairs2 <- kiosk_pairs[2301:4600,]
kiosk_pairs3 <- kiosk_pairs[4601:6800,]
saveRDS(kiosk_pairs1, './data/kiokiosk_pairs1.rds')
saveRDS(kiosk_pairs2, './data/kiokiosk_pairs2.rds')
saveRDS(kiosk_pairs3, './data/kiokiosk_pairs3.rds')
```
Now use ggmap to get these distances, via bicycling.
```{r ggmap_distance, eval = FALSE}
distance_calculations <- function(kiosks) {
kiosks$distances <- NA
kiosks$time_sec <- NA
kiosks$status <- NA
err_count <- 0
for(i in 1:nrow(kiosks)){
ckout <- kiosks$checkout_address[i]
ret <- kiosks$return_address[i]
if(is.na(ret)) {
kiosks$status[i] <- 'return_invalid'
} else if (ckout == ret) {
kiosks$status[i] <- 'same_kiosk'
} else {
# mapdist sometimes errors out, we need to make a note this
# happened
tryCatch({
dcalc <- mapdist(from = ckout, to = ret, mode = 'bicycling')
kiosks$distances[i] <- dcalc$miles
kiosks$time_sec[i] <- dcalc$seconds
kiosks$status[i] <- 'ok'
}, error = function(err) {
kiosks$status[i] <- 'error'
err_count <- err_count + 1
})
}
print(paste('Iteration: ', i))
print(paste('Checkout:', ckout))
print(paste('Return:', ret))
print(paste('Distance:', kiosks$distances[i]))
print(paste('Time (sec):', kiosks$time_sec[i]))
print(paste('Status:', kiosks$status[i]))
print(paste('Number of errors:', err_count))
Sys.sleep(2) # So we don't use up our limit from the API
}
kiosks
}
```
```{r get_distances}
kiosk_pairs1 <- distance_calculations(kiosk_pairs1)
names(kiosk_pairs1)[6:8] <- c('distance_miles', 'distance_seconds',
'mapdist_status')
saveRDS(kiosk_pairs1, './data/kiokiosk_pairs1_withdist.rds')
kiosk_pairs2 <- distance_calculations(kiosk_pairs2)
names(kiosk_pairs2)[6:8] <- c('distance_miles', 'distance_seconds',
'mapdist_status')
saveRDS(kiosk_pairs2, './data/kiokiosk_pairs2_withdist.rds')
kiosk_pairs3 <- distance_calculations(kiosk_pairs3)
names(kiosk_pairs3)[6:8] <- c('distance_miles', 'distance_seconds',
'mapdist_status')
saveRDS(kiosk_pairs3, './data/kiokiosk_pairs3_withdist.rds')
```
Open up the kiosk pairs and re-make into one dataframe.
```{r kiosk_pairs}
kiosk_pairs1 <- readRDS('./data/kiokiosk_pairs1_withdist.rds')
kiosk_pairs2 <- readRDS('./data/kiokiosk_pairs2_withdist.rds')
kiosk_pairs3 <- readRDS('./data/kiokiosk_pairs3_withdist.rds')
kiosk_pairs$distance_miles <- c(kiosk_pairs1$distance_miles,
kiosk_pairs2$distance_miles,
kiosk_pairs3$distance_miles)
kiosk_pairs$distance_seconds <- c(kiosk_pairs1$distance_seconds,
kiosk_pairs2$distance_seconds,
kiosk_pairs3$distance_seconds)
kiosk_pairs$mapdist_status <- c(kiosk_pairs1$mapdist_status,
kiosk_pairs2$mapdist_status,
kiosk_pairs3$mapdist_status)
```
About the kiosk_pairs:
```{r eda_kiosk_pairs}
summary(kiosk_pairs)
table(kiosk_pairs$mapdist_status)
```
### Add distances to B-cycle data
Now we take the kiosk_pairs information, with information about how far it is between each station, and merge the data with the rest of the bcycle information.
```{r add_mapdistances}
bcycle$ggmap_dist <- apply(bcycle, 1, function(x) {
kp_row <- kiosk_pairs %>%
filter(checkout_kiosk == x['checkout_kiosk'] &
return_kiosk == x['return_kiosk'])
kp_row$distance_miles
})
bcycle$ggmap_seconds <- apply(bcycle, 1, function(x) {
kp_row <- kiosk_pairs %>%
filter(checkout_kiosk == x['checkout_kiosk'] &
return_kiosk == x['return_kiosk'])
kp_row$distance_seconds
})
bcycle$ggmap_status <- apply(bcycle, 1, function(x) {
kp_row <- kiosk_pairs %>%
filter(checkout_kiosk == x['checkout_kiosk'] &
return_kiosk == x['return_kiosk'])
kp_row$mapdist_status
})
```
### Save Updated B-Cycle data
We've painstakingly added additional data to the original B-Cycle data. Now save it out so in our production-level script we can access it easily.
```{r save_new_bcycle_data}
write_csv(bcycle, './data/bcycle_2014_ggmap_distances.csv')
```
### Get Geocode Data
The station data also needs lat/long data to plot on the map.
```{r geo_code}
stations$lon <- NA
stations$lat <- NA
#for(i in 1:3) {
for(i in 1:nrow(stations)) {
print(paste('iteration: ', i))
gc <- geocode(paste(stations[i,]$street_address,
stations[i,]$city_state_zip))
stations[i,c('lon','lat')] <- gc
Sys.sleep(1)
}
write_csv(stations, './data/stations_address_geocode.csv')
```
## Weather Data
I like to use weather data from forecast.io. The below chunk of code only needs to be run once, to fetch the weather from forecast.io for each day of 2014. Each day's weather data will each be saved in a json file in the ./data/weather/ directory.
### Download Weather Data
Download the weather data from forecast.io to a local directory.
```{r download_weather_data, eval=FALSE}
dates <- seq(from = ymd('2014-01-01'), to = ymd('2014-12-31'), by = 'days')
# Note, to get this to work outside of the author's environment, you will need
# to get your own forecast.io API key at developer.forecast.io
apikey <- read.table('./private/forecastio_apikey.dat')$V1
weather_temp <- sapply(dates, function(daystr){
daystr <- as.character(daystr)
savedfile <- paste0('./data/weather/',daystr,'.json')
url <- paste0('https://api.forecast.io/forecast/',apikey,
'/39.7400,-104.9923,',daystr,'T00:00:00')
print(paste('Downloading weather for date',daystr))
download.file(url, savedfile)
print('Sleeping 8 seconds.')
Sys.sleep(8)
})
rm(weather_temp)
```
### On to final script
At this point, we have done enough data acquisition and intial processing to be able to move on to the final script -- `bcycle_final_script.R`.