Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
denver_bcycle/exploring_bcycle_data.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
312 lines (254 sloc)
11.6 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Exploring Denver B-cycle data" | |
author: "Tyler Byers" | |
date: "August 18, 2015" | |
output: html_document | |
--- | |
```{r set_environment} | |
library(ggplot2); library(dplyr); library(tidyr) | |
library(lubridate); library(xml2); library(readxl) | |
library(ggmap); | |
Sys.setenv(TZ = 'America/Denver') | |
``` | |
This document shows our work in constructing the data sets for the Denver B-cycle 2014 data exploration. The data construction process was fairly painstaking, so it doesn't make sense to do that in the production-level script. Data sets are constructed and saved here. | |
## B-cycle Data | |
### Load B-cycle Data | |
Load the bcycle data, which we downloaded from https://denver.bcycle.com/company. This study only uses 2014 data, from near the bottom of that page. | |
```{r read_bcycle} | |
bcycle <- read_excel('./data/2014denverbcycletripdata_public.xlsx') | |
names(bcycle) <- c('program','user_id','zip','membership_type','bike', | |
'checkout_date','checkout_time', 'checkout_kiosk', | |
'return_date','return_time','return_kiosk', | |
'duration_mins') | |
head(bcycle) | |
``` | |
It may work better if the kiosks are factor variables. | |
```{r kiosks_as_factors} | |
bcycle$checkout_kiosk <- as.factor(bcycle$checkout_kiosk) | |
bcycle$return_kiosk <- as.factor(bcycle$return_kiosk) | |
``` | |
The time in the columns is showing 1899 as a year due to horrifying Microsoft Excel behavior, so we're going to remove that. | |
```{r fix_time} | |
bcycle <- bcycle %>% separate(checkout_time, c('ugly_date','checkout_time'), | |
sep = ' ') %>% select(-ugly_date) | |
bcycle <- bcycle %>% separate(return_time, c('ugly_date', 'return_time'), | |
sep = ' ') %>% select(-ugly_date) | |
# convert these to datetime values | |
bcycle$checkout_datetime <- ymd_hms(paste(bcycle$checkout_date, | |
bcycle$checkout_time), | |
tz = 'America/Denver') | |
bcycle$return_datetime <- ymd_hms(paste(bcycle$return_date, | |
bcycle$return_time), | |
tz = 'America/Denver') | |
``` | |
### Load Station Address Data | |
One of the parts of this project will be exploring the number of miles ridden by the B-cyclers. In order to do this, we need the station addresses. To get the addresses, we went to the [B-cycle homepage](https://denver.bcycle.com), and painstakingly copied and pasted all of the marker names and addresses to an Excel spreadsheet (we could not find a way to do this programmatically that would be faster than grabbing the addresses by hand). | |
Below we open that spreadsheet and make sure we grabbed all the stations (it would be easy to forget some of them). | |
```{r load_bcycle_stations} | |
stations <- read_excel('./data/bcycle_stations.xlsx') | |
stations <- as.data.frame(t(stations)) | |
stations <- stations[2:nrow(stations), c('V2','V4', 'V5')] | |
rownames(stations) <- NULL | |
names(stations) <- c('name','street_address','city_state_zip') | |
``` | |
Do we have all the stations? | |
```{r check_for_checkout_kiosks} | |
unique_checkout_kiosks <- unique(bcycle$checkout_kiosk) | |
length(unique_checkout_kiosks) | |
length(unique(stations$name)) | |
which(!(unique_checkout_kiosks %in% stations$name)) | |
unique_checkout_kiosks[which(!(unique_checkout_kiosks %in% stations$name))] | |
``` | |
First time through, we found we were missing the following stations: | |
* 17th & Curtis | |
* 14th & Ogden | |
* 2045 Franklin | |
The 17th & Curtis location was not on the current map on the B-cycle webpage (maybe it was moved?), so found the location on this otherwise outdated map: https://www.google.com/maps/d/viewer?mid=zlN1mPiliDOE.kbb3RbqyLl1w&hl=en_US. The other two stations we had just missed, so added those to the excel worksheet. So now there are 3 more stations in our data than there were used in 2014 -- likely B-cycle added some new stations for 2015, and our data reflects that. | |
Which stations were added? | |
```{r check_for_checkout_kiosks} | |
unique_checkout_kiosks <- unique(bcycle$checkout_kiosk) | |
length(unique_checkout_kiosks) | |
length(unique(stations$name)) | |
which(!(stations$name %in% unique_checkout_kiosks)) | |
stations$name[which(!(stations$name %in% unique_checkout_kiosks))] | |
``` | |
### Station Distances | |
Now we want to use ggmap to get station distances. | |
How many kiosk start/return pairs do we have? | |
```{r kiosk_start_return} | |
kiosk_pairs <- bcycle %>% | |
group_by(checkout_kiosk, return_kiosk) %>% | |
summarise(n = n()) %>% arrange(-n) | |
kiosk_pairs$checkout_address <- | |
sapply(kiosk_pairs$checkout_kiosk, function(station){ | |
s_row <- which(stations$name == as.character(station)) | |
if(length(s_row) == 0) { # no match found | |
NA | |
} else { | |
paste0(stations$street_address[s_row], ', ', | |
stations$city_state_zip[s_row]) | |
} | |
} ) | |
kiosk_pairs$return_address <- | |
sapply(kiosk_pairs$return_kiosk, function(station){ | |
s_row <- which(stations$name == as.character(station)) | |
if(length(s_row) == 0) { # no match found | |
NA | |
} else { | |
paste0(stations$street_address[s_row], ', ', | |
stations$city_state_zip[s_row]) | |
} | |
} ) | |
sum(is.na(kiosk_pairs$checkout_address)) | |
sum(is.na(kiosk_pairs$return_address)) | |
``` | |
Based on this, we have 0 invalid checkout addresses, but 137 lines of invalid return addresses. | |
### Google Distance Matrix API | |
Now we'll use the Google Distance Matrix API to get the distances between stations. | |
However, we're limited to 2500 queries a day. So we're going to split up the kiosk pairs into 3 groups. | |
```{r split_kiosk_pairs} | |
kiosk_pairs1 <- kiosk_pairs[1:2300,] | |
kiosk_pairs2 <- kiosk_pairs[2301:4600,] | |
kiosk_pairs3 <- kiosk_pairs[4601:6800,] | |
saveRDS(kiosk_pairs1, './data/kiokiosk_pairs1.rds') | |
saveRDS(kiosk_pairs2, './data/kiokiosk_pairs2.rds') | |
saveRDS(kiosk_pairs3, './data/kiokiosk_pairs3.rds') | |
``` | |
Now use ggmap to get these distances, via bicycling. | |
```{r ggmap_distance, eval = FALSE} | |
distance_calculations <- function(kiosks) { | |
kiosks$distances <- NA | |
kiosks$time_sec <- NA | |
kiosks$status <- NA | |
err_count <- 0 | |
for(i in 1:nrow(kiosks)){ | |
ckout <- kiosks$checkout_address[i] | |
ret <- kiosks$return_address[i] | |
if(is.na(ret)) { | |
kiosks$status[i] <- 'return_invalid' | |
} else if (ckout == ret) { | |
kiosks$status[i] <- 'same_kiosk' | |
} else { | |
# mapdist sometimes errors out, we need to make a note this | |
# happened | |
tryCatch({ | |
dcalc <- mapdist(from = ckout, to = ret, mode = 'bicycling') | |
kiosks$distances[i] <- dcalc$miles | |
kiosks$time_sec[i] <- dcalc$seconds | |
kiosks$status[i] <- 'ok' | |
}, error = function(err) { | |
kiosks$status[i] <- 'error' | |
err_count <- err_count + 1 | |
}) | |
} | |
print(paste('Iteration: ', i)) | |
print(paste('Checkout:', ckout)) | |
print(paste('Return:', ret)) | |
print(paste('Distance:', kiosks$distances[i])) | |
print(paste('Time (sec):', kiosks$time_sec[i])) | |
print(paste('Status:', kiosks$status[i])) | |
print(paste('Number of errors:', err_count)) | |
Sys.sleep(2) # So we don't use up our limit from the API | |
} | |
kiosks | |
} | |
``` | |
```{r get_distances} | |
kiosk_pairs1 <- distance_calculations(kiosk_pairs1) | |
names(kiosk_pairs1)[6:8] <- c('distance_miles', 'distance_seconds', | |
'mapdist_status') | |
saveRDS(kiosk_pairs1, './data/kiokiosk_pairs1_withdist.rds') | |
kiosk_pairs2 <- distance_calculations(kiosk_pairs2) | |
names(kiosk_pairs2)[6:8] <- c('distance_miles', 'distance_seconds', | |
'mapdist_status') | |
saveRDS(kiosk_pairs2, './data/kiokiosk_pairs2_withdist.rds') | |
kiosk_pairs3 <- distance_calculations(kiosk_pairs3) | |
names(kiosk_pairs3)[6:8] <- c('distance_miles', 'distance_seconds', | |
'mapdist_status') | |
saveRDS(kiosk_pairs3, './data/kiokiosk_pairs3_withdist.rds') | |
``` | |
Open up the kiosk pairs and re-make into one dataframe. | |
```{r kiosk_pairs} | |
kiosk_pairs1 <- readRDS('./data/kiokiosk_pairs1_withdist.rds') | |
kiosk_pairs2 <- readRDS('./data/kiokiosk_pairs2_withdist.rds') | |
kiosk_pairs3 <- readRDS('./data/kiokiosk_pairs3_withdist.rds') | |
kiosk_pairs$distance_miles <- c(kiosk_pairs1$distance_miles, | |
kiosk_pairs2$distance_miles, | |
kiosk_pairs3$distance_miles) | |
kiosk_pairs$distance_seconds <- c(kiosk_pairs1$distance_seconds, | |
kiosk_pairs2$distance_seconds, | |
kiosk_pairs3$distance_seconds) | |
kiosk_pairs$mapdist_status <- c(kiosk_pairs1$mapdist_status, | |
kiosk_pairs2$mapdist_status, | |
kiosk_pairs3$mapdist_status) | |
``` | |
About the kiosk_pairs: | |
```{r eda_kiosk_pairs} | |
summary(kiosk_pairs) | |
table(kiosk_pairs$mapdist_status) | |
``` | |
### Add distances to B-cycle data | |
Now we take the kiosk_pairs information, with information about how far it is between each station, and merge the data with the rest of the bcycle information. | |
```{r add_mapdistances} | |
bcycle$ggmap_dist <- apply(bcycle, 1, function(x) { | |
kp_row <- kiosk_pairs %>% | |
filter(checkout_kiosk == x['checkout_kiosk'] & | |
return_kiosk == x['return_kiosk']) | |
kp_row$distance_miles | |
}) | |
bcycle$ggmap_seconds <- apply(bcycle, 1, function(x) { | |
kp_row <- kiosk_pairs %>% | |
filter(checkout_kiosk == x['checkout_kiosk'] & | |
return_kiosk == x['return_kiosk']) | |
kp_row$distance_seconds | |
}) | |
bcycle$ggmap_status <- apply(bcycle, 1, function(x) { | |
kp_row <- kiosk_pairs %>% | |
filter(checkout_kiosk == x['checkout_kiosk'] & | |
return_kiosk == x['return_kiosk']) | |
kp_row$mapdist_status | |
}) | |
``` | |
### Save Updated B-Cycle data | |
We've painstakingly added additional data to the original B-Cycle data. Now save it out so in our production-level script we can access it easily. | |
```{r save_new_bcycle_data} | |
write_csv(bcycle, './data/bcycle_2014_ggmap_distances.csv') | |
``` | |
### Get Geocode Data | |
The station data also needs lat/long data to plot on the map. | |
```{r geo_code} | |
stations$lon <- NA | |
stations$lat <- NA | |
#for(i in 1:3) { | |
for(i in 1:nrow(stations)) { | |
print(paste('iteration: ', i)) | |
gc <- geocode(paste(stations[i,]$street_address, | |
stations[i,]$city_state_zip)) | |
stations[i,c('lon','lat')] <- gc | |
Sys.sleep(1) | |
} | |
write_csv(stations, './data/stations_address_geocode.csv') | |
``` | |
## Weather Data | |
I like to use weather data from forecast.io. The below chunk of code only needs to be run once, to fetch the weather from forecast.io for each day of 2014. Each day's weather data will each be saved in a json file in the ./data/weather/ directory. | |
### Download Weather Data | |
Download the weather data from forecast.io to a local directory. | |
```{r download_weather_data, eval=FALSE} | |
dates <- seq(from = ymd('2014-01-01'), to = ymd('2014-12-31'), by = 'days') | |
# Note, to get this to work outside of the author's environment, you will need | |
# to get your own forecast.io API key at developer.forecast.io | |
apikey <- read.table('./private/forecastio_apikey.dat')$V1 | |
weather_temp <- sapply(dates, function(daystr){ | |
daystr <- as.character(daystr) | |
savedfile <- paste0('./data/weather/',daystr,'.json') | |
url <- paste0('https://api.forecast.io/forecast/',apikey, | |
'/39.7400,-104.9923,',daystr,'T00:00:00') | |
print(paste('Downloading weather for date',daystr)) | |
download.file(url, savedfile) | |
print('Sleeping 8 seconds.') | |
Sys.sleep(8) | |
}) | |
rm(weather_temp) | |
``` | |
### On to final script | |
At this point, we have done enough data acquisition and intial processing to be able to move on to the final script -- `bcycle_final_script.R`. |