<img style="float: left" src="julia.png">
<br><br><br><br><br>
### Everyday Analytics And Visualization<br><br>
####Analyzing citibike NYC data
JuliaCon 2015: June 27, 2015<br>
Massachusetts Institute of Technology
***
Randy Zwitch <br>
[@randyzwitch](https://twitter.com/randyzwitch)<br>
http://randyzwitch.com <br>
https://github.com/randyzwitch/juliacon2015
<br>
<br>
<br>
<br>

##Agenda<br>  
***
###Using Julia and [citibike NYC Data](http://www.citibikenyc.com/system-data), demonstrate:<br>
* Commonly-Used Syntax For Data Analysis <br>
* Data Visualization<br>
* Accessing Real-Time Web Data (time permitting)<br>

##citibike NYC
***
<img style="float: left" src="citibike-stations.png">
http://www.citibikenyc.com/stations

##Downloading/Unzipping Archived 2014 Data
***

In [None]:
#Loop over months, download 2014 history

#Set working directory
cd("/Users/randyzwitch/juliacon2015/data")

In [None]:
for month in 1:1:12
    
    #Pad with leading zero for single-digit ints
    month = lpad(month, 2, "0")
    
    #Download zip files
    #Calls cURL in background
    #Use Requests.jl for more complex HTTP calls
    download("https://s3.amazonaws.com/tripdata/2014$month-citibike-tripdata.zip", 
             "2014$month.zip")
end

#Unzip all files using OSX Terminal command 
#Called from inside Julia
run(`unzip -o -q \*.zip`)

In [None]:
#Load dataframes library, create df with 2014 data
using DataFrames

In [None]:
#Get file list, filtering by files having .csv extension
csvfiles = filter(x -> contains(x, ".csv"), readdir(pwd()))

In [None]:
#Takes 5 minutes or so to load/concatenate
#Faster method would be to `cat` the files from Terminal first
#to avoid memory swap
df = DataFrame()        
for fileloc in csvfiles        
    df = vcat(df, readtable(fileloc))  
end

In [None]:
#How big is the resulting dataframe?
size(df)                                  

##Exploring Dataset
***

In [None]:
#See dataset columns
names(df)

In [None]:
#See dataset structure
#Just first four columns for display purposes
head(df[:, 1:4])

In [None]:
#Can do summary statistics on whole dataframe
#Usually doesn't make sense to do this
describe(df) 

In [None]:
#Can do summary statistics by column 
describe(df[:tripduration])

##Data Visualization
***
There are numerous visualization libraries in Julia:
<br>
* [Gadfly](http://gadflyjl.org/index.html) (similar to ggplot, based on Grammar of Graphics)
* [Vega](https://github.com/johnmyleswhite/Vega.jl) (Vega.js wrapper)
* [Plotly](https://plot.ly/julia/) (API Interface using JSON)
* [Bokeh](http://bokeh.github.io/Bokeh.jl/) (Bindings for Continuum Python library)
* [Winston](http://winston.readthedocs.org/en/latest/) (similar to Base R graphics)
* [Gaston](https://github.com/mbaz/Gaston.jl) (Julia Wrapper of gnuplot)
* [PyPlot](https://github.com/stevengj/PyPlot.jl) (Julia wrapper of matplotlib.pyplot)
* [ASCIIPlots](https://github.com/johnmyleswhite/ASCIIPlots.jl) (Plain-text charts)
* [GoogleCharts](https://github.com/jverzani/GoogleCharts.jl) (Julia wrapper of API)

##Business Questions To Visualize
***

1. How did overall ridership change month-to-month in 2014?
2. Do the duration of rides vary per month?
3. Do men/women participate differently in the rideshare program? Does age matter?

###How did overall ridership change month-to-month in 2014?

In [None]:
#Load Gadfly library
using Gadfly

#Calculate rides per month
#Define function since timestamp format changes between files
function dateparse(d::String)
    if typeof(match(r"[-]", d)) != Nothing 
        return d[6:7]
    elseif d[2] == '/'
        return string("0", d[1])
    else
        return d[1:2]
    end
end

#Add month column
df[:month] = ASCIIString[dateparse(x) for x in df[:starttime]]

#Add random number for sampling
df[:rand] = Float64[rand() for x in df[:tripduration]];

In [None]:
#Count number of rides per month
#size(x, 1) counts the records along the '1' (row) axis
monthly_ride_counts = by(df, :month, 
                         x -> DataFrame(rides = size(x, 1)))

In [None]:
#Set plot size in Notebook
set_default_plot_size(20cm, 14cm)

#Bar chart
plot(monthly_ride_counts, x = "month", y = "rides", Geom.bar,
     Guide.title("citibike NYC Rides Per Month - 2014"),
     Theme(default_color = color("navy blue"), bar_spacing = 3mm),
     Scale.y_continuous(format = :plain))

###Does the duration of rides vary per month?

In [None]:
#Set plot size in Notebook
set_default_plot_size(20cm, 16cm)

#Factor-level Boxplot
#Plot 1% of points for time consideration
plot(df[df[:rand] .< .01, :], x = "month", y = "tripduration", 
     Geom.boxplot,
     Guide.title("citibike NYC - Trip Duration By Month"),
     Scale.y_continuous(format = :plain, minvalue = 0, maxvalue = 3600),
     Theme(default_color = color("green"))
     ) 

###Do men/women participate differently in the rideshare program? Does age matter?

In [None]:
#Calculate avg duration by gender, birth year
gender_age_duration = by(df, [:gender, :birth_year], 
                         x -> DataFrame(tripduration = mean(x[:tripduration]))) 

In [None]:
#Set plot size in Notebook
set_default_plot_size(22cm, 12cm)

#Label gender as characters instead of numbers
function gender(x)
    if x == 1
        return "male"
    elseif x == 2
        return "female"
    else
        return "unknown"
    end
end

gender_age_duration[:genderstr] = [gender(x) for x in gender_age_duration[:gender]]
#Calculate Age
function age(x) 
    try
        return 2015 - int(x)
    end
end

gender_age_duration[:age] = Float64[age(x) for x in gender_age_duration[:birth_year]];

In [None]:
#Scatterplot
plot(gender_age_duration, x="age", y="tripduration", 
     color="genderstr", 
     Geom.point,
     Guide.title("citibike NYC - Trip Duration By Age By Gender"),
     Guide.xticks(ticks=[0:10:100]),
     Guide.colorkey("gender"),
     Scale.x_continuous(minvalue=0, maxvalue=100),
     Scale.y_continuous(minvalue=0, maxvalue=1500),
     Scale.color_discrete_manual("dark gray","navy","pink")
     )

##Accessing citibike NYC Real-Time Data
***

citibike NYC provides real-time information about the number of bikes at each station, with data refreshed at every http call. Even better, this data is truly OPEN; the JSON feed is provided as a static URL, with no special credentials.

In Julia, the [Requests.jl](https://github.com/JuliaWeb/Requests.jl) library is becoming the standard for making API calls; the [JSON.jl](https://github.com/JuliaLang/JSON.jl) library is how you parse the information returned (returned as JSON) from APIs.

http://www.citibikenyc.com/stations/json

In [None]:
#Import libraries
using Requests, JSON

#Use Get Request to pull data
r = get("http://www.citibikenyc.com/stations/json")

In [None]:
#Requests library returns a Julia Composite Data type
#Use dot syntax to get data field
#JSON.parse takes a string, returns a Julia Dict
citidata = JSON.parse(r.data)

In [None]:
#Access the station list, find out how many stations there are
size(citidata["stationBeanList"]) 

In [None]:
#Determine fields in dataset using first array element
collect(keys(citidata["stationBeanList"][1]))

##Top 10 Stations Having Bikes
***

In [None]:
#Iterate over JSON, get station/location/available bikes
#Exclamation point on push! indicates mutating list object directly
station_name = ASCIIString[]
staddress = ASCIIString[]
available = Int[]
totaldocks = Int[]
lat = Float64[]
lon = Float64[]

for element in citidata["stationBeanList"]
    push!(station_name, element["stationName"])
    push!(staddress, element["stAddress1"])
    push!(available, element["availableBikes"])
    push!(totaldocks, element["totalDocks"])
    push!(lat, element["latitude"])
    push!(lon, element["longitude"])
end

In [None]:
#Concatenate arrays, convert to DataFrame
citiparsed = DataFrame()

citiparsed[:station] = station_name
citiparsed[:availablebikes] = available
citiparsed[:totaldocks] = totaldocks
citiparsed[:pctremain] = citiparsed[:availablebikes] ./ citiparsed[:totaldocks]
citiparsed[:lat] = lat
citiparsed[:lon] = lon;

In [None]:
#Exclamation on sort! means dataframe remains sorted (gets mutated)
sort!(citiparsed, cols = [order(:availablebikes, rev = true)])
tbl = head(citiparsed, 10)

##"Real-Time" Map of Bike Availability
***

In [None]:
using Vega

function citibikenycmap(;lat = Any[], lon = Any[])
    v = VegaVisualization(viewport = [500, 700])
    add_data!(v, x = lat, y = lon)
    v.data[1].name = "points"
    v.data[1].transform = [VegaTransform({"type" => "geo", "lat" => "data.x", "lon" => "data.y", "scale" => 85000})]
    push!(v.data, VegaData(name = "nyc", url = "nyc_mh_bk.json", 
    format = VegaFormat(_type = "topojson", feature = "collection")))

    v.marks = Array(VegaMark, 2)
    v.marks[1] = VegaMark(_type = "path", from = {"data" => "nyc", "transform" => [{"type" => "geopath", "value" => "data", "scale" => 85000}]},
                            properties = VegaMarkProperties(enter = VegaMarkPropertySet(path = VegaValueRef(field = "path")),
                                                            update = VegaMarkPropertySet(fill = VegaValueRef(value = "darkblue"))
                                                            )
                         )

    v.marks[2] = VegaMark(_type = "symbol", 
                          from = {"data" => "points"},
                          properties = VegaMarkProperties(enter = VegaMarkPropertySet(x = VegaValueRef(field = "x"), 
                                                                                      y = VegaValueRef(field = "y"),
                                                                                        fill = VegaValueRef(value = "red")),
                                                            update = VegaMarkPropertySet(stroke = VegaValueRef(value = "black"),
                                                                                         size = VegaValueRef(value = 100),
                                                                                         fill = VegaValueRef(value = "red")),
                                                            hover = VegaMarkPropertySet(size = VegaValueRef(value = 200),
                                                                                         fill = VegaValueRef(value = "green"))
                                                            )
    )

    return v
end

In [None]:
citibikenycmap(lat = tbl[:lat], lon = tbl[:lon])

## End of Presentation