Permalink
Browse files

Initial commit. Early assets.

Might need to change some relative refs after uploading.
  • Loading branch information...
judecalvillo committed Feb 2, 2016
1 parent 2c19fc7 commit 0eca2dc30c5063d4ff2c26f86cd6cf726dabf12d
View
@@ -1 +1,30 @@
# data-science-wg
![](datascience-wg_header.jpg)
## Thanks for stopping by!
The **Data Science Working Group’s** purpose is to efficiently assess and tackle Code for San Francisco’s data science needs. Our practicing and aspiring data scientists are available to:
+ assess/inspire the possibility of data science components in other projects;
+ provide resources to help produce those components;
+ provide a learning environment for ourselves and others to learn more practical data science.
In pursuing the above, we humbly hope that CfSF's dedicated project groups come to consider us an integral and synergistic resource for the group at large.
### Administration
Team Leads: [Jude Calvillo](http://linkd.in/1BGeytb) and [Sanat Moningi](http://bit.ly/1PFurlp)
Lead Data Scientist: [Matthew Pancia](http://bit.ly/1PFuA8k)
Wiki (resources): [DSWG Wiki >>](https://github.com/judecalvillo/Data-Science_Working-Group/wiki)
### Current Initiatives
The DSWG's initiatives are centered on building data science components for CfSF's project groups. We do, however, also pursue exploratory analyses that could inspire dedicated projects at CfSF, as well as develop standalone visualization tools.
+ [ParkSafe GIS coordinate realignment (w/street paths) >>](#)
+ [SF's budget allocations/visualization: Measuring social impact >>](https://github.com/RocioSNg/SF_brigade_impact_gov)
+ [Interactive visualization of SF's building emissions and energy use >>](#)
+ [MarbleTree: Geo-aggregating mental health resource data >>](#)
+ [Exploratory analyses upon 311 call data >>](projects-in-this-repo/311_Exploratory-Analyses)
+ [Adopt-a-Drain: Predicting flooded drains (per drain, zip, and/or day) >>](#)
*More to come! (markdown in progress)
View
@@ -0,0 +1,3 @@
## Load libraries.
library(RSocrata)
View
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,130 @@
![](311_explore.jpg)
## 311 Case Data: Exploratory Analyses
[SF OpenData](https://data.sfgov.org/) provides a real-time record and API for [311 cases completed and in progress](https://data.sfgov.org/City-Infrastructure/Case-Data-from-San-Francisco-311-SF311-/vw6y-z8j6). The Data Science Working Group at Code for San Francisco looks to perform exploratory statistical analyses on this data to find whether there are any strategically and/or politically interesting characteristics of San Francisco's public agencies and/or the publics they serve.
**Responsible DSWG Teammates**
+ [Catherine Zhang](http://bit.ly/1WXteM8)
+ [Rocio Ng](http://bit.ly/1WXtj2v)
+ [Abhiram Chintangal](http://bit.ly/1WXtpHr)
+ [Jude Calvillo](http://linkd.in/1BGeytb)
### Tests to be Performed
These have yet to be determined, but some analyses we're currently considering include:
+ Looking for statistically significant differences in...
- Resolution times by agency (overall and per request type)
- Resolution times by neighborhood served (overall and per request type)
- Resolution times by Supervisor/District (overall and per request type)
- Request types per neighborhood
+ Correlations between...
- Resolution times and call frequency
### Quickies
Just some basic descriptive stats and plots until the team begins its real statistical analyses. *Please note, all of the below draw from a 5,000 record sample*:
![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-1.png) ![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-2.png)
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-2-1.png)
### Resolution Time Explorations
We'll be adding plots shortly. These are just some summaries to inspire the DSWG's more advanced/inferential statistics.
#### Top 10 Request Types...
**--- By Shortest Mean Resolution Time (across all neighborhoods) ---**
|Request.Type |Mean.Resolve |
|:--------------------------------------------------------|:------------|
|Sign Repair - Loose |0.03 hours |
|mta - residential_parking_permit - request_for_service |0.04 hours |
|tt_collector - tt_collector - mailing_request |0.23 hours |
|county_clerk - county_clerk - request_for_service |0.51 hours |
|puc - water - customer_callback |0.79 hours |
|mta - bicycle - request_for_service |1.20 hours |
|Construction Zone Tow-away Permits for Proven Managment |1.68 hours |
|Litter_Receptacle_Request_New_Removal |1.69 hours |
|homeless_concerns - homeless_other - request_for_service |2.06 hours |
|puc - water - request_for_service |2.12 hours |
**--- By Longest Mean Resolution Time (across all neighborhoods) ---**
|Request.Type |Mean.Resolve |
|:-------------------------------------------|:--------------|
|dpw - bsm - followup_request |27208.90 hours |
|Public_Stairway_Defect |25823.03 hours |
|Streetlight - Other_Request_New_Streetlight |18549.16 hours |
|Utility Lines/Wires |17381.97 hours |
|rpd - rpd_other - request_for_service |11650.02 hours |
|SFHA Priority - Preventive |10512.38 hours |
|sfpd - sfpd - request_for_service |10336.51 hours |
|puc - puco - complaint |8905.67 hours |
|dtis - dtis - request_for_service |8573.08 hours |
|Streetlight - Other_Request_Light_Shield |6312.14 hours |
#### Top 10 Neighborhoods...
**--- By Shortest Mean Resolution Time (across all request types) ---**
|Neighborhood |Mean.Resolve |
|:---------------------|:------------|
|McLaren Park |2.82 hours |
|Candlestick Point SRA |6.30 hours |
|Parkmerced |12.40 hours |
|Merced Manor |15.88 hours |
|Sherwood Forest |24.13 hours |
|Peralta Heights |47.58 hours |
|Alamo Square |47.78 hours |
|Little Hollywood |48.76 hours |
|Lake Street |55.12 hours |
|Balboa Terrace |78.03 hours |
**--- By Longest Mean Resolution Time (across all request types) ---**
|Neighborhood |Mean.Resolve |
|:-------------------|:-------------|
|Holly Park |2540.00 hours |
|Cole Valley |2435.79 hours |
|Cayuga |2343.87 hours |
|Anza Vista |1736.16 hours |
|Presidio Terrace |1541.12 hours |
|Cow Hollow |1462.26 hours |
|Glen Park |1394.10 hours |
|West of Twin Peaks |948.83 hours |
|Northern Waterfront |887.75 hours |
|Castro/Upper Market |875.78 hours |
#### Top 10 Neighborhoods, by Longest Mean Resolution Time for Selected Request Types
**--- For Street Cleaning ---**
|Neighborhood |Mean.Resolve |
|:-------------------|:-------------|
|Inner Sunset |1147.13 hours |
|Castro/Upper Market |900.10 hours |
|West of Twin Peaks |305.28 hours |
|Seacliff |264.89 hours |
|Bayview |161.79 hours |
|Excelsior |99.47 hours |
|Outer Richmond |96.05 hours |
|Outer Sunset |81.76 hours |
|North Beach |79.34 hours |
|Chinatown |69.72 hours |
**--- For Sidewalk Cleaning ---**
|Neighborhood |Mean.Resolve |
|:---------------------|:------------|
|Downtown/Civic Center |305.20 hours |
|Potrero Hill |274.37 hours |
|Haight Ashbury |264.27 hours |
|South of Market |237.16 hours |
|Pacific Heights |205.75 hours |
|Outer Richmond |202.23 hours |
|Parkside |197.74 hours |
|Russian Hill |196.18 hours |
|Outer Mission |155.57 hours |
|North Beach |154.73 hours |
@@ -0,0 +1,119 @@
library(ggplot2)
library(grid)
library(gridExtra)
library(scales)
library(dplyr)
## Load data and add some value to NA neighborhoods.
cases_sample <- na.omit(read.csv("data/cases_sample.csv"))
levels(cases_sample$Neighborhood)[levels(cases_sample$Neighborhood)==""] <- "NOT DEFINED"
## Convert Opened and Closed days + times to usable format.
cases_sample$Opened <- as.POSIXct(strptime(cases_sample$Opened, format = "%m/%d/%Y %I:%M:%S %p"), tz = "America/Los_Angeles")
cases_sample$Closed <- as.POSIXct(strptime(cases_sample$Closed, format = "%m/%d/%Y %I:%M:%S %p"), tz = "America/Los_Angeles")
cases_sample$Resolve.Time <- round(difftime(cases_sample$Closed, cases_sample$Opened, units = "hours"), 2)
## Preview data
print(head(cases_sample))
##### Some quickie plots: 311 request sources #####
## Begin w/Freq. table
sources311 <- data.frame(table(cases_sample$Source))
sources311 <- sources311[order(-sources311$Freq),]
sources311$Var1 <- reorder(sources311$Var1, sources311$Freq)
## Flipped bar chart.
sourcebar <- ggplot(sources311, aes(x = Var1, y = Freq, fill = Var1))
sourcebar <- sourcebar + geom_bar(width = 1, stat = "identity") + xlab("")
sourcebar <- sourcebar + theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.y = element_blank())
sourcebar <- sourcebar + coord_flip() + geom_text(label=format(sources311$Freq, digits=2), size = 4)
## Pie chart. Requires new DF, just to trick GGplot to put labels in the right places.
sources311b <- sources311
sources311b$pct <- sources311$Freq/sum(sources311$Freq)
sources311b$pos <- cumsum(sources311b$Freq) - 0.75*sources311b$Freq
sourcepie <- ggplot(sources311b, aes(x = "", y = Freq, fill = Var1))
sourcepie <- sourcepie + geom_bar(width = 1, stat = "identity")
sourcepie <- sourcepie + ylab("Freq / All") + xlab("") + coord_polar("y")
sourcepie <- sourcepie + theme(legend.position="none", axis.ticks.x = element_blank(), axis.ticks.y = element_blank())
sourcepie <- sourcepie + geom_text(aes(label = percent(pct), y = pos, size = 4))
print(grid.arrange(sourcebar, sourcepie, ncol = 2, top = paste0("311 Request Frequency by Source ", "(Total = ", nrow(cases_sample),")")))
###### Neighborhood demand breakdown ######
## Begin w/freq table of neighborhoods.
neigh311 <- data.frame(table(cases_sample$Neighborhood))
neigh311 <- neigh311[order(-neigh311$Freq),]
neigh311$Var1 <- reorder(neigh311$Var1, neigh311$Freq)
## Flipped bar chart.
neighbar <- ggplot(neigh311[1:10,], aes(x = Var1, y = Freq, fill = Freq))
neighbar <- neighbar + geom_bar(width = 1, stat = "identity") + xlab("")
neighbar <- neighbar + theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.y = element_blank())
neighbar <- neighbar + coord_flip() + geom_text(aes(label=Freq, digits=2), size = 4)
neighbar <- neighbar + ggtitle("311 Requests Freq by Neighborhood: Top 10")
print(neighbar)
## Neighborhood demand by type.
## Use only the top 5.
neightype <- cases_sample[cases_sample$Neighborhood %in% neigh311$Var1[1:5],]
reqst <- data.frame(table(neightype$Request.Type))
reqst <- reqst[order(-reqst$Freq),]
neightype <- neightype[neightype$Request.Type %in% reqst$Var1[1:10],]
## Flipped bar chart.
typebar <- ggplot(neightype, aes(Request.Type, fill = Request.Type)) + facet_wrap(~Neighborhood)
typebar <- typebar + geom_bar(stat = "bin", position = "stack") + xlab("") + ylab("")
typebar <- typebar + ggtitle("Request Types Per Top 5 Neighborhood")
typebar <- typebar + theme(axis.text.x = element_blank())
print(typebar)
######### Response Times Exploration: Very Rough ##########
## First, let's summarize sheer neighborhood by mean resolution time.
meanres_neigh <- summarise(group_by(cases_sample, Neighborhood), round(mean(Resolve.Time),2))
colnames(meanres_neigh)[2] <- c("Mean.Resolve")
## Let's also summarize sheer request type by mean resolution time.
meanres_req <- summarise(group_by(cases_sample, Request.Type), round(mean(Resolve.Time),2))
colnames(meanres_req)[2] <- c("Mean.Resolve")
## Finally, let's summarize by means per request type in each neighborhoood.
meanres_nereq <- summarise(group_by(cases_sample, Neighborhood, Request.Type), round(mean(Resolve.Time),2))
colnames(meanres_nereq)[3] <- c("Mean.Resolve")
#### Now, let's compare mean response times, per neighborhood, across all request types...
## Neighborhoods w/shortest resolution times.
meanres_neigh <- meanres_neigh[order(meanres_neigh$Mean.Resolve),]
print(meanres_neigh[1:10,])
## Neighborhoods w/longest resolution times.
meanres_neigh <- meanres_neigh[order(-meanres_neigh$Mean.Resolve),]
print(meanres_neigh[1:10,])
#### Now, let's compare mean response times, per sheer request type...
## Req types w/longest mean resolution times.
meanres_req <- meanres_req[order(meanres_req$Mean.Resolve),]
print(meanres_req[1:10,])
## Req types w/longest mean resolution times.
meanres_req <- meanres_req[order(-meanres_req$Mean.Resolve),]
print(meanres_req[1:10,])
#### Let's compare mean response times, per neighborhood + request type...
## Beginning w/street cleaning.
meanres_street <- meanres[meanres_nereq$Request.Type == "Street_Cleaning",]
meanres_street <- meanres_street[order(-meanres_street$Mean.Resolve),]
print(meanres_street[1:10,])
## Let's do the same for sidewalk cleaning.
meanres_side <- meanres[meanres_nereq$Request.Type == "Sidewalk_Cleaning",]
meanres_side <- meanres_side[order(-meanres_side$Mean.Resolve),]
print(meanres_side[1:10,])
Oops, something went wrong.

0 comments on commit 0eca2dc

Please sign in to comment.