Permalink
Browse files

Repo updates: 311 reporting + CodeAcross news

+ Moved 311's exploratory analyses to sub-directory and laid groundwork
for final reporting.
+ Mentioned recent CodeAcross happenings.
+ Added mention of upcoming 'Fire Risk' project.
  • Loading branch information...
judecalvillo committed Mar 24, 2016
1 parent 95bdb8c commit 535ad3c153a6cfd2fcf21f1eca967fb68996b299
View
@@ -24,5 +24,15 @@ The DSWG's initiatives are centered on building data science components for CfSF
+ [SF's budget allocations/visualization: Measuring social impact >>](https://github.com/RocioSNg/SF_brigade_impact_gov)
+ [Interactive visualization of SF's building emissions and energy use >>](https://github.com/smoningi/SF-Environment-Benchmark)
+ [SF 311 Data Analysis >>](https://github.com/sfbrigade/data-science-wg/tree/master/projects-in-this-repo/SF_311_Data-Analysis)
+ [Adopt-a-Drain: Predicting flooded drains (per drain, zip, and/or day) >>](https://github.com/sfbrigade/data-science-wg/tree/master/projects-in-this-repo/Drain-Flooding_Prediction)
+ [Predicting Relative Risk of Fire in SF's Buildings](#)
### Recent Happenings
The team recently presented some of its work on the above at Code for San Francisco's CodeAcross 2016, where we hoped to inspire the day's hackers to use SF's OpenData portal in creative -and productive- ways.
[To View a Small Photo Gallery from the Event, Click Here >>](http://bit.ly/1MDt5az)
[![](cfa_codeacross_sf_data-science.jpg)](http://bit.ly/1MDt5az)
[To View Our Presentation (with Plots and Overviews), Click Here >>](http://bit.ly/1UNIg7H)
[![](cfa_codeacross_sf_presentation.jpg)](http://bit.ly/1UNIg7H)
Binary file not shown.
Binary file not shown.

This file was deleted.

Oops, something went wrong.

This file was deleted.

Oops, something went wrong.
@@ -6,6 +6,6 @@ Please note: Many of our initiatives are add-ons to other groups' projects, so o
+ [ParkSafe GIS coordinate realignment (w/street paths) >>](https://github.com/sfbrigade/data-science-wg/tree/master/projects-in-this-repo/Park-Safe_GIS-Solution)
+ [SF's budget allocations/visualization: Measuring social impact >>](https://github.com/RocioSNg/SF_brigade_impact_gov)
+ [Interactive visualization of SF's building emissions and energy use >>](#)
+ [Interactive visualization of SF's building emissions and energy use >>](https://github.com/smoningi/SF-Environment-Benchmark)
+ [SF 311 Data Analysis >>](https://github.com/sfbrigade/data-science-wg/tree/master/projects-in-this-repo/SF_311_Data-Analysis)
+ [Adopt-a-Drain: Predicting flooded drains (per drain, zip, and/or day) >>](https://github.com/sfbrigade/data-science-wg/tree/master/projects-in-this-repo/Drain-Flooding_Prediction)
+ [Predicting Relative Risk of Fire in SF's Buildings](#)
@@ -0,0 +1,130 @@
![](311_explore.jpg)
## 311 Case Data: Data Analysis
[SF OpenData](https://data.sfgov.org/) provides a real-time record and API for [311 cases completed and in progress](https://data.sfgov.org/City-Infrastructure/Case-Data-from-San-Francisco-311-SF311-/vw6y-z8j6). The Data Science Working Group at Code for San Francisco looks to perform exploratory statistical analyses on this data to see whether it might posses strategically and/or politically interesting characteristics, which we will later confirm -via inferential statistics- and report to relevant stakeholders (e.g. San Francisco's public agencies and/or the publics they serve).
**Responsible DSWG Teammates**
+ [Matthew Pancia](http://bit.ly/1PFuA8k)
+ [Elena Palesis](http://bit.ly/1mgjXl4)
+ [Yiwen Yu](http://bit.ly/1mgkqDE)
+ [Jude Calvillo](http://linkd.in/1BGeytb)
+ [Jeff Lam](http://bit.ly/1Pm9SLJ)
+ [Matthew Mollison](http://bit.ly/1PPZXSa)
### Current Status: March 10, 2016
We recently got to share some of our more interesting findings and visualizations at [Code for America's upcoming CodeAcross in San Francisco (March 5, 2016)](https://www.codeforamerica.org/events/codeacross-2016/). Thereafter, we had some great discussions with the City's Chief Data Officer, Joy Bonaguro, wherein we learned about some things we could do to hone our accuracy and utility. Thus, we've since updated our Census profiling to the tract level and matched our neighborhoods to those more commonly used in SF OpenData.
*Please Note: This project's README and directory will substantially change in the coming week, as we begin to publicly display our inferential statistics and lay the groundwork for our final report.
### Statistical Tests to be Performed
The tests we're currently tackling include:
+ Income correlates / significant diffs?
+ Resolution time (by agency, overall, neighborhood, income, etc)?
+ C.Neighborhoods per request type?
+ Ethnic correlates / significant diffs?
+ Significant diffs in request types by source?
+ Seasonality to request types?
+ Interaction between call frequency and resolution time, per request type and/or per responsible agency?
### Exploratory Quickies
These are just some early descriptive plots, until the team begins systematically tackling the statistical analyses mentioned above.
*Please note, all of the below draw from a 5,000 record sample*:
![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-1.png)![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-2.png)
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-2-1.png)
### Similarity of Request Type Distributions (K-L Divergence)
The graph below, produced by Matt Pancia, clusters neighborhoods according to the similarity of their request type distributions, as reflected by their [Kullback–Leibler divergence/weight](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
![](figure/kl_divergence_graph.png)
### Time-Lapse Heatmap of 311 Requests for Sidewalk and Street Cleaning
The heatmap linked to below, by Jeff Lam, geographically reflects the number of 311 requests for sidewalk and street cleaning over time. It was produced by Jeffrey Lam and will help inform our impending investigations over seasonality to request types.
[![](figure/cartodb_heatmap_sf-311-calls.jpg)](http://bit.ly/1WnReqW)
### Daily Counts of 311 Cases by Category
This plot, by Matt Mollison, shows the total number of 311 cases by request Category (a higher order grouping), per day since 2008. It was drawn from the entire dataset (vs. a sample).
![](figure/311-request-category_matt-mollison.png)
### Resolution Time Exploration (in Hours)
We'll be adding plots later. These are just some summaries to inspire the DSWG's more advanced/inferential statistics.
#### Top 10 Request Types...
**--- By Shortest Median Resolution Time (across all neighborhoods) ---**
|Request.Type | Median.Resolve|
|:--------------------------------------------------------|--------------:|
|Sign Repair - Loose | 0.03|
|mta - residential_parking_permit - request_for_service | 0.04|
|mta - parking_enforcement - request_for_service | 0.18|
|tt_collector - tt_collector - mailing_request | 0.23|
|puc - water - request_for_service | 0.47|
|puc - water - customer_callback | 0.79|
|Water_leak | 1.09|
|mta - bicycle - request_for_service | 1.20|
|Litter_Receptacle_Request_New_Removal | 1.69|
|homeless_concerns - homeless_other - request_for_service | 2.06|
**--- By Longest Median Resolution Time (across all neighborhoods) ---**
|Request.Type | Median.Resolve|
|:-------------------------------------------|--------------:|
|dpw - bsm - followup_request | 27208.90|
|Public_Stairway_Defect | 25823.03|
|Streetlight - Other_Request_New_Streetlight | 18549.16|
|Utility Lines/Wires | 17381.97|
|rpd - rpd_other - request_for_service | 11650.02|
|sfpd - sfpd - request_for_service | 10336.51|
|puc - puco - complaint | 8905.67|
|dtis - dtis - request_for_service | 6662.18|
|Streetlight - Other_Request_Light_Shield | 6312.14|
|Building - Illegal_Guest_Room_Conversions | 6300.25|
#### Top 10 C.Neighborhoods...
**--- By Shortest Median Resolution Time (across all request types) ---**
|C.Neighborhood | Median.Resolve|
|:-----------------|--------------:|
|McLaren Park | 7.25|
|Lakeshore | 21.61|
|Lincoln Park | 22.63|
|South of Market | 39.49|
|Tenderloin | 45.69|
|Mission | 47.25|
|Nob Hill | 47.83|
|Hayes Valley | 51.64|
|Lone Mountain/USF | 53.18|
|North Beach | 53.63|
**--- By Longest Median Resolution Time (across all request types) ---**
|C.Neighborhood | Median.Resolve|
|:----------------|--------------:|
|Potrero Hill | 170.34|
|Pacific Heights | 138.16|
|Treasure Island | 133.12|
|Glen Park | 132.21|
|Seacliff | 124.91|
|Inner Richmond | 120.26|
|Presidio Heights | 113.30|
|Mission Bay | 109.09|
|Golden Gate Park | 102.84|
|Inner Sunset | 102.82|
@@ -5,126 +5,67 @@
[SF OpenData](https://data.sfgov.org/) provides a real-time record and API for [311 cases completed and in progress](https://data.sfgov.org/City-Infrastructure/Case-Data-from-San-Francisco-311-SF311-/vw6y-z8j6). The Data Science Working Group at Code for San Francisco looks to perform exploratory statistical analyses on this data to see whether it might posses strategically and/or politically interesting characteristics, which we will later confirm -via inferential statistics- and report to relevant stakeholders (e.g. San Francisco's public agencies and/or the publics they serve).
**Responsible DSWG Teammates**
+ [Matthew Pancia](http://bit.ly/1PFuA8k)
+ [Matthew Pancia, Ph.D.](http://bit.ly/1PFuA8k)
+ [Elena Palesis](http://bit.ly/1mgjXl4)
+ [Yiwen Yu](http://bit.ly/1mgkqDE)
+ [Jude Calvillo](http://linkd.in/1BGeytb)
+ [Jude Calvillo (Project Lead)](http://linkd.in/1BGeytb)
+ [Jeff Lam](http://bit.ly/1Pm9SLJ)
+ [Matthew Mollison](http://bit.ly/1PPZXSa)
+ [Matthew Mollison, Ph.D.](http://bit.ly/1PPZXSa)
+ [Hannah Burak](http://bit.ly/1U7D13N)
### Current Status: March 10, 2016
### Current Status: March 23, 2016
We recently got to share some of our more interesting findings and visualizations at [Code for America's upcoming CodeAcross in San Francisco (March 5, 2016)](https://www.codeforamerica.org/events/codeacross-2016/). Thereafter, we had some great discussions with the City's Chief Data Officer, Joy Bonaguro, wherein we learned about some things we could do to hone our accuracy and utility. Thus, we've since updated our Census profiling to the tract level and matched our neighborhoods to those more commonly used in SF OpenData.
After some great discussions with the City's Chief Data Officer, Joy Bonaguro, we've updated our Census profiling to the tract level and matched our neighborhoods to those more commonly used in SF OpenData, so as to help with accuracy and agency compatibility.
*Please Note: This project's README and directory will substantially change in the coming week, as we begin to publicly display our inferential statistics and lay the groundwork for our final report.
We've also *moved our exploratory analyses to a sub-directory*, as we're now working through our more substantive statistical tests and reporting.
+ [**Click to view our exploratory analyses (visuals, summaries, etc) >>**](/Exploratory_Analyses/)
### Statistical Tests to be Performed
*What you see below is a foundation for integrating our statistical tests and final reporting.*
The tests we're currently tackling include:
### Introduction
+ Income correlates / significant diffs?
+ Resolution time (by agency, overall, neighborhood, income, etc)?
+ C.Neighborhoods per request type?
+ Ethnic correlates / significant diffs?
+ Significant diffs in request types by source?
+ Seasonality to request types?
+ Interaction between call frequency and resolution time, per request type and/or per responsible agency?
### Exploratory Quickies
These are just some early descriptive plots, until the team begins systematically tackling the statistical analyses mentioned above.
+ Intro w/impetus.
+ Acknowledgements.
*Please note: This introduction will soon replace the one above.*
*Please note, all of the below draw from a 5,000 record sample*:
### Literature Review
![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-1.png)![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1-2.png)
+ What's been done for 311 in other cities?
+ What complications have they encountered?
+ What's possible?
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-2-1.png)
### Methodology
### Similarity of Request Type Distributions (K-L Divergence)
The graph below, produced by Matt Pancia, clusters neighborhoods according to the similarity of their request type distributions, as reflected by their [Kullback–Leibler divergence/weight](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
+ Overall approach and boundaries, along with approach to special circumstances.
+ Census data concerns and integration.
+ Data sources.
![](figure/kl_divergence_graph.png)
### Research Questions
### Time-Lapse Heatmap of 311 Requests for Sidewalk and Street Cleaning
The heatmap linked to below, by Jeff Lam, geographically reflects the number of 311 requests for sidewalk and street cleaning over time. It was produced by Jeffrey Lam and will help inform our impending investigations over seasonality to request types.
#### Operational Concerns
[![](figure/cartodb_heatmap_sf-311-calls.jpg)](http://bit.ly/1WnReqW)
1. What case features predict cases that would later become 'invalid'?
2. Can we predict one kind of request type from requests for another type(s)?
3. Can we fairly accurately forecast the frequency of one or more request categories from their apparent seasonality?
4. Similarly, can we detect anomalies in frequency of one or more request categories, particularly per census tract?
5. Has reporting 'homeless concerns' substantially changed since 311 changed its app and voice menus around such reporting?
6. Is there any potential for more responsive reporting tools via app or voice interfaces? (i.e. they change upon meeting one or more conditions)
### Daily Counts of 311 Cases by Category
This plot, by Matt Mollison, shows the total number of 311 cases by request Category (a higher order grouping), per day since 2008. It was drawn from the entire dataset (vs. a sample).
![](figure/311-request-category_matt-mollison.png)
### Resolution Time Exploration (in Hours)
We'll be adding plots later. These are just some summaries to inspire the DSWG's more advanced/inferential statistics.
#### Top 10 Request Types...
**--- By Shortest Median Resolution Time (across all neighborhoods) ---**
|Request.Type | Median.Resolve|
|:--------------------------------------------------------|--------------:|
|Sign Repair - Loose | 0.03|
|mta - residential_parking_permit - request_for_service | 0.04|
|mta - parking_enforcement - request_for_service | 0.18|
|tt_collector - tt_collector - mailing_request | 0.23|
|puc - water - request_for_service | 0.47|
|puc - water - customer_callback | 0.79|
|Water_leak | 1.09|
|mta - bicycle - request_for_service | 1.20|
|Litter_Receptacle_Request_New_Removal | 1.69|
|homeless_concerns - homeless_other - request_for_service | 2.06|
**--- By Longest Median Resolution Time (across all neighborhoods) ---**
|Request.Type | Median.Resolve|
|:-------------------------------------------|--------------:|
|dpw - bsm - followup_request | 27208.90|
|Public_Stairway_Defect | 25823.03|
|Streetlight - Other_Request_New_Streetlight | 18549.16|
|Utility Lines/Wires | 17381.97|
|rpd - rpd_other - request_for_service | 11650.02|
|sfpd - sfpd - request_for_service | 10336.51|
|puc - puco - complaint | 8905.67|
|dtis - dtis - request_for_service | 6662.18|
|Streetlight - Other_Request_Light_Shield | 6312.14|
|Building - Illegal_Guest_Room_Conversions | 6300.25|
#### Top 10 C.Neighborhoods...
**--- By Shortest Median Resolution Time (across all request types) ---**
|C.Neighborhood | Median.Resolve|
|:-----------------|--------------:|
|McLaren Park | 7.25|
|Lakeshore | 21.61|
|Lincoln Park | 22.63|
|South of Market | 39.49|
|Tenderloin | 45.69|
|Mission | 47.25|
|Nob Hill | 47.83|
|Hayes Valley | 51.64|
|Lone Mountain/USF | 53.18|
|North Beach | 53.63|
**--- By Longest Median Resolution Time (across all request types) ---**
|C.Neighborhood | Median.Resolve|
|:----------------|--------------:|
|Potrero Hill | 170.34|
|Pacific Heights | 138.16|
|Treasure Island | 133.12|
|Glen Park | 132.21|
|Seacliff | 124.91|
|Inner Richmond | 120.26|
|Presidio Heights | 113.30|
|Mission Bay | 109.09|
|Golden Gate Park | 102.84|
|Inner Sunset | 102.82|
#### Equity Concerns
1. Which features of calls (location, request type, source, caselength, etc) are good predictors of income?
2. How are 311 requests distributed across Census block income levels?
3. How are 311 request resolution times distributed across Census block income levels?
4. Is there a correlation between resolution time and percent (%) any racial or ethnic population?
### Conclusion / Implications
#### Public Policy
#### Operations
#### Product Opportunities
#### Future Research
### Appendix

0 comments on commit 535ad3c

Please sign in to comment.