Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Towards an implementation of EatSure.ca with Vancouver Costal Health data
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
EatSure Van Data Grabber Tim Smith, Naoya Makino December 4, 2010 Hacked together real fast! 'Cause that's how we roll. Here are some scripts to scrape restaurant inspection data from the unfriendly Vancouver Coastal Health restaurant inspection website. (http://www.foodinspectionweb.vcha.ca/) See it more or less in action at: http://www.google.com/fusiontables/DataSource?snapid=111755 -- I recommend the map view, myself. Note that inconsistent address information causes some of the points to appear in unlikely locations Things that you need: 1. Python 2.x 2. BeautifulSoup 3. A sense of adventure Run them in this order: 1. python fetchandparselist.py Fetches and parses the entire list of restaurants from VCH. Creates restaurants.html. Outputs restos.tab, which is a tab-separated text file. It contains restaurant name, address, and date of last inspection. The last column is the restaurant's GUID. 2. mkdir restos; python fetchrestos.py Fetches the restaurant info pages for each individual restaurant. For bonus points, parallelize this yourself; it takes a while. 3. python parserestos.py Creates an index of inspections that have occurred, the date, and the type of inspection. Creates inspections.repr. To fetch each individual inspection report (our current "score" does not depend on these, but probably should): 4. mkdir inspections; python fetchinspections.py start finish start and finish are indicies into the list of restaurants, so you can fetch a sliver of the data set at a time. This is useful for parallelizing the fetch. For example, you can start six simultaneous instances of fetchinspections... [0 1000), [1000 2000), [2000 3000), [3000 4000), [4000 5000), and [5000 6000). Python will not complain if you overshoot the end of the data set on the last one. There are slightly fewer than 6000 restaurants in the database. You will see some 500 errors, which happen because the scraper (erroneously) tries to grab a Permit object instead when there are no inspections to grab. These should not be cause for concern. To generate the score reports we have now: 1. python score.py Generates scores.pickle, which contains a dictionary linking the restaurant GUID to the number of re-inspections noted recently on the VCH website (the "score"). 2. python mkcsv.py Generates a CSV file suitable for upload to Google Fusion Tables. Creates (imaginitively) output.csv. And that's where we are.