This repository contains code and data related to Greater London Authority spending. Its primary purpose it to prepare the openly available GLA data for loading into OpenSpending.
See also this blog post http://schoolofdata.org/2013/03/26/using-sql-for-lightweight-data-analysis/
Consolidated data is in data/all.csv
. For the schema see datapackage.json
.
Do the following steps:
-
Pull down a copy of the data:
node scripts/scrape.js
-
Symlink the directory with the downloaded data to archive/latest
ln -s archive/{current-date} archive/latest
-
Clean the data
node scripts/process.js
This data is pretty horrible. In the current 65 files (summer 2013) one can find approximately 20+ different structures of the CSV files. See scripts/process.js
for the gory details.
CSV files are listed on http://www.london.gov.uk/mayor-assembly/gla/spending-money-wisely/budget-expenditure-charges/expenditure-over-250
That site states:
The Mayor is committed to providing financial transparency. In 2008 he instructed that regular reports should be published on all GLA expenditure over £1,000 (including VAT). From summer 2010 the reporting threshhold was reduced to £500 (including VAT), and from Period 1 2011/12 the reporting threshhold was changed to £500 excluding VAT. From Period 2 2012/13 onwards the reporting threshold was changed to £250 excluding VAT.
From Period 4 2012/13 onwards the report includes expenditure from the GLA's subsidiary, GLA Land & Property Ltd.
There are more than 60 CSV files as of July 2013 (a list can be found in scrape.json").
Unfortunately the "format" varies substantially, not only in terms of fields but in e.g. number of blank columns or blank lines etc etc.
A summary can be found in this Data Explorer gist.
Aside: from the presence of "SAP Document No" field in several of the CSVs it appears likely that the GLA are using SAP for their accounting systems.
- (July 2013) Period 8 2012/13 is an HTML file showing a 403 Access Denied from someone's login session
(March 2013) Bad file for Period 8 2012/13 (13 October - 10 November). The file is not named in the usual way "Mayor's%20250%20Report%20-%202012-13%20-%20P8%20%20-%20Final.csv" and appears to be an Excel file that was not converted to CSV!
- Amounts are formatted with "," making them appear as strings to computers.
- Dates vary substantially in format from "16 Mar 2011" in this file to "21.01.2010" in January 2010 data
- Use of (978) to indicate negative amounts rather than -978
- Repeated data in 2012-13-P4 file
- Script to convert a given file (month) into a standardized CSV form (follow data package)
- Clean up dates
- Clean up amounts (remove ',')
- Post result to http://data.openspending.org/ (s3)
- Load most recent month to OpenSpending
- Script to consolidate all files
- Post result to http://data.openspending.org/
- Load this to OpenSpending
- Post on the OpenSpending City map
Repeat monthly part each month as new data becomes available!