print line 100,000 and then pull out 60th "column"

Analyzing UK Government Spending from and for OpenSpending. See:

This OpenSpending things-to-do issue: openspending/ideas#5
This post http://okfnlabs.org/blog/2013/04/11/quick-and-dirty-analysis-on-large-csv.html
Scratchpad google doc

Roadmap

Pull a sample of data (and push to s3)
Load scripts
Identify initial questions / analyses
- Biggest recipients
- Department spending over time (can we see the spending cuts?)
- Extract entities with counts

Data

The Data Package for UK Spending: https://raw.github.com/openspending/dpkg-uk25k/master/datapackage.json
The all in one CSV file (3.7Gb): http://data.etl.openspending.org/uk25k/spending-latest.csv

Summary

Size: ~3.8Gb, 5.5m records
- 2010 ends around line 535280
Essential fields

Analysis

Preparation

Grab the data into the cache subdirectory
Create a Postgresql database named gbspend (you could use any name but you'll need to change names in scripts).
Prep a load script using data package tools load-postgresql script:

python bin/load-postgresql.py cache/datapackage.sample.json > scripts/create.sql
Use this to create table ("sample") and load sample data into it:

psql gbspend -f scripts/create.sql

Note: load of sample is ~12s (535k rows). Load of full dataset and sample ~2m35s (5.480m rows)

Running Some SQL

We have various sql scripts in the scripts directory.

As an example, the analysis script calculates various aggregate totals. You can run it by doing:

psql gbspend -f scripts/analysis.sql

By default the scripts run on the sample table. If we want to change we can just pipe the script through find and replace (sed) before running it:

cat scripts/analysis.sql | sed "s/sample/latest/g" > psql gbspend

Warning: running this prior to creating relevant indexes will be slow!

Quick and Dirty Techniques

Looking at specific rows

As part of preparing a sample we needed to search through the data quickly. Specifically, data is approx date ordered and we wanted to try pulling out all of 2010. This required a search for last row containing 2010 data. However, the technique we use is general and very fast.

Easiest way is printing out a specific line and checking the date. This is a big files (millions of lines) so we can't open in a spreadsheet. We use some simple unix tools to do this quickly.

print line 100,000 and then pull out 60th "column"

NB: cut splits on ',' not on CSV columns.

Usually the same but not if ',' in one of the column contents

sed '100000q;d' latest.csv | cut -d, -f 60

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Roadmap

Data

Summary

Analysis

Preparation

Running Some SQL

Quick and Dirty Techniques

Looking at specific rows

print line 100,000 and then pull out 60th "column"

NB: cut splits on ',' not on CSV columns.

Usually the same but not if ',' in one of the column contents

About

Releases

Packages

Languages

rufuspollock/gov-spending-gb-analysis

Folders and files

Latest commit

History

Repository files navigation

Roadmap

Data

Summary

Analysis

Preparation

Running Some SQL

Quick and Dirty Techniques

Looking at specific rows

print line 100,000 and then pull out 60th "column"

NB: cut splits on ',' not on CSV columns.

Usually the same but not if ',' in one of the column contents

About

Resources

Stars

Watchers

Forks

Languages