This repository has been archived by the owner on May 17, 2018. It is now read-only.
Home
Chris Chang edited this page Feb 17, 2015
·
7 revisions
- Update the app in as few commands as possible. Even if it's one command that runs a bunch of other commands, that meta commmand serves as documentation
- If you're using Django, bend your data to the ORM, not the ORM to the data. Less custom code and fewer queries.
.annotate
and.aggregate
can be very useful, just make sure you have the right.order_by
set. - Make your models resemble the CSV. This is for readability. Even if your fields match the headings of the CSV, document with a comment above each model field the name of the csv field this data came from.
- Group fields together. Fields from the csv stay together, denormalized data together. Keep them partitioned in the source code.
- Store the raw data in the model. I store them as a json blob. But I don't use JSONField because it's impossible to pass the raw JSON out if you want to let some JavaScript consume it later. There's often some useful data hidden in there that's not represented in your model (unless you captured every single field). If you do a big data migration, you might be able to do it from this field instead of going back to the original csvs. And it also helps you keep your data accountable and traceable.
- use csvkit's csvsort to make sure csvs are in a certain order. If you know a csv is in a certain order, you can make some optimizations
- I sort canonical interest data by reverse chronological order so when I start running into old data, I can stop. The alternative would be to maintain the time of the last run between runs (holding onto state)
- Registration reports are sorted by lobbyist, so if a lobbyist has 20 clients, I only update the lobbyist's info the first time I see it: https://github.com/texastribune/tx_lobbying/blob/master/tx_lobbying/scrapers/registration.py
- Lazy geocoding via ajax is a good way to get maps on demand while gradually building geo data. More important data points get mapped first too.
- For elevators, I cached geocoding results in a local key-value store. It's extra work I decided not to bring over here.
- Make it easy to throw away your database and rebuild. Almost all my data lives outside the database:
- TEC data (cached locally in ./data)
- Nomenklatura (cached locally in ./data) I lose geocoding data when I rebuild, so I have a managment command to dump and load again. I don't use Django fixtures to make the data more portable and so I can load it at any time.