Skip to content
This repository has been archived by the owner on May 17, 2018. It is now read-only.
Chris Chang edited this page Feb 17, 2015 · 7 revisions

Lessons learned from previous apps

  • Update the app in as few commands as possible. Even if it's one command that runs a bunch of other commands, that meta commmand serves as documentation

models.py

  • If you're using Django, bend your data to the ORM, not the ORM to the data. Less custom code and fewer queries. .annotate and .aggregate can be very useful, just make sure you have the right .order_by set.
  • Make your models resemble the CSV. This is for readability. Even if your fields match the headings of the CSV, document with a comment above each model field the name of the csv field this data came from.
  • Group fields together. Fields from the csv stay together, denormalized data together. Keep them partitioned in the source code.
  • Store the raw data in the model. I store them as a json blob. But I don't use JSONField because it's impossible to pass the raw JSON out if you want to let some JavaScript consume it later. There's often some useful data hidden in there that's not represented in your model (unless you captured every single field). If you do a big data migration, you might be able to do it from this field instead of going back to the original csvs. And it also helps you keep your data accountable and traceable.

Sorted CSVs

  • use csvkit's csvsort to make sure csvs are in a certain order. If you know a csv is in a certain order, you can make some optimizations

Geo

  • Lazy geocoding via ajax is a good way to get maps on demand while gradually building geo data. More important data points get mapped first too.
  • For elevators, I cached geocoding results in a local key-value store. It's extra work I decided not to bring over here.

Throwaway Data

  • Make it easy to throw away your database and rebuild. Almost all my data lives outside the database:
    1. TEC data (cached locally in ./data)
    2. Nomenklatura (cached locally in ./data) I lose geocoding data when I rebuild, so I have a managment command to dump and load again. I don't use Django fixtures to make the data more portable and so I can load it at any time.