New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into speeding up feed imports #28

Closed
araichev opened this Issue Apr 29, 2014 · 16 comments

Comments

Projects
None yet
4 participants
@araichev

I think it would be worthwhile for someone, possibly me, to spend some time figuring out how to speed up feed imports. As it stands, a 165MB feed (21MB zipped) such as Southeast Queensland's (http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications/open-data/gtfs/SEQ.zip) takes about 9--12 hours to load into a PostGIS database. (I'll time it more accurately next time i do the import.) Is that slow? Seems like it, but i don't know much about databases.

@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 10, 2014

thanks for the heads up. I decided to load my city's bus information just to test this app and now I know it's going to take a loooong time. A comment about this in the README would be nice.

thanks for the heads up. I decided to load my city's bus information just to test this app and now I know it's going to take a loooong time. A comment about this in the README would be nice.

@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 10, 2014

Without profiling it, my guess would be inserting into the database after every row of every file (base.py: cls.objects.create(**fields))

this project (https://github.com/cbick/gtfs_SQL_importer) just builds a giant insert statement and combines it with sql. I don't think this would work out of the box for django-multi-gtfs, but maybe it could be reworked? Or rework multi-gtfs to fit gtfs_SQL_importer? I guess comparing the performance of the two would be a good first step...

Without profiling it, my guess would be inserting into the database after every row of every file (base.py: cls.objects.create(**fields))

this project (https://github.com/cbick/gtfs_SQL_importer) just builds a giant insert statement and combines it with sql. I don't think this would work out of the box for django-multi-gtfs, but maybe it could be reworked? Or rework multi-gtfs to fit gtfs_SQL_importer? I guess comparing the performance of the two would be a good first step...

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 10, 2014

Member

I suspected this would be required. Thanks for the gtfs_SQL_importer tip, I'll see how they do it, and see if there is some common ground.

Some other ideas:

  • Turn off indexes before import, re-enable after import
  • Stop using DecimalField
  • Remove some belts-and-suspenders indexes, rely on validators for consistency checking
Member

jwhitlock commented Jun 10, 2014

I suspected this would be required. Thanks for the gtfs_SQL_importer tip, I'll see how they do it, and see if there is some common ground.

Some other ideas:

  • Turn off indexes before import, re-enable after import
  • Stop using DecimalField
  • Remove some belts-and-suspenders indexes, rely on validators for consistency checking
@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 10, 2014

Disclaimer: I know squat about databases. I'm just going off stack overflow here.

I think before doing anything fancy, that batching the INSERTs will have the biggest effect. Here is an interesting decorator that might help: https://docs.djangoproject.com/en/dev/topics/db/transactions/#django.db.transaction.commit_on_success

The next step if that is still very slow seem to be to drop all foreign key constraints before import and recreate them after. So long as there is still a column with an integer that just happens to be the primary key of the row in another table you want a relation to, I think this should be fine. This would be done with South: http://south.readthedocs.org/en/latest/databaseapi.html#db-delete-foreign-key and http://south.readthedocs.org/en/latest/databaseapi.html#db-alter-column

I don't know if django will have trouble (like, with building m2m tables) if south is playing with the schema behind django's back.

I get the impression that turning off indexes (indices?) will make inserting slower by making the foreign key constraint much slower to check. Or maybe turning off indices and disabling foreign key constraints amount to the same thing. I have no idea.

But, "optimization before profiling is always premature".

Disclaimer: I know squat about databases. I'm just going off stack overflow here.

I think before doing anything fancy, that batching the INSERTs will have the biggest effect. Here is an interesting decorator that might help: https://docs.djangoproject.com/en/dev/topics/db/transactions/#django.db.transaction.commit_on_success

The next step if that is still very slow seem to be to drop all foreign key constraints before import and recreate them after. So long as there is still a column with an integer that just happens to be the primary key of the row in another table you want a relation to, I think this should be fine. This would be done with South: http://south.readthedocs.org/en/latest/databaseapi.html#db-delete-foreign-key and http://south.readthedocs.org/en/latest/databaseapi.html#db-alter-column

I don't know if django will have trouble (like, with building m2m tables) if south is playing with the schema behind django's back.

I get the impression that turning off indexes (indices?) will make inserting slower by making the foreign key constraint much slower to check. Or maybe turning off indices and disabling foreign key constraints amount to the same thing. I have no idea.

But, "optimization before profiling is always premature".

@codelahoma

This comment has been minimized.

Show comment
Hide comment
@codelahoma

codelahoma Jun 11, 2014

Member

Considering only I/O, since I really doubt this is a computation bound problem, my understanding is as follows (generally, for databases).

Since indices involve both reads and writes, turning them off should give a bigger boost than disabling foreign key constraints, which are reads.

That said, if the tables containing the foreign keys are large enough, turning off the indices could cause the constraints to slow down, too.

Batching should help a lot, since it takes the performance of overall request overhead from O(n) to O(n/batch_size).

Apologies if I'm abusing big O notation there, if what I say is obvious, and/or if what I say is obviously wrong.

Edit: Made it clearer I'm referring to aggregate overhead, not per call.

Member

codelahoma commented Jun 11, 2014

Considering only I/O, since I really doubt this is a computation bound problem, my understanding is as follows (generally, for databases).

Since indices involve both reads and writes, turning them off should give a bigger boost than disabling foreign key constraints, which are reads.

That said, if the tables containing the foreign keys are large enough, turning off the indices could cause the constraints to slow down, too.

Batching should help a lot, since it takes the performance of overall request overhead from O(n) to O(n/batch_size).

Apologies if I'm abusing big O notation there, if what I say is obvious, and/or if what I say is obviously wrong.

Edit: Made it clearer I'm referring to aggregate overhead, not per call.

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 11, 2014

Member

Blocked by #10 - Southeast Queensland's feed contains additional columns that have to be hand-removed to get the feed into multigtfs.

Member

jwhitlock commented Jun 11, 2014

Blocked by #10 - Southeast Queensland's feed contains additional columns that have to be hand-removed to get the feed into multigtfs.

@jwhitlock jwhitlock added this to the v0.4.0 milestone Jun 11, 2014

@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 11, 2014

Sorry, what is blocked? I don't see how extra columns affect batching or
turning off constraints/indicies.

Elliot
On Jun 11, 2014 10:40 AM, "John Whitlock" notifications@github.com wrote:

Blocked by #10
#10 - Southeast
Queensland's feed contains additional columns that have to be hand-removed
to get the feed into multigtfs.


Reply to this email directly or view it on GitHub
#28 (comment)
.

Sorry, what is blocked? I don't see how extra columns affect batching or
turning off constraints/indicies.

Elliot
On Jun 11, 2014 10:40 AM, "John Whitlock" notifications@github.com wrote:

Blocked by #10
#10 - Southeast
Queensland's feed contains additional columns that have to be hand-removed
to get the feed into multigtfs.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 11, 2014

Member

Here's my desired process:

  1. Time importing Southeast Queensland's Feed
  2. Make some changes
  3. Time importing Southeast Queensland's Feed. Did the changes make a big enough difference? Save the changes. Did they make little difference or have a negative effect? Try something else.

Because Southeast Queensland's feed includes extra columns (#10), I can't do step 1.

So, blocked is probably the wrong word. I could use a different feed, optimize, and just tell araichev 'works for me'. But, I'm going to work on #10 first, so that I can compare apples to apples (or South Queensland to South Queensland).

Member

jwhitlock commented Jun 11, 2014

Here's my desired process:

  1. Time importing Southeast Queensland's Feed
  2. Make some changes
  3. Time importing Southeast Queensland's Feed. Did the changes make a big enough difference? Save the changes. Did they make little difference or have a negative effect? Try something else.

Because Southeast Queensland's feed includes extra columns (#10), I can't do step 1.

So, blocked is probably the wrong word. I could use a different feed, optimize, and just tell araichev 'works for me'. But, I'm going to work on #10 first, so that I can compare apples to apples (or South Queensland to South Queensland).

@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 11, 2014

That makes sense, though it might make more sense to use a smaller benchmark rather than wait 9 hours to see if there was an improvement. Also, araichev didn't complain about not being able to import at all, so fixing #10 isn't exactly related to this.

That makes sense, though it might make more sense to use a smaller benchmark rather than wait 9 hours to see if there was an improvement. Also, araichev didn't complain about not being able to import at all, so fixing #10 isn't exactly related to this.

@araichev

This comment has been minimized.

Show comment
Hide comment
@araichev

araichev Jun 11, 2014

Yeah, after fixing issue #10, i would test a small feed first, such as Cairns (739 KB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications/open-data/gtfs/cairns.zip), then move on to a slightly bigger one (<= 10 MB zipped; any suggestions?), then finally a big one, such as Southeast Queensland (21.1 MB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications./open-data/gtfs/SEQ.zip). A feed of intermediate size (any suggestions

Yeah, after fixing issue #10, i would test a small feed first, such as Cairns (739 KB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications/open-data/gtfs/cairns.zip), then move on to a slightly bigger one (<= 10 MB zipped; any suggestions?), then finally a big one, such as Southeast Queensland (21.1 MB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications./open-data/gtfs/SEQ.zip). A feed of intermediate size (any suggestions

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 19, 2014

Member

Issue #10 is fixed, so I'm able to import some feeds. I re-discovered that DEBUG=True can hurt you (see my blog post). My times are:

As you can see from the Tulsa Transit vs Cairns number, the content of the feed matters more than the absolute size. Here's the timing breakdown for SEQ:

  • 12776 Stops in 63 seconds, or 202 / sec
  • 525 Routes in 0.7 seconds, or 750 / sec
  • 101 Services in 0.1 seconds
  • 154 Service Dates in 0.3 seconds
  • 715557 Shape Points in 31909 seconds (8.5 hours), or 22 / sec
  • 85724 Trips in 385 seconds, or 300 / sec
  • 2512036 StopTimes in 9416 seconds (2.5 hours), or 266 / sec
  • 30 - 60 minutes to add geometries to trips, routes, etc.

So, if a feed has a lot of detailed shapes, it will import super slowly.

I was worried that geometry columns would be the biggest problem. There is certainly a difference between rows with geometry, such as Stops, and rows without geometry, such as Routes. However, I think the biggest difference is rows that refer to other types, such as trips belonging to a route. So, with a much smaller test feed, I'm going to try:

  • Caching IDs of inserted rows, to avoid database lookups for related rows
  • Bulk importing of rows

I'm hopeful that small steps can get SEQ down below 2 hours. Not great, but back in long lunch territory.

Member

jwhitlock commented Jun 19, 2014

Issue #10 is fixed, so I'm able to import some feeds. I re-discovered that DEBUG=True can hurt you (see my blog post). My times are:

As you can see from the Tulsa Transit vs Cairns number, the content of the feed matters more than the absolute size. Here's the timing breakdown for SEQ:

  • 12776 Stops in 63 seconds, or 202 / sec
  • 525 Routes in 0.7 seconds, or 750 / sec
  • 101 Services in 0.1 seconds
  • 154 Service Dates in 0.3 seconds
  • 715557 Shape Points in 31909 seconds (8.5 hours), or 22 / sec
  • 85724 Trips in 385 seconds, or 300 / sec
  • 2512036 StopTimes in 9416 seconds (2.5 hours), or 266 / sec
  • 30 - 60 minutes to add geometries to trips, routes, etc.

So, if a feed has a lot of detailed shapes, it will import super slowly.

I was worried that geometry columns would be the biggest problem. There is certainly a difference between rows with geometry, such as Stops, and rows without geometry, such as Routes. However, I think the biggest difference is rows that refer to other types, such as trips belonging to a route. So, with a much smaller test feed, I'm going to try:

  • Caching IDs of inserted rows, to avoid database lookups for related rows
  • Bulk importing of rows

I'm hopeful that small steps can get SEQ down below 2 hours. Not great, but back in long lunch territory.

@araichev

This comment has been minimized.

Show comment
Hide comment
@araichev

araichev Jun 19, 2014

Nice work profiling, John. Let's see how the optimizations go.

Nice work profiling, John. Let's see how the optimizations go.

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 21, 2014

Member

I've pushed code that speeds up imports and exports. Here's the numbers for my laptop and a local PostGIS database:

  • SEQ - 41 minute import (17x speed-up), 11 minutes export
  • Cairns - 88 second import (21x speed-up), 30 second export
  • Tulsa - 77 second import (21x speed-up), 25 second export

SEQ was updated a few days ago, so the total item numbers have changed:

  • 12771 Stops in 8.7 seconds, or 1468 / sec
  • 537 Routes in 0.3 seconds
  • 735419 Shape Points in 331.5 seconds, or 2218 / sec
  • 76540 Trips in 37.6 seconds, or 2035 / sec
  • 2248646 Stop Times in 575.3 seconds, or 3908 / sec
  • 99 seconds to create cached Shape geometries
  • 298 seconds to create cached Trip geometries
  • 1130 seconds to create cached Route geometries

That's 16 minutes to import the data, and then 25 minutes to update the cached geometries. I can't see a way to speed this up. Maybe someone with more PostGIS experience will see a solution.

It's still memory intensive to import or export a feed as big as SEQ.zip. I wouldn't attempt it with less than 4 GB of RAM.

Member

jwhitlock commented Jun 21, 2014

I've pushed code that speeds up imports and exports. Here's the numbers for my laptop and a local PostGIS database:

  • SEQ - 41 minute import (17x speed-up), 11 minutes export
  • Cairns - 88 second import (21x speed-up), 30 second export
  • Tulsa - 77 second import (21x speed-up), 25 second export

SEQ was updated a few days ago, so the total item numbers have changed:

  • 12771 Stops in 8.7 seconds, or 1468 / sec
  • 537 Routes in 0.3 seconds
  • 735419 Shape Points in 331.5 seconds, or 2218 / sec
  • 76540 Trips in 37.6 seconds, or 2035 / sec
  • 2248646 Stop Times in 575.3 seconds, or 3908 / sec
  • 99 seconds to create cached Shape geometries
  • 298 seconds to create cached Trip geometries
  • 1130 seconds to create cached Route geometries

That's 16 minutes to import the data, and then 25 minutes to update the cached geometries. I can't see a way to speed this up. Maybe someone with more PostGIS experience will see a solution.

It's still memory intensive to import or export a feed as big as SEQ.zip. I wouldn't attempt it with less than 4 GB of RAM.

@jwhitlock jwhitlock closed this Jun 21, 2014

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jun 22, 2014

Member

By the way, thanks for the suggestions. About 1/3 of the import speedup was from bulk inserts, and 2/3 from pre-caching database reads during import. I didn't try other options, like temporarily removing indexes or changing field types. There may be additional improvements, but this feels good enough for now.

Member

jwhitlock commented Jun 22, 2014

By the way, thanks for the suggestions. About 1/3 of the import speedup was from bulk inserts, and 2/3 from pre-caching database reads during import. I didn't try other options, like temporarily removing indexes or changing field types. There may be additional improvements, but this feels good enough for now.

@Permafacture

This comment has been minimized.

Show comment
Hide comment
@Permafacture

Permafacture Jun 22, 2014

41 minutes is a heck of a lot more useable than 12 hours. Good job.

41 minutes is a heck of a lot more useable than 12 hours. Good job.

@araichev

This comment has been minimized.

Show comment
Hide comment
@araichev

araichev Jun 22, 2014

Yeah, that's great!

Yeah, that's great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment