Out of space running Django test suite on Postgres #1125

Closed
qris opened this Issue May 19, 2013 · 22 comments

Comments

Projects
None yet
5 participants

qris commented May 19, 2013

I've been trying to get Django running in Travis and I've been having a lot of problems with postgres tests.

MySQL and SQLite work, but at some point Postgres thinks that the disk is full. This doesn't seem to happen if I run only postgres tests, but it does if I run postgres and mysql at the same time.

Sometimes Postgres encounters errors like this:

2013-05-19 12:21:18 UTC LOG:  could not open temporary statistics file "pg_stat_tmp/pgstat.tmp": Cannot allocate memory

But there's plenty of disk space:

$ df
Filesystem          1K-blocks      Used Available Use% Mounted on
/vz/private/3135353 448715560 116973084 318066424  27% /
none                   530844        72    530772   1% /run
/dev/null              262144    206908     55236  79% /var/ramfs
none                     5120         0      5120   0% /run/lock
none                  2654208         0   2654208   0% /run/shm

I've even seen it panic and crash:

2013-05-19 09:42:06 UTC PANIC:  could not write to file "pg_xlog/xlogtemp.11740": No space left on device
2013-05-19 09:42:06 UTC STATEMENT:  COMMIT
2013-05-19 09:42:06 UTC LOG:  server process (PID 11740) was terminated by signal 6: Aborted
2013-05-19 09:42:06 UTC LOG:  terminating any other active server processes
2013-05-19 09:42:06 UTC WARNING:  terminating connection because of crash of another server process
2013-05-19 09:42:06 UTC DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2013-05-19 09:42:06 UTC HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2013-05-19 09:42:06 UTC LOG:  all server processes terminated; reinitializing
2013-05-19 09:42:06 UTC LOG:  database system was interrupted; last known up at 2013-05-19 09:39:56 UTC
2013-05-19 09:42:06 UTC LOG:  database system was not properly shut down; automatic recovery in progress
2013-05-19 09:42:06 UTC LOG:  redo starts at 0/502EEE8
2013-05-19 09:42:17 UTC FATAL:  the database system is in recovery mode

which causes our tests to fail with this error:

File "/home/travis/build/aptivate/django/django/db/backends/__init__.py", line 131, in _cursor
    return self.create_cursor()
File "/home/travis/build/aptivate/django/django/db/backends/postgresql_psycopg2/base.py", line 141, in create_cursor
    cursor = self.connection.cursor()
InterfaceError: connection already closed

In that case, df was not run until after Postgres had already recovered, so I couldn't be sure whether it had deleted all the WAL logs that might be filling up the disk.

But the amount of disk space available is so huge that I can't imagine us filling it up. It doesn't look like either /tmp or /var/lib/postgresql (home of pg_xlog) is mounted on a ramdisk or symlinked to a different filesystem. Do you have any idea why we might be seeing this behaviour that I can't explain? Could the host's filesystem be filling up when it's running too many jobs at once?

How can I reduce the disk space used by Postgres? I've already tried reducing checkpoint_timeout from the default 300 to 60 seconds, which made the tests pass when run on their own, but not with MySQL and SQLite running at the same time.

I even get some completely random errors like this:

File "/home/travis/build/aptivate/django/django/db/backends/__init__.py", line 112, in connect
File "/home/travis/build/aptivate/django/django/db/backends/postgresql_psycopg2/base.py", line 118, in get_new_connection
File "/home/travis/virtualenv/python2.7/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
django.db.utils.OperationalError: could not translate host name "localhost" to address: Name or service not known
Owner

drogus commented May 19, 2013

Postgres is configured to run on tmpfs, which is full in 79% according to your df output. I'm not sure what would be best way to handle this, but probably one of:

  • umount tmpfs and mount it again, but with bigger limit
  • stop postgres, initialize data directory at the other location and start postgres pointed to the new data directory

qris commented May 19, 2013

Thanks @drogus, where and how is it configured to do that? Is the standard config available anywhere?

Owner

joshk commented May 19, 2013

Or run mysql and postgres in different jobs?

On 19/05/2013, at 5:25 PM, Piotr Sarnacki notifications@github.com wrote:

Postgres is configured to run on tmpfs, which is full in 79% according to your df output. I'm not sure what would be best way to handle this, but probably one of:

umount tmpfs and mount it again, but with bigger limit
stop postgres, initialize data directory at the other location and start postgres pointed to the new data directory

Reply to this email directly or view it on GitHub.

Owner

joshk commented May 19, 2013

If you would like us to spin up a cloud VM for you to use for debugging purposes, email support@travis-ci.com and we can sort one out for you.

On 19/05/2013, at 5:26 PM, Chris Wilson notifications@github.com wrote:

Thanks @drogus, where and how is it configured to do that? Is the standard config available anywhere?


Reply to this email directly or view it on GitHub.

qris commented May 19, 2013

@joshk they are different jobs, but they still interfere with each other somehow if running simultaneously.

qris commented May 19, 2013

@drogus that could be really useful, wish i'd asked earlier. Sprints just finished so i'll ask you when I have time to work on this again.

Owner

joshk commented May 19, 2013

Hmmm, are you saying that two jobs which are running in two different VMs are interfering with each other?

On 19/05/2013, at 6:12 PM, Chris Wilson notifications@github.com wrote:

@drogus that could be really useful, wish i'd asked earlier. Sprints just finished so i'll ask you when I have time to work on this again.


Reply to this email directly or view it on GitHub.

qris commented May 19, 2013

@joshk it does appear so, but I'll have to run a build with and without multiple jobs to be 100% sure. I'll let you know if I succeed in proving that.

qris commented May 21, 2013

@drogus, remounting /var/ramfs with more space seems to have fixed the problem, thanks! I'd still really like to know how you configured the virtual machines so that I can find and fix other such issues myself in future.

@joshk we have one remaining failing test which doesn't fail on my machine, so it would be really useful to have another machine to debug it on. Would you mind spinning up a VM for me? Thanks :)

Also it would be great to be able to download Travis VM images (when I'm back in a place with fast Internet, next week). I tried to find this in the docs and on the Internet, and only found old images. Any chance you could document where to find them?

qris commented May 26, 2013

@drogus i'm afraid that wasn't the complete solution. Tests pass sometimes, but not reliably.

Why do you put Postgres on a ramdisk? Does it work if I put it on the big root filesystem instead? Is there a huge performance difference?

Owner

drogus commented May 26, 2013

@qris we put postgres on ramdisk to increase performance, it's hard to say what would be the difference, it really depends on how much DB interaction do you need. I guess there is no other way than to just check it, I would be curious what's the difference myself

How does django tests use the database? From what I can see, you mount ramdisk at 2GB, which is quite big for tests. Maybe there is an issue with purging the database after each of the tests?

Owner

roidrage commented Jun 4, 2013

Sorry for not chiming in on this sooner. This is not necessarily related to disk space but to shared memory allocatable from the kernel. We have a fix pending roll out for this, but in the meantime, adding this to your .travis.yml should help remove these errors:

before_install:
  - sudo sysctl -w kernel.shmmax=268435456
  - sudo sysctl -w kernel.shmall=268435456

qris commented Jun 6, 2013

@roidrage thanks for the advice. Unfortunately I tried it and it doesn't work. Both Postgres tests are still failing with these kinds of errors:

OperationalError: could not extend file "base/17323/35861": No space left on device
HINT:  Check free disk space.

And in the Postgres logs we can see errors like this:

2013-06-05 17:57:57 UTC LOG:  automatic vacuum of table "test_django.public.auth_user_user_permissions": cannot (re)acquire exclusive lock for truncate scan
2013-06-05 17:58:18 UTC ERROR:  could not write to file "pg_xlog/xlogtemp.620": No space left on device
2013-06-05 17:58:23 UTC ERROR:  could not extend file "base/17323/35861": No space left on device
2013-06-05 17:58:23 UTC HINT:  Check free disk space.

Even after disabling writing to WAL logs by using UNLOGGED tables, I'm still getting the same errors.

I notice that the default configuration of Postgres (at compile time) is that each WAL segment is 16 MB in size, and with the default checkpoint_segments = 3, the system will not reduce the number of WAL segments below 10 (3 * checkpoint_segments + 1). It can go higher depending on the amount of write activity between checkpoints. On my system the progres directory total disk space was hovering between 330 and 360 MB while running the tests; even reducing checkpoint_segments to 1 only reduced this by about 30 MB. While running the tests on my system, I have:

chris@lap-x201:$ sudo du -sk /var/lib/postgresql/9.1/main/pg_xlog /var/lib/postgresql/9.1/main/base
114724  /var/lib/postgresql/9.1/main/pg_xlog
191648  /var/lib/postgresql/9.1/main/base

Therefore it seems that in the current default Travis configuration with a 256 MB ramdisk, the minimum ramdisk used, with no actual data in the database, will be about 160 MB (10 * 16 MB) or about 60%. Since Django's tests use about 190 MB of data, it seems impossible to fit the test database in the current ramdisk, no matter what I do.

I can work around it by redirecting the Postgres data directory to be on the main filesystem instead of the ramdisk, although this makes the tests take 40-60 minutes to complete and they sometimes time out. The Django developers currently will not accept this as a workaround. What do you say to increasing the default ramdisk size to 1 GB? The virtual machines should have enough RAM for that.

Owner

roidrage commented Jun 6, 2013

Sorry, but for operational reasons, increasing the RAM disk is not an option at this point.

Have you tried removing the MySQL data directory? It resides on the same disk and should free up some additional space.

On Thursday, 6. June 2013 at 17:23, Chris Wilson wrote:

@roidrage (https://github.com/roidrage) thanks for the advice. Unfortunately I tried it and it doesn't work (https://travis-ci.org/aptivate/django/jobs/7814336). Both Postgres tests are still failing with these kinds of errors:
OperationalError: could not extend file "base/17323/35861": No space left on device HINT: Check free disk space.
And in the Postgres logs we can see errors like this:
2013-06-05 17:57:57 UTC LOG: automatic vacuum of table "test_django.public.auth_user_user_permissions": cannot (re)acquire exclusive lock for truncate scan 2013-06-05 17:58:18 UTC ERROR: could not write to file "pg_xlog/xlogtemp.620": No space left on device 2013-06-05 17:58:23 UTC ERROR: could not extend file "base/17323/35861": No space left on device 2013-06-05 17:58:23 UTC HINT: Check free disk space.
Even after disabling writing to WAL logs by using UNLOGGED tables (http://www.postgresql.org/docs/9.1/static/sql-createtable.html#AEN67496), I'm still getting the same errors.
I notice that the default configuration of Postgres (at compile time) is that each WAL segment is 16 MB in size, and with the default checkpoint_segments = 3, the system will not reduce the number of WAL segments below 10 (http://www.postgresql.org/docs/9.1/static/wal-configuration.html) (3 * checkpoint_segments + 1). It can go higher depending on the amount of write activity between checkpoints. On my system the progres directory total disk space was hovering between 330 and 360 MB while running the tests; even reducing checkpoint_segments to 1 only reduced this by about 30 MB. While running the tests on my system, I have:
chris@lap-x201:$ sudo du -sk /var/lib/postgresql/9.1/main/pg_xlog /var/lib/postgresql/9.1/main/base 114724 /var/lib/postgresql/9.1/main/pg_xlog 191648 /var/lib/postgresql/9.1/main/base
Therefore it seems that in the current default Travis configuration with a 256 MB ramdisk, the minimum ramdisk used, with no actual data in the database, will be about 160 MB (10 * 16 MB) or about 60%. Since Django's tests use about 190 MB of data, it seems impossible to fit the test database in the current ramdisk, no matter what I do.
I can work around it by redirecting the Postgres data directory to be on the main filesystem instead of the ramdisk, although this makes the tests take 40-60 minutes to complete and they sometimes time out. The Django developers currently will not accept this as a workaround. What do you say to increasing the default ramdisk size to 1 GB? The virtual machines should have enough RAM for that.


Reply to this email directly or view it on GitHub (#1125 (comment)).

Owner

joshk commented Jun 17, 2013

I am rolling out some VM updates to Org tomorrow with and increased tmpfs volume, and shmmax changes.

We have also fixed an issue with swap space not being available on all VMs.

Have you got closer to things running smoothly?

qris commented Jun 17, 2013

I had to reconfigure postgres to reduce the number of WAL logs and move them off the ramdisk. I hope it will now run stably, but I don't know how to be sure of that. Also I'm not sure if these changes will be accepted by the Django community or not. My previous ones to make the tests pass were rejected, as was the idea of splitting the tests into two separate runs.

Owner

joshk commented Jun 17, 2013

Hmmm, ok, well, I have doubled the tmpfs size, so this might help. If you try running the tests this time tomorrow and let me know how it goes it would be much appreciated. :)

qris commented Jun 17, 2013

I tried, but am currently blocked by a new environment issue, #1170.

qris commented Jun 17, 2013

After working around #1170, I'm still having the same issue as before: running out of disk space, unless I move some Postgres data off ramdisk onto the root filesystem. Has the fix even been deployed yet, or should I wait until tomorrow?

Filesystem          1K-blocks      Used Available Use% Mounted on
/vz/private/3955016 448715560 117345160 308576980  28% /
none                   314576        72    314504   1% /run
/dev/null              262144    244668     17476  94% /var/ramfs
none                     5120         0      5120   0% /run/lock
none                  1572864         0   1572864   0% /run/shm

The space available to the ramdisk is the same as before (256 MB), but /run/shm is huge with 1.5 GB space. I'd be very happy to trade off some of this unused space for more space on the ramdisk; after all, both are backed by system memory in the virtual machine, so it shouldn't make any difference to your ops whether it's allocated to one kind of ramdisk or another.

Owner

joshk commented Jun 17, 2013

This will be updated tomorrow, the tmpfs will double to 512mb by default.

I will have a blog post out in an hour with a list of all the updates.

On 17/06/2013, at 3:53 PM, Chris Wilson notifications@github.com wrote:

After working around #1170, I'm still having the same issue as before: running out of disk space, unless I move some Postgres data off ramdisk onto the root filesystem. Has the fix even been deployed yet, or should I wait until tomorrow?

Filesystem 1K-blocks Used Available Use% Mounted on
/vz/private/3955016 448715560 117345160 308576980 28% /
none 314576 72 314504 1% /run
/dev/null 262144 244668 17476 94% /var/ramfs
none 5120 0 5120 0% /run/lock
none 1572864 0 1572864 0% /run/shm
The space available to the ramdisk is the same as before (256 MB), but /run/shm is huge with 1.5 GB space. I'd be very happy to trade off some of this unused space for more space on the ramdisk; after all, both are backed by system memory in the virtual machine, so it shouldn't make any difference to your ops whether it's allocated to one kind of ramdisk or another.


Reply to this email directly or view it on GitHub.

Owner

joshk commented Jun 20, 2013

if you have a chance, please try running the test suite again and we can see if the increased tmpfs has helped at all :)

qris commented Jun 20, 2013

Yes, it works now, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment