pg_dump does not work for single tables #112

lukland · 2017-07-06T08:14:07Z

I am facing an issue with the pg_dump command. I dump the data of my tables with the command:

pg_dump -a -U user -d database-t table> /my/path/table.bak

It works perfectly fine with standard tables but not with hypertable.
Is it possible to dump hypertables that way or do we have to use the COPY FROM in the doc. We have a specific usecase where we need to use INSERT INTO statements, too much complicated to explain in few lines.

The text was updated successfully, but these errors were encountered:

cevian · 2017-07-06T14:16:33Z

I think you will find your answer in this comment of an old PR: #102 (comment)

Let me know if that helps

lukland · 2017-07-07T07:15:35Z

Yes I know that way to dump a table to CSV format. Unfotunately, I have a specific usecase where I need the same file as those generated with pg_dump (with the SQL statement INSERT INTO) and not the CSV file.

Of course I can generate the file .sql from the CSV by myself, but it seems a little strange that the command pg_dump doesn't work with the hypertables.

cevian · 2017-07-07T10:03:48Z

I am guessing you are also familiar with our pg_dump procedure for backup/restore of the entire db (as opposed to one hypertable): http://docs.timescale.com/api#backup

The reason that pg_dump doesn't just work for a single hypertable is that pg_dump does not dump the data from inherited tables when processing parent tables. So, if you just dump the hypertable (which is a parent table of a bunch of chunks) you won't get the data in the underlying chunks.

We may create a wrapper utility for this in the future. In the meantime, maybe we can help you make this work with COPY instead of INSERT INTO? What issue are you having using COPY?

lukland · 2017-07-07T11:43:35Z

Ok thank you. I will write a little tool to convert CSV to INSERT INTO.

I will try to summarize our problem.
We have local data generated, on different sites, each one stored with a postgres. We also have a backend architecture, with a postgres too, where all data from all sites are stored together. So we dump datas each X seconds, and send it to the backend. One of our issues is that if a data already exists in the backend, postgres fires an Integrity error (of course).

In the backend, we use python and psycopg2, and the copy_from method, which takes a complete CSV file, and dump the data to postgres. The problem is that if there is an integrity error, there is a rollback on all the CSV, not just the line, so data are not added ! Moreover, psycopg2 does not seem to let us execute a line of the CSV at a time. Finally, we have some data on our database wich are updated (profile table for example), so if there are duplicate keys, we won't insert, but update ! Of course it is a problem for performances, but we can specify what tables are updated so that our algorithm does not update on hypertables. So, to do that, we need INSERT INTO, that we execute line by line (we can do that with psycopg2), and transform INSERT INTO to UPDATE on conflict.

I hope I was clear :)
(I am working with @Romathonat, from the other issue you linked)

cevian · 2017-07-07T15:40:57Z

Oh ok, that's more clear. May I suggest an approach like this instead: https://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com

That should become even easier once we get #100 resolved (working on that in current sprint).

cevian · 2017-07-07T15:46:37Z

Just to be clear the approach I (and the message above) am suggesting is to use psycopg2 and the copy_from to copy data from CSV to a temporary table and then do the "conflict resolution" in sql while copying the data in the temporary table to the hypertable. In the message I linked to they did it through a series of sql commands, but you can also do this in a PLPGSQL function that copies data row-by-row and catches integrity errors inside the function.

I think this approach using a temporary table will be much faster and easier than inserting row-by-row on the client side.

Please let me know if any of this was unclear or you have any other questions.

mfreed · 2017-07-07T18:39:53Z

Also, this no longer seems to be the immediate issue, but the latest version of our docs (just published) provides more detail about dump/restore for single hypertables vs. the entire db:

http://docs.timescale.com/api#backup

Romathonat · 2017-07-10T11:38:48Z

So, if let me summarize (maybe it will help other because this is a problem that will be often encoutered with IOT, I think).

Problem

There are data from multiples sites, each one with its own database. There is also a backend, where all data from all sites are stored. How can we synchronize data each X minutes from each site to the backend, knowing that some data may be updated ?

First solution

Dump data to sql (insert into) format, on each site, then send it to the backend. On the backend, execute each line one by one. If there is an Integrity error, it means we are trying to insert a key that is already present, so we do not need to insert the data but to update it. Indeed, the data may have been updated on site.

In practise, we use python for our backend, with psycopg2 to communicate with our postgreSQL, which is boosted for timeseries data with timescaleDB.

We have a function that transform an INSERT to an UPDATE (by @lucaslandry)

def insert_to_update(test_string, id_name):
    regex = "INSERT INTO (?P<table>[\w]+) \((?P<columns>.+)\) VALUES \((?P<values>.+)\)"
    match = re.search(regex, test_string)
    
    if match is not None:
        columns = match.group('columns').replace(' ', '').split(',')
        values =  match.group('values').replace(' ', '').split(',')

        update_command = "UPDATE {} SET ({}) = ({}) WHERE {} = {} ;".format(
                match.group('table'),
                match.group('columns'),
                match.group('values'),
                id_name, 
                values[columns.index(id_name)])
            
    
    return update_command

Another that use the psycopg2 error to get the value and name of the key with error:

import re 

def extract_id(error_string):
    regex = ("DETAIL:\s{2}Key\s\((.+)\)=\((.+)\)\salready\sexists")                     
    match = re.search(regex, error_string) 
    id_name, id_value = match.group(1), match.group(2)
    return id_name, id_value

And then we use them that way:

    # add your connection credentials
    con = psycopg2.connect(...)
    cur = con.cursor()
    print("Connection to Postgres Database established.")

    filename, extension = os.path.splitext(os.path.basename(body))

    with open(body, "r") as my_file:
        for line in my_file:
            if 'INSERT INTO' in line[:11]:
                try:
                    # we try to execute the insert
                    cur.execute(line)                
                    con.commit()
                except psycopg2.IntegrityError as e:
                    # if it does not work, we change that inser into to update.
                    # first we rollback the error transaction
                    con.rollback()

                    id_name, id_value = extract_id(e.pgerror)

                    update_command = insert_to_update(line, id_name)

                    cur.execute(update_command)
                    con.commit()

    con.close()

This approach works fine, however there are several drawback:

Issue 1: Updating each row when there is a conflict can be quite inefficient when working on hypertables, because these data does not need to be updated (sensor data in our case).
Solution: Differentiate each case, when working or not on an hypertable.

Issue 2: Performances could certainly be improved, using postgres internal mechanism.

NB: I put this post on my website, if you want to see it's here

mfreed · 2017-07-28T02:47:13Z

We just merged a PR into master with support for UPSERTS, i.e., ON CONFLICT DO UPDATE or ON CONFLICT DO NOTHING. (#137)

This should be part of the 0.3.0 release, which should go out early next week.

mfreed · 2017-07-31T21:52:55Z

UPSERTS can now be found in the 0.3.0 release:
https://github.com/timescale/timescaledb/releases/tag/0.3.0

More information also in the docs: http://docs.timescale.com/api#upsert

Please let us know if this allows you to simplify your import?

mfreed · 2017-08-15T01:09:15Z

@Romathonat Unless anything else, I'm going to close out this issue?

Romathonat · 2017-08-15T08:22:07Z

Yes, I will work on this part of this project in several weeks, I guess the upsert will help us so you can close this ;)

112: Call out Forge availability in our Readme r=JLockerman a=JLockerman Co-authored-by: Joshua Lockerman <josh@timescale.com>

mfreed added duplicate limitation and removed limitation labels Jul 6, 2017

mfreed changed the title ~~pg_dump does not work~~ pg_dump does not work for single tables Jul 7, 2017

mfreed closed this as completed Aug 15, 2017

syvb pushed a commit to syvb/timescaledb that referenced this issue Sep 8, 2022

Merge timescale#112

c697daf

112: Call out Forge availability in our Readme r=JLockerman a=JLockerman Co-authored-by: Joshua Lockerman <josh@timescale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_dump does not work for single tables #112

pg_dump does not work for single tables #112

lukland commented Jul 6, 2017

cevian commented Jul 6, 2017 •

edited

lukland commented Jul 7, 2017

cevian commented Jul 7, 2017

lukland commented Jul 7, 2017

cevian commented Jul 7, 2017

cevian commented Jul 7, 2017

mfreed commented Jul 7, 2017

Romathonat commented Jul 10, 2017 •

edited

mfreed commented Jul 28, 2017

mfreed commented Jul 31, 2017

mfreed commented Aug 15, 2017

Romathonat commented Aug 15, 2017

pg_dump does not work for single tables #112

pg_dump does not work for single tables #112

Comments

lukland commented Jul 6, 2017

cevian commented Jul 6, 2017 • edited

lukland commented Jul 7, 2017

cevian commented Jul 7, 2017

lukland commented Jul 7, 2017

cevian commented Jul 7, 2017

cevian commented Jul 7, 2017

mfreed commented Jul 7, 2017

Romathonat commented Jul 10, 2017 • edited

Problem

First solution

mfreed commented Jul 28, 2017

mfreed commented Jul 31, 2017

mfreed commented Aug 15, 2017

Romathonat commented Aug 15, 2017

cevian commented Jul 6, 2017 •

edited

Romathonat commented Jul 10, 2017 •

edited