Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_dump does not work for single tables #112

Closed
lukland opened this issue Jul 6, 2017 · 12 comments
Closed

pg_dump does not work for single tables #112

lukland opened this issue Jul 6, 2017 · 12 comments

Comments

@lukland
Copy link

lukland commented Jul 6, 2017

I am facing an issue with the pg_dump command. I dump the data of my tables with the command:

pg_dump -a -U user -d database-t table> /my/path/table.bak

It works perfectly fine with standard tables but not with hypertable.
Is it possible to dump hypertables that way or do we have to use the COPY FROM in the doc. We have a specific usecase where we need to use INSERT INTO statements, too much complicated to explain in few lines.

@cevian
Copy link
Contributor

cevian commented Jul 6, 2017

I think you will find your answer in this comment of an old PR: #102 (comment)

Let me know if that helps

@lukland
Copy link
Author

lukland commented Jul 7, 2017

Yes I know that way to dump a table to CSV format. Unfotunately, I have a specific usecase where I need the same file as those generated with pg_dump (with the SQL statement INSERT INTO) and not the CSV file.

Of course I can generate the file .sql from the CSV by myself, but it seems a little strange that the command pg_dump doesn't work with the hypertables.

@cevian
Copy link
Contributor

cevian commented Jul 7, 2017

I am guessing you are also familiar with our pg_dump procedure for backup/restore of the entire db (as opposed to one hypertable): http://docs.timescale.com/api#backup

The reason that pg_dump doesn't just work for a single hypertable is that pg_dump does not dump the data from inherited tables when processing parent tables. So, if you just dump the hypertable (which is a parent table of a bunch of chunks) you won't get the data in the underlying chunks.

We may create a wrapper utility for this in the future. In the meantime, maybe we can help you make this work with COPY instead of INSERT INTO? What issue are you having using COPY?

@lukland
Copy link
Author

lukland commented Jul 7, 2017

Ok thank you. I will write a little tool to convert CSV to INSERT INTO.

I will try to summarize our problem.
We have local data generated, on different sites, each one stored with a postgres. We also have a backend architecture, with a postgres too, where all data from all sites are stored together. So we dump datas each X seconds, and send it to the backend. One of our issues is that if a data already exists in the backend, postgres fires an Integrity error (of course).

In the backend, we use python and psycopg2, and the copy_from method, which takes a complete CSV file, and dump the data to postgres. The problem is that if there is an integrity error, there is a rollback on all the CSV, not just the line, so data are not added ! Moreover, psycopg2 does not seem to let us execute a line of the CSV at a time. Finally, we have some data on our database wich are updated (profile table for example), so if there are duplicate keys, we won't insert, but update ! Of course it is a problem for performances, but we can specify what tables are updated so that our algorithm does not update on hypertables. So, to do that, we need INSERT INTO, that we execute line by line (we can do that with psycopg2), and transform INSERT INTO to UPDATE on conflict.

I hope I was clear :)
(I am working with @Romathonat, from the other issue you linked)

@mfreed mfreed changed the title pg_dump does not work pg_dump does not work for single tables Jul 7, 2017
@cevian
Copy link
Contributor

cevian commented Jul 7, 2017

Oh ok, that's more clear. May I suggest an approach like this instead: https://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com

That should become even easier once we get #100 resolved (working on that in current sprint).

@cevian
Copy link
Contributor

cevian commented Jul 7, 2017

Just to be clear the approach I (and the message above) am suggesting is to use psycopg2 and the copy_from to copy data from CSV to a temporary table and then do the "conflict resolution" in sql while copying the data in the temporary table to the hypertable. In the message I linked to they did it through a series of sql commands, but you can also do this in a PLPGSQL function that copies data row-by-row and catches integrity errors inside the function.

I think this approach using a temporary table will be much faster and easier than inserting row-by-row on the client side.

Please let me know if any of this was unclear or you have any other questions.

@mfreed
Copy link
Member

mfreed commented Jul 7, 2017

Also, this no longer seems to be the immediate issue, but the latest version of our docs (just published) provides more detail about dump/restore for single hypertables vs. the entire db:

http://docs.timescale.com/api#backup

@Romathonat
Copy link

Romathonat commented Jul 10, 2017

So, if let me summarize (maybe it will help other because this is a problem that will be often encoutered with IOT, I think).

Problem

There are data from multiples sites, each one with its own database. There is also a backend, where all data from all sites are stored. How can we synchronize data each X minutes from each site to the backend, knowing that some data may be updated ?

First solution

Dump data to sql (insert into) format, on each site, then send it to the backend. On the backend, execute each line one by one. If there is an Integrity error, it means we are trying to insert a key that is already present, so we do not need to insert the data but to update it. Indeed, the data may have been updated on site.

In practise, we use python for our backend, with psycopg2 to communicate with our postgreSQL, which is boosted for timeseries data with timescaleDB.

We have a function that transform an INSERT to an UPDATE (by @lucaslandry)

def insert_to_update(test_string, id_name):
    regex = "INSERT INTO (?P<table>[\w]+) \((?P<columns>.+)\) VALUES \((?P<values>.+)\)"
    match = re.search(regex, test_string)
    
    if match is not None:
        columns = match.group('columns').replace(' ', '').split(',')
        values =  match.group('values').replace(' ', '').split(',')

        update_command = "UPDATE {} SET ({}) = ({}) WHERE {} = {} ;".format(
                match.group('table'),
                match.group('columns'),
                match.group('values'),
                id_name, 
                values[columns.index(id_name)])
            
    
    return update_command

Another that use the psycopg2 error to get the value and name of the key with error:

import re 

def extract_id(error_string):
    regex = ("DETAIL:\s{2}Key\s\((.+)\)=\((.+)\)\salready\sexists")                     
    match = re.search(regex, error_string) 
    id_name, id_value = match.group(1), match.group(2)
    return id_name, id_value

And then we use them that way:

    # add your connection credentials
    con = psycopg2.connect(...)
    cur = con.cursor()
    print("Connection to Postgres Database established.")

    filename, extension = os.path.splitext(os.path.basename(body))

    with open(body, "r") as my_file:
        for line in my_file:
            if 'INSERT INTO' in line[:11]:
                try:
                    # we try to execute the insert
                    cur.execute(line)                
                    con.commit()
                except psycopg2.IntegrityError as e:
                    # if it does not work, we change that inser into to update.
                    # first we rollback the error transaction
                    con.rollback()

                    id_name, id_value = extract_id(e.pgerror)

                    update_command = insert_to_update(line, id_name)

                    cur.execute(update_command)
                    con.commit()

    con.close()

This approach works fine, however there are several drawback:

Issue 1: Updating each row when there is a conflict can be quite inefficient when working on hypertables, because these data does not need to be updated (sensor data in our case).
Solution: Differentiate each case, when working or not on an hypertable.

Issue 2: Performances could certainly be improved, using postgres internal mechanism.

NB: I put this post on my website, if you want to see it's here

@mfreed
Copy link
Member

mfreed commented Jul 28, 2017

We just merged a PR into master with support for UPSERTS, i.e., ON CONFLICT DO UPDATE or ON CONFLICT DO NOTHING. (#137)

This should be part of the 0.3.0 release, which should go out early next week.

@mfreed
Copy link
Member

mfreed commented Jul 31, 2017

UPSERTS can now be found in the 0.3.0 release:
https://github.com/timescale/timescaledb/releases/tag/0.3.0

More information also in the docs: http://docs.timescale.com/api#upsert

Please let us know if this allows you to simplify your import?

@mfreed
Copy link
Member

mfreed commented Aug 15, 2017

@Romathonat Unless anything else, I'm going to close out this issue?

@Romathonat
Copy link

Yes, I will work on this part of this project in several weeks, I guess the upsert will help us so you can close this ;)

@mfreed mfreed closed this as completed Aug 15, 2017
syvb pushed a commit to syvb/timescaledb that referenced this issue Sep 8, 2022
112: Call out Forge availability in our Readme r=JLockerman a=JLockerman



Co-authored-by: Joshua Lockerman <josh@timescale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants