Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

Closed
chrnin opened this issue Feb 12, 2018 · 18 comments
Closed

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

chrnin opened this issue Feb 12, 2018 · 18 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@chrnin
Copy link

chrnin commented Feb 12, 2018

When I try to insert a pretty (but not so) large dataframe (that I got from a ~80k rows csv file, avg 140M of data), postgresql exhausts memory (8G + 2G swap) because of a large unique insert query with all the data inside causing my system to randomly kill some processes (rsession, or postmaster, or anyone with large memory consumption).

The example below crashes on my computer with 8G ram, to reproduce on bigger system, increase the a range accordingly.

tibble(
   a=1:400000,
   b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
   c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
   d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
   e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
   f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
   g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
   h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
   i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
   j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
   k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dplyr::copy_to(
   dest = my_beloved_postgresql_database,
   df=.,
   name='kamikaze_table',
   temporary=FALSE,
   overwrite=TRUE
)

edit: I'm using dplyr 0.7.4 on ubuntu 17.10.

@edgararuiz-zz
Copy link

Hi @chrnin , can you share with me your connection command (dbConnect())?

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

Hi @edgararuiz

here it is:

  dplyr::src_postgres(
    host = conf$host,
    dbname = conf$dbname,
    port = conf$port,
    user = conf$user,
    password = conf$pw
  )

@edgararuiz-zz
Copy link

Ok, I suspect that the root cause is the RPostgreSQL package. The reason why is the copy_to() calls dbWriteTable(), RPostgreSQL has a custom dbWriteTable() routine here: https://github.com/tomoakin/RPostgreSQL/blob/f93cb17cf584d57ced5045a46d16d2bfe05a2769/RPostgreSQL/R/PostgreSQL.R#L147-L153. I would suggest to open an issue in that package after confirming that dbWriteTable(), instead of copy_to(), also fails.

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

I just tried with dbWriteTable and I didn't witness the same memory consumption. It seems to work as wanted.

tibble(
   a=1:400000,
   b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
   c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
   d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
   e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
   f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
   g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
   h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
   i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
   j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
   k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dbWriteTable(my_beloved_postgresql_database$con, 'kamikaze_table', ., row.names = F, append=T )

@edgararuiz-zz
Copy link

Can you match the arguments passed in the copy_to command? The two commands seem to have been passed with different arguments (i.e: append = T), I just want to make sure we can compare apples to apples

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

No problem, I just dropped the table and removed the supplementary arguments. The function call is now like this:
dbWriteTable(my_beloved_postgresql_database$con, 'kamikaze_table', .)

It still works without expensive memory needs.

@edgararuiz-zz
Copy link

That's so weird, cause copy_to() is only eventually calling dbWriteTable(). One more test, please, can we try forcing the method from each package? Try DBI::dbWriteTable() and then RPostgreSQL::dbWriteTable()

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

I just did it and both are working in a very comparable amount of time. I'm going to trace the queries received by the server, there might be a difference there...

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

So, that's interesting.

DBI::dbWriteTable() and RPostgreSQL::dbWriteTable() are doing the same thing:

2018-02-20 20:17:28.941 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: create table "kamikaze_table"
        (
                "row.names" text,
                "a" integer,
                "b" text,
                "c" text,
                "d" text,
                "e" text,
                "f" text,
                "g" text,
                "h" text,
                "i" text,
                "j" text,
                "k" text
        )

2018-02-20 20:17:28.950 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: SHOW client_encoding
2018-02-20 20:17:28.952 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: SET CLIENT_ENCODING TO 'UTF8'
2018-02-20 20:17:28.952 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: COPY "kamikaze_table" ( "row.names","a","b","c","d","e","f","g","h","i","j","k" ) FROM STDIN

However, with dplyr::copy_to() query is written this way:

2018-02-20 20:20:40.140 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: CREATE TABLE "kamikaze_table" ("a" integer, "b" text, "c" text, "d" text, "e" text, "f" text, "g" text, "h" text, "i" text, "j" text, "k" text)
2018-02-20 20:20:44.548 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: INSERT INTO "kamikaze_table" VALUES (1, 'Just some boring data to make the dataset grow faster, ok.. That''s pretty huge, but I have huge CSV files sometimes.', 'Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I''m sorry for that…', 'I wonder what I could say here in order to make this mildly interesting, so I''m gonna share my thoughts (as a r beginner)', 'I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…', 'I also think that a solution would be to chunk the data frame into pieces in order to limit query size.', 'By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…', 'As a fair new user, I have no idea if this is the kind of things you intend to do with this function.', 'Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement', 'But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…', 'Feel free to reject this issue, I don''t even know if it''s dumb… I''m going to survive using something else for large files… Thanks for reading :)'), (2, 'Just some boring data to make the dataset grow faster, ok.. That''s pretty huge, but I have huge CSV files sometimes.', 'Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I''m sorry for that…', 'I wonder what I could say here in order to make this mildly interesting, so I''m gonna share my thoughts (as a r beginner)', 'I think PostgreSQL is strugg......... to the end.

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

db_write_table.PostgreSQLConnection <- function(con, table, types, values,
                                                temporary = TRUE, ...) {

  db_create_table(con, table, types, temporary = temporary)

  if (nrow(values) == 0)
    return(NULL)

  cols <- lapply(values, escape, collapse = NULL, parens = FALSE, con = con)
  col_mat <- matrix(unlist(cols, use.names = FALSE), nrow = nrow(values))

  rows <- apply(col_mat, 1, paste0, collapse = ", ")
  values <- paste0("(", rows, ")", collapse = "\n, ")

  sql <- build_sql("INSERT INTO ", as.sql(table), " VALUES ", sql(values))
  dbExecute(con, sql)

  table
}

Here's the function that builds the query, actually it's in dbplyr ( /R/db-postgres.r ).

@edgararuiz-zz
Copy link

Interesting, so does running dbplyr:::db_write_table.PostgreSQLConnection() crash?

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

Yes it indeed crashes. Here's what I did:

> data <- tibble(
+     a=1:400000,
+     b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
+     c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
+     d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
+     e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
+     f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
+     g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
+     h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
+     i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
+     j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
+     k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
+ ) 
> dbplyr:::db_write_table.PostgreSQLConnection(con=my_beloved_postgresql_database$con, 'kamikaze_table', types=db_data_type(my_beloved_postgresql_database$con, data), values=data, temporary=FALSE )```

@edgararuiz-zz
Copy link

Yay! It looks like you found the culprit. We need to either remove this customization or fix it. Thanks for your patience and great work!

@batpigandme - Would you mind also adding the bug tag for this one please?

@chrnin
Copy link
Author

chrnin commented Feb 20, 2018

You're welcome ! Thanks for the ideas, I wouldn't have dived so deep without your leads !

@batpigandme batpigandme added the bug an unexpected problem or unintended behavior label Feb 20, 2018
chrnin added a commit to SGMAP-AGD/opensignauxfaibles that referenced this issue Feb 22, 2018
- mise à jour des fichiers traités dans import_data.Rmd pour la prise en compte des nouveaux fichiers
- utilisation d'insert_multi pour l'insertion en base pour la plupart des étapes (temporaire, voir: tidyverse/dplyr#3355)
- ajout du champ date_debut_periode_autorisee dans table_activitepartielle
- remplacement de table_apart part table_activitepartielle dans le calcul de wholesample_apart
- insertion de table_delais dans table_delais et non table_ccsv.
@hadley
Copy link
Member

hadley commented May 20, 2018

I think the solution here is rather than using dplyr::src_postgres(), instead make a connection using RPostgres (not RPostgreSQL) so we can use its built-in dbWriteTable() rather than the code I clearly had to write to work around some bug in RPostgreSQL.

@krlmlr
Copy link
Member

krlmlr commented Jun 25, 2018

/move to dbplyr

@romainfrancois
Copy link
Member

Not sure why the move bot did not close this 🤷‍♂️

@lock
Copy link

lock bot commented Mar 13, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Mar 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

6 participants