copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

chrnin · 2018-02-12T15:51:44Z

When I try to insert a pretty (but not so) large dataframe (that I got from a ~80k rows csv file, avg 140M of data), postgresql exhausts memory (8G + 2G swap) because of a large unique insert query with all the data inside causing my system to randomly kill some processes (rsession, or postmaster, or anyone with large memory consumption).

The example below crashes on my computer with 8G ram, to reproduce on bigger system, increase the a range accordingly.

tibble(
   a=1:400000,
   b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
   c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
   d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
   e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
   f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
   g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
   h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
   i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
   j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
   k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dplyr::copy_to(
   dest = my_beloved_postgresql_database,
   df=.,
   name='kamikaze_table',
   temporary=FALSE,
   overwrite=TRUE
)

edit: I'm using dplyr 0.7.4 on ubuntu 17.10.

edgararuiz-zz · 2018-02-20T16:01:00Z

Hi @chrnin , can you share with me your connection command (dbConnect())?

chrnin · 2018-02-20T16:36:56Z

Hi @edgararuiz

here it is:

  dplyr::src_postgres(
    host = conf$host,
    dbname = conf$dbname,
    port = conf$port,
    user = conf$user,
    password = conf$pw
  )

edgararuiz-zz · 2018-02-20T17:01:16Z

Ok, I suspect that the root cause is the RPostgreSQL package. The reason why is the copy_to() calls dbWriteTable(), RPostgreSQL has a custom dbWriteTable() routine here: https://github.com/tomoakin/RPostgreSQL/blob/f93cb17cf584d57ced5045a46d16d2bfe05a2769/RPostgreSQL/R/PostgreSQL.R#L147-L153. I would suggest to open an issue in that package after confirming that dbWriteTable(), instead of copy_to(), also fails.

chrnin · 2018-02-20T17:35:15Z

I just tried with dbWriteTable and I didn't witness the same memory consumption. It seems to work as wanted.

tibble(
   a=1:400000,
   b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
   c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
   d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
   e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
   f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
   g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
   h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
   i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
   j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
   k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dbWriteTable(my_beloved_postgresql_database$con, 'kamikaze_table', ., row.names = F, append=T )

edgararuiz-zz · 2018-02-20T17:41:57Z

Can you match the arguments passed in the copy_to command? The two commands seem to have been passed with different arguments (i.e: append = T), I just want to make sure we can compare apples to apples

chrnin · 2018-02-20T18:32:00Z

No problem, I just dropped the table and removed the supplementary arguments. The function call is now like this:
dbWriteTable(my_beloved_postgresql_database$con, 'kamikaze_table', .)

It still works without expensive memory needs.

edgararuiz-zz · 2018-02-20T18:43:36Z

That's so weird, cause copy_to() is only eventually calling dbWriteTable(). One more test, please, can we try forcing the method from each package? Try DBI::dbWriteTable() and then RPostgreSQL::dbWriteTable()

chrnin · 2018-02-20T18:58:56Z

I just did it and both are working in a very comparable amount of time. I'm going to trace the queries received by the server, there might be a difference there...

chrnin · 2018-02-20T19:26:37Z

So, that's interesting.

DBI::dbWriteTable() and RPostgreSQL::dbWriteTable() are doing the same thing:

2018-02-20 20:17:28.941 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: create table "kamikaze_table"
        (
                "row.names" text,
                "a" integer,
                "b" text,
                "c" text,
                "d" text,
                "e" text,
                "f" text,
                "g" text,
                "h" text,
                "i" text,
                "j" text,
                "k" text
        )

2018-02-20 20:17:28.950 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: SHOW client_encoding
2018-02-20 20:17:28.952 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: SET CLIENT_ENCODING TO 'UTF8'
2018-02-20 20:17:28.952 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: COPY "kamikaze_table" ( "row.names","a","b","c","d","e","f","g","h","i","j","k" ) FROM STDIN

However, with dplyr::copy_to() query is written this way:

2018-02-20 20:20:40.140 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: CREATE TABLE "kamikaze_table" ("a" integer, "b" text, "c" text, "d" text, "e" text, "f" text, "g" text, "h" text, "i" text, "j" text, "k" text)
2018-02-20 20:20:44.548 CET [22467] postgres@my_beloved_postgresql_database LOG:  statement: INSERT INTO "kamikaze_table" VALUES (1, 'Just some boring data to make the dataset grow faster, ok.. That''s pretty huge, but I have huge CSV files sometimes.', 'Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I''m sorry for that…', 'I wonder what I could say here in order to make this mildly interesting, so I''m gonna share my thoughts (as a r beginner)', 'I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…', 'I also think that a solution would be to chunk the data frame into pieces in order to limit query size.', 'By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…', 'As a fair new user, I have no idea if this is the kind of things you intend to do with this function.', 'Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement', 'But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…', 'Feel free to reject this issue, I don''t even know if it''s dumb… I''m going to survive using something else for large files… Thanks for reading :)'), (2, 'Just some boring data to make the dataset grow faster, ok.. That''s pretty huge, but I have huge CSV files sometimes.', 'Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I''m sorry for that…', 'I wonder what I could say here in order to make this mildly interesting, so I''m gonna share my thoughts (as a r beginner)', 'I think PostgreSQL is strugg......... to the end.

chrnin · 2018-02-20T20:07:13Z

db_write_table.PostgreSQLConnection <- function(con, table, types, values,
                                                temporary = TRUE, ...) {

  db_create_table(con, table, types, temporary = temporary)

  if (nrow(values) == 0)
    return(NULL)

  cols <- lapply(values, escape, collapse = NULL, parens = FALSE, con = con)
  col_mat <- matrix(unlist(cols, use.names = FALSE), nrow = nrow(values))

  rows <- apply(col_mat, 1, paste0, collapse = ", ")
  values <- paste0("(", rows, ")", collapse = "\n, ")

  sql <- build_sql("INSERT INTO ", as.sql(table), " VALUES ", sql(values))
  dbExecute(con, sql)

  table
}

Here's the function that builds the query, actually it's in dbplyr ( /R/db-postgres.r ).

edgararuiz-zz · 2018-02-20T20:20:21Z

Interesting, so does running dbplyr:::db_write_table.PostgreSQLConnection() crash?

chrnin · 2018-02-20T20:56:58Z

Yes it indeed crashes. Here's what I did:

> data <- tibble(
+     a=1:400000,
+     b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
+     c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
+     d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
+     e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
+     f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
+     g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
+     h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
+     i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
+     j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
+     k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
+ ) 
> dbplyr:::db_write_table.PostgreSQLConnection(con=my_beloved_postgresql_database$con, 'kamikaze_table', types=db_data_type(my_beloved_postgresql_database$con, data), values=data, temporary=FALSE )```

edgararuiz-zz · 2018-02-20T20:59:46Z

Yay! It looks like you found the culprit. We need to either remove this customization or fix it. Thanks for your patience and great work!

@batpigandme - Would you mind also adding the bug tag for this one please?

chrnin · 2018-02-20T21:03:49Z

You're welcome ! Thanks for the ideas, I wouldn't have dived so deep without your leads !

- mise à jour des fichiers traités dans import_data.Rmd pour la prise en compte des nouveaux fichiers - utilisation d'insert_multi pour l'insertion en base pour la plupart des étapes (temporaire, voir: tidyverse/dplyr#3355) - ajout du champ date_debut_periode_autorisee dans table_activitepartielle - remplacement de table_apart part table_activitepartielle dans le calcul de wholesample_apart - insertion de table_delais dans table_delais et non table_ccsv.

hadley · 2018-05-20T14:27:52Z

I think the solution here is rather than using dplyr::src_postgres(), instead make a connection using RPostgres (not RPostgreSQL) so we can use its built-in dbWriteTable() rather than the code I clearly had to write to work around some bug in RPostgreSQL.

krlmlr · 2018-06-25T13:56:55Z

/move to dbplyr

romainfrancois · 2018-09-14T11:21:29Z

Not sure why the move bot did not close this 🤷‍♂️

lock · 2019-03-13T12:18:27Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

batpigandme added the database label Feb 12, 2018

batpigandme added the bug an unexpected problem or unintended behavior label Feb 20, 2018

ghost mentioned this issue Jun 25, 2018

copy_to causes PostgreSQL to be memory hungry with large dataframes tidyverse/dbplyr#123

Closed

romainfrancois closed this as completed Sep 14, 2018

lock bot locked and limited conversation to collaborators Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

chrnin commented Feb 12, 2018 •

edited

Loading

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018 •

edited

Loading

chrnin commented Feb 20, 2018 •

edited

Loading

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

hadley commented May 20, 2018

krlmlr commented Jun 25, 2018

romainfrancois commented Sep 14, 2018

lock bot commented Mar 13, 2019

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

copy_to causes PostgreSQL to be memory hungry with large dataframes #3355

Comments

chrnin commented Feb 12, 2018 • edited Loading

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018 • edited Loading

chrnin commented Feb 20, 2018 • edited Loading

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

edgararuiz-zz commented Feb 20, 2018

chrnin commented Feb 20, 2018

hadley commented May 20, 2018

krlmlr commented Jun 25, 2018

romainfrancois commented Sep 14, 2018

lock bot commented Mar 13, 2019

chrnin commented Feb 12, 2018 •

edited

Loading

chrnin commented Feb 20, 2018 •

edited

Loading

chrnin commented Feb 20, 2018 •

edited

Loading