-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy_to causes PostgreSQL to be memory hungry with large dataframes #3355
Comments
Hi @chrnin , can you share with me your connection command ( |
Hi @edgararuiz here it is:
|
Ok, I suspect that the root cause is the |
I just tried with dbWriteTable and I didn't witness the same memory consumption. It seems to work as wanted.
|
Can you match the arguments passed in the |
No problem, I just dropped the table and removed the supplementary arguments. The function call is now like this: It still works without expensive memory needs. |
That's so weird, cause |
I just did it and both are working in a very comparable amount of time. I'm going to trace the queries received by the server, there might be a difference there... |
So, that's interesting.
However, with
|
db_write_table.PostgreSQLConnection <- function(con, table, types, values,
temporary = TRUE, ...) {
db_create_table(con, table, types, temporary = temporary)
if (nrow(values) == 0)
return(NULL)
cols <- lapply(values, escape, collapse = NULL, parens = FALSE, con = con)
col_mat <- matrix(unlist(cols, use.names = FALSE), nrow = nrow(values))
rows <- apply(col_mat, 1, paste0, collapse = ", ")
values <- paste0("(", rows, ")", collapse = "\n, ")
sql <- build_sql("INSERT INTO ", as.sql(table), " VALUES ", sql(values))
dbExecute(con, sql)
table
} Here's the function that builds the query, actually it's in dbplyr ( /R/db-postgres.r ). |
Interesting, so does running |
Yes it indeed crashes. Here's what I did: > data <- tibble(
+ a=1:400000,
+ b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
+ c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
+ d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
+ e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
+ f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
+ g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
+ h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
+ i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
+ j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
+ k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
+ )
> dbplyr:::db_write_table.PostgreSQLConnection(con=my_beloved_postgresql_database$con, 'kamikaze_table', types=db_data_type(my_beloved_postgresql_database$con, data), values=data, temporary=FALSE )``` |
Yay! It looks like you found the culprit. We need to either remove this customization or fix it. Thanks for your patience and great work! @batpigandme - Would you mind also adding the |
You're welcome ! Thanks for the ideas, I wouldn't have dived so deep without your leads ! |
- mise à jour des fichiers traités dans import_data.Rmd pour la prise en compte des nouveaux fichiers - utilisation d'insert_multi pour l'insertion en base pour la plupart des étapes (temporaire, voir: tidyverse/dplyr#3355) - ajout du champ date_debut_periode_autorisee dans table_activitepartielle - remplacement de table_apart part table_activitepartielle dans le calcul de wholesample_apart - insertion de table_delais dans table_delais et non table_ccsv.
I think the solution here is rather than using |
/move to dbplyr |
Not sure why the move bot did not close this 🤷♂️ |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
When I try to insert a pretty (but not so) large dataframe (that I got from a ~80k rows csv file, avg 140M of data), postgresql exhausts memory (8G + 2G swap) because of a large unique insert query with all the data inside causing my system to randomly kill some processes (rsession, or postmaster, or anyone with large memory consumption).
The example below crashes on my computer with 8G ram, to reproduce on bigger system, increase the a range accordingly.
edit: I'm using dplyr 0.7.4 on ubuntu 17.10.
The text was updated successfully, but these errors were encountered: