New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug when importing 30GB csv file #141
Comments
Thanks for the details bug report. It will be another 17 hours before I can download the file as I don't have unlimited internet where I am. So please be patient. I will prioritise this. |
The great thing about disk.frame is that you don't need to combine the CSVs first before you read. In fact, it's recommended that you don't so you can take advantage of parallel processing where each worker looks at one file at a time. But I will do the combination and then test. It's a great test for disk.frame! |
Thanks!
Yes I'm aware that combining the csv beforehand is pointless, but as you guessed I wanted to test things out and see if I could reproduce the blog post I mentioned. Thanks for your reactivity!
Bruno Rodrigues, Phd
Sent from mobile, excuse my brevity
www.brodrigues.co
…-------- Original Message --------
On Sep 3, 2019, 08:31, evalparse wrote:
The great thing about disk.frame is that you don't need to combine the CSVs first before you read. In fact, it's recommended that you don't so you can take advantage of parallel processing where each worker looks at one file at a time. But I will do the combination and then test. It's a great test for disk.frame!
—
You are receiving this because you authored the thread.
Reply to this email directly, [view it on GitHub](#141), or [mute the thread](https://github.com/notifications/unsubscribe-auth/AAW4EMSQZGVHUUN2A5PMAULQHYAEXANCNFSM4ITCKA7A).
|
I have implemented a backend using Also, please up the devtools::install_github("disk.frame")
library(tidyverse)
library(disk.frame)
setup_disk.frame(workers = 6)
options(future.globals.maxSize = Inf)
path_to_data <- "/run/media/cbrunos/0a8a239e-7f3e-4756-9ddd-129c23fad79b/laragreen/Downloads/AirOnTimeCSV/"
flights.df <- csv_to_disk.frame(
paste0(path_to_data, "combined.csv"),
outdir = paste0(path_to_data, "combined.df"),
in_chunk_size = 1e6,
backend = "readr") |
Hi, this runs, but maxes out my RAM and R gets killed by linux:
|
@b-rodrigues Ok, I got this with #147. The problem is with There's at least two ways to use Don't forget to install the least version remotes::install_github("xiaodaigh/disk.frame") Using LaF backend library(disk.frame)
setup_disk.frame()
rows = 148619656
recommended_nchunks = recommend_nchunks(file.size(file.path(path_to_data, "combined.csv")))
in_chunk_size = ceiling(rows/ recommended_nchunks)
path_to_data <- "c:/data/"
system.time(flights.df <- csv_to_disk.frame(
paste0(path_to_data, "combined.csv"),
outdir = paste0(path_to_data, "combined.laf.df"),
in_chunk_size = in_chunk_size,
backend = "LaF"
)) Using Both of the below works fine: library(disk.frame)
setup_disk.frame()
rows = 148619656
recommended_nchunks = recommend_nchunks(file.size(file.path(path_to_data, "combined.csv")))
in_chunk_size = ceiling(rows/ recommended_nchunks)
system.time(a <- csv_to_disk.frame(
file.path(path_to_data, "combined.csv"),
outdir = file.path(path_to_data, "combined.readr.df"),
in_chunk_size = in_chunk_size,
colClasses = list(character = c("WHEELS_OFF","WHEELS_ON")),
chunk_reader = "readr"
)) or library(disk.frame)
setup_disk.frame()
rows = 148619656
recommended_nchunks = recommend_nchunks(file.size(file.path(path_to_data, "combined.csv")))
in_chunk_size = ceiling(rows/ recommended_nchunks)
system.time(a <- csv_to_disk.frame(
file.path(path_to_data, "combined.csv"),
outdir = file.path(path_to_data, "combined.readr.df"),
in_chunk_size = in_chunk_size,
colClasses = list(character = c("WHEELS_OFF","WHEELS_ON")),
chunk_reader = "readLines"
)) The best way is to not combine first! path_to_data = "c:/data/AirOnTimeCSV/"
system.time(a <- csv_to_disk.frame(
list.files(path_to_data, pattern = ".csv$", full.names = TRUE),
outdir = file.path(path_to_data, "airontimecsv.df"),
colClasses = list(character = c("WHEELS_OFF", "WHEELS_ON"))
)) Repeating your analysis using library(disk.frame)
setup_disk.frame()
path_to_data = "c:/data/Air"
a = disk.frame(file.path(path_to_data, "airontimecsv.df"))
system.time(r_mean_del_delay <- a %>%
group_by(YEAR, MONTH, DAY_OF_MONTH) %>%
summarise(sum_delay = sum(DEP_DELAY, na.rm = TRUE), n = n()) %>%
collect %>%
group_by(YEAR, MONTH, DAY_OF_MONTH) %>%
summarise(mean_delay = sum(sum_delay)/sum(n)))
library(lubridate)
dep_delay = r_mean_del_delay %>%
arrange(YEAR, MONTH, DAY_OF_MONTH) %>%
mutate(date = ymd(paste(YEAR, MONTH, DAY_OF_MONTH, sep = "-")))
library(ggplot2)
ggplot(dep_delay, aes(date, mean_delay)) + geom_smooth() |
Another very fast way to do it is to split up the files first using pt= proc.time()
a = bigreadr::split_file(file.path(path_to_data, "combined.csv"), every_nlines = in_chunk_size, repeat_header = TRUE)
f = bigreadr::get_split_files(a)
csv_to_disk.frame(
f,
outdir = "c:/data/split30g.df"
)
data.table::timetaken(pt) Update 20190910 |
Just added a pt= proc.time()
csv_to_disk.frame(
file.path(path_to_data, "combined.csv"),
outdir = "c:/data/split30g.df",
in_chunk_size = in_chunk_size,
chunk_reader = "bigreadr",
colClasses = list(character = c(22,23))
#colClasses = list(character = c("WHEELS_OFF","WHEELS_ON"))
)
data.table::timetaken(pt) # 3min total. |
Thank you very much for your help and responsiveness! I was able to run the code on my machine without any issues. I have written a blog post about this magnificient package: |
Closing it as issue seem resolved. |
I am hitting the same error |
@jangorecki Perhaps this issue Rdatatable/data.table#3526 Try Would be great if the data can be shared. Always looking for large datasets. |
I need to debug fread's segfault so disk.frame won't address by needs. Thanks anyway. |
Hi all,
Then, as the one-stage group-by has been implemented, I tried to calculate the average delay by origins in the following way:
Unfortunately this operation makes my machine crash, sometime causing a blue screen (I'm on Windows). I'm using a 32 GB hexacore machine (12 threads, so I used 11 workers for the upon mentioned computations), Microsoft R Open 4.0.2 and disk.frame 0.3.7. |
Try this which should load only the 5 columns needed for this operation..
In the future, we might be able to detect the column needs from the code so the user don't have to use
|
Thanks @xiaodaigh , now it works well. I noticed two things:
Does disk.dataframe creation force the cluster to reduce its nodes (sessions) in some way? |
Hi,
I'm trying to reproduce a blog post I've written where I've used Spark to read in a 30gb file: https://www.brodrigues.co/blog/2018-02-16-importing_30gb_of_data/
you can download the data here: https://packages.revolutionanalytics.com/datasets/AirOnTime87to12/ (it's the zip file)
This zip file contains a folder with a lot of smaller csv files. I could import these, but I wanted to try the scenario where I would get a 30gb file and import it in one go (just as described in the post).
So first I combine the file into one big csv file:
head -1 airOT198710.csv > combined.csv for file in $(ls airOT*); do cat $file | sed "1 d" >> combined.csv; done
and then I try out
{disk.frame}
:But when I try this, I get the following error message:
The hard drive has around 600gb of space left, and 16gigs of ram.
I am running the latest
{disk.frame}
from github as well.Here are some info about my session:
The text was updated successfully, but these errors were encountered: