RCrawler memory leak Version 0.1.9-1 #57

KostaGav · 2019-03-29T10:47:03Z

Thank you for developing this cool packages! Unfortunately, I have a problem concerning RAM usage for big crawling processes, which seems to only has been occurring with the latest version of your package (0.1.9-1). I did not encounter this problem with the old version of your package (0.1.8-0).

I am running the script in a loop and save the output of each iteration. All objects are being removed after the iteration. Yet, the script eats about 100 MB of RAM every 20 minutes exceeding my total RAM after a few hours.

With the old version of your package, I was able to run the script for several weeks without any memory problems.

Do you have an idea, why this problem might has occurred now?

Please find my script attached (data, titlepath, articlepath, paper and ending have been loaded beforehand):

`

for(i in 1:length(data)){

Rcrawler(data[i], MaxDepth = 0, ExtractXpathPat = c(titlepath[t], articlepath[t]), crawlUrlfilter = paste0(".*www\\.",paper,"\\",ending,".*html$"), no_cores = 1, no_conn = 1, saveOnDisk = FALSE) `

cat(paste0("Timeframe ", year[t], "-", month[t], "\n", "Already ", i, " sites scraped. There are ", length(data) - i , " sites left to scrape"))

if(exists("DATA")){
  #restructure data list
  dat <- DATA %>% 
    map_df(enframe) %>% 
    slice(-1) %>% 
    unnest() %>% 
    mutate(Id = rep(1:nrow(INDEX), each = 2)) %>% 
    group_by(Id) %>% 
    rowid_to_column("id_h") %>% 
    ungroup() %>% 
    select(id_h, Id, value) %>% 
    arrange(id_h) %>% 
    spread(id_h, value) %>% 
    rename(title = `1`, article = `2`)
  
  #extract date of article and broader area (sports, politics etc...)
  INDEX <- INDEX %>% 
    mutate(Id = as.numeric(Id),
           date = str_match(Url,"/web/(\\w+?)/")[,2],
           date = parse_datetime(str_sub(date, 1, 8), "%Y%m%d")
    ) %>% 
    as.tibble()
  
  #merge articles with meta data and clean up a bit
  dat_full <- INDEX %>% 
    mutate(Id = as.numeric(Id)) %>% 
    left_join(dat, by = c("Id")) %>% 
    mutate(article = str_replace_all(article, pattern = "\n", " "),
           article = str_replace_all(article, pattern = "\t", " "),
           article = str_replace_all(article, pattern = "\\s+", " ")
    )
  
  save(dat_full, file = paste0(wd,"/", year[t], "/articles_", paper, year[t], "-", month[t], "-", i, ".RData"))
  gc()
  
  rm(DATA, INDEX, dat, dat_full)
} else {
  next
}
}

`

Update: I am running both versions of the package on the same URL-list on two virtual machines with the same properties (4 cores, 8 GB RAM, Ubuntu 16) for six hours now. In version 0.1.9-1 the process has occupied nearly 5 GB RAM until now (incrementing for about 100 MB every 15 minutes), whereas the process in 0.1.8-0 occupies 1 GB RAM with no differences in memory usage for six hours now.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCrawler memory leak Version 0.1.9-1 #57

RCrawler memory leak Version 0.1.9-1 #57

KostaGav commented Mar 29, 2019 •

edited

RCrawler memory leak Version 0.1.9-1 #57

RCrawler memory leak Version 0.1.9-1 #57

Comments

KostaGav commented Mar 29, 2019 • edited

KostaGav commented Mar 29, 2019 •

edited