Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCrawler memory leak Version 0.1.9-1 #57

Open
KostaGav opened this issue Mar 29, 2019 · 0 comments
Open

RCrawler memory leak Version 0.1.9-1 #57

KostaGav opened this issue Mar 29, 2019 · 0 comments

Comments

@KostaGav
Copy link

KostaGav commented Mar 29, 2019

Thank you for developing this cool packages! Unfortunately, I have a problem concerning RAM usage for big crawling processes, which seems to only has been occurring with the latest version of your package (0.1.9-1). I did not encounter this problem with the old version of your package (0.1.8-0).

I am running the script in a loop and save the output of each iteration. All objects are being removed after the iteration. Yet, the script eats about 100 MB of RAM every 20 minutes exceeding my total RAM after a few hours.

With the old version of your package, I was able to run the script for several weeks without any memory problems.

Do you have an idea, why this problem might has occurred now?

Please find my script attached (data, titlepath, articlepath, paper and ending have been loaded beforehand):

`

for(i in 1:length(data)){

Rcrawler(data[i], MaxDepth = 0, ExtractXpathPat = c(titlepath[t], articlepath[t]), crawlUrlfilter = paste0(".*www\\.",paper,"\\",ending,".*html$"), no_cores = 1, no_conn = 1, saveOnDisk = FALSE) `

cat(paste0("Timeframe ", year[t], "-", month[t], "\n", "Already ", i, " sites scraped. There are ", length(data) - i , " sites left to scrape"))

if(exists("DATA")){
  #restructure data list
  dat <- DATA %>% 
    map_df(enframe) %>% 
    slice(-1) %>% 
    unnest() %>% 
    mutate(Id = rep(1:nrow(INDEX), each = 2)) %>% 
    group_by(Id) %>% 
    rowid_to_column("id_h") %>% 
    ungroup() %>% 
    select(id_h, Id, value) %>% 
    arrange(id_h) %>% 
    spread(id_h, value) %>% 
    rename(title = `1`, article = `2`)
  
  #extract date of article and broader area (sports, politics etc...)
  INDEX <- INDEX %>% 
    mutate(Id = as.numeric(Id),
           date = str_match(Url,"/web/(\\w+?)/")[,2],
           date = parse_datetime(str_sub(date, 1, 8), "%Y%m%d")
    ) %>% 
    as.tibble()
  
  #merge articles with meta data and clean up a bit
  dat_full <- INDEX %>% 
    mutate(Id = as.numeric(Id)) %>% 
    left_join(dat, by = c("Id")) %>% 
    mutate(article = str_replace_all(article, pattern = "\n", " "),
           article = str_replace_all(article, pattern = "\t", " "),
           article = str_replace_all(article, pattern = "\\s+", " ")
    )
  
  save(dat_full, file = paste0(wd,"/", year[t], "/articles_", paper, year[t], "-", month[t], "-", i, ".RData"))
  gc()
  
  rm(DATA, INDEX, dat, dat_full)
} else {
  next
}
}

`

Update: I am running both versions of the package on the same URL-list on two virtual machines with the same properties (4 cores, 8 GB RAM, Ubuntu 16) for six hours now. In version 0.1.9-1 the process has occupied nearly 5 GB RAM until now (incrementing for about 100 MB every 15 minutes), whereas the process in 0.1.8-0 occupies 1 GB RAM with no differences in memory usage for six hours now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant