RsparkleR provides an R interface for launching virtual machines and deploying Sparkler as painless as possible with a few lines from your local R session.
Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix.
See all documentation on the sparkler website
Creating a Sparkler Cluster
Detailled instructions here : https://data-seo.com/2017/12/17/advanced-r-programming-seo-crawler
Configure a OVH Cloud Project with billing https://api.ovh.com/createToken/index.cgi?GET=/*&POST=/*&PUT=/*&DELETE=/*
Create your SSH keys : sshPubKeyPath, sshPrivKeyPath
Put your regionVM ( SBG3,BHS3,WAW1,UK1,DE1,GRA3)
- SBG3 Datacenter is in France
- BHS3 Datacenter is in Canada
- WAW1 Datacenter is in Poland
- UK1 Datacenter is in UK
- DE1 Datacenter is in Deutch
- GRA3 Datacenter is in France
Put your typeVM (s1-2,s1-4,...) and SSH Key About range of cloud servers : https://www.ovh.co.uk/public-cloud/instances/prices/
ovh <- importOvh()
client <- loadClient(ovh,endpoint,application_key,application_secret,consumer_key)
vm <- createSparkler(client,regionVM='UK1',typeVM='s1-4',sshPubKeyPath,sshPrivKeyPath)
Wait for it to install and your instance is ready, you get vm object with ip and port 22 is open
Now you can deploy your Sparkler
Deploy your Docker with Sparkler : Run
startSparkler(vm, prod=TRUE, debug=TRUE). Be patient for the first time.
Launch crawl :
crawlid <- startCrawl(vm, url="https://data-seo.com", topUrls=100, topGroups=5, maxIter=2, debug=TRUE)
Get results from SolR
crawlDF <- readSolr(vm, pattern, crawlid, topUrls=100, extracted=TRUE)
- Thamme Gowda and USC Data Science ( http://irds.usc.edu ) for creating Sparkler
- Mark Edmondson for the googleComputeEngineR package for providing an R interface to the Google Cloud Compute Engine API, for launching virtual machines.
- Scott Chamberlin for the analogsea package for launching Digital Ocean VMs, which inspired the SSH connector functions for this one.
- Winston Chang for the harbor package where the docker functions come from. If
harborwill be published to CRAN, it will become a dependency for this one.