Warcbase workshop vm
The virtual machine that is built uses 2GB of RAM. Your host machine will need to be able to support that.
It requires a lot of data. If you are attending a workshop at a conference, we strongly recommend downloading everything beforehand.
Download each of the following dependencies.
To install this virtual machine, you have two options.
You can download it from this link and "import the appliance" using VirtualBox. Note that this is a 6.4GB download. If you do this, skip to "Spark Notebook" below.
Or you can use vagrant to build it yourself, or provision it using
You'll need to get your virtual machine running on the command line. For a basic walkthrough of how to use the command line, please consult this lesson at the Programming Historian.
From a working directory, please run the following commands.
git clone https://github.com/web-archive-group/warcbase_workshop_vagrant.git(this clones this repository)
cd warcbase_workshop_vagrant(this changes into the repository directory)
vagrant up(this builds the virtual machine - it will take a while and download a lot of data)
Once you run these three commands, you will have a running virtual machine with the latest version of warcbase installed.
You can also deploy this as an AWS machine. To do so, install vagrant-aws.
vagrant plugin install vagrant-aws
And then modify the
VagrantFile to point to your AWS information. The following block will need to be changed:
config.vm.provider :aws do |aws, override| aws.access_key_id = "KEYHERE" aws.secret_access_key = "SECRETKEYHERE" aws.region = "us-west-2" aws.region_config "us-west-2" do |region| region.ami = "ami-01f05461" # by default, spins up lightweight m3.medium. If want powerful, uncomment below. # region.instance_type = "c3.4xlarge" region.keypair_name = "KEYPAIRNAME" end override.ssh.username = "ubuntu" override.ssh.private_key_path = "PATHTOPRIVATEKEY"
You can then load it by typing:
vagrant up --provider aws
Note, you will need to change your AWS Security Group to allow for incoming connections on port 22 (SSH) and 9000 (for Spark Notebook). By default, it launches a lightweight m3.medium. To do real work, you will need a larger (and sadly more expensive instance).
Now you need to connect to the machine. This will be done through your command line, but also through your browser through Spark Notebook.
We use three commands to connect to this virtual machine.
ssh to connect to it via your command line.
scp to copy a file (such as a WARC or ARC),
rsync to sync a directory between two machines.
To get started, type
vagrant ssh in the directory where you installed the VM.
Here are some other example commands:
ssh -p 2222 ubuntu@localhost- will connect to the machine using
scp -P 2222 somefile.txt ubuntu@localhost:/destination/path- will copy
somefile.txtto your vagrant machine.
- You'll need to specify the destination. For example,
scp -P 2222 WARC.warc.gz ubuntu@localhost:/home/ubuntuwill copy WARC.warc.gz to the home directory of the vagrant machine.
- You'll need to specify the destination. For example,
rsync --rsh='ssh -p2222' -av somedir ubuntu@localhost:/home/ubuntu- will sync
somedirto your home directory of the vagrant machine.
- Ubuntu 14.04
- warcbase HEAD
- Apache Spark 1.5.1
- Spark Notebook
To run spark notebook, type the following:
vagrant ssh(if on vagrant; if you downloaded the ova file and are running with VirtualBox you do not need to do this)
./spark-notebook -Dhttp.port=9000 -J-Xms1024m
- Visit http://127.0.0.1:9000/ in your web browser.
If you are connecting via AWS, visit the IP address of your instance (found on EC2 dashboard), port 9000 (i.e.
To run spark shell:
vagrant ssh(if you did not run that in the previous step)
./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
ubuntu@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) Type in expressions to have them evaluated. Type :help for more information. WARN Utils - Your hostname, warcbase resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0) WARN Utils - Set SPARK_LOCAL_IP if you need to bind to another address WARN MetricsSystem - Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. WARN ObjectStore - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 WARN ObjectStore - Failed to get database default, returning NoSuchObjectException WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable WARN ObjectStore - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 WARN ObjectStore - Failed to get database default, returning NoSuchObjectException SQL context available as sqlContext. scala> :paste // Entering paste mode (ctrl-D to finish) import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc) .keepValidPages() .map(r => ExtractDomain(r.getUrl)) .countItems() .take(10) // Exiting paste mode, now interpreting. ERROR ArcRecordUtils - Read 1235 bytes but expected 1311 bytes. Continuing... import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ r: Array[(String, Int)] = Array((communist-party.ca,39), (www.gca.ca,39), (greenparty.ca,39), (www.davidsuzuki.org,34), (westernblockparty.com,26), (www.nosharia.com,24), (partimarijuana.org,22), (www.ccsd.ca,22), (canadianactionparty.ca,22), (www.nawl.ca,19))
To quit Spark Shell, you can exit using Ctrl+C.
This build also includes the warcbase resources repository, which contains NER libraries as well as sample data from the University of Toronto (located in
The ARC and WARC file are drawn from the Canadian Political Parties & Political Interest Groups Archive-It Collection, collected by the University of Toronto. We are grateful that they've provided this material to us.
If you use their material, please cite it along the following lines:
- University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp
You can find more information about this collection at WebArchives.ca.
This research has been supported by the Social Sciences and Humanities Research Council with Insight Grant 435-2015-0011. Additional funding for student labour on this project comes from an Ontario Ministry of Research and Innovation Early Researcher Award. The idea for the AWS deployment came from the DocNow team and their repository here.