Infrastructure as a code for Data Science processing machine
Python requirements can be installed with
pip install -r requirements.txt. Note that ansible requires a python 2.7 virtual environment at time of writing.
You will need to install the aws command line tools using
brew install aws, then configure an AWS command line profile with
aws configure --profile gds-data. For this you will need an aws access key with the relevant permissions against your IAM account. When asked to set a default region set
eu-west-2 (London), and default format
To install Terraform on OSX you need to:
brew install terraform
You will also need to initialise the modules the first time, before running the databox script. Assuming you are still inside the project folder, please do:
this will install the required AWS module.
Create the AWS infrastructure
Using the databox.sh wrapper
Create the databox
The bash script
databox.sh wraps the terraform and ansible process, so that you can simply run the following to get started:
This will use the default settings which are:
|-i||instance_type||t2.micro. A list of other available instance types can be found here|
|-u||username||A lookup will be performed using the bash command
|-v||volume_size||Elastic Block Store volume (hard drive) size|
|-a||ami_id||ID of a specific image (e.g.: ami-dca37ea5). If left unset, will default to ubuntu. Note that some amis are only available in specific regions, which will need to be specified with
|-p||playbook||playbooks/databox.yml. Path to ansible playbook used for custom deployment tasks.|
|-s||snapshot_id||The id of a snapshot to be loaded onto the EBS volume. If not provided, an empty volume will be provisioned. The snapshot must be in the same region as specified in
NOTE: Ansible will require you to enter your local sudo password before continuing.
You can use the arguments in the table above to customise your databox, for example:
./databox.sh -r eu-west-1 -i c4.2xlarge up
It should not usually be necessary to specify a username using
-u unless you are running multiple databox, in which case it is required (this is not recommended).
Choosing a custom ami
If you wish to create an instance with some software already configured, you can use a custom ami, for example a deep learning ami.
This ami is limited to the eu-west-1 region, so to launch the instance on a p2 (gpu optimised instance - note that it is not campatible with the new p3 instance) use the following command:
./databox.sh -a ami-1812bb61 -r eu-west-1 -i p2.xlarge up
Using custom ansible playbooks
-p flag is left unset, this defaults to a
playbooks/databox.yml which will simply mount the data volume, and install docker. Custom playbooks, for instance for preparing environments on a Deep Learning AMI (see the govuk-taxonomy-supervised-learning project). The appropriate command for this example would be:
./databox.sh -a ami-1812bb61 -r eu-west-1 -i p2.xlarge -s snap-04eb15f2e4faee97a -p playbooks/govuk-taxonomy-supervised-learning.yml up
The playbooks currently available in this repository are:
|playbooks/databox.yml||Default playbook. Mounts the data volume and installs docker.|
|playbooks/govuk-taxonomy-supervised-learning.yml||Mounts the data volume, clones the govuk-taxonomy-supervised-learning repo, install necessary packages into the appropriate conda environment, and sets environment variables.|
Connecting to your databox
At the end of the process an IP address will be output like this:
Apply complete! Resources: 6 added, 0 changed, 0 destroyed. Outputs: ec2_ip = 188.8.131.52
To log into this machine take this address and run:
You can test that Docker is up and running with:
ubuntu@ip-172-31-9-43:~$ docker version Client: Version: 17.06.1-ce API version: 1.30 Go version: go1.8.3 Git commit: 874a737 Built: Thu Aug 17 22:51:12 2017 OS/Arch: linux/amd64 Server: Version: 17.06.1-ce API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 874a737 Built: Thu Aug 17 22:50:04 2017 OS/Arch: linux/amd64 Experimental: false
Mounting EBS volumes (hard drive storage)
New Elastic Block Store (EBS) volumes will be mounted at
/data within the instance, so all outputs should be saved here, rather than to the root file system of the instance (otherwise you will quickly run out of space, and it will be difficult to persist).
Manual instructions for mounting an EBS volume are defined in the amazon web services documentation. This is only likely to be necessary if you are restoring a volume from a previous snapshot. The instructions are replicated in brief here.
List available disk devices (having set up a databox with the -v argument):
ubuntu:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdh 202:112 0 80G 0 disk
We want to connect the xvdh disk. First we need to check whether it has a file system:
ubuntu:~$ sudo file -s /dev/xvdh /dev/xvdh: data
If the command returns only
/dev/xvdh: data it means that there is no filesystem on the device, and this needs to be created.
ubuntu:~$ sudo mkfs -t ext4 /dev/xvdh mke2fs 1.42.13 (17-May-2015) Creating filesystem with 20971520 4k blocks and 5242880 inodes Filesystem UUID: ebc4eb4a-b481-4aa4-b49c-32f5a12e160b Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done
If the device returns something else, then there is already a filesystem, and you are good to go. In either case, you want to get to a situation where the command
sudo file -s /dev/xvdh gives a response:
ubuntu:~$ sudo file -s /dev/xvdh /dev/xvdh: Linux rev 1.0 ext4 filesystem data, UUID=ebc4eb4a-b481-4aa4-b49c-32f5a12e160b (extents) (large files) (huge files)
Finally the device needs to be mounted to an existing directory e.g.
ubuntu:~$ sudo mkdir /data ubuntu:~$ sudo mount /dev/xvdh /data
This will need re-mount the device every time the instance reboots unless you add an entry to your /etc/fstab file. More in-depth instructions for doing this are provided in the [AWS documentation])(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html).
Following the example above, first create a copy of your fstab in case you need to restore it:
sudo cp /etc/fstab /etc/fstab.orig
Then the following line would need to be added to /etc/fstab (based on the example above) where the UUID matches the UUID of the devide (obtainable from
sudo file -s /dev/xvdh).
UUID=ebc4eb4a-b481-4aa4-b49c-32f5aa56210b /data ext4 defaults,nofail 0 2
Following this, run
sudo mount -a to ensure that the device is mountable. If not, restore your original fstab and start again. Unmountable drives in the fstab may cause the instance to fail to boot.
Destroying the databox
The resources can later be destroyed with:
Note that if you create a databox by specifying region this way, you must also pass the region (
-r) variable to the
./databox.sh down command:
./databox.sh -r eu-west-1 down
NOTE: Failing to pass the correct region argument to the
./databox.sh down command will result in your resources not being found, and consequently, not destroyed.
Using terraform and ansible directly
If you need additional customisations, the following commands can be run without the
Setting up a databox
To create resources with the default settings:
at the end it will output and IP address like this:
Apply complete! Resources: 6 added, 0 changed, 0 destroyed. Outputs: ec2_ip = 184.108.40.206
Variable arguments can be passed to terraform with
--var, for example:
terraform apply --var username=user --var aws_region=eu-west-1 --var instance_type=c4.2xlarge
Install Docker and other tools on the databox
ansible-playbook -i '220.127.116.11,' -K playbooks/databox.yml -u ubuntu
Note: the correct IP address that has been shown in the output must be used. The IP address must be followed by a comma!
Connect to the databox
As with the
./databox.sh wrapper, you will need to connect to the databox with:
Destroy the created databox on AWS
As before, if you specified a region in
terraform apply --var aws_region=... you must specify the same region in
terraform destroy --var aws_region=... otherwise the resources you created will not be found.
Copying data to and from a databox with scp
To transfer data to and from your local machine you can use scp. SCP uses the same authentication mechanism as SSH, so if you have followed the above steps, it should be very easy!
Uploading data to the databox
From the local machine (replacing 0.0.0.0 with the actual IP of your databox:
# Create a folder in which to store input data ssh email@example.com 'mkdir -p /home/ubuntu/govuk-lda-tagger-image/input' # Secure copy input_data.csv from local to the newly created input folder scp input_data.csv firstname.lastname@example.org:/home/ubuntu/govuk-lda-tagger-image/input/input_data.csv
Downloading data to your local machine
From the local machine (again replacing 0.0.0.0 with the actual IP of the remote machine):
# Specifying `-r` allows a recursive copy of the whole folder scp -r email@example.com:/home/ubuntu/govuk-lda-tagger-image/output ./
Keep process running while disconnecting from SSH
It is possible to keep a process running in the background and being able to disconnect from SSH or from the VPN and resume at anytime.
This is very useful in case we want to run a very long process and we don't want to keep our laptop on or connected all the time.
Our Databox comes with an utility called screen.
To use it, we just need to type
screen after we connect with SSH, a presentation screen will appear and we just need to press SPACE.
At this point the terminal looks like the initial one, but we are inside a screen session.
We can now run any commend that needs to be kept running, for example:
tail -f /var/log/syslog
then we detach from this session pressing CTRL+A+D simultaneously and we should see something like this:
ubuntu@ip-172-31-6-53:~$ screen [detached from 9114.pts-0.ip-172-31-6-53]
at this point we can exit the terminal just typing:
Next time we log back with SSH, we just need to type:
and we will be back to our session. If we want to terminate the process, instead of pressing CTRL+A+D we terminate with CTRL+C as usual and we exit the screen session.