thetanz/sonar was an ingestion framework for Project Sonar, an initiative led by Rapid7 that provided normalised datasets of global network scan data across public internet space.
Intended for monthly execution, these scripts would concurrently download and process all available Rapid 7 datasets into Google Big Query with the help of multiple Google Compute instances.
Rapid 7 have deprecated the public revision of this service, see the below post released Feb 10, 2022
Evolving How We Share Rapid7 Research Data
It would appear as though the case noted below coupled with GDPR & CCPA regulations have put into question what Rapid 7 can publicly share.
Case C-582/14 - The court ruled that dynamic IP addresses may constitute โpersonal dataโ even where only a third party (in this case an internet service provider) has the additional data necessary to identify the individual
If you intend on using Rapid7's service for publicly-disclosed research it would appear as though you can still gain access by reaching out.
This ingestion framework is being released in an archived state. No support or updates to follow
This set of scripts helps ingest monthly GZIP archives on scan datasets. Data is downloaded and subsequently loaded back into Google CLoud's Big Query service.
URL's from BackBlaze are fetched, subsequently downloading archived before processing and loading events
these scripts expect the presence of a GCP service account JSON file within the current directory, named
gcp-svc-sonar.json
The container image by default will iterate through each dataset listed within the variable file
This can take a fair amount of time to download, process and load (upwards of 15 hours) with many disk-heavy operations
The dockerfile will initially run the orchestrator script orchestrator.sh
which will read the sourcetypes in the sonardatasets
array within the variable file datasets.sh
.
For every sourcetype identified the latest available URL to download the given dataset is discovered and is passed to loader.sh
The loader creates the relevant SQL table and downloads the sonar tarball. We treat downloads differently depending on BQ's quotas.
The files are simply too large to process in-memory so all datasets are currently written to disk and uploaded as either direct or indirect, chunked tarballs depending on the ultimate size of the archive.
when the archive is over 4gb whilst still below 15tb we chunk the file into three-milion line sections and create tarballs for each
Tarballs are uploaded directly as compressive uploads with content-encoding values.
Once the tarball is available in JSON form within GCP Storage, a BQ Batch Load Operation brings this in with indexing leveraging inline decompressive transcoding.
Big Query can 'auto detect' a json schema but it can be temperamental. Establishing a fixed schema is the best way to ensure reliable ingestion and removed any auto-detection guesswork.
Whilst Rapid 7 provide a standard schema, Big Query requires a custom file to specify this . Reference the Google docs on BQ Schemas for more info
the schema provided by rapid7 can be fetched with the below
baseuri='https://opendata.rapid7.com'
schemafile=`curl -s ${baseuri}/sonar.rdns_v2/ | grep "schema.json" | cut -d '"' -f2`
wget --no-verbose --show-progress --progress=dot:mega ${baseuri}${schemafile} -O json_schema.json
time: 4-6 hours
create a set of gcp container vm's to process each dataset concurrently with batch.sh
.
when using
batch.sh
- ensure you updateYOUR_GCP_PROJECT
accordingly to allow google container registry to correctly function within the context of your project
time: 30-40 hours free disk space: ~200gb
docker build . -t sonar
docker run sonar
you can specify a single dataset from the variable file as an input argument to process an individual dataset, i.e
docker run sonar fdns_v2:fdns_txt_mx_dmarc.json.gz
-
piping a gzip archive to gcp big query directly takes longer than it does to upload it with transcoding to gcp storage and running a subsequent load job
-
we implicitly decompress and chunk any archive over 4gb - we can upload decompressed archives up to 15tb in size however we save on data transfer costs when we only upload tarballs/archives.
-
the decompressed archives (80gb+ json line-delimited files) are massive, chunking at any good speed is rather difficult - we avoid doing so wherever possible.
-
using GNU's
chunk
doesn't seem to do 'in place' chunking so we end up downloading an archive, unpacking it & then chunking it which doubles disk space. some of these large datasets can often grow above 80GB after decompression. -
maintaining direct pipes from stdin would be ideal but the sheer filesize of some operations makes this difficult given compute and memory availability
wget example.com/file.gz | unzip | upload
-
due to the inherent size of these datasets work with tools such as
sed/awk
take extended amounts of time - no transposing/additions to datasets is performed