Skip to content
Samuel Harrold edited this page May 31, 2014 · 1 revision

Data load_bz2.py will download, extract, partition, and upload a set of bz2 files from wikimedia to the active Disco Distributed File System. The sets of CSV files are used to look up tags between BZ2 URLs and data sets. The CSV files column labels are hard-coded in the script.

$ cd benchmark_disco/data
$ python load_bz2.py -h
usage: load_bz2.py [-h] [--fcsvs [FCSVS [FCSVS ...]]] [--data_dir DATA_DIR]
                   [--verbose] [--delete]

Download bz2 files then upload to Disco and tag.

optional arguments:
  -h, --help            show this help message and exit
  --fcsvs [FCSVS [FCSVS ...]]
                        Input .csv files with URLs of bz2 files for download and DDFS tags for upload. Default: [all .csv in CWD]
  --data_dir DATA_DIR   Path to save bz2 files for extraction and loading. Default: /tmp
  --verbose, -v         Print 'INFO:' messages to stdout.
  --delete, -d          Delete files after uploading to Disco.
$ python load_bz2.py --data_dir /scratch/wikimedia_dumps -v

Home

Home

How to...

Load data

How to load data

Count words

How to count words

Sort

Plot

Clone this wiki locally