-
Notifications
You must be signed in to change notification settings - Fork 1
How to load data
Samuel Harrold edited this page May 31, 2014
·
1 revision
Data load_bz2.py will download, extract, partition, and upload a set of bz2 files from wikimedia to the active Disco Distributed File System. The sets of CSV files are used to look up tags between BZ2 URLs and data sets. The CSV files column labels are hard-coded in the script.
$ cd benchmark_disco/data
$ python load_bz2.py -h
usage: load_bz2.py [-h] [--fcsvs [FCSVS [FCSVS ...]]] [--data_dir DATA_DIR]
[--verbose] [--delete]
Download bz2 files then upload to Disco and tag.
optional arguments:
-h, --help show this help message and exit
--fcsvs [FCSVS [FCSVS ...]]
Input .csv files with URLs of bz2 files for download and DDFS tags for upload. Default: [all .csv in CWD]
--data_dir DATA_DIR Path to save bz2 files for extraction and loading. Default: /tmp
--verbose, -v Print 'INFO:' messages to stdout.
--delete, -d Delete files after uploading to Disco.
$ python load_bz2.py --data_dir /scratch/wikimedia_dumps -v