Python script that exports URLs from ClueWeb archives.
Clone the repo
git clone https://github.com/unix121/clueweb-url-export
cd clueweb-url-export/src
Run the script
python clueweb_export.py /input/directory/ /output/directory/
This script can be used to extract URLs from ClueWeb database (warc format) and export them into JSON and Plain Text format. It's easy to modify to your needs.
It has been tested with ClueWeb09 and works fine.
It takes a few hours to export a whole subdirectory from ClueWeb but it does the job.
Stavros Grigoriou (unix121)