Skip to content
This repository has been archived by the owner on May 2, 2021. It is now read-only.
/ clueweb-url-export Public archive

🌐 Python script that exports URLs from ClueWeb archives.

License

Notifications You must be signed in to change notification settings

stav121/clueweb-url-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

clueweb-url-export

Python script that exports URLs from ClueWeb archives.

Usage

Clone the repo

git clone https://github.com/unix121/clueweb-url-export
cd clueweb-url-export/src

Run the script

python clueweb_export.py /input/directory/ /output/directory/

About

This script can be used to extract URLs from ClueWeb database (warc format) and export them into JSON and Plain Text format. It's easy to modify to your needs.

It has been tested with ClueWeb09 and works fine.

It takes a few hours to export a whole subdirectory from ClueWeb but it does the job.

Author

Stavros Grigoriou (unix121)

About

🌐 Python script that exports URLs from ClueWeb archives.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages