Skip to content

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

License

Notifications You must be signed in to change notification settings

tfmorris/dkpro-c4corpus

 
 

Repository files navigation

DKPro C4CorpusTools

NOTE: work in progress until 1.0.0 release

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

  • DKPro C4CorpusTools (or C4CorpusTools) refers to the project source codes
  • C4Corpus refers the preprocessed CommonCrawl data set (C4 = Creative Commons from Common Crawl)

Please use the following citation if you use C4Corpus or C4CorpusTools

@InProceedings{Habernal.et.al.2016.LREC,
  author    = {Habernal, Ivan and Zayed, Omnia, and Gurevych, Iryna},
  title     = {{C4Corpus: Multilingual Web-size corpus with free license}},
  booktitle = {Proceedings of the 10th International Conference on Language Resources
               and Evaluation (LREC 2016)},
  month     = {May},
  year      = {2016},
  address   = {Portoro\v{z}, Slovenia},
  publisher = {European Language Resources Association (ELRA)},
  pages     = {(to appear)},
  url       = {TBA}
}

The full LREC article is available at the UKP website.

Consult the official C4CorpusTools documentation which contains

  • C4Corpus Users's Guide
    • How to access C4Corpus at S3
    • Running boilerplate removal outside Hadoop
    • Examples of simple search in C4Corpus
  • C4Corpus Developers's Guide
    • How to run the full processing pipeline on CommonCrawl
  • Corpus statistics reported in the LREC article

About

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 96.7%
  • Python 3.0%
  • HTML 0.3%