Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore update readme Sep 5, 2012 update readme Sep 5, 2012


Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts. Live Demo


This project is our entry to the CommonCrawl contest. The idea is inspired by Google's release of the entity linking dataset, which provides baseline for research on entity linking and other information retrieval and natural language processing tasks.

Human language is ambiguous, and synonymy and polysemy are fundamental problems in natural language processing (NLP) and information retrieval (IR). One of the approaches for Word Sense Disambiguation (WSD) is utilizing external ontologies, e.g. Wikipedia to determine the meaning of a word based on the probabilities that it can be mapped each of the possible Wikipedia concepts. Our entry aims to build such a corpus of anchortext-WikipediaConcept-Count triples from the CommonCrawl dataset, so as to benifit research on WSD, NLP and IR. More specifically, we extract all anchortexts (the text you click on in a webpage link) which point to a Wikipedia page, together with the corresponding Wikipedia page. Based on the corpus, we developed this web application to demonstrate the anchortext-WikipediaConcept-Count structure.

Application scenarios

  • Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. This can be seen as an "Explicit Topic Modeling". Example

  • Given a sentence, it can help identify entities (person, locatin, organization) in the sentence and map them onto Wikipedia concepts

  • CommonCrawl vs. Google, with regards to anchortext-WikipediaConcept-Count corpus richness and precision

  • For entity linking tasks, will the combination of both corpus boost the performance compared with the usage of each dataset individually?


Live Demo:

Help Spread

If you find our work interesting, please vote our entry on CommonCrawl Contest Website and stay tuned for our release of the dataset.