Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts. Live Demo
This project is our entry to the CommonCrawl contest. The idea is inspired by Google's release of the entity linking dataset, which provides baseline for research on entity linking and other information retrieval and natural language processing tasks.
Human language is ambiguous, and synonymy and polysemy are fundamental problems in natural language processing (NLP) and information retrieval (IR). One of the approaches for Word Sense Disambiguation (WSD) is utilizing external ontologies, e.g. Wikipedia to determine the meaning of a word based on the probabilities that it can be mapped each of the possible Wikipedia concepts. Our entry aims to build such a corpus of anchortext-WikipediaConcept-Count triples from the CommonCrawl dataset, so as to benifit research on WSD, NLP and IR. More specifically, we extract all anchortexts (the text you click on in a webpage link) which point to a Wikipedia page, together with the corresponding Wikipedia page. Based on the corpus, we developed this web application to demonstrate the anchortext-WikipediaConcept-Count structure.
Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. This can be seen as an "Explicit Topic Modeling". Example
Given a sentence, it can help identify entities (person, locatin, organization) in the sentence and map them onto Wikipedia concepts
CommonCrawl vs. Google, with regards to anchortext-WikipediaConcept-Count corpus richness and precision
For entity linking tasks, will the combination of both corpus boost the performance compared with the usage of each dataset individually?
Live Demo: http://wikientities.appspot.com/
If you find our work interesting, please vote our entry on CommonCrawl Contest Website and stay tuned for our release of the dataset.