Abstract:
SMAPH-S is a precursor of SMAPH-2, a state-of-the-art system for joint entity mention detection and linking in web queries. Both systems use a piggyback approach to annotate queries. A set of candidate entities is drawn directly from Bing search results or annotations of Bing snippets and therefore performance depends heavily on the accuracy of Bing itself. Our system improves on SMAPH-S by systematically detecting queries which produce uninformative Bing results and rewrites them to extract better candidate entities. To this end, we split query strings into smaller chunks based on their linking probability. We also improve the way mention candidates are generated so that the system is able to handle noisy inputs as they are very common in web queries. Finally, we report the results of experimenting with different regressors in the pruning phase, such as Probabilistic Logistic Regression and AdaBoost.
The piggyback paper contains additional details.
This project is based on marcocor's query annotator stub. The project is mavenized.
- Python with scikit-learn and Flask.
- The pruner is written in Python using scikit-learn and relies on Flask to expose an API that is started and called from the Java pipeline.
- Scala
- We use Scala to generate the dataset for training the pruner.
- Make sure you have all dependencies installed.
- Fill in your Bing API key in config.properties.
- To benchmark our annotator, run BenchmarkMain.
File pom.xml defines a Maven project. It includes two dependencies: bat-framework and bing-api-java. You need the BAT-framework to benchmark your annotation system, and the Bing java API to access the Bing API (in case your project is built on top of Bing).
- SmaphSAnnotator contains the improved SMAPH-S annotator we implemented.