# Semanlink automatic tagging and evaluation

This notebook presents how to evaluate a neural search pipeline using pairs of queries and answers. We will automatically tag arXiv papers that François-Paul Servant manually automated as part of the [Semanlink](http://www.semanlink.net/sl/home?lang=fr) Knowledge Graph.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pprint import pprint as print
from cherche import data, rank, retrieve, eval
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
documents, query_answers = data.arxiv_tags(arxiv_title=True, arxiv_summary=False, comment=False)

The `documents` contain a list of tags. Each tag is represented as a dictionary and contains a set of attributes. We will try to automate the tagging of arXiv documents with a neural search pipeline that will retrieve tags based on their attributes using the title, abstract, and comments of the arXiv articles as a query. For each query, there is a list of relevant document identifiers.

In [4]:
print(query_answers[:2])

[(' Joint Embedding of Words and Labels for Text Classification',
  [{'uri': 'http://www.semanlink.net/tag/deep_learning_attention'},
   {'uri': 'http://www.semanlink.net/tag/arxiv_doc'},
   {'uri': 'http://www.semanlink.net/tag/nlp_text_classification'},
   {'uri': 'http://www.semanlink.net/tag/label_embedding'}]),
 (' A Survey on Recent Approaches for Natural Language Processing in '
  'Low-Resource Scenarios',
  [{'uri': 'http://www.semanlink.net/tag/bosch'},
   {'uri': 'http://www.semanlink.net/tag/survey'},
   {'uri': 'http://www.semanlink.net/tag/arxiv_doc'},
   {'uri': 'http://www.semanlink.net/tag/nlp_low_resource_scenarios'},
   {'uri': 'http://www.semanlink.net/tag/low_resource_languages'}])]


Here is the list of attributes each tag has:

In [5]:
documents[0]

{'prefLabel': ['Attention mechanism'],
 'type': ['http://www.semanlink.net/2001/00/semanlink-schema#Tag'],
 'broader': ['http://www.semanlink.net/tag/deep_learning'],
 'creationTime': '2016-01-07T00:58:24Z',
 'creationDate': '2016-01-07',
 'comment': 'Good explanation is this [blog post by D. Britz](/doc/?uri=http%3A%2F%2Fwww.wildml.com%2F2016%2F01%2Fattention-and-memory-in-deep-learning-and-nlp%2F). (But the best explanation related to attention is to be found in this [post](/doc/2019/08/transformers_from_scratch_%7C_pet) about Self-Attention.) \r\n\r\nWhile simple Seq2Seq builds a single context vector out of the encoder’s last hidden state, attention creates\r\nshortcuts between the context vector and the entire source input: the context vector has access to the entire input sequence.\r\nThe decoder can “attend” to different parts of the source sentence at each step of the output generation, and the model learns what to attend to based on the input sentence and what it has produced 

Let's evaluate a first piepline made of a single retriever

In [6]:
retriever = retrieve.TfIdf(
    key = "uri",
    on = ["prefLabel_text", "altLabel_text"], 
    documents = documents,
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 30,
)

eval.eval(search = retriever, query_answers=query_answers, hits_k=range(6))

{'Precision@1': '63.06%',
 'Precision@2': '43.47%',
 'Precision@3': '33.12%',
 'Precision@4': '26.67%',
 'Precision@5': '22.55%',
 'Recall@1': '16.79%',
 'Recall@2': '22.22%',
 'Recall@3': '25.25%',
 'Recall@4': '27.03%',
 'Recall@5': '28.54%',
 'F1@1': '26.52%',
 'F1@2': '29.41%',
 'F1@3': '28.65%',
 'F1@4': '26.85%',
 'F1@5': '25.19%',
 'R-Precision': '26.95%',
 'Precision': '5.65%'}

The results of Lunr are inferior to TfIdf on this dataset.

In [7]:
retriever = retrieve.Lunr(
    key = "uri",
    on = ["prefLabel_text", "altLabel_text"], 
    documents = documents,
    k = 10,
)

eval.eval(search = retriever, query_answers=query_answers, hits_k=range(6))

{'Precision@1': '56.55%',
 'Precision@2': '42.10%',
 'Precision@3': '34.17%',
 'Precision@4': '27.80%',
 'Precision@5': '23.33%',
 'Recall@1': '14.90%',
 'Recall@2': '21.52%',
 'Recall@3': '25.86%',
 'Recall@4': '27.74%',
 'Recall@5': '28.59%',
 'F1@1': '23.59%',
 'F1@2': '28.48%',
 'F1@3': '29.44%',
 'F1@4': '27.77%',
 'F1@5': '25.69%',
 'R-Precision': '27.59%',
 'Precision': '14.00%'}

You can find an explanation of the metrics [here](https://amitness.com/2020/08/information-retrieval-evaluation/). The TfIdf retriever using caracters ngrams did well.

Here is what tagging looks like using our retriever

In [8]:
retriever("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

[{'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 4.147},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/ranking_information_retrieval',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/embeddings_in_ir', 'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/entity_discovery_and_linking',
  'similarity': 1.579},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 1.479}]

Let's try to improve those results using a ranker.

In [9]:
retriever = retrieve.TfIdf(
    key = "uri",
    on = ["prefLabel_text", "altLabel_text"], 
    documents = documents,
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 30,
)

ranker = rank.Encoder(
    key = "uri",
    on = ["prefLabel_text", "altLabel_text"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 10,
)

In [11]:
search = retriever + ranker
search.add(documents)

Ranker embeddings calculation.: 100%|█| 7/7 [00:02<00:


TfIdf retriever
 	 key: uri
 	 on: prefLabel_text, altLabel_text
 	 documents: 433
Encoder ranker
	 key: uri
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 Embeddings pre-computed: 433

In [12]:
eval.eval(search = search, query_answers = query_answers, hits_k=range(6))

{'Precision@1': '63.69%',
 'Precision@2': '42.83%',
 'Precision@3': '33.01%',
 'Precision@4': '26.75%',
 'Precision@5': '22.55%',
 'Recall@1': '17.15%',
 'Recall@2': '22.64%',
 'Recall@3': '25.75%',
 'Recall@4': '27.34%',
 'Recall@5': '28.55%',
 'F1@1': '27.02%',
 'F1@2': '29.63%',
 'F1@3': '28.93%',
 'F1@4': '27.04%',
 'F1@5': '25.20%',
 'R-Precision': '27.43%',
 'Precision': '13.18%'}

The Bert Sentence classifier improved the results of the extractor a little. We managed to increase the F1@k score, precision and recall.

Here are proposed tags for Bert using retriever ranker:

In [13]:
search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

[{'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 0.5449117422103882},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 0.42808786034584045},
 {'uri': 'http://www.semanlink.net/tag/embeddings_in_ir',
  'similarity': 0.42719531059265137},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 0.4264187514781952},
 {'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 0.40513238310813904},
 {'uri': 'http://www.semanlink.net/tag/entity_discovery_and_linking',
  'similarity': 0.35288918018341064},
 {'uri': 'http://www.semanlink.net/tag/ranking_information_retrieval',
  'similarity': 0.34628304839134216},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 0.32937100529670715},
 {'uri': 'http://www.semanlink.net/tag/active_learning',
  'similarity': 0.2809343934059143},
 {'uri': 'http://www.semanlink.net/tag/cognitive_search',
  'similarity

Let's try to use using Flash as a retriever. Flash Text will retrieve tags labels inside the title. 

In [14]:
retriever = retrieve.Flash(
    key = "uri",
    on = ["prefLabel", "altLabel"], 
    k = 30,
)

search = retriever + ranker
search.add(documents)

Ranker embeddings calculation.: 100%|█| 7/7 [00:03<00:


Flash retriever
 	 key: uri
 	 on: prefLabel, altLabel
 	 documents: 605
Encoder ranker
	 key: uri
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 Embeddings pre-computed: 433

FlashText as a retriever provides fewer candidates than TfIdf but has higher precision.

In [15]:
eval.eval(search = search, query_answers = query_answers, hits_k=range(6))

{'Precision@1': '72.80%',
 'Precision@2': '61.90%',
 'Precision@3': '59.90%',
 'Precision@4': '59.27%',
 'Precision@5': '59.37%',
 'Recall@1': '16.33%',
 'Recall@2': '19.54%',
 'Recall@3': '20.11%',
 'Recall@4': '20.16%',
 'Recall@5': '20.20%',
 'F1@1': '26.67%',
 'F1@2': '29.71%',
 'F1@3': '30.11%',
 'F1@4': '30.08%',
 'F1@5': '30.15%',
 'R-Precision': '20.20%',
 'Precision': '59.37%'}

In [16]:
search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

[]

We can get the best of both worlds by using pipeline union. It gets a bit complicated, but the union allows us to retrieve the best candidates from the first model then add the candidates from the second model without duplicates (no matter how many models are in the union). Our first retriever and ranker (Flash + Encoder) have low recall and high precision. The second retriever has a lower precision but higher recall. So we can mix things up and offer Flash and Ranker candidates first, then TfIdf and Ranker candidates seconds.

In [17]:
precision = retrieve.Flash(
    key = "uri",
    on = ["prefLabel", "altLabel"], 
    k = 30,
) + ranker

recall = retrieve.TfIdf(
    key = "uri",
    on = ["prefLabel_text", "altLabel_text"], 
    documents = documents,
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 30,
) + ranker

search = precision | recall

search.add(documents)

Ranker embeddings calculation.: 100%|█| 7/7 [00:03<00:
Ranker embeddings calculation.: 100%|█| 7/7 [00:03<00:


Union Pipeline
-----
Flash retriever
 	 key: uri
 	 on: prefLabel, altLabel
 	 documents: 605
Encoder ranker
	 key: uri
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 Embeddings pre-computed: 433
TfIdf retriever
 	 key: uri
 	 on: prefLabel_text, altLabel_text
 	 documents: 433
Encoder ranker
	 key: uri
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 Embeddings pre-computed: 433
-----

In [18]:
eval.eval(search = search, query_answers = query_answers, hits_k=range(6))

{'Precision@1': '69.11%',
 'Precision@2': '49.84%',
 'Precision@3': '39.07%',
 'Precision@4': '31.13%',
 'Precision@5': '25.92%',
 'Recall@1': '18.74%',
 'Recall@2': '25.89%',
 'Recall@3': '30.10%',
 'Recall@4': '31.58%',
 'Recall@5': '32.57%',
 'F1@1': '29.49%',
 'F1@2': '34.08%',
 'F1@3': '34.00%',
 'F1@4': '31.35%',
 'F1@5': '28.87%',
 'R-Precision': '31.96%',
 'Precision': '13.97%'}

We did improves F1 and recall scores using union of pipelines.

Here are our tags for BERT's article with best of both worlds

In [19]:
search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

[{'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 0.11754539040261458},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 0.10458505653003734},
 {'uri': 'http://www.semanlink.net/tag/embeddings_in_ir',
  'similarity': 0.10449175080983726},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 0.10441063828677113},
 {'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 0.10221160275170203},
 {'uri': 'http://www.semanlink.net/tag/entity_discovery_and_linking',
  'similarity': 0.09700882931829885},
 {'uri': 'http://www.semanlink.net/tag/ranking_information_retrieval',
  'similarity': 0.0963700883333301},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 0.09475397763274808},
 {'uri': 'http://www.semanlink.net/tag/active_learning',
  'similarity': 0.09027379432403294},
 {'uri': 'http://www.semanlink.net/tag/cognitive_search',
  'similari