# Semanlink automatic tagging, evaluation

This notebook present how to evaluate a neural search pipeline using pairs of query and answers. We will try to automatically tag arxiv papers that were manually automated by François-Paul Servant as part of [Semanlink](http://www.semanlink.net/sl/home?lang=fr) knowledge graph.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pprint import pprint as print
from cherche import data, rank, retrieve, eval
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
documents, query_answers = data.arxiv_tags(arxiv_title=True, arxiv_summary=False, comment=False)

The `documents` contain a list of tags, each tag is represented as a dict and contains a set of attributes. We will try to automate the tagging of arxiv documents with a neural search pipeline that will retrieve tags based on their attributes using the title, abstract and comments of the arxiv articles as a query.

Here is the list of attributes each tag has:

In [4]:
print(documents[3])

{'altLabel': [],
 'altLabel_text': '',
 'broader': ['http://www.semanlink.net/tag/statistical_classification',
             'http://www.semanlink.net/tag/embeddings'],
 'broader_altLabel': ['embedding'],
 'broader_altLabel_text': 'embedding',
 'broader_prefLabel': ['Classification', 'Embeddings'],
 'broader_prefLabel_text': 'Classification Embeddings',
 'broader_related': ['http://www.semanlink.net/tag/nlp_techniques',
                     'http://www.semanlink.net/tag/similarity_queries'],
 'comment': 'How to embed (describe) classes (in classification)? See related '
            'work section of this '
            '[paper](doc:2020/02/joint_embedding_of_words_and_la)\r\n'
            '\r\n'
            '> [FastText](tag:fasttext) generates both word\r\n'
            'embeddings and label embeddings. It seeks to predict one of the '
            'document’s labels (instead of the central\r\n'
            'word) ([src](doc:2020/10/1911_11506_word_class_embeddi))',
 'creationDate': '2020

Here is what a query looks likes using `arxiv title`, `arxiv summary` and `comments`. We will try to find the right document (tag) for this query.

In [5]:
print(query_answers[0][0])

' Joint Embedding of Words and Labels for Text Classification'


Let's evaluate a first piepline made of a single retriever

In [6]:
retriever = retrieve.TfIdf(
    on = ["prefLabel_text", "altLabel_text"], 
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 10,
)

retriever.add(documents)

TfIdf retriever
 	 on: prefLabel_text, altLabel_text
 	 documents: 433

In [7]:
eval.eval(search = retriever, query_answers=query_answers)

{'Precision@1': '63.06%',
 'Precision@2': '43.47%',
 'Precision@3': '33.12%',
 'Precision@4': '26.67%',
 'Precision@5': '22.55%',
 'Precision@6': '19.85%',
 'Precision@7': '17.52%',
 'Precision@8': '15.84%',
 'Precision@9': '14.61%',
 'R-Precision': '26.95%',
 'Precision': '13.47%'}

Here is what tagging looks like using our retriever

In [8]:
tags = retriever("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding")
for tag in tags:
    print(tag["uri"])

'http://www.semanlink.net/tag/nlu'
'http://www.semanlink.net/tag/attention_is_all_you_need'
'http://www.semanlink.net/tag/pre_trained_language_models'
'http://www.semanlink.net/tag/nlp_pretraining'
'http://www.semanlink.net/tag/co_training'
'http://www.semanlink.net/tag/huggingface_transformers'
'http://www.semanlink.net/tag/self_training'
'http://www.semanlink.net/tag/attention_in_graphs'
'http://www.semanlink.net/tag/sbert'
'http://www.semanlink.net/tag/language_model'


Not bad, TfIdf on title using ngrams does well. We will now try to improve those results using a ranker.

In [9]:
retriever = retrieve.TfIdf(
    on = ["prefLabel_text", "altLabel_text"], 
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 30,
)

ranker = rank.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = ["prefLabel_text", "altLabel_text"],
    k = 10,
    path = "all-mpnet-base-v2.pkl"
)

Pre-calculation of 314 queries to speed up evaluation. Transformers are slow using cpu...

Pre-calculation time on cpu:
- title: 24 seconds 
- title and summary: 6 minutes and 18 seconds
- title, summary and comments: 8 minutes

In [10]:
ranker.add([q for q, _ in query_answers])

Encoder ranker
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 embeddings stored at: all-mpnet-base-v2.pkl

In [11]:
search = retriever + ranker
search.add(documents)

TfIdf retriever
 	 on: prefLabel_text, altLabel_text
 	 documents: 433
Encoder ranker
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 embeddings stored at: all-mpnet-base-v2.pkl

In [12]:
eval.eval(search = search, query_answers = query_answers)

{'Precision@1': '63.69%',
 'Precision@2': '42.83%',
 'Precision@3': '33.01%',
 'Precision@4': '26.75%',
 'Precision@5': '22.55%',
 'Precision@6': '19.53%',
 'Precision@7': '17.56%',
 'Precision@8': '15.88%',
 'Precision@9': '14.37%',
 'R-Precision': '27.43%',
 'Precision': '13.18%'}

The Sentence Bert ranker improves the results of the retriever a bit.

Here are our tags for Bert paper using retriever ranker:

In [13]:
tags = search("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding")
for tag in tags:
    print(tag["uri"])

'http://www.semanlink.net/tag/pre_trained_language_models'
'http://www.semanlink.net/tag/attention_is_all_you_need'
'http://www.semanlink.net/tag/nlp_pretraining'
'http://www.semanlink.net/tag/nlu'
'http://www.semanlink.net/tag/attention_knowledge_graphs'
'http://www.semanlink.net/tag/sbert'
'http://www.semanlink.net/tag/grounded_language_learning'
'http://www.semanlink.net/tag/language_models_as_knowledge_bases'
'http://www.semanlink.net/tag/language_models_knowledge'
'http://www.semanlink.net/tag/attention_in_graphs'


Let's try to use using FlashText as a retriever.

In [14]:
retriever = retrieve.Flash(
    on = ["prefLabel", "altLabel"], 
    k = 30,
)

ranker = rank.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = ["prefLabel_text", "altLabel_text"],
    k = 10,
    path = "all-mpnet-base-v2.pkl"
)

search = retriever + ranker
search.add(documents)

Flash retriever
 	 on: prefLabel, altLabel
 	 documents: 605
Encoder ranker
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 embeddings stored at: all-mpnet-base-v2.pkl

FlashText as a retriever provides fewer candidates than TfIdf but has higher precision.

In [15]:
eval.eval(search = search, query_answers = query_answers)

{'Precision@1': '72.80%',
 'Precision@2': '61.90%',
 'Precision@3': '59.90%',
 'Precision@4': '59.27%',
 'Precision@5': '59.37%',
 'Precision@6': '59.37%',
 'Precision@7': '59.37%',
 'Precision@8': '59.37%',
 'Precision@9': '59.37%',
 'R-Precision': '20.20%',
 'Precision': '59.37%'}

In [16]:
tags = search("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding")
for tag in tags:
    print(tag["uri"])

'http://www.semanlink.net/tag/attention_is_all_you_need'
'http://www.semanlink.net/tag/bert'


We can get the best of both worlds by using pipeline union. It gets a bit complicated, but the union allows us to retrieve the best candidates from the first model, then add the candidates from the second model without duplicates (etc, no matter how many models are in the union). Our first retriever and ranker (Flash + Encoder) has low recall and high precision. The second retriever has lower precision but higher recall. We can mix things up and offer Flash and Ranker candidates first, then TfIdf and Ranker candidates seconds.

In [17]:
ranker = rank.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = ["prefLabel_text", "altLabel_text"],
    k = 10,
    path = "all-mpnet-base-v2.pkl"
)

precision = retrieve.Flash(
    on = ["prefLabel", "altLabel"], 
    k = 30,
) + ranker

recall = retrieve.TfIdf(
    on = ["prefLabel_text", "altLabel_text"], 
    tfidf = TfidfVectorizer(lowercase=True, min_df=1, max_df=0.9, ngram_range=(3, 7), analyzer="char"), 
    k = 10,
) + ranker

search = precision | recall

search.add(documents)

Union Pipeline
-----
Flash retriever
 	 on: prefLabel, altLabel
 	 documents: 605
Encoder ranker
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 embeddings stored at: all-mpnet-base-v2.pkl
TfIdf retriever
 	 on: prefLabel_text, altLabel_text
 	 documents: 433
Encoder ranker
	 on: prefLabel_text, altLabel_text
	 k: 10
	 similarity: cosine
	 embeddings stored at: all-mpnet-base-v2.pkl
-----

In [18]:
eval.eval(search = search, query_answers = query_answers)

{'Precision@1': '66.88%',
 'Precision@2': '47.77%',
 'Precision@3': '37.26%',
 'Precision@4': '29.62%',
 'Precision@5': '24.59%',
 'Precision@6': '21.07%',
 'Precision@7': '18.29%',
 'Precision@8': '16.32%',
 'Precision@9': '14.79%',
 'R-Precision': '30.48%',
 'Precision': '13.68%'}

Here are our tags for bert's article with best of both worlds

In [19]:
tags = search("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding")
for tag in tags:
    print(tag["uri"])

'http://www.semanlink.net/tag/attention_is_all_you_need'
'http://www.semanlink.net/tag/bert'
'http://www.semanlink.net/tag/pre_trained_language_models'
'http://www.semanlink.net/tag/nlp_pretraining'
'http://www.semanlink.net/tag/nlu'
'http://www.semanlink.net/tag/sbert'
'http://www.semanlink.net/tag/attention_in_graphs'
'http://www.semanlink.net/tag/self_training'
'http://www.semanlink.net/tag/language_model'
'http://www.semanlink.net/tag/co_training'
'http://www.semanlink.net/tag/huggingface_transformers'


Here is what tagging looks like with my first paper

In [20]:
tags = search("Knowledge Base Embedding By Cooperative Knowledge Distillation")
for tag in tags:
    print(tag["uri"])

'http://www.semanlink.net/tag/knowledge_base'
'http://www.semanlink.net/tag/knowledge_distillation'
'http://www.semanlink.net/tag/embeddings'
'http://www.semanlink.net/tag/knowledge_graph_embeddings'
'http://www.semanlink.net/tag/text_kg_and_embeddings'
'http://www.semanlink.net/tag/ai_knowledge_bases'
'http://www.semanlink.net/tag/phrase_embeddings'
'http://www.semanlink.net/tag/knowledge_graph'
'http://www.semanlink.net/tag/language_models_as_knowledge_bases'
'http://www.semanlink.net/tag/bert_kb'
'http://www.semanlink.net/tag/knowledge'
