<h1><span class="tocSkip"></span>Table of Contents</h1>
<div id="toc-wrapper"></div>
<div id="toc"></div>

# Introduction

The **COVID-19 pandemic**, is an ongoing global pandemic of coronavirus disease 2019 (COVID‑19), caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2).

The pandemic has caused global social and economic disruption, including the largest global recession since the Great Depression. [[Wikipedia]](https://en.wikipedia.org/wiki/COVID-19_pandemic)

In response to the **COVID-19 pandemic**, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) is a resource of over 158,000 scholarly articles, including over 75,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. 

By filtering the CORD-19 dataset, collecting only papers published after the december 2019 and related to COVID-19/SARS-CoV-2, we build a corpus of 5900 papers.

Our goal is to build a Search Engine (SE) over the COVID-19 corpus taking into account linguistic phenomena, synonymy and polysemy.


*Before we move on the next section, just let us prepare our framework, by loading libraries, data and setting some queries...*

## Configuration class

We set variables like from where we load, where to store and some parameters (Explained later).

In [None]:
class config():
    OUTPUT_DIR="/kaggle/working/"
    CORPUS_FN='/kaggle/input/cord-19-step2-corpus/corpus.pkl'
    ENM_FN='/kaggle/working/ranker_enm.pickle'
    TOC2_FN='/kaggle/input/toc2js/toc2.js'
    
    n_relevant = 150 # |R|
    ql_lambda = 0.4
    rm1_lambda = 0.6
    rm3_lambda = 0.8
    
    coherence_model = 'c_v'
    coherence_win = 110


## Libraries

### Our libraries

All our libraries are made public under open source.

In [None]:
import cord_19_container as container
import cord_19_rankers as rankers
import cord_19_lm as lm
import cord_19_vis as vis

from cord_19_container import Sentence, Document, Paper, Corpus

from cord_19_metrics import compute_queries_perf

from cord_19_helpers import load, save
from cord_19_text_cleaner import Cleaner
from cord_19_wn_phrases import wn_phrases

### Commun libraries

In [None]:
from gensim import matutils
from gensim.models import (LdaModel, LsiModel, TfidfModel,
                           CoherenceModel, LogEntropyModel)

import numpy as np
import pandas as pd

import math
import operator
import logging

### Visualization libraries

In [None]:
%matplotlib inline

from IPython.display import display, HTML, Markdown, Latex

import matplotlib.pyplot as plt
import seaborn as sns

import pyLDAvis.gensim

HTML("""
<style>
.output_png {
    text-align: center;
    vertical-align: middle;
}

.rendered_html table{
    display: table;
}
</style>
""")

## Load data

Load the corpus, papers talking about COVID-19/SARS-CoV-2, done in our previous kernel.

In [None]:
corpus = load(config.CORPUS_FN)
dictionary = corpus.dictionary

# Rebuild id2token from token2id, only token2id is saved
for k,v in dictionary.token2id.items():
    dictionary.id2token[v]=k

# Set the dictionary as global, we have to find better way
container.dictionary = dictionary
rankers.dictionary = dictionary
vis.dictionary = dictionary

print(f'#Papers {len(corpus)}, #Tokens {len(dictionary)}')

## Queries

We define a bunch of [queries](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568) to train our model, we are using 17 queries, higher is better.


In [None]:
queries = [
    'Range of incubation periods for the disease in humans.',
    'Range of incubation periods for the disease in humans and how this varies across age and health status.',
    'How long individuals are contagious, even after recovery?',
    'Prevalence of asymptomatic shedding and transmission e.g. particularly children.',
    'Seasonality of transmission e.g. climate, humidity, temperature.',
    'Physical science of the coronavirus: Charge distribution, adhesion to hydrophilic hydrophobic surfaces.',
    'Physical science of the coronavirus: Environmental survival to inform \
decontamination efforts for affected areas and provide information about viral shedding.',
    'Persistence and stability on a multitude of substrates and sources \
e.g. nasal discharge, sputum, urine, fecal matter, blood.',
    'Persistence of virus on surfaces of different materials e.g. copper, stainless steel, plastic.',
    'Natural history of the virus and shedding of it from an infected person.',
    'Implementation of diagnostics and products to improve clinical processes.',
    'Disease models, including animal models for infection, disease and transmission.',
    'Tools and studies to monitor phenotypic change and potential adaptation of the virus.',
    'Immune response and immunity.',
    'Effectiveness of movement control strategies to prevent secondary transmission in health care and \
community settings.',
    'Effectiveness of personal protective equipment PPE and its usefulness to reduce risk of \
transmission in health care and community settings.',
    'Role of the environment in transmission.'
]

Define the queries corpus, foreach query in queries do:
- Clean;
- Tokenize;
- Merge multi-word expressions using WordNet.

In [None]:
queries_corpus = container.Corpus([container.Document([Cleaner(True).clean(q)]) for q in queries])
for doc in queries_corpus:
    doc.tokenize()
    wn_phrases(doc)

# Language models

Language modeling is a quite general formal approach to Information Retrieval (IR), with many variant realizations. The original and basic method for using language models in IR is the *query likelihood model*. [[4]](#r4)

## Query likelihood model

We estimate the likelihood of an individual document model generating the query as:

$\hat{P}(Q|D) = \prod_{q \in Q}\hat{P}(q|D)$

Using Bayes rule (**Query generating document**), we have:

$P(D|Q) = \frac{P(Q|D)P(D)}{P(Q)}$

where:
- $P(Q)$ = is constant for all documents, and so can be ignored
- $P(D)$ = is the prior of relevant document
- $Q$ = Query sequence of terms
- $q$ = term of query
- $D$ = Document sequence of terms

$P(D)$ is uniform across all $D$ and is ignored for the moment, **<span style="color:red">but we can improve our SE by including it as authority/newness</span>**.


The classic problem with using language models is one of estimation (the $\hat{}$ symbol is used above to stress that the model is estimated): terms appear very sparsely in documents.

In particular, some words will not have appeared in the document at all, but are possible words for the information need, which the user may have used in the query. 

If we estimate $\hat{P}(q|D) = 0$ for a term missing from a document $D$, then we get a strict conjunctive semantics: documents will only give a query non-zero probability if all of the query terms appear in the document.
Regardless of the approach here, there is a more general problem of estimation: occurring words are also badly estimated; in particular,
the probability of words occurring once in the document is normally overestimated, since their one occurrence was partly by chance.

The answer to this is smoothing. There's a wide space of approaches (three of them are presented in [[9]](#r9)) to smoothing probability distributions to deal with this problem, we are using the simplist form *Jelinek-Mercer*.


$P(q|D) = \lambda P_{MLE}(q|D) + (1-\lambda)P_{coll}(w)$

where:
- $\lambda$ free parameter [0,1] was set to 0.6 as many papers `config.ql_lambda`

$P_{MLE}(q|D)$ defined as $P_{MLE}(w|D, w \notin Q) = 0$

where:
- $P_{coll}(w)$ = relative frequency of the term in the collection as a whole


*Most if not all we said in this section comes from well known online IR Standford course "Introduction to Information Retrieval (Section 12)" [[4]](#r4)*

Function `cord_19_lm::compute_ql`

In [None]:
ranker_ql = rankers.ranker_QL(corpus, config.ql_lambda)

# Relevance Model

Relevance models are a form of massive query expansion through blind feedback. Constructing a relevance model entails first ranking the collection according to the maximum likelihood query model. [[13 Section 3.1.4]](#r6) [[6]](#r6)


## Relevance Model (RM1)

<br>
$P_{RM1}(w|Q) = \sum_{D \in R}P(w|D)P(D|Q)$
<br><br>
We do for $P(w|D)$ as we did previously for $P(q|D)$.

where:
- $R$ is a set of relevant documents the $|R|$ defined at `config.n_relevant`

Function `cord_19_lm::compute_rm1`

## Relevance Model (RM3)

RM3 is the most popular of the four Relevance Model variants (RM1, RM2, RM3 and RM4)

Some times the original query terms do not have the highest weights in the expanded query model, seems risky and problematic.

RM3 solve this behavior by interpolating the RM1 with the original query's MLE. [[13 Section 3.1.4]](#r6) [[6]](#r6)

<br>
$P_{RM3}(w|Q) = \lambda P_{RM1}(w|Q) + (1-\lambda)P_{MLE}(w|Q)$

where:
- $\lambda$ is a free parameter in range [0,1] defined at `config.rm3_lambda`


Function `cord_19_lm::compute_rm3`

## Topic models RM

Combine topics models with RM query expansion method [[5]](#r5)

$P(w|Q) = \sum_{D \in R}(\lambda P_{RM3}(w|D)+(1-\lambda) P_{TM}(w|D,Q) P(D|Q)$

$P_{TM}(w|D,Q) = \sum_{t_m}P(w|t_m)P(t_m|D,Q) \propto \sum_{t_m}P(w|t_m)P(t_m|D)P(Q|t_m)$

Note: we can use $P_{RM3}$ or $P_{RM1}$ depending on the problem. e.g., $P_{RM3}$ for query expansion.

where:
- $P(w|t_m)$ = word probality given a topic
- $P(t_m|D)$ = topic probality given a Document
- $P(Q|t_m) \propto P(t_m|Q)$ = Query likelihood given a topic

Function `cord_19_lm::compute_tm_rm`

# Metrics

We need metrics for estimating the effectiveness of a search performed in response to a query in lack of relevance judgments. In case we have labeled data usually we use [Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29#Mean_average_precision) (AP).

## Clarity score

*Clarity score* for predicting query performance by computing the relative entropy between a query language model and the corresponding collection language model. The resulting *clarity score* measures the coherence of the language usage in documents whose models are likely to generate the query. 

The *clarity score* for the query is simply the relative entropy, or Kullback-Leibler divergence (we use [Jensen–Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) ), between the query and collection language models. [[2]](#r2)

<br>
$clarity\;score=\sum_{w \in V}P_{RM1}(w|Q)log_2\frac{P_{RM1}(w|Q)}{P_{coll}(w)}$

where:
- $V$ = the entire collection vocabulary

Function `cord_19_metrics::compute_cs`

## UEF(Clarity)

Utility estimation framework (UEF) is a framework for the query-performance prediction task. The approach [[3]](#r3) is based on using statistical decision theory for estimating the utility that a document ranking provides with respect to an information need expressed by the query. To address the uncertainty in inferring the information need, we estimate utility by the expected similarity between the given ranking and those induced by relevance models; the impact of a relevance model is based on its presumed representativeness of the information need.

$U(R|I_Q) \approx Sim(\pi(R, \hat{R_Q}), R)P(\hat{R_Q}|I_Q)$

The re-ranking function $\pi(R, \hat{R_Q})$, foreach $D$ in $R$ using negative cross entropy (CE) is:

$Score_{CE} = \sum_{w \in D} P_{RM3}(w|Q) log P(w|D)$

where:
- $Sim$ is similarity between ranked lists, we use Pearson's r correlation
- $\hat{R_Q}$ = estimates for the true relevance model
- $P(\hat{R_Q}|I_Q)$ Quantifying the extent to which a relevance-model estimate, represents the information need $I_Q$ in our case *Clarity scrore*

Function `cord_19_metrics::compute_uef_cs`

# Topic Coherence

We evaluate the performance of the topic models we will use by their coherences. Using the $C_v$ model (window size 110) [[11]](#r11), the $C_v$ model produce a strong average correlation of **0.73** with human ratings. Fortunately already implemented into *Gensim*.

In [None]:
def get_coherence(model):
    """Topics coherence sorted in ascending order
    return: (topic,coherence)
    """
    cm = CoherenceModel(model=model, texts=corpus.text,
                        coherence=config.coherence_model,
                        window_size=config.coherence_win)
    
    coherence_values = np.asarray(cm.get_coherence_per_topic())
    coherence = {k:v for k,v in enumerate(coherence_values)}
    
    coherence = sorted(coherence.items(), key=operator.itemgetter(1), reverse=True)
    
    return coherence

# Models

## Indexing by Latent Dirichlet Allocation (LDI)

In this section (the master piece of the kernel), we propose an LDA-based probabilistic topic search model that identifies a set of documents that closely matches a given set of query terms in a topic space. This method is called Indexing by Latent Dirichlet Allocation (LDI). [[1]](#r1)

Since LDA models documents as a mixture of topics, it provides a new approach for representing documents in a topic space where the topics can be seen as index terms for indexing. 

Since we aim to construct explicit document representations associated with topics, the method directly uses the $\beta$ matrix in the LDA model. The conditional probability $\beta_{jk}$ in LDA represents the selection probability of the word $w^j$ given a topic (concept) $z^k$. This value represents the probability of a word given a specific topic and it is used in identifying words that are associated with a topic. However, it may not be used as the probability of a
topic given a word. Thus, for characterization, we define word representation in topic space, $W_j \in \mathbb{R} ^K$. The $kth$ component $W_{j}^{k}$
of $W_j$ represents the probability of word $w^j$ embodying the $kth$ concept $z^k$. This quantity can be obtained by Bayes’ rule as:

$$W_{j}^{k} = p(z^k=1|w^j=1)=\frac {p(w^j=1|z^k=1)p(z^k=1)} {\sum \limits _{h=1} ^{K} p(w^j=1|z^h=1)p(z^h=1) }.$$

We assume that the probability of a topic selection is uniformly distributed. It is conceivable that more sophisticated adaptive techniques for the probability of topic selection will result in a more accurate model. With this assumption, we obtain the probality of a word $w^j$ corresponding to a concept $z^k$ as

$$W^{k}_{j}=\frac {\beta_{jk}} {\sum \limits _{h=1} ^K \beta_{jh}}.$$

Furthermore, the documents can be represented in the topic space as well, $D_i \in \mathbb{R} ^K$. The $kth$ component $D^{k}_{i}$ of $D_i$ represents the probability of a concept $z^k$ given a document $d_i$ and it is expressed as

$$D^{k}_{i} \approx \tilde{D^{k}_{i}} = \frac {\sum _{w^j \in d_i} W ^{k} _{j} n_{ij}} {N_{d_i}}$$

where:
- $n_{ij}$ is the count of word $j$ in document $d_i$;
- $N_{d_i}$ is the number of words in document $d_i$;
- $K$ number of topics

$\beta$ can be computed using: `B = lda_model.state.get_lambda(); B = B / B.sum(axis=1)[:, None]`

The computation of $W$ and $D$, are not restricted to LDA but can be done for many others models, provided that they can produce $\beta$ matrix (probability of the word $w^j$ given a topic $z^k$).

### LDA Training

We train our LDA model in unusual way, not necessary better, but mathematically simpler with less stochasticity and we believe it's better for small corpus. Please refere to [[10 Section 2.2]](#r10)

We train using standard (batch) variational Bayes (VB) , we set decay () to zero and chunksize=len(corpus) , 
- Set $\kappa$ to zero (*Gensim*:lda:decay=0 means 'learning rate'=1);
- Set $S=D$ (*Gensim*:lda:chunksize=|corpus|).

We can set the LDA priors in five different ways Please refere to [[12]](#r12):
- Symmetric priors over both $\Theta$ and $\Phi$ denoted **SS**, default option in *Gensim*:lda;
- Symmetric prior over $\Theta$ and asymmetric over $\Phi$ denoted **SA**;
- Asymmetric prior over $\Theta$ and symmetric over $\Phi$ denoted **AS**;
- Asymmetric learned prior over $\Theta$ and symmetric over $\Phi$ denoted **ASO**;
- Aymmetric priors over both $\Theta$ and $\Phi$ denoted **AA**.

We can say that $\Theta$ and $\Phi$ are Topics importance and Words importance prior.

1. **SA** and **AA** produce the worst results, that's why not included into *Gensim*;
2. **AS** and **ASO** produce better than **SS**;
3. There's no clear winner between **AS** and **ASO**
4. The learned asymmetricity from **ASO** can improve the uniformity assumption in the previous section, also coherence can do.
5. We use **ASO**

In the production mode we need to train LDA until convergence, the relative improvement in maximization step lower than 1e-5. In this kernel we stop around 5e-2.


In [None]:
def train_lda(k):
    gensimLogger = logging.getLogger('gensim.models.ldamodel')
    gensimLogger.setLevel(logging.DEBUG)
    
    ch = logging.FileHandler(config.OUTPUT_DIR+f'lda.log', mode='w')
    formatter = logging.Formatter('%(asctime)s : %(levelname)s : %(message)s')
    ch.setFormatter(formatter)
    gensimLogger.addHandler(ch)

    lda_params = {'num_topics':k,
                  'eval_every':0,
                  'gamma_threshold':1e-5,
                  'iterations':5000,
                  'passes':50,
                  'offset':1,
                  'chunksize':len(corpus),
                  'decay':0., # kappa
                  'update_every':0,
                  'random_state':1,
                  'alpha':'auto', 'eta':'symmetric' # ASO
                 }

    lda_model = LdaModel(corpus.bow,
                        id2word=dictionary,
                        **lda_params)

    ch.close()
    gensimLogger.removeHandler(ch)
        
    return lda_model

In [None]:
%%time

lda_model = train_lda(100)

In [None]:
# Better in the future to use corpus.TRF
ranker_ldi = rankers.Ranker_LDI(lda_model, corpus.TF)

In [None]:
%%time

lda_top_topics = get_coherence(lda_model)

### LDA Intertopic Distance Map (Top 25)

In [None]:
top_topics_ids = [k for k,v in lda_top_topics[:25]]
pyvis = vis.prepare(ranker_ldi, corpus, dictionary=dictionary, top_topics=top_topics_ids)

html = vis.prepared_data_to_html(pyvis, visid='pylda_ldi')
display(HTML(html))

## Non-negative matrix factorization (NMF)

Non-negative matrix factorization (NMF or NNMF) is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ and $H$, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect instead of LSI results [[15]](#r15).

NMF is a NP-Hard problem, discussing the methods to solve it is beyond this scope. Fortunately we have `sklearn.decomposition.NMF`.

We are interested in the non-negativity of both input and the output:
- Our input $V$ matrix terms-documents relative frequency all elements are **positive**;
- The output is **positive** we can easily transform to probality, then apply the same process as LDI.

Empirical studies with LSI report that the Log and Entropy weighting functions work well, in practice, with many data sets [[14]](#r14).  We do the same here.

$V$ now is defined as

$g_i = 1 + \sum _j \frac {P_{ij} log(P_{ij})} {log(n)}$  
$V_{ij} = g_{i} log(tf_{ij} + 1)$

Where:
- $tf_{ij}$ term frequency;
- $P_{ij}$ = $\frac{tf_{ij}}{\sum_j tf_{ij}}$
- $i$ document index;
- $j$ term index;

Thanks again to *Gensim* no need to develop from scratch, but use `gensim.models.LogEntropyModel`

### NMF Training

We will train our NMF model, using default *sklearn* parameters.

In [None]:
def train_nmf(k, nmf_pre, max_iter=400):
    X_csr = matutils.corpus2csc(nmf_pre[corpus.bow])
    X = X_csr.T.toarray()
    
    from sklearn import decomposition
    
    clf = decomposition.NMF(n_components=k, max_iter=max_iter, random_state=1, verbose=0)
    
    W = clf.fit_transform(X)
    H = clf.components_
    
    return H

In [None]:
%%time

nmf_pre = LogEntropyModel(corpus.bow)

H1 = train_nmf(200, nmf_pre, max_iter=500)

In [None]:
# We need an interface compatible with Gensim, used for coherence
nmf_model = rankers.NMFModel(H1.T)
ranker_nmf = rankers.Ranker_NMF(H1, corpus, nmf_pre)

In [None]:
%%time

nmf_top_topics = get_coherence(nmf_model)

### NMF Intertopic Distance Map (Top 25)

In [None]:
top_topics_ids = [k for k,v in nmf_top_topics[:25]]
pyvis = vis.prepare(ranker_nmf, corpus, dictionary=dictionary, top_topics=top_topics_ids)

html = vis.prepared_data_to_html(pyvis, visid='pylda_nmf')
display(HTML(html))

## Latent semantic indexing (LSI)

LSI [[14]](#r14) uses common linear algebra techniques to learn the conceptual correlations in a collection of text. In general, the process involves constructing a weighted term-document matrix, performing a Singular Value Decomposition on the matrix, and using the matrix to identify the concepts contained in the text.

$A=U \Sigma V^T$

Truncated version to r (number of topics):

$A \approx A_{r} = U_{r} \Sigma_{r} V_{r}^T $

where:
- $A$ = matrix [m,n] of term frequencies, m is the number of unique terms, and n is the number of documents
- $U$ = term-concept vector matrix [m,r], when truncated r is number of topics
- $S$ = Singular values [r,r], sorted by magnitude
- $V$ = concept-document vector matrix [n,r]

The problem with LSI the $U$ resulting dimensions might be difficult to interpret, and contains positive and negative values, but usually we use the magnitude as the importance of word topics.

In [None]:
lsi_pre = LogEntropyModel(corpus.bow)
lsimodel = LsiModel(lsi_pre[corpus.bow], id2word=dictionary, power_iters=10,
                    num_topics=200, onepass=False)

ranker_lsi = rankers.Ranker_LSI(corpus, lsimodel, lsi_pre)

In [None]:
%%time

lsi_top_topics = get_coherence(lsimodel)

## TF-IDF

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. [[Wikipedia]](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

There is many weighting variants in the vector space model [SMART](https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System), we use the 'nfc' form.

In [None]:
ranker_tfidf = rankers.Ranker_TFIDF(corpus, smartirs='nfc')

# Topic models comparative

## Clusters of documents

In [None]:
# https://stackoverflow.com/questions/25812255/row-and-column-headers-in-matplotlibs-subplots

fig, axes = plt.subplots(3, 3, gridspec_kw={"height_ratios":[0.01,1,1],
                                            "width_ratios": [0.02,1,1]},
                        figsize=(14,14))

axes = axes.flatten()
"""
0 1 2
3 4 5
6 7 8
"""

# Columns titles
# 1st column
axes[1].axis('off')
axes[1].set_xticks([]); axes[1].set_yticks([])
axes[1].set_title('NMF', loc='left', fontweight='bold')
# 2nd column
axes[2].axis('off')
axes[2].set_title('LDI', loc='left', fontweight='bold')

# Rows titles
# Empty
axes[0].axis("off")
# 1st row
axes[3].axis("off")
axes[3].annotate('The visualization of the clustered data.',
                 xy=(0, 0), xycoords='data',
                 rotation='vertical',
                 fontsize='large', fontweight='bold')
# 2nd row
axes[6].axis("off")
axes[6].annotate('The silhouette plot for the various clusters.',
                 xy=(0, 0), xycoords='data',
                 rotation='vertical',
                 fontsize='large', fontweight='bold')

vis.plot_tm_clusters(ranker_nmf, axes[4], axes[7], set_y_label=False)
vis.plot_tm_clusters(ranker_ldi, axes[5], axes[8], set_y_label=True)

fig.suptitle(f'20 Clusters of documents projected into t-SNE (2 subspaces)')

#fig.tight_layout()
plt.show()

## Dominant topics

In [None]:
fig_1, (ax1_nmf, ax1_ldi) = plt.subplots(1, 2, figsize=(14,6), sharey=True)
fig_2, (ax2_nmf, ax2_ldi) = plt.subplots(1, 2, figsize=(14,6), sharey=True)

vis.plot_topics_dist(ranker_nmf, corpus, ax1_nmf, ax2_nmf, "NMF", set_y_label=True)
fig_1.suptitle(f'Number of Documents by Dominant Topic.')

vis.plot_topics_dist(ranker_ldi, corpus, ax1_ldi, ax2_ldi, "LDI", set_y_label=False)
fig_2.suptitle(f'Mean topic probability over corpus.')

plt.show()

## Coherence

In [None]:
nmf_coh = np.asarray([coherence for _, coherence in nmf_top_topics])
lda_coh = np.asarray([coherence for _, coherence in lda_top_topics])
lsi_coh = np.asarray([coherence for _, coherence in lsi_top_topics])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[10, 4])

ax1.plot(nmf_coh, label=f'NMF \u03BC={nmf_coh.mean():.2f}')
ax1.plot(lda_coh, label=f'LDA \u03BC={lda_coh.mean():.2f}')
ax1.plot(lsi_coh, label=f'LSI \u03BC={lsi_coh.mean():.2f}')
ax1.axhline(y=0.5, color='grey', linestyle='dashed')
ax1.set_xlabel('topic')
ax1.set_ylabel('coherence')
ax1.set_title('Topic/Model sorted by coherence')
ax1.legend()

sns.kdeplot(nmf_coh, shade=True, cut=0, ax=ax2, label=f'NMF')
sns.kdeplot(lda_coh, shade=True, cut=0, ax=ax2, label=f'LDA')
sns.kdeplot(lsi_coh, shade=True, cut=0, ax=ax2, label=f'LSI')
ax2.set_title('Topic/Model KDE')

plt.show()

## High ranked topics

In [None]:
def get_topics_tbl(X, n_topics=10, n_words=5, best=True):
    columns = []
    tbl_data = dict()
    for name, top_topics, model in X:
        if best:
            top_topics = top_topics[:n_topics]
        else:
            top_topics = top_topics[-n_topics:]
            
        for tid, coherence in top_topics:
            words = '\n'.join([w for w,p in model.show_topic(tid, topn=n_words)])
            
            if not name in tbl_data:
                tbl_data[name] = {'Coherence': [],
                                  'Topics': []}
            
            tbl_data[name]['Coherence'].append(coherence)
            tbl_data[name]['Topics'].append(words)
        
    # Pandas: Multilevel column names
    # https://stackoverflow.com/questions/21443963/pandas-multilevel-column-names
    d = {k:pd.DataFrame(v) for k,v in tbl_data.items()}
    
    return pd.concat(d, axis=1)

In [None]:
df = get_topics_tbl([('NMF', nmf_top_topics, nmf_model),
                     ('LDI', lda_top_topics, lda_model),
                     ('LSI', lsi_top_topics, lsimodel)],
                   best=True)

df_style = (df.style
            .set_properties(**{
                'text-align': 'left',
                'white-space': 'pre-wrap'})
            .set_table_styles([dict(selector='th',
                                    props=[('text-align', 'center')])]))

HTML('<center>' + df_style.render() + '</center>')

## Low ranked topics

In [None]:
df = get_topics_tbl([('NMF', nmf_top_topics, nmf_model),
                     ('LDI', lda_top_topics, lda_model),
                     ('LSI', lsi_top_topics, lsimodel)],
                   best=False)

df_style = (df.style
            .set_properties(**{
                'text-align': 'left',
                'white-space': 'pre-wrap'})
            .set_table_styles([dict(selector='th',
                                    props=[('text-align', 'center')])]))

HTML('<center>' + df_style.render() + '</center>')

# Ensemble Model (EnM)

The Ensemble Model (EnM) ranks documents according to the summation of weighted similarity values that are computed by constituent indexing models. [[1]](#r1)



In [None]:
ranker_enm = rankers.Ranker_EnM({#'QL':ranker_ql,
                                 'TFIDF':ranker_tfidf,
                                 'LDI':ranker_ldi,
                                 #'LSI':ranker_lsi,
                                 'NMF':ranker_nmf}, renorm=False)

## Ensemble Model Boosting (EnM.B)

Algorithm EnM.B is developed within the boosting scheme that utilizes the competition between constituent models and training data sets to iteratively update the weights until the game reaches equilibrium. [[1]](#r1)

***
$\mathbf{\text{EnM.B: A boosting algorithm for training the EnM.}}$<br>
***
**Require:** Query set $Q$, a set of basis models $\Phi$ a duplicate set of models $\Phi^\prime = \Phi$.  
&emsp;Initialized weights $\mathfrak{D}^1$ over all queries with uniform distribution, i.e., $\mathfrak{D}^1 = 1/|Q|$.  
&emsp;Initialize $\alpha's$ with zeros.  
&emsp;Set the initial performance measure $E_0$.


&emsp;**while** $|E^t - E^{t-1}| > \epsilon$ **do**  
&emsp;&emsp;**if** $\Phi^\prime = \emptyset$ **then**  
&emsp;&emsp;&emsp;$\Phi^\prime = \Phi$;  
&emsp;&emsp;**end if**  
&emsp;&emsp;Select basis models $\phi^t \in \Phi^\prime $ with weights $\mathfrak{D}^t$ on training queries using:  
$$j^{*} = \underset{j}{\textrm{arg max}}\sum \limits_{i=1} ^{|Q|} \mathfrak{D}_i P(\phi_{ji});$$  

&emsp;&emsp;Update the weight $\alpha = \alpha + \delta_{i}*e_{j^{*}}$ using using:  
$$\delta_{j} = \frac{1}{2} log \frac{\sum \limits_{i=1} ^{|Q|} \mathfrak{D}_i (1+P(\phi_{ji}))}{\sum \limits_{i=1} ^{|Q|} \mathfrak{D}_i (1-P(\phi_{ji}))};$$  

&emsp;&emsp;Compute the $MP$ $E^t$ with $EnM$ $H^t$;  

&emsp;&emsp;**if** $|E^t - E^{t-1}| > \epsilon$ **then**  
&emsp;&emsp;&emsp;$\Phi^\prime = \Phi^\prime \setminus \phi^\prime;$  
&emsp;&emsp;&emsp;Update $\mathfrak{D}^{t+1}$ using:  
$$\mathfrak{D}_{i} = \frac {exp(-P(h_i))}{Z}, Z = \sum \limits_{i=1} ^{|Q|} exp(-P(h_i));$$
&emsp;&emsp;**end if**  
&emsp;**end while**  
&emsp;return $EnM$ $H$.
***

Useful links:
- [Writing Math Equations in Jupyter Notebook: A Naive Introduction](https://medium.com/analytics-vidhya/writing-math-equations-in-jupyter-notebook-a-naive-introduction-a5ce87b9a214)
- [Math symbols defined by LaTeX package «amsfonts»](http://milde.users.sourceforge.net/LUCR/Math/mathpackages/amsfonts-symbols.pdf)
- [7 Essential Tips for Writing With Jupyter Notebook](https://towardsdatascience.com/7-essential-tips-for-writing-with-jupyter-notebook-60972a1a8901)

In [None]:
def EnM_b(ranker_enm, queries_corpus, kind='cs', max_iters=150, eps=1e-6):
    models = ranker_enm.models_name
    
    models_dup = models.copy()
    
    alpha = np.zeros(len(models))
    q_weight = np.ones(len(queries_corpus)) / len(queries_corpus)
    
    query_likelihood = ranker_ql[queries_corpus]
    _ = ranker_enm[queries_corpus]
    
    performances = compute_queries_perf(corpus, queries_corpus, ranker_enm.scores, query_likelihood,
                                        kind=kind,
                                        rm1_lambda=config.rm1_lambda, rm3_lambda=config.rm3_lambda)
    print('Models:', models)
    print('Queries perf:\n')
    with np.printoptions(precision=2):
        print(performances)
    print('Winner:', np.argmax(performances, 0))
    print('')
    
    E_prev = 0
    E_best = 0
    E_hist = []
    delta = []
    iteration = 0
    
    while True:
        if len(models_dup) == 0:
            models_dup = models.copy()
        
        #Eq 37
        dup_performances = performances[ranker_enm.get_models_index(models_dup)]
        e = (dup_performances*q_weight).sum(1)
        
        j_star = np.argmax(e)
        
        #Eq 35
        sigma = 0.5*math.log((q_weight*(1+dup_performances[j_star,:])).sum() / 
                             (q_weight*(1-dup_performances[j_star,:])).sum() )
        
        g_j_star = ranker_enm.get_model_index(models_dup[j_star])
        alpha[g_j_star] += sigma
        
        ranker_enm.set_alpha(alpha)
        enm_perf = compute_queries_perf(corpus, queries_corpus, ranker_enm.combine_scores(), 
                                        query_likelihood,
                                        kind=kind,
                                        rm1_lambda=config.rm1_lambda, rm3_lambda=config.rm3_lambda)
        
        E = enm_perf.mean()
        E_hist.append(E)
        
        if E>E_best:
            E_best = E
        
        print(f'iter:{iteration:4d} E:{E:.6f} E_best:{E_best:.6f} j_star:{g_j_star} alpha:{alpha}')
        
        if abs(E - E_prev) > eps and iteration<max_iters:
            E_prev = E
            del models_dup[j_star]
            #Eq 33
            q_weight = np.exp(-enm_perf)
            q_weight[:] /= q_weight.sum()
        else:
            # Maybe (E - E_prev) < eps should be true for all models, before we break the loop
            break
        
        iteration += 1
        
    alpha = alpha/sum(alpha)
    
    return alpha, E_hist

In [None]:
%%time

alpha, E_hist = EnM_b(ranker_enm, queries_corpus, kind='uef', max_iters=151)
ranker_enm.set_alpha(alpha)

save(ranker_enm, config.ENM_FN)

In [None]:
df=pd.DataFrame([alpha], columns=ranker_enm.models_name)

html = (df.style
        .format('{:.2f}')
        .hide_index()
        .set_caption("Normalized alpha")
        .render())
display(HTML('<center>'+html+'</center>'))

In [None]:
plt.figure(figsize=[5,4])
plt.plot(np.asarray(E_hist))
plt.xlabel('round')
plt.ylabel('mean UEF(cs)')
plt.title('Learning curve of EnM.B')
plt.tight_layout(pad=0)
plt.show()

# Results

We put some links of the SE results into this [page](https://www.kaggle.com/atmarouane/covid-19-se-results-dispatcher), it will be updated frequently.

# TODO

1. There is a serious high quality list of human judgements about relevancy on more than 40 queries on CORD-19, at [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), it's better to switch the metric from **UEF(Clarity) to MAP**;
2. Review the code, to more simple and packed one;
3. Use best practices to rearange and comment the code.

# References

1. <a id="r1"></a>[Indexing by Latent Dirichlet Allocation and Ensemble Model](https://arxiv.org/pdf/1309.3421.pdf)
2. <a id="r2"></a>[Predicting Query Performance](https://dl.acm.org/doi/pdf/10.1145/564376.564429)
3. <a id="r3"></a>[Using Statistical Decision Theory and Relevance Models
for Query-Performance Prediction](https://dl.acm.org/doi/pdf/10.1145/1835449.1835494)
4. <a id="r4"></a>[Language models for information retrieval](https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf)
5. <a id="r5"></a>[Topic Models and Its Applications](https://staff.fnwi.uva.nl/e.kanoulas/wp-content/uploads/Lecture-5-1-Topic-Models-for-IR.pdf)
6. <a id="r6"></a>[Lecture 10: Query Expansion & Relevance Feedback](http://people.cs.vt.edu/~jiepu/cs5604_fall2018/10_qm.pdf)
7. <a id="r7"></a>[Relevance-Based Language Models](https://dl.acm.org/doi/pdf/10.1145/383952.383972)
8. <a id="r8"></a>[LDA-Based Document Models for Ad-hoc Retrieval](http://ciir.cs.umass.edu/pubfiles/ir-464.pdf)
9. <a id="r9"></a>[A Study of Smoothing Methods for Language Models
Applied to Ad Hoc Information Retrieval](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.8019&rep=rep1&type=pdf)
10. <a id="r10"></a>[Online Learning for Latent Dirichlet Allocation](https://www.di.ens.fr/~fbach/mdhnips2010.pdf)
11. <a id="r11"></a>[Exploring the Space of Topic Coherence Measures](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf)
12. <a id="r12"></a>[Rethinking LDA: Why Priors Matter](https://people.cs.umass.edu/~wallach/publications/wallach09rethinking.pdf)
13. <a id="r13"></a>[UMass at TREC 2004: Novelty and HARD](https://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf)
14. <a id="r14"></a>[Latent semantic indexing](https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing)
15. <a id="r15"></a>[Non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)


In [None]:
from IPython.display import HTML

with open(config.TOC2_FN, 'r') as file:
    js = file.read()

    display(HTML('<script type="text/Javascript">'+js+'</script>'))
    
    del js

In [None]:
%%javascript

// Autonumbering & Table of Contents
// Using: https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/toc2
table_of_contents(default_cfg);

// We solve a unknow problem, sometimes in ‘notebookviewer.js’ in the function ‘setAnchorsToScrollIntoView’, 't.hash' is undefined.
Array.from(document.getElementsByTagName("a")).forEach(function(t) {
    var e = t.hash;
    if (e === undefined) {
        t.hash = '';
    }
})