## Install missing packages

In [None]:
!pip install fastcluster

## Import packages

In [None]:
import re
import gc
import fastcluster
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as patches

from itertools import chain
from scipy.stats import entropy
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.neighbors import KNeighborsClassifier
from spacy import displacy

RANDOM_STATE = 3472

np.random.seed(RANDOM_STATE)

%matplotlib inline

## Introduction

### Background
The task is to find answers for the given medical question. Experts from the medical community are already familiar with such task producing many systematic reviews every year (e.g. [Cochrane](https://www.cochranelibrary.com/) produces a great deal of them). A systematic review is a systematic medical knowledge synthesis of existing studies. Researchers work worldwide in distributed environments, working simultaneously on the same experiments. As a result, a lot of studies are published, often with contradictory or uncertain outcomes. The goal of a systematic review is to analyze existing studies and in turn make valid conclusions. 

#### How does the systematic review process look like?

1. A relevant subset of studies is retrieved using a search strategy. The most common form of the search strategy is a Boolean query for a specific database. In the competition, we are already given a set of studies.

2. The screening process takes place, which is performed in two stages. In the first stage reviewers investigate articles based on titles and abstracts & make decisions, which articles are relevant to the given question. Their goal is to minimize false negatives i.e. the chance that they miss the potentially relevant study. The reason behind this is that they don’t have access to the full-texts, therefore they are uncertain about their decision. As a result, they decide to include the study for the next stage. The stage is also required to save money because in many cases the access to full-text is paid. In the second stage, the full-texts are obtained and screened. Take in mind that now reviewers have access to the full information. Reviewers need to know the rules of study relevancy level assessment. The rules are expressed as inclusion & exclusion criteria. Such rules are different for the first and second stages of the screening process. Although in the competition we are not given detailed criteria, the task description itself is a combination of question and inclusion criteria.

3. Reviewers extract information from studies into the given extraction form. Extraction form is a set of fields that describe the object e.g. test article or control group. In some cases, the risk of bias is additionally performed to evaluate the certainty of the evidence (e.g. using [ROBINS](https://www.riskofbias.info/) methodology).

4. Conclusions are made & guidelines are published. Many systematic reviews include meta-analysis i.e. a quantitive way to make conclusions from data.

### Objective  
Each step exists to save reviewers' time in the succeeding steps. **<span style="color:green">Our goal is to  save time during screening step using machine learning methods.</span>** 

### Methods
The dataset doesn't contain any expert knowledge, thus only pre-trained supervised models, unsupervised or semi-supervised models are applicable. Initially, we thought about using [K Nearest Neighbours](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) method algorithm for the given task description. The main business issue with such an approach is that it isn't clear what K should be to minimize false negatives. Another slightly different issue is that the task description is from a different distribution than the studies' distribution.


**<span style="color:green">Our approach is to create clusters of studies & sample a subset of studies from each cluster. Each reviewer then goes through the sample & decide whether a cluster of studies is worth further exploration. It allows to quickly discard some clusters and in turn save time.</span> ** For example if there was only 1 relevant study in the set of 1000 screened documents (the cluster has 2000 total studies), then it wouldn't be worth to explore further the cluster. Eventually, the decision belongs to the reviewer, but it is the property of every decision support system. 

#### Text representation

1. One of the most popular methods for text representation is TF-IDF. In theory, it is language-independent, but in practice, it requires a tokenization algorithm that is language-dependent (see [Japanese language](http://www.lrec-conf.org/proceedings/lrec2018/pdf/8884.pdf)). It works best if the least informative tokens and noise tokens are removed according to [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). The main drawbacks of TF-IDF are that it doesn't understand language and it is very sensitive to morphological changes. We decided to use stemming & lemmatization (see specifics in the function `simplify_text` from previous notebook) to overcome those issues. But why do we use TF-IDF in the first place? Well, popularity is not a good reason, we have others. 

    * Discussions with our reviewers show that they mostly look for keywords.
    * Our unpublished results for screening task show that more complicated deep learning methods don't provide a relevant advantage over simple methods (so far). We experimented with our custom datasets & [Cohen's datasets](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1447545/).
    * A screening task can be expressed as a classification task. As shown in the paper [Rethinking Complex Neural Network Architectures for Document Classification (Table 2)](https://cs.uwaterloo.ca/~jimmylin/publications/Adhikari_etal_NAACL2019.pdf), classical methods are still a good baseline.
    * Current state of art BERT [doesn't work well in the unsupervised fahsion](https://arxiv.org/abs/1908.10084). Some variations like SBERT were explored, but up to my knowledge, there is no SBERT for the medical domain. You might find online models like [BioBERT](https://arxiv.org/abs/1901.08746) or [SciBERT](https://arxiv.org/abs/1903.10676), but they should suffer similar issues as BERT.


2. Initially we wanted to experiment with medical sentence embeddings [called BioSentVec](https://github.com/ncbi-nlp/BioSentVec) used in [LitSense](https://www.ncbi.nlm.nih.gov/research/litsense/), but Kaggle's limitation is 20GB for uploading the model. We will work with BioWordVec then, which is based on FastText.


#### Clustering algorithms
There exist many clustering algorithms like [KMeans](https://en.wikipedia.org/wiki/K-means_clustering), [GMM](https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model), [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN), [OPTICS](https://en.wikipedia.org/wiki/OPTICS_algorithm), [hierarchical clustering algorithms](https://en.wikipedia.org/wiki/Hierarchical_clustering) and many more. A nice comparison can be found [here](https://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb) and [here](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

* DBSCAN and OPTICS return non-convex clusters & they don't require to decide on a number of clusters upfront. At the same time, they require different hyperparameters & they are very sensitive to the changes of those hyperparameters (shows my experience).
* KMeans, GMM return convex clusters & require to decide on a number of clusters upfront. There is no easy way to make such a decision. There exist some heuristics like BIC & AIC criteria for methods that return likelihood (GMM), elbow methods (KMeans), threshold-based (hierarchical clustering algorithms) or more general ones (Silhouette Score).

We find GMM useful, because it is a probabilistic model meaning that we can optimize likelihood using variational learning. A concept of a likelihood allows us to use cross-validation. The clusters are convex and follow the Gaussian shape. It is found its biggest drawback, but in practice it is not so harmful. The biggest concern using GMM is Mahalanobis distance, which is not well-applicable to TF-IDF features, because TF factor is dependent on the text length. Using cosine distance with such features is much better. We decided to:

1. Use hierarchical clustering algorithms for TF-IDF features with cosine distance. Distance threshold is chosen based on visualizations. The reason we chose such algorithm is that it shows the "big picture" using either distance matrix or dendrogram. In my opinion it is the most general view of clustering.
2. Use GMM with hyperparameter search for BioWordVec features and Mahalonobis distance.

### Remarks
The work here is the result of [Evidence Prime](https://evidenceprime.com/) team. We work on the project called [LASER](https://evidenceprime.com/laser/), which aims to track scientific literature and synthesize the knowledge about the given topic. We share our experience & knowledge in the notebook, hoping that it will inspire the community to fight COVID-19. 

**<span style="color:red">The notebook is based upon our previous work with the purpose to clean the dataset. They should be evaluated toghether!</span> ** [Link](https://www.kaggle.com/quittend/cleaning-cord-19-metadata/notebook)

## Read data

In [None]:
df = pd.read_csv('/kaggle/input/cleaning-cord-19-metadata/cord_metadata_cleaned.csv')

# A simplification algorithm returns NaN for somes texts => no useful information
df = df.dropna(subset=['text_simplified']).reset_index(drop=True)

## TF-IDF & hierarchical clustering

### Create vocabulary

In [None]:
tokens = df['text_simplified'].str.split(' ').tolist()
tokens = pd.Series(chain(*tokens))
tokens_count = tokens.value_counts()
tokens_count

We decided not to truncate tokens from top of the list, because it contains some important tokens like `viru`,`infect`, `protein`. They could be used as keywords in the screening process.

In [None]:
ax = tokens_count.plot(figsize=(15, 5))
ax.set_xlabel("Token")
ax.set_ylabel("Count")
ax.grid(True)

Not the scale we need, let's truncate top tokens.

In [None]:
ax = tokens_count[10000:].plot(figsize=(15, 5))
ax.set_xlabel("Token")
ax.set_ylabel("Count"),
ax.grid(True)

### Obtain features
10000 top tokens in the dictionary seem enough.

In [None]:
tfidf_vectorizer = TfidfVectorizer(
    input='content',
    lowercase=False,
    preprocessor=lambda text: text,  
    tokenizer=lambda text: text.split(' '),
    token_pattern=None,
    analyzer='word',
    stop_words=None,
    ngram_range=(1, 1),
    max_features=10000,
    binary=False,
    norm='l2',
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False,
)

features = tfidf_vectorizer.fit_transform(df['text_simplified'])
features.shape

### Find clustering threshold

In [None]:
features = features.astype('float32').toarray()

Distance matrix & linkage calculation & plotting takes time and memory, therefore we decided to first work on 10% sample.

In [None]:
sample_size = 0.1
sample_mask = np.random.choice(
    a=[True, False], 
    size=len(features), 
    p=[sample_size, 1 - sample_size]
)

features_sample = features[sample_mask]
features_sample.shape

**Notes:**
* If you use parallel implementation (n_jobs == 1), then CPU usage is still 400% (a bug?)
* If you use parallel implementation (n_jobs != 1), then diagonal is not zero (a bug?)

In [None]:
%%time
distance_matrix = pairwise_distances(features_sample, metric='cosine')

The `complete` link merges outliers late, because they would increase maximum distances too much. Hence it is more robust to outliers (it prefers inliers first). Such property minimizes the chance of finiding relevant document in the "irrelevant cluster".

In [None]:
%%time 
distances = squareform(distance_matrix, force='tovector')
Z = fastcluster.linkage(distances, method='complete', preserve_input=True)

In [None]:
sns.clustermap(
    data=distance_matrix,
    col_linkage=Z, 
    row_linkage=Z,
    cmap=plt.get_cmap('RdBu'),
)

In [None]:
dissimilarities = pd.Series(distance_matrix.flatten())

In [None]:
ax = dissimilarities.hist(bins=100, figsize=(15, 5))
ax.set_xlabel("Cosine dissimilarity")
ax.set_ylabel("Count")
ax.grid(True)

In [None]:
ax = dissimilarities[dissimilarities >= 0.8].hist(bins=100, figsize=(15, 5))
ax.set_xlabel("Cosine dissimilarity")
ax.set_ylabel("Count")
ax.grid(True)

In [None]:
ax = dissimilarities[dissimilarities >= 0.95].hist(bins=100, figsize=(15, 5))
ax.set_xlabel("Cosine dissimilarity")
ax.set_ylabel("Count")
ax.grid(True)

On the first glance, there are no similar documents (almost every pair is blue i.e. distance is close to 1.0). The reason is that vector space is 10000 dimensional. For example if there are documents with 5 keywords in common, then comparing to 10000 "possible agreements", they result in cosine distance around `1.0`. The `clustermap` scale is wrong, because we have `0.0` distance on diagonal. The clustermap would look look reasonably if you change `TfidfVectorizer.max_features=100`, but then you discard some potential important keywords.

In [None]:
# Cluster features
clusters = fcluster(Z, t=0.999, criterion='distance')

# Plot clustermap
clustermap = sns.clustermap(
    data=distance_matrix,
    col_linkage=Z, 
    row_linkage=Z,
    cmap=plt.get_cmap('RdBu'),
)

# Draw clusters on the clustermap plot
cluster_mapping = dict(zip(range(len(features_sample)), clusters))
clustermap_clusters = pd.Series(
    [cluster_mapping[id_] for id_ in list(clustermap.data2d.columns)]
)

for cluster in set(clusters):
    cluster_range = list(clustermap_clusters[clustermap_clusters == cluster].index)
    clustermap.ax_heatmap.add_patch(
        patches.Rectangle(
            xy=(np.min(cluster_range), np.min(cluster_range)), 
            width=len(cluster_range), 
            height=len(cluster_range),
            fill=False,
            edgecolor='lightgreen',
            lw=2
        )
    )
    
print(f'There are {clustermap_clusters.nunique()} clusters.')

On averate each cluster should have at least 10 elements. Some elements will be appended to the current clusters if perform clustering on the whole dataset.

### Cluster the whole dataset
We couldn't perform calculation on the whole dataset due to memory issues. We decided to use KNN method to obtain clusters for the remaining 40k datapoints. Let's free some RAM first.

In [None]:
del clustermap
del distance_matrix
del distances
del Z

gc.collect()

Each cluster has on average at least 10 elements, hence for safety we set up `n_neighbors=5` to minimize uncertainty.

In [None]:
model = KNeighborsClassifier(n_neighbors=5, metric='cosine', n_jobs=-1)
model.fit(features_sample, clusters)

In [None]:
df['cluster'] = model.predict(features)

## Visualize data

### How many elements do each cluster have?

In [None]:
cluster_count = df['cluster'].value_counts().sort_values()

ax = cluster_count.plot(kind='bar', figsize=(15, 5))
ax.set_xticks([])
ax.set_xlabel("Cluster id")
ax.set_ylabel("Count")
ax.grid(True)

Clusters with only one element are useless in our solution (they don't save reviewer's time), therefore we classify them all as noise. Although the number of clusters should be set up during clustering, we couldn't set up a higher distance threshold than `0.999`, because we would end up with one big cluster. We decided to correct for "stupid clusters". How many elements should be in one cluster? Well, at least 2 to save time, but we use the same number as a number of neighbors i.e. `5`. You can think of that as regularization (one shared hyperparameters).


In [None]:
noise_clusters = set(cluster_count[cluster_count <= 5].index)
noise_mask = df['cluster'].isin(noise_clusters)

df.loc[noise_mask, 'cluster'] = -1

In [None]:
cluster_count = df['cluster'].value_counts().sort_values()

ax = cluster_count.plot(kind='bar', figsize=(15, 5))
ax.set_xticks([])
ax.set_xlabel("Cluster id")
ax.set_ylabel("Count")
ax.grid(True)

### What does each cluster represent?
We saw some fancy kernels with PCA projection or T-SNE. First of all we don't think that 2D plot, where each datapoint is a study, brings any value to reviewers. Note that it's hard enough for Data Scientist to make any [conclusions from such plots](https://distill.pub/2016/misread-tsne/). We decided to give reviewers a grasp of keywords that are associated with each cluster.

In [None]:
columns = np.array(tfidf_vectorizer.get_feature_names())
top_k = 3

def describe(df: pd.DataFrame) -> pd.DataFrame:
    order = features[df.index].mean(axis=0).argsort()[::-1][:top_k]
    top_words = columns[order]
    
    cluster_id = df['cluster'].iloc[0]
    for i, word in enumerate(top_words):
        # For noisy clusters don't use keywords!
        df[f'word_{i + 1}'] = word if cluster_id != -1 else ''
        
    return df

df = df.groupby('cluster').apply(describe)

In [None]:
df.filter(regex='text_simplified|word_\d+', axis=1)

Let's see an example of one cluster e.g. `10`.

In [None]:
cluster_id = 10
df_cluster = df.loc[df['cluster'] == cluster_id, :]

keywords = (df
    .loc[df['cluster'] == 10, ['word_1', 'word_2', 'word_3']]
    .drop_duplicates()
    .values
    .tolist()[0]
) 

keywords

In [None]:
elements = []
for _, text in df_cluster['text_simplified'].items():
    ents = []
    text = '• ' + text
    for keyword in keywords:
        matches = list(re.finditer(keyword, text))
        for match in matches:
            start, end = match.span()
            ents.append({
                "start": start,
                "end": end, 
                "label": 'KEYWORD'
            })
            
    elements.append({
        'text': text,
        "ents": ents,
    })

In [None]:
displacy.render(elements, style="ent", jupyter=True, manual=True)

## Save results
A reviewer, who works with this file, should explore each cluster separately & make decisions whether a cluster is worth further exploration after `N` studies are screened.

In [None]:
(df
    .drop(columns=['title_lang', 'abstract_lang', 'distance'])
    .to_csv('/kaggle/working/cord_metadata_keywords.csv', index=False)
)