# Overview: Description of Project and Methods Used

In this notebook, we aim to explore the 51,045 Covid-19 articles and provide a more efficient way to look for answers. We chose to use the Specter embeddings of the tiles and abstracts as given in the most recent version of the COVID-19 project file. For all dataset records, results have been run and displayed in the order below: 
* Reading and indexing specter embeddings 
* PCA and TSNE plots displaying results for three clustering methods (Fuzzy C-means clustering, K-means clustering and Hierarchical clustering)
* The titles of the top 5 closest points to the centroids for two clustering methods (Fuzzy C-means clustering and K-means clustering) to show similarity within clusters and dissimilarity among clusters.

To efficiently run the Hierarchical clustering on the Kaggle kernel, we narrowed down the dataset to 1,000 records, and provided the code for this task. We also provided our resulting hierarchical clusters and plots for the full dataset, which we processed in the same way as demonstrated on the sample records, just at full-scale. 

We have also provided some evaluation in the form of comparisons between each of the different clustering methods used, as well as some observations and next steps. As a preview, Kmeans clustering seems to be the winner in terms of clear separation between clusters.

In [None]:
%%capture
# create conda environment with recquired packages
# this takes ~10-15 mins
!conda create -n alcapone_rapids -c rapidsai -c nvidia -c conda-forge rapids scikit-fuzzy python=3.6 cudatoolkit=10.1 -y

In [None]:
# this is to make the conda packages accessible
import sys
sys.path = ["/opt/conda/envs/alcapone_rapids/lib/python3.6/site-packages"]+ sys.path
sys.path = ["/opt/conda/envs/alcapone_rapids/lib/python3.6"] + sys.path
sys.path = ["/opt/conda/envs/alcapone_rapids/lib"] + sys.path

In [None]:
import json
import numpy as np
import pandas as pd
import torch
from torch.nn.utils.rnn import pad_sequence
import skfuzzy as fuzz
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering, KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import distance as eudist
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
import csv
from cuml.manifold import TSNE as cTSNE
from cuml import KMeans as cKMeans
from IPython.display import Image, display

In [None]:
# define global variables
ROOT_PATH = '/kaggle/input/CORD-19-research-challenge/'
METADATA_PATH = f'{ROOT_PATH}/metadata.csv'

In [None]:
# load metadata into a df and look at the contents
meta_df = pd.read_csv(METADATA_PATH, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

# Processing SPECTER Embeddings

In this section, we use the full data set to compare different methods of clustering (K-Means, Fuzzy C-Means, Heirarchical). 

For this task, we used the Specter Embeddings as described in the upcoming ACL paper found [here](https://github.com/allenai/paper-embedding-public-apis#specter). 


In [None]:
def create_embedding_dict(filepath, sample_size=None):
    """create embedding dictionary from file at given filepath"""

    embedding_dict = {}
    with open(filepath) as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for i, row in enumerate(reader):
            # exit the loop if the desired sample size is reached
            if sample_size and i == sample_size:
                break
            embed = np.zeros((768,))
            for idx, val in enumerate(row):
                if idx > 0:
                    embed[idx-1] = float(val)
            embedding_dict[row[0]] = embed
    return embedding_dict

In [None]:
embedding_dict = create_embedding_dict(f'{ROOT_PATH}/cord19_specter_embeddings_2020-04-10/cord19_specter_embeddings_2020-04-10.csv',
                                       sample_size=None
                                      )
embedding_mat = np.array(list(embedding_dict.values()))
embedding_mat.shape, len(embedding_dict)

# Clustering 

## Fuzzy C-Means 
We first chose to use Fuzzy C-means as it is an unsupervised method of dealing with a dataset that contains similarities-- it is probable that the semantics of COVID-19 paper titles and subject matter are similar enough to warrant membership in multiple clusters. (Note that we then counteract this assumption below in K-means). The evaluation of Fuzzy C-means on the full set of records shows that Fuzzy C-means can be used with some degree of success to cluster this data. See below for the set up of our clusters, their centroids, and the results for each. 

In [None]:
n_clusters = 10

In [None]:
def fuzzy_clustering(all_embedding, n_clusters):
    """returns clusters and centroids as results of fuzzy c-means clustering"""
    
    centroids, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(data=all_embedding.T, 
                                                          c=n_clusters, 
                                                          m=2, 
                                                          error=0.5, 
                                                          maxiter=1000, 
                                                          init=None)
    clusters = np.argmax(u, axis=0)
    return clusters, centroids

In [None]:
def get_clusters(embedding_dict, n_clusters, clusters, centroids = None, k = 5):
    """returns dictionary for clusters"""
    
    cluster_dict = {}
    distance_dict = {}
    for i in range(n_clusters):
        cluster_dict[i] = []
        distance_dict[i] = []
        for j in np.where(clusters == i)[0]:
            paper_id = list(embedding_dict.keys())[j]
            cluster_dict[i].append(paper_id)
            if centroids is not None:
                distance = eudist.euclidean(embedding_mat[j], centroids[i])
                distance_dict[i].append(distance)
    
    if centroids is not None:
        closest_dict = {}
        for i in range(n_clusters):
            closest_idx = np.argsort(distance_dict[i])[0:k]
            closest_dict[i] = []
            for j in range(min(k, len(closest_idx))):
                closest_dict[i].append(cluster_dict[i][j])
        return cluster_dict, closest_dict
    else:
        return cluster_dict

In [None]:
fuzzy_clusters, fuzzy_centroids = fuzzy_clustering(embedding_mat, n_clusters)
fuzzy_clusters_dict, fuzzy_closest_dict = get_clusters(embedding_dict, n_clusters, fuzzy_clusters, fuzzy_centroids)

## Fuzzy C-Means Evaluation

### 1. PCA and t-SNE
The selection of both of these forms of analysis dealt with both narrowing the set of variables (dimensions) and labeling the clusters, as well as the ability to provide a simple visualization tool - greater distance apart means more dissimilarity between paper topic and similar papers will be in closer proximity to each other. 

Overall, the PCA plot looks acceptable - papers from the same cluster are close to each other, forming groups. However, for Cluster 6, Cluster 8 and Cluster 9, there are overlaps and the topics seem convoluted-- some overlap in cluster subject.

In addition, from the t-SNE plot,  we observe similar patterns to the PCA plot - groups formed with some overlaps.

In [None]:
def get_pca(all_embedding):
    """returns result of pca given an embedding matrix"""
    
    pca = PCA()
    pca_result = pca.fit_transform(all_embedding)
    return pca_result

In [None]:
def plot_pca(pca_result, clusters, title):
    """plots and saves pca result image"""
    
    sns.set(rc={'figure.figsize':(10, 10)})
    palette = sns.color_palette("bright", len(set(clusters)))
    sns.scatterplot(pca_result[:,0], pca_result[:,1], hue=clusters, legend='full', palette=palette)
    
    plt.title(title)
    plt.savefig(f"/kaggle/working/{title}.png")
    plt.show()

In [None]:
fuzzy_pca = get_pca(embedding_mat)

In [None]:
#use PCA to plot embeddings v. fuzzy output clusters
plot_pca(fuzzy_pca, fuzzy_clusters, "PCA Covid-19 Articles - Clustered(Fuzzy C-Means)")

In [None]:
def get_tsne(all_embedding):
    """returns result of TNSE given an embedding matrix"""
    
    tsne = cTSNE(verbose=1)
    tsne_result = tsne.fit_transform(all_embedding)
    return tsne_result

In [None]:
def plot_tsne(tsne_result, clusters, title):
    """plots and saves tsne result image """
    
    sns.set(rc={'figure.figsize':(10, 10)})
    palette = sns.color_palette("bright", len(set(clusters)))
    sns.scatterplot(tsne_result[:,0], tsne_result[:,1], hue=clusters, legend='full', palette=palette)
    
    plt.title(title)
    plt.savefig(f"/kaggle/working/{title}.png")
    plt.show()

In [None]:
fuzzy_tsne = get_tsne(embedding_mat)

In [None]:
#use tSNE to plot embeddings v. fuzzy output clusters 
plot_tsne(fuzzy_tsne, fuzzy_clusters, "t-SNE Covid-19 Articles - Clustered(Fuzzy C-Means)")

### 2. The Titles of the Top 5 Closest Points to Centroids
To evaluate how Fuzzy C-means clustering performs, we find the top 5 closest points to the centroids of each cluster and present their corresponding paper titles. In theory, the papers within each cluster should have similar topics, and the papers from different clusters should have different topics. For example, topics in Cluster 2 seems to be about treatment methods of the virus. Cluster 6 seems to be discussing the geographical features of the virus, which is a different topic than Cluster 8. 

In [None]:
for cluster, paper_id in fuzzy_closest_dict.items():
    print(f"Cluster {cluster} - Titles")
    for idx in paper_id:
        print(f"{meta_df['title'].loc[meta_df['cord_uid'] == idx].values[0]}")

## K-Means Clustering

Since we first used Fuzzy C-means, we wanted to follow that up with K-means to make the partition stricter, and see if this change in cluster overlap showed significant difference in producing clusters and subsequent visualizations of these results. 

In [None]:
def kmeans_clustering(all_embedding, n_clusters):
    """returns result of k-means clustering"""
    
    kmeans = cKMeans(n_clusters=n_clusters, random_state=0).fit(all_embedding)
    clusters = kmeans.labels_
    centroids = kmeans.cluster_centers_
    return clusters, centroids

In [None]:
kmeans_clusters, kmeans_centroids = kmeans_clustering(embedding_mat, n_clusters)
kmeans_clusters_dict, kmeans_closest_dict = get_clusters(embedding_dict, n_clusters, kmeans_clusters, kmeans_centroids)

## K-Means Evaluation
### 1. PCA and t-SNE

From the PCA plot and the t-SNE plot labeled by K-means clusters, we observe more clearly defined cluster clouds, and the number of data points in each cluster is more evenly spread out.

In [None]:
kmeans_pca = get_pca(embedding_mat)

In [None]:
plot_pca(kmeans_pca, kmeans_clusters, "PCA Covid-19 Articles - Clustered(kmeans)")

In [None]:
kmeans_tsne = get_tsne(embedding_mat)

In [None]:
plot_tsne(kmeans_tsne, kmeans_clusters, "t-SNE Covid-19 Articles - Clustered(kmeans)")


### 2. The Titles of the top 5 Closest Points to Centroids
Similarly, we find the top 5 closest points to the centroids of each cluster and check their paper titles for K-means clustering results. Cluster 6 seems to discuss the viruses geographically, and Cluster 3 seems to discuss how health regulation plays a part in this pandemic. The results are promising.

In [None]:
for cluster, paper_id in kmeans_closest_dict.items():
    print(f"Cluster {cluster} - Titles")
    for idx in paper_id:
        print(f"{meta_df['title'].loc[meta_df['cord_uid'] == idx].values[0]}")

## Hierarchical Clustering

The last form of clustering that we chose to use on this dataset was hierarchical clustering, used to more closely reproduce the target relationships between the different clusters that may have formed in both methods above. We also thought that hierarchical clustering could give more insight into the overlap shown in the analysis of the Fuzzy C-means clusters. In running all three of these algorithms on this particular unlabeled dataset, we hoped to reinforce similarities.

In [None]:
# this section contains code we used to generate the clusters using hierarchical clustering
# to run it on kaggle kernel, we suggest limiting the sample size to 1000
# to do so, regenerate the embedding matrix by running the following:

# embedding_dict = create_embedding_dict(f'{ROOT_PATH}/cord19_specter_embeddings_2020-04-10/cord19_specter_embeddings_2020-04-10.csv',
#                                        sample_size=1000,
#                                       )
# embedding_mat = np.array(list(embedding_dict.values()))
# embedding_mat.shape, len(embedding_dict)

In [None]:
def hierarchical_clustering(all_embedding, n_clusters):
    hierarchical = AgglomerativeClustering(n_clusters=n_clusters).fit(all_embedding)
    clusters = hierarchical.labels_
    return clusters

In [None]:
hierarchical_clusters = hierarchical_clustering(embedding_mat, n_clusters)
hierarchical_clusters_dict = get_clusters(embedding_dict, n_clusters, hierarchical_clusters)

In [None]:
hierarchical_pca = get_pca(embedding_mat)

In [None]:
plot_pca(hierarchical_pca, hierarchical_clusters, "PCA Covid-19 Articles - Clustered(Hierarchical)")

In [None]:
hierarchical_tsne = get_tsne(embedding_mat)

In [None]:
plot_tsne(hierarchical_tsne, hierarchical_clusters, "t-SNE Covid-19 Articles - Clustered(Hierarchical)")

## Hierarchical Clustering Evaluation 
Here we have attached the results from Hierarchical Clustering, obtained by running the code above.

From the PCA plot and the t-SNE plot labeled by Hierarchical clusters, we observe more clearly defined cluster clouds, as well. However, Cluster 6 and Cluster 3 are more convoluted in the t-SNE.

In [None]:
hierarchical_pca = Image("/kaggle/input/results/PCA Covid-19 Articles - Clustered(Hierarchical).png")
hierarchical_tsne = Image("/kaggle/input/results/t-SNE Covid-19 Articles - Clustered(Hierarchical).png")
display(hierarchical_pca, hierarchical_tsne)

# Future Steps
For phase II, we aim to move towards a more general purpose Information Retrieval System by implementing a Neural Information Retrieval (Neural IR) system.  While traditional Neural IR systems focus heavily on word-level interactions, we aim to incorporate entity-oriented search into the system.  By integrating textual base IR and entity search IR, we hope that the augmented system will be able to parse harder queries with enriched data and interact with the documents in a more dynamic way.

Extraction of information from a knowledge graph will have three parts.  First, by performing entity linking on the titles we aim to extract a set of important entities.  Each entity is associated with a description and type.  By creating an embedding of entities, embedding of types with attention, and embedding the description, we have a system that incorporates entity semantic relationship with augmented text data not found in the document text.  

These three embeddings are combined by a linear layer and fed to a downstream interaction matrix.  The final component of the interaction matrix will be the text data from the documents.  From there, weâ€™ll apply an RBF kernel on the interaction matrix to generate translation scores.  

The translation scores from the query-word, document-word, query-entity and document-entity pairs will form the basis for the neural ranking model.  The neural ranking aspect can be trained by another standard ranking loss function, but for the sake of simplicity will be a pairwise loss function.

The system enables an end-to-end optimization problem.  By integrating entities, entity description, types, and text data, we will create a system that is able to learn embeddings and ranking in one go.  

# References

```
@inproceedings{specter_cohan_2020,
    title = "SPECTER: Scientific Paper Embeddings using Citation-informed TransformERs",
    author = "Cohan, Arman and
      Feldman, Sergey and
      Beltagy, Iz  and
      Downey, Doug and
      Weld, Daniel",
    booktitle = "ACL",
    year = "2020",
}
```