# Nature Bio Embeddings

This Jupyter Notebook code demonstrates the usage of the PaperEmbeddingsQueryEngine class for querying similar papers based on their embeddings. The code utilizes various libraries and models for natural language processing and similarity computation. In this environment, you will find a curated collection of `open access` research paper abstracts datasets specifically focused on biosciences, sourced exclusively from Nature journal publications. By gathering these datasets, repository aims to provide a open, comprehensive entrypoint embeddings for researchers, students, and enthusiasts in the biosciences field. Datasets included in this repository are solely abstracts from Nature publications in the field of biosciences.

In addition to the datasets available here, this repository houses the embedding models created using the `all-MPNet-base-v2` sentence transformer model which provides a comprehensive collection of embeddings that encode the semantic information of the data. These embeddings serve as compact numerical representations of sentences or texts, allowing for efficient similarity comparisons, clustering, and downstream applications.

Whether you are working on natural language processing tasks, information retrieval, or machine learning applications, the embeddings in this repository will enhance your research and analysis capabilities. You can leverage them for tasks such as document similarity, text classification, question-answering, and more.

[Additionally you can access the embeddings via HuggingFace](https://huggingface.co/simudt/nature-bio-oa-abstract-embeddings)


## Environment Setup


Before running the code, please make sure you have the following dependencies installed:


In [None]:
!git clone https://github.com/SCALEDSL/Nature-Biosci-Embeddings

In [None]:
%cd /content/Nature-Biosci-Embeddings

In [None]:
!pip3 install -U sentence-transformers transformers pandas

In [None]:
!pip3 list

## Querying Embeddings


The last part of the code demonstrates an example usage of the PaperEmbeddingsQueryEngine class. It creates an instance of the class, specifies the CSV file and embedding folder paths, and queries similar papers based on a given query. In this example, the query is "What are side effects of BioNTech COVID-19 mRNA vaccine?". Please make sure to adjust the file paths in the example code to match your specific setup if you changed the directories. Feel free to experiment with different queries and explore the retrieved similar papers.

Note:
The code assumes the availability of a CUDA-enabled GPU for faster computation.
If a GPU is not available, it falls back to CPU execution.


In [None]:
import os
import torch
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer


class PaperEmbeddingsQueryEngine:
    def __init__(
        self, model_name="all-MPNet-base-v2", csv_file_path=None, embedding_folder=None
    ):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.csv_file_path = csv_file_path
        self.embedding_folder = embedding_folder

        try:
            self.model = SentenceTransformer(model_name).to(self.device)
        except OSError:
            print("Model not found, please try another model.")
            exit()

        self.data_df = pd.read_csv(self.csv_file_path)
        self.journal = self.data_df["Journal"].tolist()
        self.titles = self.data_df["Title"].tolist()
        self.paper_descriptions = self.data_df["Abstract"].tolist()
        self.paper_urls = self.data_df["URL"].tolist()

        self.titles_embeddings = self.load_embeddings(
            "nature_bio_titles_embeddings.pth"
        )
        self.paper_description_embeddings = self.load_embeddings(
            "nature_bio_abstract_embeddings.pth"
        )

    def load_embeddings(self, embedding_file_name):
        embedding_file = os.path.join(self.embedding_folder, embedding_file_name)
        embeddings = torch.load(embedding_file, map_location=self.device)
        print(f"Shape of the {embedding_file_name}: ", embeddings[0].shape)
        return embeddings

    def query_similar_papers(self, query, top_k=5):
        query_embedding = self.model.encode(
            [query], convert_to_tensor=True, device=self.device
        )
        query_embeddings_tensor = query_embedding.to(self.device)

        titles_similarity_scores = self.compute_similarity_scores(
            query_embeddings_tensor, self.titles_embeddings
        )
        paper_desc_similarity_scores = self.compute_similarity_scores(
            query_embeddings_tensor, self.paper_description_embeddings
        )

        top_k_titles_indices = self.get_top_k_indices(titles_similarity_scores, top_k)
        top_k_paper_desc_indices = self.get_top_k_indices(
            paper_desc_similarity_scores, top_k
        )

        self.display_results(
            query, top_k_titles_indices, self.titles, titles_similarity_scores, "title"
        )
        print()
        self.display_results(
            query,
            top_k_paper_desc_indices,
            self.paper_descriptions,
            paper_desc_similarity_scores,
            "paper description",
        )

    def compute_similarity_scores(self, query_embeddings_tensor, embeddings):
        embeddings_tensor = torch.stack(embeddings).to(self.device)
        return cosine_similarity(
            query_embeddings_tensor.cpu().numpy(), embeddings_tensor.cpu().numpy()
        )

    def get_top_k_indices(self, similarity_scores, top_k):
        return np.argsort(similarity_scores[0])[-top_k:]

    def display_results(self, query, top_k_indices, data, similarity_scores, data_type):
        for idx in reversed(top_k_indices):
            print(f"The most related {data_type} to '{query}' is '{data[idx]}'")
            print(f"Journal of this paper: {self.journal[idx]}")
            print(f"Journal URL of this: {self.paper_urls[idx]}")
            print(f"Similarity score: {similarity_scores[0][idx]}")


if __name__ == "__main__":
    csv_file = (
        "/content/Nature-Biosci-Embeddings/dataset/concatenated/finalized_nature_bio_embeddings.csv"
    )
    embedding_folder = "/content/Nature-Biosci-Embeddings/embeddings/"
    engine = PaperEmbeddingsQueryEngine(
        csv_file_path=csv_file, embedding_folder=embedding_folder
    )
    query = "What are side effects of BioNTech COVID-19 mRNA vaccine?"
    engine.query_similar_papers(query)