# Example how to read text embeddings

The text embeddings were generated from PubMed abstracts using BioLinkBERT-large model.

## Data description
The [text embeddings folder](https://drive.google.com/drive/folders/1YnEsy2WNh74lbKC5gX7xORmWGYQ_Tmo3) contains two files:

1. **text_embeddings_*.npy**
   - NumPy array of text embeddings.
   - Shape: (number of text segments, embedding dimension).
   - Each row corresponds to a text segment from a PubMed abstract.

2. **text_metadata_*.csv**
   - CSV table with metadata for each text segment.
   - Columns: embedding_index, pubmed_id, segment, mesh_terms.
   - Use embedding_index to match rows to embeddings in the .npy file.

In [5]:
import numpy as np
import pandas as pd
import torch

# Load text embeddings as numpy array
text_embeddings_file = "/Users/vedran/Documents/scpgpt-project/text-embeddings/text_embeddings_20251106_162234.npy"
text_embeddings = np.load(text_embeddings_file)
print("Text embeddings shape:", text_embeddings.shape)
print("First embedding vector:", text_embeddings[0])

# Optionally, convert to torch tensor
text_embeddings_torch = torch.from_numpy(text_embeddings)
print("Torch tensor shape:", text_embeddings_torch.shape)

# Load text metadata as DataFrame
text_metadata_file = "/Users/vedran/Documents/scpgpt-project/text-embeddings/text_metadata_20251106_162234.csv"
text_metadata = pd.read_csv(text_metadata_file)
print("\nMetadata columns:", text_metadata.columns.tolist())
print("Total text segments:", len(text_metadata))

# Example: Get text segment info for a given embedding index
example_index = 0
pubmed_id = text_metadata.loc[example_index, 'pubmed_id']
segment = text_metadata.loc[example_index, 'segment']
mesh_terms = text_metadata.loc[example_index, 'mesh_terms']
print(f"\nEmbedding index {example_index}:")
print(f"  PubMed ID: {pubmed_id}")
print(f"  Segment: {segment}")
print(f"  MeSH terms: {mesh_terms}")

Text embeddings shape: (17633, 1024)
First embedding vector: [-0.17757398  0.02531927  0.23698409 ...  0.14285968 -0.03779435
  0.02688294]
Torch tensor shape: torch.Size([17633, 1024])

Metadata columns: ['embedding_index', 'pubmed_id', 'segment', 'mesh_terms']
Total text segments: 17633

Embedding index 0:
  PubMed ID: 36291189
  Segment: 0
  MeSH terms: Mesenchymal Stem Cells,Humans,NF-kappa B,RNA,Small Interfering,Interleukin-8,Osteogenesis
