# Example how to read text embeddings

The text embeddings were generated from PubMed abstracts using BioLinkBERT-large model.

## Data description
The text embeddings folder contains two files:

1. **text_embeddings_*.npy**
   - NumPy array of text embeddings.
   - Shape: (number of text segments, embedding dimension).
   - Each row corresponds to a text segment from a PubMed abstract.

2. **text_metadata_*.csv**
   - CSV table with metadata for each text segment.
   - Columns: text_id, pubmed_id, segment, mesh_terms, embedding_index.
   - Use embedding_index to match rows to embeddings in the .npy file.
   - text_id format: pubmed_<pubmedid>_seg<segment> (e.g., pubmed_36291189_seg0)

In [3]:
import numpy as np
import pandas as pd
import torch

# Load text embeddings as numpy array
text_embeddings_file = "/Users/vedran/Documents/scpgpt-project/text-embeddings/text_embeddings_20251106_165002.npy"
text_embeddings = np.load(text_embeddings_file)
print("Text embeddings shape:", text_embeddings.shape)

# Optionally, convert to torch tensor
text_embeddings_torch = torch.from_numpy(text_embeddings)
print("Torch tensor shape:", text_embeddings_torch.shape)

# Load text metadata as DataFrame
text_metadata_file = "/Users/vedran/Documents/scpgpt-project/text-embeddings/text_metadata_20251106_165002.csv"
text_metadata = pd.read_csv(text_metadata_file)
print("\nMetadata columns:", text_metadata.columns.tolist())
print("Total text segments:", len(text_metadata))
print("First 3 text IDs:", text_metadata['text_id'].head(3).tolist())

# Example: Get text segment info for a given embedding index
example_index = 0
text_id = text_metadata.loc[example_index, 'text_id']
pubmed_id = text_metadata.loc[example_index, 'pubmed_id']
segment = text_metadata.loc[example_index, 'segment']
mesh_terms = text_metadata.loc[example_index, 'mesh_terms']
print(f"\nEmbedding index {example_index} corresponds to text '{text_id}':")
print(f"  PubMed ID: {pubmed_id}")
print(f"  Segment: {segment}")
print(f"  MeSH terms: {mesh_terms}")

Text embeddings shape: (17633, 1024)
Torch tensor shape: torch.Size([17633, 1024])

Metadata columns: ['text_id', 'pubmed_id', 'segment', 'mesh_terms', 'embedding_index']
Total text segments: 17633
First 3 text IDs: ['pubmed_36291189_seg0', 'pubmed_36291189_seg1', 'pubmed_36291189_seg2']

Embedding index 0 corresponds to text 'pubmed_36291189_seg0':
  PubMed ID: 36291189
  Segment: 0
  MeSH terms: Mesenchymal Stem Cells,Humans,NF-kappa B,RNA,Small Interfering,Interleukin-8,Osteogenesis
