# Japanese Text Embedding Generator with PLaMo-1B

**Goal:** This notebook demonstrates how to generate high-quality text embeddings for Japanese sentences using the `pfnet/plamo-embedding-1b` model from Hugging Face. Embeddings are numerical representations of text that capture semantic meaning, useful for various NLP tasks like similarity search, clustering, and classification.

We will cover:
1.  Installing necessary libraries.
2.  Loading the pre-trained PLaMo model and its tokenizer.
3.  Defining a function to generate embeddings using mean pooling.
4.  Defining a utility function for cosine similarity (NumPy-based).
5.  Defining a function for semantic search (using PyTorch-based cosine similarity).
6.  Running examples for single sentences, batches, saving/loading, and search.

## 1. Library Installation

The following libraries are required:
*   `torch`: The core PyTorch library for tensor computations and neural network operations.
*   `transformers`: Hugging Face's library providing access to pre-trained models (like PLaMo) and tokenizers.
*   `sentencepiece`: A tokenizer library often used by models like PLaMo.
*   `numpy`: A library for numerical operations, especially for handling the embeddings as arrays.

The cell below will install these libraries using pip.

In [None]:
# Install necessary libraries
!pip install torch transformers sentencepiece numpy

## 2. Import Libraries

Now, let's import the installed libraries and necessary modules.
*   `torch` for tensor operations.
*   `torch.nn.functional` (as `F`) for PyTorch functions like cosine similarity.
*   `AutoTokenizer` and `AutoModel` from `transformers` for loading the model and tokenizer automatically based on the model name.
*   `numpy` (as `np`) for numerical operations.

In [None]:
# Import libraries
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import numpy as np

## 3. Load Pre-trained Model and Tokenizer

We will use the `pfnet/plamo-embedding-1b` model, a powerful model specifically trained for generating Japanese text embeddings.

*   **`MODEL_NAME`**: Specifies the Hugging Face model identifier.
*   **`AutoTokenizer.from_pretrained(MODEL_NAME)`**: Loads the appropriate tokenizer for the specified model. The tokenizer converts text into a format (token IDs, attention masks) that the model can understand.
*   **`AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)`**: Loads the pre-trained PLaMo model.
    *   `trust_remote_code=True`: This argument is sometimes required for models that have custom code within their Hugging Face repository. It allows the execution of this custom code. Always ensure you trust the source of the model when using this option.
*   **`device`**: We check if a CUDA-enabled GPU is available and set the device accordingly (`'cuda'` or `'cpu'`).
*   **`.to(device)`**: This moves the model's parameters and buffers to the selected device (GPU if available, otherwise CPU). Processing on a GPU significantly speeds up computations.

In [None]:
# Load pre-trained model and tokenizer
MODEL_NAME = 'pfnet/plamo-embedding-1b'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# It's good practice to move model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True).to(device)
print(f"Using device: {device}")

## 4. Embedding Generation Function

The function `get_japanese_embedding` takes either a single Japanese sentence or a list of sentences and returns their embeddings as NumPy arrays.

**Function Breakdown:**

1.  **Tokenization (`tokenizer(...)`):**
    *   `text_or_texts`: The input Japanese string or list of strings.
    *   `return_tensors='pt'`: Returns PyTorch tensors.
    *   `truncation=True`: Truncates sequences to the model's maximum input length if they are too long.
    *   `padding=True`: Pads shorter sequences to the length of the longest sequence in a batch, ensuring uniform tensor dimensions.
    *   `max_length=512`: Sets the maximum sequence length. (Note: PLaMo-1B's default is 2048. For many common uses, 512 is a practical starting point, but this can be adjusted. The model will handle sequences up to its configured maximum.)
    *   `add_special_tokens=True`: Adds special tokens like `[CLS]` and `[SEP]` if the model expects them (PLaMo does).
    *   `.to(device)`: Moves the tokenized inputs to the same device as the model.

2.  **Model Inference (`with torch.no_grad(): ... outputs = model(**inputs)`)**
    *   `torch.no_grad()`: Disables gradient calculations, which is crucial for inference as it reduces memory consumption and speeds up computations when we are not training the model.
    *   `model(**inputs)`: Passes the tokenized input (input IDs, attention mask, etc.) to the model. The model returns a set of outputs, including the `last_hidden_state`.

3.  **Mean Pooling for Sentence Embedding:**
    *   `last_hidden_states = outputs.last_hidden_state`: This tensor contains the embeddings for each token in the input sequence(s). Its shape is typically (batch_size, sequence_length, hidden_dim).
    *   **Attention Mask (`inputs['attention_mask']`)**: The attention mask is used to distinguish real tokens from padding tokens. It has a value of 1 for real tokens and 0 for padding tokens.
    *   `expanded_mask = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()`: The attention mask is expanded to match the dimensions of `last_hidden_states`. This allows element-wise multiplication to zero out the embeddings of padding tokens.
    *   `sum_embeddings = torch.sum(last_hidden_states * expanded_mask, 1)`: The embeddings of non-padding tokens are summed up along the sequence dimension (dimension 1).
    *   `sum_mask = expanded_mask.sum(1)`: The number of actual (non-padding) tokens in each sequence is calculated.
    *   `mean_embeddings = sum_embeddings / sum_mask`: The sum of embeddings is divided by the number of actual tokens to get the mean embedding. This mean-pooled embedding represents the entire sentence.
    *   `clamp(sum_mask, min=1e-9)`: Prevents division by zero if a sequence had no actual tokens (though padding and tokenizer settings usually prevent this).

4.  **Output:**
    *   `.cpu().numpy().squeeze()`: The resulting embeddings are moved to the CPU, converted to a NumPy array, and any unnecessary single dimensions (e.g., if a single sentence was input) are removed using `squeeze()`.

In [None]:
# Function to generate embeddings using mean pooling
def get_japanese_embedding(text_or_texts):
    # Tokenize the text (handles single string or list of strings)
    inputs = tokenizer(text_or_texts, return_tensors='pt', truncation=True, padding=True, max_length=512, add_special_tokens=True).to(device)
    
    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Perform mean pooling
    last_hidden_states = outputs.last_hidden_state
    attention_mask = inputs['attention_mask']
    expanded_mask = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
    sum_embeddings = torch.sum(last_hidden_states * expanded_mask, 1)
    sum_mask = expanded_mask.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
    
    # Squeeze is important here: if input was single string, result is (1, dim), squeeze to (dim,)
    # If input was list of N strings, result is (N, dim), squeeze does nothing if N > 1.
    return mean_embeddings.cpu().numpy().squeeze()

## Utility Function: Cosine Similarity (NumPy-based)

The following function calculates the cosine similarity between two vectors (or a vector and a matrix of vectors) using NumPy. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It's widely used to compare document or sentence embeddings.

- If `vec_a` is a 1D array and `vec_b` is a 1D array, it returns a single similarity score.
- If `vec_a` is a 1D array and `vec_b` is a 2D array (matrix), it calculates the cosine similarity between `vec_a` and each row vector in `vec_b`, returning a 1D array of similarity scores.

In [None]:
# (NumPy is imported as np in cell 4)

def cosine_similarity(vec_a, vec_b):
    """
    Computes the cosine similarity between vec_a and vec_b using NumPy.
    vec_a: 1D NumPy array.
    vec_b: 1D or 2D NumPy array. If 2D, similarity is computed between vec_a and each row of vec_b.
    Returns: A scalar float if vec_b was 1D, or a 1D NumPy array of floats if vec_b was 2D.
    """
    vec_a_norm = np.linalg.norm(vec_a)
    vec_b_norms = np.linalg.norm(vec_b, axis=-1) # works for 1D and 2D (along last axis)

    if vec_a_norm == 0: # If vec_a is zero vector, similarity is 0
        return 0.0 if vec_b.ndim == 1 else np.zeros(vec_b.shape[0])
            
    if vec_b.ndim == 1:
        dot_prod = np.dot(vec_a, vec_b)
        if vec_b_norms == 0: return 0.0 # If vec_b is zero vector
        similarities = dot_prod / (vec_a_norm * vec_b_norms)
    elif vec_b.ndim == 2:
        dot_prod = np.dot(vec_b, vec_a) # (N,D) @ (D,) -> (N,)
        # Initialize similarities to zero, especially for cases where vec_b_norms might be zero
        similarities = np.zeros(vec_b.shape[0])
        # Calculate similarities only for rows with non-zero norm
        non_zero_mask = (vec_b_norms != 0)
        if np.any(non_zero_mask):
             similarities[non_zero_mask] = dot_prod[non_zero_mask] / (vec_a_norm * vec_b_norms[non_zero_mask])
    else:
        raise ValueError("vec_b must be 1D or 2D array")
        
    return similarities

## Semantic Search Function (using PyTorch Cosine Similarity)

The `search_similar_texts` function implements semantic search. Given a query text, it finds the most similar texts from a provided list of original texts by comparing their embeddings.

**Function Parameters:**
*   `query_text` (str): The Japanese text to search with.
*   `original_texts` (list of str): A list of the original Japanese sentences/texts that correspond to `all_embeddings`.
*   `all_embeddings` (2D NumPy array): A NumPy array where each row is an embedding of a text from `original_texts`. Assumes these embeddings were pre-calculated using `get_japanese_embedding`.
*   `top_n` (int, optional): The number of top similar texts to return. Defaults to 3.

**Process:**
1.  Generates an embedding for the `query_text` (as a NumPy array) using the `get_japanese_embedding` function.
2.  Converts the NumPy query embedding and the `all_embeddings` NumPy array to PyTorch tensors, moving them to the active `device`.
3.  Calculates the cosine similarities between the query tensor and the tensor of all embeddings. (Note: This implementation uses `torch.nn.functional.cosine_similarity` for efficient computation on tensors).
4.  Converts the resulting similarity tensor back to a NumPy array.
5.  Identifies the `top_n` texts with the highest similarity scores using NumPy for sorting.
6.  Returns a list of tuples, where each tuple contains `(similar_text, similarity_score)`.

In [None]:
# (NumPy 'np', 'get_japanese_embedding' are defined in previous cells)
# (PyTorch 'torch', 'device', 'F' are defined/imported in previous cells)

def search_similar_texts(query_text, original_texts, all_embeddings, top_n=3):
    """
    Searches for texts in 'original_texts' that are semantically similar to 'query_text'.
    Uses torch.nn.functional.cosine_similarity for the calculation.
    
    Args:
        query_text (str): The text to search for.
        original_texts (list): A list of original text strings.
        all_embeddings (np.ndarray): A 2D NumPy array of embeddings corresponding to original_texts.
        top_n (int): Number of top similar texts to return.
        
    Returns:
        list: A list of tuples, each containing (similar_text, similarity_score), 
              sorted by similarity in descending order.
    """
    if not isinstance(all_embeddings, np.ndarray):
        all_embeddings = np.array(all_embeddings)
    if all_embeddings.ndim != 2:
        raise ValueError(f"all_embeddings must be a 2D array, but got {all_embeddings.ndim} dimensions.")
    if len(original_texts) != all_embeddings.shape[0]:
        raise ValueError(f"Number of original_texts ({len(original_texts)}) must match number of rows in all_embeddings ({all_embeddings.shape[0]}).")

    # 1. Get query embedding (NumPy array from get_japanese_embedding)
    query_embedding_np = get_japanese_embedding(query_text)
    if query_embedding_np.ndim == 2: # Ensure it's 1D for a single query
        query_embedding_np = query_embedding_np.squeeze(0)

    # 2. Convert NumPy embeddings to PyTorch tensors and move to device
    query_tensor = torch.from_numpy(query_embedding_np).unsqueeze(0).to(device) # Shape: (1, D)
    all_embeddings_tensor = torch.from_numpy(all_embeddings).to(device)      # Shape: (N, D)

    # 3. Calculate cosine similarities using torch.nn.functional
    # query_tensor (1,D) is broadcast against all_embeddings_tensor (N,D)
    # dim=-1 ensures similarity is computed along the embedding dimension
    similarities_tensor = F.cosine_similarity(query_tensor, all_embeddings_tensor, dim=-1) # Output shape: (N,)
    
    # 4. Convert similarities tensor back to NumPy array
    similarities = similarities_tensor.cpu().numpy()
    
    # Ensure similarities is a 1D array for np.argsort
    if isinstance(similarities, float) or similarities.ndim == 0:
        similarities = np.array([similarities])
            
    num_available_results = len(similarities)
    actual_top_n = min(top_n, num_available_results) 

    # 5. Get indices of top_n similarities in descending order
    sorted_indices = np.argsort(similarities)[::-1][:actual_top_n]
    
    # 6. Retrieve the corresponding original texts and their scores
    results = []
    for idx in sorted_indices:
        results.append((original_texts[idx], similarities[idx]))
        
    return results

## 6. Example Usage (Embeddings, Saving, Loading, and Search)

The following cell demonstrates how to use the functions defined above.
It shows:
1.  Generating an embedding for a single Japanese sentence.
2.  Generating embeddings for a batch of multiple Japanese sentences.
    *   The function `get_japanese_embedding` handles both single strings and lists of strings automatically.
3.  (The cell after this one will show saving/loading and an example of using the search function)

In [None]:
# Example Usage for generating embeddings
print(f"Using Model: {MODEL_NAME}\n")

# 1. Single sentence example
print("--- Single Sentence Embedding ---")
sample_text_single = "こんにちは、美しい世界！"
embedding_single = get_japanese_embedding(sample_text_single)
print(f"Original sentence: {sample_text_single}")
print(f"Embedding shape: {embedding_single.shape}")
print(f"Sample embedding (first 5 values): {embedding_single[:5]}\n")

# 2. Batch (multiple sentences) example
print("--- Batch Sentences Embedding ---")
corpus_texts = [
    "これは最初の文です。",
    "日本語の埋め込みをテストしています。",
    "これが最後の文になります。",
    "東京は日本の首都です。",
    "猫が窓辺で日向ぼっこをしています。"
]
corpus_embeddings = get_japanese_embedding(corpus_texts) 
print(f"Original corpus sentences: {corpus_texts}")
print(f"Corpus embeddings shape: {corpus_embeddings.shape}") 
if corpus_embeddings.ndim == 2 and corpus_embeddings.shape[0] > 0:
    print(f"Sample embedding for the first corpus sentence (first 5 values): {corpus_embeddings[0, :5]}\n")
else:
    print("Could not display sample of corpus embeddings due to unexpected shape.\n")

## 7. Saving, Loading, and Search Example

This section demonstrates saving and loading the generated batch embeddings, and then using the `search_similar_texts` function with a query to find similar sentences from the batch.

In [None]:
# Saving, Loading, and Search Example (uses corpus_embeddings and corpus_texts from the cell above)

if 'corpus_embeddings' in locals() and isinstance(corpus_embeddings, np.ndarray) and corpus_embeddings.size > 0:
    # Saving batch embeddings to a .npy file
    print("--- Saving Corpus Embeddings ---")
    output_filename = "japanese_corpus_embeddings.npy"
    np.save(output_filename, corpus_embeddings)
    print(f"Corpus embeddings saved to: {output_filename}\n")

    # Loading embeddings from the .npy file
    print("--- Loading Corpus Embeddings ---")
    loaded_corpus_embeddings = np.load(output_filename)
    print(f"Embeddings loaded from: {output_filename}")
    print(f"Loaded corpus embeddings shape: {loaded_corpus_embeddings.shape}\n")

    # Example of using the search function
    print("--- Semantic Search Example (using PyTorch Cosine Similarity) ---")
    query_sentence = "日本の首都はどこですか。"
    print(f"Search Query: {query_sentence}")
    
    search_results = search_similar_texts(query_sentence, corpus_texts, loaded_corpus_embeddings, top_n=3)
    
    print("Top similar texts found:")
    for text, score in search_results:
        print(f"  Score: {score:.4f} - Text: {text}")
else:
    print("Skipping saving/loading/search example as 'corpus_embeddings' is not defined, not a NumPy array, or empty. Please run the cell above first.")