**EMBEDDING**

Embed textual data from an uploaded CSV file (column name: 'text') and download the CSV file with the embedding vectors attached as additional column.

Install packages

In [1]:
# Install required packages
%%capture
!pip install sentence-transformers
!pip install torch

# Import libraries
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from google.colab import files

Load the CSV file

In [2]:
# Load data
print("Please upload your CSV file when prompted...")
uploaded = files.upload()
input_filename = next(iter(uploaded))
print(f'Uploaded file: {input_filename}')

Please upload your CSV file when prompted...


Saving df.csv to df.csv
Uploaded file: df.csv


Embed and download

In [3]:
###
# Process function
##
def process_csv_embeddings(input_filename, output_filename, text_column='text', batch_size=32):
    # Read the CSV file
    print(f"Reading {input_filename}...")
    df = pd.read_csv(input_filename)

    if text_column not in df.columns:
        raise ValueError(f"Column '{text_column}' not found in CSV file")

    print("Loading Sentence-BERT multilingual model (mpnet)...")
    # Using the multilingual model that supports Dutch
    #model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') # faster, but 384 dimensional dense
    model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') # lower, but 768 dimensional dense

    print("Generating embeddings...")
    texts = df[text_column].tolist()

    # Generate embeddings with progress bar
    embeddings = model.encode(texts, batch_size=batch_size, show_progress_bar=True)

    print("\nConverting embeddings to CSV format...")
    df['embeddings'] = [','.join(map(str, emb)) for emb in embeddings]

    print(f"Saving results to {output_filename}...")
    df.to_csv(output_filename, index=False)
    files.download(output_filename)
    print("Done! File has been downloaded.")

    return df

# attach embedding vectors to csv and present as download
output_filename = 'embedded_' + input_filename
try:
    df_with_embeddings = process_csv_embeddings(input_filename, output_filename)
    print(f"Shape of first embedding: {len(df_with_embeddings['embeddings'].iloc[0].split(','))} dimensions")
except Exception as e:
    print(f"Error processing file: {str(e)}")

Reading df.csv...
Loading Sentence-BERT multilingual model (mpnet)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings...


Batches:   0%|          | 0/46 [00:00<?, ?it/s]


Converting embeddings to CSV format...
Saving results to embedded_df.csv...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Done! File has been downloaded.
Shape of first embedding: 768 dimensions
