# Million Text Embeddings

A dataset with more than a million English sentences and their respective embeddings with the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model.\
**Train Set:	1,000,000**\
**Test Set:		2,00,000**\
**Dimensions:	768**\
**Source: [sentence-transformers/agnews](https://huggingface.co/datasets/sentence-transformers/agnews)**

### Load source dataset

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("sentence-transformers/agnews")["train"]

texts = []
for row in dataset:
    texts.append(row['title'])
    texts.append(row['description'])
texts = filter(lambda t: len(t) <= 350, texts)
texts = list(set(texts))

train_texts = texts[:1_000_000]
test_texts = texts[1_000_000:1_200_000]

len(train_texts), len(test_texts)

(1000000, 200000)

### Compute embedding vectors

In [3]:
from sentence_transformers import SentenceTransformer

In [4]:
model = SentenceTransformer("all-mpnet-base-v2")

train_embeddings = model.encode(train_texts, show_progress_bar=True)
test_embeddings = model.encode(test_texts, show_progress_bar=True)

len(train_embeddings), len(test_embeddings)

Batches:   0%|          | 0/31250 [00:00<?, ?it/s]

Batches:   0%|          | 0/6250 [00:00<?, ?it/s]

(1000000, 200000)

### Create Dataset object and push to HuggingFace

In [5]:
from datasets import Dataset

In [None]:
train_data = {"text": train_texts, "embedding": train_embeddings}
train_dataset = Dataset.from_dict(train_data)
train_dataset.push_to_hub("Sreenath/million-text-embeddings", split="train")

test_data = {"text": test_texts, "embedding": test_embeddings}
test_dataset = Dataset.from_dict(test_data)
test_dataset.push_to_hub("Sreenath/million-text-embeddings", split="test")

Uploading the dataset shards:   0%|          | 0/7 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]