# Million Text Embeddings

A dataset with a million English sentences and their respective embeddings with the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model.\
**Dimensions:	768**\
**Source: [sentence-transformers/agnews](https://huggingface.co/datasets/sentence-transformers/agnews)**

### Load source dataset

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("sentence-transformers/agnews")["train"]

texts = []
for row in dataset:
    texts.append(row['title'])
    texts.append(row['description'])
    
texts = filter(lambda t: len(t) <= 350, texts)
texts = list(set(texts))[:1_000_000]
len(texts)

1000000

### Compute embedding vectors

In [3]:
from sentence_transformers import SentenceTransformer

In [4]:
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(texts, show_progress_bar=True)
len(embeddings)

Batches:   0%|          | 0/31250 [00:00<?, ?it/s]

1000000

### Create Dataset object and push to HuggingFace

In [5]:
from datasets import Dataset

In [6]:
data = {"text": texts, "embedding": embeddings}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("Sreenath/million-text-embeddings")

Uploading the dataset shards:   0%|          | 0/7 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/665 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Sreenath/million-text-embeddings/commit/4b0da94b53656dfafd0d426dc549af64ecf34e20', commit_message='Upload dataset', commit_description='', oid='4b0da94b53656dfafd0d426dc549af64ecf34e20', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Sreenath/million-text-embeddings', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Sreenath/million-text-embeddings'), pr_revision=None, pr_num=None)