# Million Text Embeddings

A dataset with more than a million English sentences and their respective embeddings with the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model.\
**Train Set:	1,000,000**\
**Test Set:		2,00,000**\
**Dimensions:	768**\
**Source: [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences)**

### Load source dataset

In [1]:
from datasets import load_dataset
from langdetect import detect

In [2]:
def filter_text(text: str) -> bool:    
    if "..." in text:
        return False
    if len(text) > 350:
        return False
    try:
        if detect(text) != 'en':
            return False
    except:
        return False

    return True

In [3]:
dataset = load_dataset("agentlans/high-quality-english-sentences")
texts = dataset["train"]["text"] + dataset["test"]["text"]
texts = list(filter(filter_text, texts))

train_texts = texts[:1_000_000]
test_texts = texts[1_000_000:1_200_000]

print(texts[:5])

len(texts), len(train_texts), len(test_texts)

['Soon we dropped into a living forest, where cold-tolerant evergreens and boreal animals still evoke the Canadian heritage of an ecosystem pushed south by glaciers 20,000 years ago.', 'Annual population growth rate (2011 est., CIA World Factbook): 1.284%.', 'This has led to the recent banning of Neonics in the EU, however the US and Canada are still using this chemical pesticide.', "In addition, these colors weren't confined to a province but rather irregularly scattered across various regions over all of China.", 'A family member or a support person may stay with a patient during recovery.']


(1672314, 1000000, 200000)

### Compute embedding vectors

In [4]:
from sentence_transformers import SentenceTransformer

In [5]:
model = SentenceTransformer("all-mpnet-base-v2")

train_embeddings = model.encode(train_texts, show_progress_bar=True)
test_embeddings = model.encode(test_texts, show_progress_bar=True)

len(train_embeddings), len(test_embeddings)

Batches:   0%|          | 0/31250 [00:00<?, ?it/s]

Batches:   0%|          | 0/6250 [00:00<?, ?it/s]

(1000000, 200000)

### Create Dataset object and push to HuggingFace

In [6]:
from datasets import Dataset

In [7]:
train_data = {"text": train_texts, "embedding": train_embeddings}
train_dataset = Dataset.from_dict(train_data)
train_dataset.push_to_hub("Sreenath/million-text-embeddings", split="train")

test_data = {"text": test_texts, "embedding": test_embeddings}
test_dataset = Dataset.from_dict(test_data)
test_dataset.push_to_hub("Sreenath/million-text-embeddings", split="test")

Uploading the dataset shards:   0%|          | 0/7 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/143 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/100 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/100 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/997 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Sreenath/million-text-embeddings/commit/7dc0f8da10ccb79d1104e7902b831c19e5275384', commit_message='Upload dataset', commit_description='', oid='7dc0f8da10ccb79d1104e7902b831c19e5275384', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Sreenath/million-text-embeddings', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Sreenath/million-text-embeddings'), pr_revision=None, pr_num=None)