# Text Classification - Vanilla Embeddings

----



## $\color{blue}{Sections:}$
* Preamble
* Admin - importing libraries
* Load - Loading our data from pandas
* Embeddings - create the embeddings
* Save - save the embeddings on dataframes and docs

## $\color{blue}{Preamble:}$

This note book will create embeddings and update dataframes and docs with embeddings from 'thenlper/gte-base'.

## $\color{blue}{Admin:}$


In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'


Mounted at /content/drive
/content/drive/MyDrive


In [None]:
%%capture
!pip install sentence-transformers huggingface_hub

In [None]:
%%capture
!pip install dill
!pip install langchain

## $\color{blue}{Load:}$

In [None]:
import pandas as pd
path = "class/datasets/"
df_train = pd.read_pickle(path + "df_train")
df_dev = pd.read_pickle(path + "df_dev")
df_test = pd.read_pickle(path + "df_test")

In [None]:
import dill
def save_langchain_docs(docs, filename):
    """Save a list of Langchain Documents to a .dill file."""
    with open(filename, 'wb') as f:
        dill.dump(docs, f)
    print(f"Documents saved to {filename}")

def load_langchain_docs(filename):
    """Load a list of Langchain Documents from a .dill file."""
    with open(filename, 'rb') as f:
        docs = dill.load(f)
    print(f"Documents loaded from {filename}")
    return docs

In [None]:
docs_train = load_langchain_docs(path + "docs_train")
docs_dev = load_langchain_docs(path + "docs_dev")
docs_test = load_langchain_docs(path + "docs_test")

Documents loaded from class/datasets/docs_train
Documents loaded from class/datasets/docs_dev
Documents loaded from class/datasets/docs_test


## $\color{blue}{Embeddings:}$

In [None]:
import os
from getpass import getpass
from huggingface_hub import login

# Prompt for your Hugging Face token securely
token = getpass("Please enter your Hugging Face token: ")

Please enter your Hugging Face token: ··········


In [None]:
# Use the token for Hugging Face login
if token:
    print("HuggingFace token has been successfully entered.")
    login(token=token)
else:
    print("Continuing without Hugging Face login")

HuggingFace token has been successfully entered.
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

test_sentence = 'This is a test'
model = SentenceTransformer('thenlper/gte-base')
test_embedding = model.encode(test_sentence, convert_to_tensor=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
print(type(test_embedding))
test_embedding.size()

<class 'torch.Tensor'>


torch.Size([768])

### $\color{red}{Test:}$

In [None]:
test_sentences = list(df_test['content'])
test_embeddings = []
for sent in tqdm(test_sentences):
  test_embeddings.append(model.encode(sent))

100%|██████████| 1000/1000 [00:11<00:00, 85.87it/s]


### $\color{red}{Dev:}$

In [None]:
dev_sentences = list(df_dev['content'])
dev_embeddings = []
for sent in tqdm(dev_sentences):
  dev_embeddings.append(model.encode(sent))

100%|██████████| 964/964 [00:11<00:00, 85.53it/s]


### $\color{red}{Train:}$

In [None]:
train_sentences = list(df_train['content'])
train_embeddings = []
for sent in tqdm(train_sentences):
  train_embeddings.append(model.encode(sent))

100%|██████████| 12000/12000 [02:20<00:00, 85.47it/s]


## $\color{blue}{Save:}$

### $\color{red}{Save-DataFrames:}$

In [None]:
df_train['vanilla_embedding'] = train_embeddings
df_dev['vanilla_embedding'] = dev_embeddings
df_test['vanilla_embedding'] = test_embeddings
path = "class/datasets/"
df_train.to_pickle(path + 'df_train')
df_dev.to_pickle(path + 'df_dev')
df_test.to_pickle(path + 'df_test')

### $\color{red}{Save-Docs:}$

In [None]:
for i in range(len(train_embeddings)):
  docs_train[i].metadata['vanilla_embedding'] = train_embeddings[i]
save_langchain_docs(docs_train, path + 'docs_train')

Documents saved to class/datasets/docs_train


In [None]:
for i in range(len(dev_embeddings)):
  docs_dev[i].metadata['vanilla_embedding'] = dev_embeddings[i]
save_langchain_docs(docs_dev, path + 'docs_dev')

Documents saved to class/datasets/docs_dev


In [None]:
for i in range(len(test_embeddings)):
  docs_test[i].metadata['vanilla_embedding'] = test_embeddings[i]
save_langchain_docs(docs_test, path + 'docs_test')

Documents saved to class/datasets/docs_test
