<a href="https://colab.research.google.com/github/tokyo8182/LLM-RAG/blob/main/Embeddings_Inference_Time_Comparison_v1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install llama-index
%pip install "llama-index-embeddings-huggingface"
%pip install xformers

Collecting llama-index
  Downloading llama_index-0.12.1-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.0-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.1 (from llama-index)
  Downloading llama_index_core-0.12.1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.0-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.2-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting 

## Setup + Data

In [2]:
import pandas as pd
import time
from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter

In [5]:

# connecting to google drive drive:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
def load_data(csv_path):
    return pd.read_csv(csv_path)['markdown_content'].tolist()

DATA_PATH = "/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/v2_new_LK_tea_dataset_updated.csv"
documents = load_data(DATA_PATH)
len(documents)

166

In [7]:
# drop nans
documents = [doc for doc in documents if isinstance(doc, str) and not pd.isna(doc)]
len(documents)

165

In [8]:
words_per_doc = [len(str(document).split(' ')) for document in documents]
len(words_per_doc), sum(words_per_doc), max(words_per_doc), min(words_per_doc)

(165, 167135, 21832, 99)

## Chunking

In [9]:
def generate_chunk_nodes(texts):
    # Text chunking
    splitter = SentenceSplitter(chunk_size=750, chunk_overlap=250)
    nodes = splitter.get_nodes_from_documents(
        [Document(text=text) for text in texts]
    )
    return nodes

chunk_nodes = generate_chunk_nodes(documents)

print(f"Total chunks created: {len(chunk_nodes)}")

Total chunks created: 575


In [10]:
def measure_embedding_time(model_name, chunk_nodes):
    # Initialize embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_name, trust_remote_code=True, cache_folder="/content/drive/MyDrive/Omdena_Challenge/cached_models/")

    # Time embedding process
    start_time = time.monotonic()
    embeddings = [embed_model.get_text_embedding(node.text) for node in chunk_nodes]

    end_time = time.monotonic()
    return end_time - start_time, len(embeddings), embeddings

## Embeddings Generation

### [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)

In [11]:
model_name = 'BAAI/bge-small-en-v1.5'
elapsed_time_bge, embd_count_bge, bge_embeddings_bge = measure_embedding_time(model_name, chunk_nodes)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
print(f"Generated {embd_count_bge} embeddings in {elapsed_time_bge:.4f} seconds")
print(f"Generation rate {(embd_count_bge/elapsed_time_bge):.4f} embeddings/second")

Generated 575 embeddings in 12.0393 seconds
Generation rate 47.7604 embeddings/second


## [stella_en_400M_v5](https://huggingface.co/dunzhang/stella_en_400M_v5)

In [13]:
model_name = 'dunzhang/stella_en_400M_v5'
elapsed_time_st, embd_count_st, nv_st_embeddings = measure_embedding_time(model_name, chunk_nodes)

modules.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/170k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dunzhang/stella_en_400M_v5:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/57.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dunzhang/stella_en_400M_v5:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Some weights of the model checkpoint at dunzhang/stella_en_400M_v5 were not used when initializing NewModel: ['new.pooler.dense.bias', 'new.pooler.dense.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/186 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

2_Dense_1024/config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

In [14]:
print(f"Generated {embd_count_st} embeddings in {elapsed_time_st:.4f} seconds")
print(f"Generation rate {(embd_count_st/elapsed_time_st):.4f} embeddings/second")

Generated 575 embeddings in 74.7813 seconds
Generation rate 7.6891 embeddings/second
