# Exploring text-embeddings-inference Server

This notebook demonstrates how to use open source embedding model from text-embeddings-inference server in your RAG pipelines.

We are implementing two approaches:

1. Install text-embeddings-inference server on a local CPU, run evaluations to compare performance between two embedding models: inference server's bge-large-en-v1.5 vs OpenAI's text-embedding-ada-002.

2. Install text-embeddings-inference server on an AWS GPU EC2 instance, instance type g5.xlarge. We again run the evaluations to compare performance between the embedding models from the inference server and that from OpenAI.

* LLM is gpt-3.5-turbo 
* Use embedding model from inference server
* Use embedding model from OpenAI
* Apply EDD to compare the results

In [1]:
!pip install llama_index==0.8.48 pypdf sentence-transformers transformers httpx 



## Use Inference embedding model

### Define LLM gpt-3.5-turbo

In [2]:
import os
import logging
import sys

os.environ["OPENAI_API_KEY"] = "sk-############"

logging.basicConfig(stream=sys.stdout, level=logging.ERROR)

from llama_index.llms import OpenAI

# define LLM
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")

### Define Inference embedding model 

In [3]:
from llama_index.embeddings import TextEmbeddingsInference

embed_model = TextEmbeddingsInference(
    model_name="BAAI/bge-large-en-v1.5",
    base_url = "http://127.0.0.1:8080",
    #base_url = "http://ec2-##-##-##-##.compute-1.amazonaws.com:8080",
    timeout=60,  # timeout in seconds
    embed_batch_size=10,  # batch size for embedding
)

### Define service_context, load doc, parse nodes

In [4]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=256, chunk_overlap=20)

In [5]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
    MetadataExtractor,
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
)
from llama_index.text_splitter import TokenTextSplitter
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['Paleolithic diet'], auto_suggest=False)
print(f'Loaded {len(documents)} documents')

# #construct text splitter to split texts into chunks for processing
text_splitter = TokenTextSplitter(separator=" ", chunk_size=256, chunk_overlap=20)

#create node parser to parse nodes from document
node_parser = SimpleNodeParser(text_splitter=text_splitter)

nodes = node_parser.get_nodes_from_documents(documents)
print(f"loaded nodes with {len(nodes)} nodes")

Loaded 1 documents
loaded nodes with 14 nodes


### Construct index and query engine

In [6]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex(
    nodes=nodes,
    service_context=service_context,
    show_progress=True
)

query_engine = index.as_query_engine()

response = query_engine.query("What are some potential health improvements that may result from following the paleolithic diet?")
print(response)

Generating embeddings:   0%|          | 0/14 [00:00<?, ?it/s]

Following the paleolithic diet may potentially lead to weight loss and fat loss, as well as increased satiety from the foods typically eaten. However, it is important to note that there may also be potential nutritional deficiencies, such as those of vitamin D and calcium, which could compromise bone health. Additionally, there is a risk of ingesting toxins from high fish consumption.


## Use OpenAI embedding model

### Build index and query engine with OpenAI embedding model

In [7]:
from llama_index.embeddings import OpenAIEmbedding

service_context_openai = ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(), chunk_size=256, chunk_overlap=20)

index_openai = VectorStoreIndex(
    nodes=nodes,
    service_context=service_context_openai
)
query_engine_openai = index_openai.as_query_engine()

## Applying EDD (Eval-Driven Development) 

### Generate question dataset

In [8]:
from llama_index.evaluation import DatasetGenerator
import nest_asyncio

nest_asyncio.apply()

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", encoding='utf-8') as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(nodes)
    generated_questions = data_generator.generate_questions_from_nodes(num=30)
    print(f"Generated {len(generated_questions)} questions.")

    question_dataset.extend(generated_questions)
    
    # save the questions into a txt file
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

1. What did John Harvey Kellogg support in terms of diet?
2. What advancements in science have contributed to our understanding of early human diets?
3. What is the proposed role of cooked starches in the paleolithic diet?
4. Why is it difficult to devise an ideal diet by studying contemporary hunter-gatherers?
5. What is the main limitation of the data used in Cordain's book?
6. What is the emphasis of Loren Cordain's version of the paleolithic diet?
7. When did the Paleolithic diet start to gain popularity?
8. What was Richard Mackarness' stance on carbohydrates in his book "Eat Fat and Grow Slim"?
9. What technological developments would have been dropped if humans were not nutritionally adaptable?
10. How does the Paleolithic diet avoid food processing?
11. What are some of the key differences between the Paleolithic diet and modern diets?
12. According to Walter L. Voegtlin, what was the main component of the Stone Age diet?
13. How do modern domesticated plants and animals differ

### Define evaluators

In [9]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# use gpt-4 to evaluate
gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0.1, llm="gpt-4"))

faithfulness_gpt4 = FaithfulnessEvaluator(service_context=gpt4_service_context)
relevancy_gpt4 = RelevancyEvaluator(service_context=gpt4_service_context)

### Define BatchEvalRunner

In [10]:
from llama_index.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=4,
    show_progress=True
)

def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Correct: {correct}. Score: {score}")
    return score

### Evaluate OpenAI embedding model

In [11]:
eval_results = await runner.aevaluate_queries(
    query_engine_openai, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:21<00:00,  1.36it/s]
100%|██████████| 60/60 [00:09<00:00,  6.62it/s]

------------------
faithfulness Correct: 24. Score: 0.8
relevancy Correct: 28. Score: 0.9333333333333333





### Evaluate Inference server embedding model

In [12]:
eval_results = await runner.aevaluate_queries(
    query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:21<00:00,  1.37it/s]
100%|██████████| 60/60 [00:08<00:00,  7.00it/s]

------------------
faithfulness Correct: 25. Score: 0.8333333333333334
relevancy Correct: 28. Score: 0.9333333333333333



