# <span style="color: #2E86C1; font-weight: bold; font-size: 36px;">Smarter Food Choices with RAG using OpenVINO</span>

---


## <span style="color: #2980B9; font-weight: bold;">Project Overview</span>

**Retrieval-Augmented Generation (RAG) Assistant for Nutrition**

This project leverages the capabilities of **LLaMA 3.1 8B Instruct**, optimized with the **Intel OpenVINO toolkit**, creating an assistant that empowers users to make informed dietary choices through image recognition and real-time data generation.

---

### <span style="color: #2874A6; font-weight: bold;">Key Features and Technologies</span>

#### **1. Intel AI PC**  
- **Purpose and Role**: The Intel AI PC is a high-performance device built for AI and machine learning workloads, featuring a unique blend of CPUs, GPUs, and AI accelerators within Intel’s XPU architecture.
- **Benefits for the Project**: With Intel AI PC, the assistant runs complex deep learning models and real-time inference, ensuring smooth, immediate responses.
- **Why It Matters**: The Intel AI PC handles demanding tasks like RAG, OCR, and NLP, enabling a responsive and high-quality user experience.

---

#### **2. Intel OpenVINO Toolkit**  
- **Purpose and Role**: OpenVINO (Open Visual Inference and Neural network Optimization) toolkit optimizes and accelerates deep learning models on Intel hardware, supporting model optimization, quantization, and hardware-specific acceleration.
- **Benefits for the Project**: OpenVINO optimizes the LLaMA model and embedding pipeline for Intel GPUs, significantly improving inference times and reducing latency for instant user responses.
- **Why It Matters**: OpenVINO ensures the assistant meets real-time performance while maintaining accuracy, enhancing overall user experience.

---

#### **3. OCR with PyTesseract**  
- **Purpose and Role**: Optical Character Recognition (OCR) with PyTesseract extracts text from images, particularly effective for recognizing nutritional labels.
- **Benefits for the Project**: PyTesseract processes various label fonts and formats, converting them into structured data crucial for the RAG assistant’s understanding of ingredients and nutritional content.
- **Why It Matters**: OCR integration enables users to analyze food labels easily by snapping a photo, providing a seamless and user-friendly experience.

---

#### **4. LLaMA 3.1 8B Instruct**  
- **Purpose and Role**: The LLaMA model generates natural-sounding, contextually accurate text, answering user questions and offering nutritional insights.
- **Benefits for the Project**: As the assistant’s core, LLaMA provides contextual insights and suggestions, adding a human-like, interactive component.
- **Why It Matters**: LLaMA’s NLP capabilities ensure that users receive detailed, accurate responses, making it both informative and easy to engage with.

---

#### **5. LangChain Integration**  
- **Purpose and Role**: LangChain connects various AI system components, enabling smooth interactions between the language model, OCR, and other modules.
- **Benefits for the Project**: LangChain facilitates a seamless experience where OCR-extracted data flows into the LLaMA model, allowing users to ask questions about nutritional content and get accurate answers instantly.
- **Why It Matters**: LangChain simplifies backend architecture, allowing complex queries and interactions, ultimately enhancing usability.

---

<span style="font-size: 14px; font-style: italic;">Created for the Intel Student Ambassador Fall Hackathon 2024 by</span> **Ilias Amchichou**, **Ishparsh Uprety**, <span style="font-weight: bold;">and</span> **Visharad Kashyap**.


## Install dependencies and libraries

In [None]:
import os
import requests

from pip_helper import pip_install

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

pip_install("--pre", "-U", "openvino>=2024.2.0", "--extra-index-url", "https://storage.openvinotoolkit.org/simple/wheels/nightly")
pip_install( "openvino-tokenizers[transformers]", "--extra-index-url", "https://storage.openvinotoolkit.org/simple/wheels/nightly")
pip_install(
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
    "git+https://github.com/huggingface/optimum-intel.git",
    "nncf",
    "datasets",
    "accelerate",
    "gradio>=4.19",
    "onnx<1.16.2",
    "einops",
    "transformers_stream_generator",
    "tiktoken",
    "transformers>=4.43.1",
    "faiss-cpu",
    "sentence_transformers",
    "langchain>=0.2.0",
    "langchain-community>=0.2.15",
    "langchainhub",
    "unstructured",
    "scikit-learn",
    "python-docx",
    "pypdf",
    "pytesseract",
    "pillow",
)

## Import LLM configuration file and your document(s)

In [None]:
import os
from pathlib import Path
import requests
import shutil
import io

config_shared_path = Path("../../utils/llm_config.py")
config_dst_path = Path("llm_config.py")
text_example_en_path = Path("text_example_en.pdf")

if not text_example_en_path.exists():
    r = requests.get(url=text_example_en)
    content = io.BytesIO(r.content)
    with open("text_example_en.pdf", "wb") as f:
        f.write(content.read())

In [None]:
from pathlib import Path
import torch
import ipywidgets as widgets
from transformers import (
    TextIteratorStreamer,
    StoppingCriteria,
    StoppingCriteriaList,
)

## Select the language and the Large Language Model 

In [None]:
from llm_config import (
    SUPPORTED_EMBEDDING_MODELS,
    SUPPORTED_RERANK_MODELS,
    SUPPORTED_LLM_MODELS,
)

model_languages = list(SUPPORTED_LLM_MODELS)

model_language = widgets.Dropdown(
    options=model_languages,
    value=model_languages[0],
    description="Model Language:",
    disabled=True,
)

model_language

In [None]:
llm_model_ids = [model_id for model_id, model_config in SUPPORTED_LLM_MODELS[model_language.value].items() if model_config.get("rag_prompt_template")]

llm_model_id = widgets.Dropdown(
    options=llm_model_ids,
    value=llm_model_ids[-1],
    description="Model:",
    disabled=True,
)

llm_model_id

In [None]:
llm_model_configuration = SUPPORTED_LLM_MODELS[model_language.value][llm_model_id.value]
print(f"Selected LLM model {llm_model_id.value}")

## Prepare the model for LLM conversion and Weights Compression using Optimum-CLI

In [None]:
from IPython.display import Markdown, display

prepare_int8_model = widgets.Checkbox(
    value=True,
    description="Prepare INT8 model",
    disabled=True,
)


display(prepare_int8_model)

In [None]:
enable_awq = widgets.Checkbox(
    value=True,
    description="Enable AWQ",
    disabled= prepare_int8_model.value,
)
display(enable_awq)

In [None]:
pt_model_id = llm_model_configuration["model_id"]
pt_model_name = llm_model_id.value.split("-")[0]

int8_model_dir = Path(llm_model_id.value) / "INT8_compressed_weights"

def convert_to_int8():
    if (int8_model_dir / "openvino_model.xml").exists():
        return
    int8_model_dir.mkdir(parents=True, exist_ok=True)
    remote_code = llm_model_configuration.get("remote_code", False)
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format int8".format(pt_model_id)
    if remote_code:
        export_command_base += " --trust-remote-code"
    export_command = export_command_base + " " + str(int8_model_dir)
    display(Markdown("**Export command:**"))
    display(Markdown(f"`{export_command}`"))
    ! $export_command



if prepare_int8_model.value:
    convert_to_int8()

In [None]:
int8_weights = int8_model_dir / "openvino_model.bin"
for precision, compressed_weights in zip([8], [int8_weights]):
    if compressed_weights.exists():
        print(f"Size of model with INT{precision} compressed weights is {compressed_weights.stat().st_size / 1024 / 1024:.2f} MB")

## Select the embedding model

In [None]:
embedding_model_id = list(SUPPORTED_EMBEDDING_MODELS[model_language.value])

embedding_model_id = widgets.Dropdown(
    options=embedding_model_id,
    value=embedding_model_id[0],
    description="Embedding Model:",
    disabled=True,
)

embedding_model_id

In [None]:
embedding_model_configuration = SUPPORTED_EMBEDDING_MODELS[model_language.value][embedding_model_id.value]
print(f"Selected {embedding_model_id.value} model")

In [None]:
export_command_base = "optimum-cli export openvino --model {} --task feature-extraction".format(embedding_model_configuration["model_id"])
export_command = export_command_base + " " + str(embedding_model_id.value)

if not Path(embedding_model_id.value).exists():
    ! $export_command

## Select the device to load the Embedding model

In [None]:
from notebook_utils import device_widget

embedding_device = device_widget()

embedding_device

In [None]:
print(f"Embedding model will be loaded to {embedding_device.value} device for text embedding")

In [None]:
from notebook_utils import optimize_bge_embedding

USING_NPU = embedding_device.value == "NPU"

npu_embedding_dir = embedding_model_id.value + "-npu"
npu_embedding_path = Path(npu_embedding_dir) / "openvino_model.xml"
if USING_NPU and not Path(npu_embedding_dir).exists():
    shutil.copytree(embedding_model_id.value, npu_embedding_dir)
    optimize_bge_embedding(Path(embedding_model_id.value) / "openvino_model.xml", npu_embedding_path)

## Select the device to load the Large Language Model

In [None]:
from notebook_utils import device_widget

llm_device = device_widget("GPU")

llm_device

In [None]:
print(f"LLM model will be loaded to {llm_device.value} device for response generation")

## Loading the Hugging Face embedding model using OpenVINO through OpenVINOBgeEmbeddings

In [None]:
from langchain_community.embeddings import OpenVINOBgeEmbeddings

embedding_model_name = npu_embedding_dir if USING_NPU else embedding_model_id.value
batch_size = 1 if USING_NPU else 4
embedding_model_kwargs = {"device": embedding_device.value, "compile": False}
encode_kwargs = {
    "mean_pooling": embedding_model_configuration["mean_pooling"],
    "normalize_embeddings": embedding_model_configuration["normalize_embeddings"],
    "batch_size": batch_size,
}

embedding = OpenVINOBgeEmbeddings(
    model_name_or_path=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=encode_kwargs,
)
if USING_NPU:
    embedding.ov_model.reshape(1, 512)
embedding.ov_model.compile()

text = "This is a test document."
embedding_result = embedding.embed_query(text)
embedding_result[:3]

In [None]:
available_models = []
if int8_model_dir.exists():
    available_models.append("INT8")

model_to_run = widgets.Dropdown(
    options=available_models,
    value=available_models[0],
    description="Model to run:",
    disabled=True,
)

model_to_run

 ## Triggering OpenVINO as backend inference framework

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

import openvino.properties as props
import openvino.properties.hint as hints
import openvino.properties.streams as streams



model_dir = int8_model_dir

ov_config = {hints.performance_mode(): hints.PerformanceMode.LATENCY, streams.num(): "1", props.cache_dir(): ""}

llm = HuggingFacePipeline.from_model_id(
    model_id=str(model_dir),
    task="text-generation",
    backend="openvino",
    model_kwargs={
        "device": llm_device.value,
        "ov_config": ov_config,
        "trust_remote_code": True,
    },
    pipeline_kwargs={"max_new_tokens": 2},
)

if llm.pipeline.tokenizer.eos_token_id:
    llm.pipeline.tokenizer.pad_token_id = llm.pipeline.tokenizer.eos_token_id

llm.invoke("2 + 2 =")

## Running QA over Document
A typical RAG application has two main components:

- **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happen offline.

- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

**Indexing**

1. `Load`: First we need to load our data. We’ll use DocumentLoaders for this.
2. `Split`: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window.
3. `Store`: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

![Indexing pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/dfed2ba3-0c3a-4e0e-a2a7-01638730486a)

**Retrieval and generation**

1. `Retrieve`: Given a user input, relevant splits are retrieved from storage using a Retriever.
2. `Generate`: A LLM produces an answer using a prompt that includes the question and the retrieved data.

![Retrieval and generation pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/f0545ddc-c0cd-4569-8c86-9879fdab105a)

In [None]:
import re
from typing import List
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownTextSplitter,
)
from langchain.document_loaders import (
    CSVLoader,
    EverNoteLoader,
    PyPDFLoader,
    TextLoader,
    UnstructuredEPubLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredODTLoader,
    UnstructuredPowerPointLoader,
    UnstructuredWordDocumentLoader,
)


TEXT_SPLITERS = {
    "Character": CharacterTextSplitter,
    "RecursiveCharacter": RecursiveCharacterTextSplitter,
    "Markdown": MarkdownTextSplitter,
}


LOADERS = {
    ".csv": (CSVLoader, {}),
    ".doc": (UnstructuredWordDocumentLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),
    ".enex": (EverNoteLoader, {}),
    ".epub": (UnstructuredEPubLoader, {}),
    ".html": (UnstructuredHTMLLoader, {}),
    ".md": (UnstructuredMarkdownLoader, {}),
    ".odt": (UnstructuredODTLoader, {}),
    ".pdf": (PyPDFLoader, {}),
    ".ppt": (UnstructuredPowerPointLoader, {}),
    ".pptx": (UnstructuredPowerPointLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
}

text_example_path = "text_example_en.pdf"

In [None]:
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.docstore.document import Document
from langchain.retrievers import ContextualCompressionRetriever
from threading import Thread
import gradio as gr

stop_tokens = llm_model_configuration.get("stop_tokens")
rag_prompt_template = llm_model_configuration["rag_prompt_template"]


class StopOnTokens(StoppingCriteria):
    def __init__(self, token_ids):
        self.token_ids = token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in self.token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False


if stop_tokens is not None:
    if isinstance(stop_tokens[0], str):
        stop_tokens = llm.pipeline.tokenizer.convert_tokens_to_ids(stop_tokens)

    stop_tokens = [StopOnTokens(stop_tokens)]


def load_single_document(file_path: str) -> List[Document]:
    """
    helper for loading a single document

    Params:
      file_path: document path
    Returns:
      documents loaded

    """
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADERS:
        loader_class, loader_args = LOADERS[ext]
        loader = loader_class(file_path, **loader_args)
        return loader.load()

    raise ValueError(f"File does not exist '{ext}'")


def default_partial_text_processor(partial_text: str, new_text: str):
    """
    helper for updating partially generated answer, used by default

    Params:
      partial_text: text buffer for storing previosly generated text
      new_text: text update for the current step
    Returns:
      updated text string

    """
    partial_text += new_text
    return partial_text


text_processor = llm_model_configuration.get("partial_text_processor", default_partial_text_processor)


def create_vectordb(
    docs, spliter_name, chunk_size, chunk_overlap, vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold, progress=gr.Progress()
):
    """
    Initialize a vector database

    Params:
      doc: orignal documents provided by user
      spliter_name: spliter method
      chunk_size:  size of a single sentence chunk
      chunk_overlap: overlap size between 2 chunks
      vector_search_top_k: Vector search top k
      vector_rerank_top_n: Search rerank top n
      run_rerank: whether run reranker
      search_method: top k search method
      score_threshold: score threshold when selecting 'similarity_score_threshold' method

    """
    global db
    global retriever
    global combine_docs_chain
    global rag_chain

    if vector_rerank_top_n > vector_search_top_k:
        gr.Warning("Search top k must >= Rerank top n")

    documents = []
    for doc in docs:
        if type(doc) is not str:
            doc = doc.name
        documents.extend(load_single_document(doc))

    text_splitter = TEXT_SPLITERS[spliter_name](chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    texts = text_splitter.split_documents(documents)
    db = FAISS.from_documents(texts, embedding)
    if search_method == "similarity_score_threshold":
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold}
    else:
        search_kwargs = {"k": vector_search_top_k}
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method)
    if run_rerank:
        reranker.top_n = vector_rerank_top_n
        retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
    prompt = PromptTemplate.from_template(rag_prompt_template)
    combine_docs_chain = create_stuff_documents_chain(llm, prompt)

    rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

    return "Vector database is Ready"


def update_retriever(vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold):
    """
    Update retriever

    Params:
      vector_search_top_k: Vector search top k
      vector_rerank_top_n: Search rerank top n
      run_rerank: whether run reranker
      search_method: top k search method
      score_threshold: score threshold when selecting 'similarity_score_threshold' method

    """
    global db
    global retriever
    global combine_docs_chain
    global rag_chain

    if vector_rerank_top_n > vector_search_top_k:
        gr.Warning("Search top k must >= Rerank top n")

    if search_method == "similarity_score_threshold":
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold}
    else:
        search_kwargs = {"k": vector_search_top_k}
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method)
    if run_rerank:
        retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
        reranker.top_n = vector_rerank_top_n
    rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

    return "Vector database is Ready"


def bot(history, temperature, top_p, top_k, repetition_penalty, hide_full_prompt, do_rag):
    """
    callback function for running chatbot on submit button click

    Params:
      history: conversation history
      temperature:  parameter for control the level of creativity in AI-generated text.
                    By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse.
      top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability.
      top_k: parameter for control the range of tokens considered by the AI model based on their cumulative probability, selecting number of tokens with highest probability.
      repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text.
      hide_full_prompt: whether to show searching results in prompt.
      do_rag: whether do RAG when generating texts.

    """
    streamer = TextIteratorStreamer(
        llm.pipeline.tokenizer,
        timeout=3600.0,
        skip_prompt=hide_full_prompt,
        skip_special_tokens=True,
    )
    pipeline_kwargs = dict(
        max_new_tokens=512,
        temperature=temperature,
        do_sample=temperature > 0.0,
        top_p=top_p,
        top_k=top_k,
        repetition_penalty=repetition_penalty,
        streamer=streamer,
    )
    if stop_tokens is not None:
        pipeline_kwargs["stopping_criteria"] = StoppingCriteriaList(stop_tokens)

    llm.pipeline_kwargs = pipeline_kwargs
    if do_rag:
        retrieved_result = rag_chain.invoke({"input": history[-1][0]})

        context_text = " ".join([doc.page_content for doc in retrieved_result["documents"]])

        history.append(["Retrieved Context", context_text])

        input_text = rag_prompt_template.format(input=history[-1][0], context=context_text)

        t1 = Thread(target=llm.invoke, args=(input_text,))
    else:
        input_text = rag_prompt_template.format(input=history[-1][0], context="")
        t1 = Thread(target=llm.invoke, args=(input_text,))
    t1.start()

    partial_text = ""
    for new_text in streamer:
        partial_text = text_processor(partial_text, new_text)
        history[-1][1] = partial_text
        yield history


def request_cancel():
    llm.pipeline.model.request.cancel()


create_vectordb(
    [text_example_path],
    "RecursiveCharacter",
    chunk_size=400,
    chunk_overlap=50,
    vector_search_top_k=10,
    vector_rerank_top_n=2,
    run_rerank=True,
    search_method="similarity_score_threshold",
    score_threshold=0.5,
)

## Creating a UI using Gradio

In [None]:
from gradio_helper import make_demo

demo = make_demo(
    load_doc_fn=create_vectordb,
    run_fn=bot,
    stop_fn=request_cancel,
    update_retriever_fn=update_retriever,
    model_name=llm_model_id.value,
)

try:
    demo.launch(share=True)

In [None]:
demo.close()