my fine tuned llama model does not greets back. #12272

Areej17-01 · 2024-03-26T09:24:42Z

Areej17-01
Mar 26, 2024

HI, I am new to RAG and LLMs. I am using the Llama model by TheBloke/Llama-2-7B-Chat-GGUF for RAG, and it performs well in information retrieval. I am utilizing the LlamaIndex query engine. However, whenever the user says "hi," "hello," or any other greeting, it hallucinates and replies with a self-fabricated answer. Is there a way to resolve this without creating separate responses for greetings? Any help would be appreciated. Thanks!

Code:

from llama_index.core import PromptTemplate

system_prompt = """[INST] <>

You are a helpful, respectful, and honest MULTILINGUAL assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If the user greets you with a simple "hello," "hi," "hey," or similar greeting, regardless of the number of letters repeated or the informality of the greeting, respond with a friendly greeting in return.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something incorrect. If you don't know the answer to a question, please don't share false information.

Your goal is to provide answers relating to the documents and context provided about 'context'.

[/INST] <>

"""

query_wrapper_prompt = PromptTemplate("{query_str}")

llm = LlamaCpp(

streaming=True,

model_path="/content/drive/MyDrive/model/llama-2-7b-chat.Q8_0.gguf",

temperature=0.1,

top_p=1,

# verbose=True,

n_ctx=4000,

system_prompt=system_prompt,

query_wrapper_prompt=query_wrapper_prompt,

model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},

n_batch=100,

)

user_query: "Hi"

bot_response: "The 'context' Market Enablement Framework outlines the processes and procedures the 'context' uses to assess market issues and ensure a vibrant and functioning disability services market. The framework also details the role of the 'context' as a market steward in the new disability marketplace during the three-year transition phase to full Scheme."

This response is not even from the document.
@dosu-bot

@Areej17-01 · 2024-03-26T09:24:58Z

dosubot[bot]
bot Mar 26, 2024

Hey @Areej17-01! Great to see you back around these parts. Ready for another adventure in code? 🚀

Based on the code you've shared, it seems like the Llama model is not correctly interpreting the system prompt. The system prompt is designed to guide the model's behavior, but it's not a strict rule that the model will always follow. The model might still generate unexpected responses if it doesn't understand the prompt or if the prompt is not specific enough.

To address this issue, you could try refining the system prompt to make it more explicit. For example, you could specify that the model should always respond with a greeting when it receives a greeting. Here's an example of how you might modify the system prompt:

system_prompt = """[INST] <>

You are a helpful, respectful, and honest MULTILINGUAL assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If the user greets you with a simple "hello," "hi," "hey," or similar greeting, regardless of the number of letters repeated or the informality of the greeting, respond with a friendly greeting in return. Do not provide any other information unless specifically asked by the user.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something incorrect. If you don't know the answer to a question, please don't share false information.

Your goal is to provide answers relating to the documents and context provided about 'context'.

[/INST] <>

"""

In this modified prompt, the instruction to respond with a greeting when greeted is more explicit, which might help guide the model's behavior more effectively.

If refining the system prompt doesn't resolve the issue, you might need to implement a preprocessing step to handle greetings before they are passed to the model. This could involve identifying greetings in the user's input and replacing them with a standardized response, as suggested in the previous response.

Here's an example of how you might implement such a preprocessing step:

import re

def preprocess_input(input_text):
    # Define a regular expression pattern for greetings
    greetings_pattern = r"\b(hi|hello|hey|greetings)\b"
    
    # Check if the input text matches the greetings pattern
    if re.search(greetings_pattern, input_text, re.IGNORECASE):
        # Standardize or replace the greeting with a predefined response or action
        standardized_response = "Hello! How can I assist you today?"
        return standardized_response
    else:
        # If no greeting is detected, return the original input text
        return input_text

# Example usage
user_input = "hi"
processed_input = preprocess_input(user_input)
print(processed_input)  # Outputs: "Hello! How can I assist you today?"

In this example, the preprocess_input function checks if the input text contains a greeting using a regular expression. If a greeting is detected, it returns a standardized response. Otherwise, it returns the original input text. This preprocessing step can be applied to the user's input before it is passed to the Llama model, ensuring that greetings are handled consistently.

I hope this helps! If you have any further questions or issues, please don't hesitate to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

4 replies

Areej17-01 Mar 26, 2024
Author

but how will it handle other unusual greetings like "hey," "heeeeyaaa," "hiii ,if I use the second method you mentioned. I want it to respond it to just fine as other chatbots.? @dosu-bot

dosubot[bot] bot Mar 26, 2024

Based on your request, it seems you want to modify the preprocessing step in your Llama model to recognize and respond appropriately to unusual greetings like "hey," "heeeeyaaa," "hiii", etc.

To achieve this, you can adjust the tokenizer function used in the TokenTextSplitter class to better handle these variations. The tokenizer function is responsible for converting the input text into tokens, which are then used for further processing, including splitting the text into chunks.

Given the context of your code, the tokenizer is set with self._tokenizer = tokenizer or get_tokenizer(). If the default tokenizer does not handle unusual greetings well, you can implement a custom tokenizer function that includes logic to recognize and normalize these greetings. This custom tokenizer can then be passed to the TokenTextSplitter during initialization.

Here's an example of how you might implement a custom tokenizer function and use it with your TokenTextSplitter:

import re
from typing import List

def custom_tokenizer(text: str) -> List[str]:
    # Normalize unusual greetings
    text = re.sub(r'\b(he+y+|hi+|hello+)\b', 'hello', text, flags=re.IGNORECASE)
    # Split text into tokens using whitespace and punctuation
    tokens = re.findall(r'\w+|[^\w\s]', text, re.UNICODE)
    return tokens

# When initializing TokenTextSplitter, pass the custom tokenizer
token_text_splitter = TokenTextSplitter(tokenizer=custom_tokenizer, ...)

In this example, the custom_tokenizer function first normalizes various forms of greetings to a standard form (e.g., "hello") using regular expressions. It then splits the text into tokens based on whitespace and punctuation. This approach ensures that unusual greetings are recognized and treated appropriately during the preprocessing step.

Remember to replace the ... in the TokenTextSplitter initialization with the actual parameters you need for your use case.

For more information, you can refer to the source code of the TokenTextSplitter class in the LlamaIndex codebase:

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Areej17-01 Mar 26, 2024
Author

but I have to greet back to user in other languages too, is there any other way I can do it? because this seems quiet manual, my mistral RAG is greeting back fine but the retrieval of mistral is not as good as llama for my documents. @dosu-bot

dosubot[bot] bot Mar 26, 2024

Based on the information you've provided and the context of the LlamaIndex repository, it seems that the Llama model doesn't have a built-in mechanism specifically designed for handling greetings within the system prompt. The handling of greetings would depend on the content of the chat prompts themselves and how they are designed to interpret and respond to user inputs.

However, there are a few strategies you can consider to improve the greeting response of the Llama model:

Language Detection and Dynamic Prompt Selection: Before generating a response, detect the language of the user's greeting. This can be done using a separate language detection library or service. Based on the detected language, select or construct a greeting response template in the same language. This requires having a set of greeting response templates in various languages or a way to dynamically translate or adapt a base template into the target language. Here's a simplified example of how you might implement this in Python:

def generate_greeting_response(user_greeting, llm, prompt_helper):
    # Detect the language of the user's greeting
    language_code = detect_language(user_greeting)
    
    # Select a greeting response template based on the detected language
    # Fallback to English if the language is not supported
    greeting_template = greeting_templates.get(language_code, greeting_templates['en'])
    
    # Construct the prompt using the selected template
    prompt = greeting_template.format(user_greeting=user_greeting)
    
    # Generate the response using the Llama model
    response = llm.generate(prompt)
    
    return response

Fine-tuning the Llama model: The Llama model can be trained or fine-tuned on a specific dataset to improve its handling of greetings. The LlamaIndex repository provides various finetuning modules and engines, such as OpenAIFinetuneEngine, SentenceTransformersFinetuneEngine, and others, which are designed for adapting and fine-tuning models on specific datasets. These tools indicate that the framework supports fine-tuning on custom datasets, which would include datasets focused on greetings or any other specific domain.

Please note that these are general strategies and may require further customization based on your specific requirements and the capabilities of the Llama model. If you need further assistance, please provide more details about your use case and the specific issues you're encountering.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Moiz-srptech · 2025-02-27T12:53:24Z

Moiz-srptech
Feb 27, 2025

Hello @dosu-bot. I have a problem in my chatbot. Actually we are using Llama3.1 as an LLM in our RAG Application but the issue is it is not picking our prompting our prompt for retrieving response. And if the query is "Hi", "Hello", What's up?", it is not getting prompt from template in which we mention it "Hi! How I can assist you today?"and insteda of this it is replying us with the previous queries response as in the session context. Please resolve my issue. @dosu-bot.

below is my code:

qa_prompt = """

Task:

Respond to the user query using the provided context, incorporating inline citations in the format [source_id] only when the <source_id> tag is explicitly provided in the context.

Guidelines:

If you don't know the answer, clearly state that.
Use the retrieved context to provide accurate and concise responses if the user query asks for it, else provide the answer of user query without using the retrived chunks.
If uncertain, ask the user for clarification.
Respond in the same language as the user's query.
If the context is unreadable or of poor quality, inform the user and provide the best possible answer.
If the answer isn't present in the context but you possess the knowledge, explain this to the user and provide the answer using your own understanding.
Only include inline citations using [source_id] when a <source_id> tag is explicitly provided in the context.
Do not cite if the <source_id> tag is not provided in the context.
Do not use XML tags in your response.
Ensure citations are concise and directly related to the information provided.
Memory is only to be used for adding further context to the response, it does not mean that it should be used as a source of information or should be useful in every user query.

Example of Citation:

If the user asks about a specific topic and the information is found in "whitepaper.pdf" with a provided <source_id>, the response should include the citation like so:

"According to the study, the proposed method increases efficiency by 20% [whitepaper.pdf]."
If no <source_id> is present, the response should omit the citation.

Output:

Provide a clear and direct response to the user's query, including inline citations in the format [source_id] only when the <source_id> tag is present in the context.

{retrieved_chunks}

<user_query>
{query_str}
</user_query>

{memories} """

Below is the pipeline of RAG for answer retrieval:

import logging
import os, re
import sys, json
import chromadb
from concurrent.futures import ThreadPoolExecutor
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema import AIMessage
from langchain.memory import ConversationBufferMemory
from langchain_community.callbacks import get_openai_callback
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex
from pymongo import MongoClient
from datetime import datetime
from neo4j import GraphDatabase
from mem0 import Memory
from src.subqueries_gen_class import query_analyzer
from src.answer_template import qa_prompt
from langchain_ollama import ChatOllama
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch

class StreamPipeline:
_initialized = False

def __init__(self) -> None:
    self._load_env()
    self.CHROMA_DB_PATH = "src/chromadb"
    self.COLLECTION_NAME = "quickstart"
    self.model_name = "gpt-4o-mini"
    self.embed_model = "BAAI/bge-base-en-v1.5"
    self.mem0 = Memory()
    self._initialize_model()
    self._initialize_embedder()

    self.db_client = MongoClient(os.getenv("MONGO_URI_LLM"))
    self.db = self.db_client.get_database(os.getenv("DB_NAME"))
    self.query_responses_collection = self.db.get_collection(
        os.getenv("COLLECTION_NAME")
    )

    self.graph_driver = GraphDatabase.driver(
        os.getenv("LLM_NEO4J_URI"),
        auth=(os.getenv("LLM_NEO4J_USER"), os.getenv("LLM_NEO4J_PASSWORD")),
    )
    self.user_memory = {}

    StreamPipeline._initialized = True

def _initialize_embedder(self):
    self.tokenizer = AutoTokenizer.from_pretrained(self.embed_model)
    self.model = AutoModel.from_pretrained(self.embed_model)
    self.model.eval()

def _load_env(self):
    try:
        load_dotenv()
        openai_api = os.getenv("OPENAI_API_KEY")
        if not openai_api:
            raise ValueError("OpenAI API key is missing.")
    except Exception as e:
        logging.error(f"Error while loading env variable: {e}")
        raise Exception(e, sys)

def _initialize_model(self):
    self.llm = ChatOllama(
        model="llama3.1:latest",
        temperature=0.2,
        verbose=True,
        streaming=True,
        stream_usage=True
    )
    # self.llm = ChatOpenAI(
    #     model=self.model_name,
    #     temperature=0.2,
    #     verbose=True,
    #     streaming=True,
    #     stream_usage=True,
    # )
    self.embed_model_ = HuggingFaceEmbedding(model_name=self.embed_model)

def _get_user_memory(self, user_id):
    """Retrieve or create memory for a given user."""
    if user_id not in self.user_memory:
        self.user_memory[user_id] = ConversationBufferMemory(return_messages=True)
    return self.user_memory[user_id]

def embed_text(self, text):
    inputs = self.tokenizer(
        text, return_tensors="pt", truncation=True, padding=True
    )
    with torch.no_grad():
        outputs = self.model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embedding

def subqueries_retriever(self, sub_queries, group=None):
    try:
        if group:
            self.COLLECTION_NAME = group
        else:
            self.COLLECTION_NAME = "quickstart"

        index = self._create_or_get_chroma(group)
        retrieved_chunks = []

        def retrieve_subquery(sub_query):
            retriever = index.as_retriever(similarity_top_k=15)
            return retriever.retrieve(sub_query.sub_query)

        with ThreadPoolExecutor() as executor:
            results = executor.map(retrieve_subquery, sub_queries)
            for chunks in results:
                retrieved_chunks.extend(chunks)

        return retrieved_chunks

    except Exception as e:
        logging.error(f"Error during retrieving subqueries: {e}")
        raise Exception(e, sys)

def single_query_retrieve(self, query, group=None):
    try:
        if group:
            self.COLLECTION_NAME = group
        else:
            self.COLLECTION_NAME = "quickstart"

        index = self._create_or_get_chroma(group)
        retriever = index.as_retriever(similarity_top_k=15)

        return retriever.retrieve(query)

    except Exception as e:
        logging.error(f"Error during retrieving single query: {e}")
        raise Exception(e, sys)

def _create_or_get_chroma(self, group=None):
    try:
        if group:
            self.COLLECTION_NAME = group
        else:
            self.COLLECTION_NAME = "quickstart"

        db = chromadb.PersistentClient(path=self.CHROMA_DB_PATH)
        chroma_collection = db.get_or_create_collection(self.COLLECTION_NAME)
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
        logging.info("collection found")
        return VectorStoreIndex.from_vector_store(
            vector_store=vector_store, embed_model=self.embed_model_
        )

    except Exception as e:
        logging.error(f"Error during setup_chroma: {e}")
        raise Exception(e, sys)

def retrieve_memories(self, query: str, user_id: str):
    """Retrieve relevant user memories for the query using embeddings."""
    try:
        query_embedding = self.embed_text(query)

        with self.graph_driver.session() as session:
            query_string = (
                "MATCH (u:User {id: $user_id})-[:IS]->(m:Memory) "
                "RETURN m.text AS memory_text, m.embedding AS embedding"
            )
            results = session.run(query_string, {"user_id": user_id})

            memories_with_similarity = []
            for record in results:
                memory_text = record["memory_text"]
                memory_embedding = np.array(record["embedding"])

                # cosine similarity
                # similarity = np.dot(query_embedding, memory_embedding) / (
                #     np.linalg.norm(query_embedding) * np.linalg.norm(memory_embedding)
                # )

                # Euclidean Distance (L2 Distance) similarity
                similarity = -np.linalg.norm(query_embedding - memory_embedding)

                # Manhattan Distance similarity
                # similarity = -np.sum(np.abs(query_embedding - memory_embedding))

                print(f"""similarity for "{memory_text}": {similarity}""")

                memories_with_similarity.append((memory_text, similarity))
                
            memories_with_similarity.sort(key=lambda x: x[1], reverse=True)

            top_k_memories = [text for text, _ in memories_with_similarity[:5]]

            return "\n".join(f"- {entry}" for entry in top_k_memories)

    except Exception as e:
        logging.error(f"Error during retrieving memories: {e}")
        raise Exception(e, sys)

    except Exception as e:
        logging.error(f"Error during retrieving memories: {e}")
        raise Exception(e, sys)

def store_interaction(self, user_id, query_str, response_text, memories=None):
    """Store user memories in Neo4j with embeddings."""
    try:
        with self.graph_driver.session() as session:
            if memories:
                for memory_item in memories:
                    print("inserting memory in Neo4j", memory_item)
                    memory_text = memory_item.get("memory", "Unknown Memory")
                    embedding = self.embed_text(memory_text)
                    session.run(
                        """
                        MERGE (u:User {id: $user_id})
                        MERGE (m:Memory {text: $memory_text})
                        SET m.embedding = $embedding
                        MERGE (u)-[:IS]->(m)
                        """,
                        {
                            "user_id": user_id,
                            "memory_text": memory_text,
                            "embedding": embedding.tolist(),
                        },
                    )
    except Exception as e:
        logging.error(f"Error storing interaction in graph DB: {e}")
        raise Exception(e, sys)

def input_user_query(self, query_str, user_id, group=None, user_info=None):
    try:
        user_id = (
            user_info.get("user_id", "default_user")
            if user_info
            else "default_user"
        )

        relevant_memories = self.retrieve_memories(query_str, user_id)

        memory = self._get_user_memory(user_id)

        with get_openai_callback() as cb:
            sub_queries = query_analyzer.invoke({"question": query_str})

        if sub_queries:
            self.retriever_chunk = self.subqueries_retriever(sub_queries, group)
        else:
            self.retriever_chunk = self.single_query_retrieve(query_str, group)

        document_metadata = [
            {
                key: value if value else "Unknown"
                for key, value in chunk.metadata.items()
            }
            for chunk in self.retriever_chunk
        ]

        document_summary = "\n".join(
            [
                ", ".join([f"{key}: {value}" for key, value in doc.items()])
                for doc in document_metadata
            ]
        )

        retrieved_text = self.retriever_chunk
        prompt = PromptTemplate(
            input_variables=[
                "query_str",
                "retrieved_chunks",
                "chat_history",
                "memories",
            ],
            template=qa_prompt,
        )

        input_data = {
            "query_str": query_str,
            "retrieved_chunks": "\n".join([chunk.text for chunk in retrieved_text]),
            "chat_history": "\n".join(
                [msg.content for msg in memory.chat_memory.messages]
            ),
            "document_details": document_summary,
            "memories": relevant_memories,
        }

        input_string = prompt.format(**input_data)

        llm_chain = self.llm
        stream_response = llm_chain.stream(input_string)

        def stream_response_handler():
            response_text = ""
            for chunk in stream_response:
                if isinstance(chunk, AIMessage):
                    content = chunk.content
                    if content:
                        response_text += content
                        yield content
                else:
                    yield "\n"

            is_memory_insert = self.mem0.add(
                f"User: {query_str}\nAssistant: {response_text}", user_id=user_id
            )

            if any(event["event"] == "ADD" for event in is_memory_insert):
                self.store_interaction(
                    user_id, query_str, response_text, memories=is_memory_insert
                )
                memory.chat_memory.add_user_message(query_str)
                memory.chat_memory.add_ai_message(response_text)

        return {"answer_stream": stream_response_handler}

    except Exception as e:
        logging.error(f"Error in input_user_query: {e}")
        raise Exception(e, sys)

0 replies

my fine tuned llama model does not greets back. #12272

Uh oh!

Areej17-01 Mar 26, 2024

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

dosubot[bot] bot Mar 26, 2024

Sources

About Dosu

Uh oh!

Areej17-01 Mar 26, 2024 Author

Uh oh!

dosubot[bot] bot Mar 26, 2024

Sources

About Dosu

Uh oh!

Uh oh!

Areej17-01 Mar 26, 2024 Author

Uh oh!

dosubot[bot] bot Mar 26, 2024

Sources

About Dosu

Uh oh!

Uh oh!

Moiz-srptech Feb 27, 2025

Task:

Guidelines:

Example of Citation:

Output:

Areej17-01
Mar 26, 2024

Replies: 2 comments 4 replies

dosubot[bot]
bot Mar 26, 2024

Areej17-01 Mar 26, 2024
Author

Areej17-01 Mar 26, 2024
Author

Moiz-srptech
Feb 27, 2025