# Build RAG with Hugging Face and Milvus

_Authored by: [Chen Zhang](https://github.com/zc277584121)_


[Milvus](https://milvus.io/) is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search. In this tutorial, we will show you how to build a RAG (Retrieval-Augmented Generation) pipeline with Hugging Face and Milvus.

The RAG system combines a retrieval system with an LLM. The system first retrieves relevant documents from a corpus using Milvus vector database, then uses an LLM hosted in Hugging Face to generate answers based on the retrieved documents.

## Preparation
### Dependencies and Environment

### https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus

In [1]:
! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm

Collecting pymilvus
  Downloading pymilvus-2.5.9-py3-none-any.whl.metadata (5.7 kB)
Collecting huggingface-hub
  Downloading huggingface_hub-0.31.4-py3-none-any.whl.metadata (13 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Collecting grpcio<=1.67.1,>=1.49.1 (from pymilvus)
  Downloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting python-dotenv<2.0.0,>=1.0.1 (from pymilvus)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting ujson>=2.0.0 (from pymilvus)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting milvus-lite>=2.4.0 (from pymilvus)
  Downloading milvus_lite-2.4.12-py3-none-manylinux2014_x86_64.whl.metadata (10.0 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloadi

> If you are using Google Colab, to enable the dependencies, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

In addition, we recommend that you configure your [Hugging Face User Access Token](https://huggingface.co/docs/hub/security-tokens), and set it in your environment variables because we will use a LLM from the Hugging Face Hub. You may get a low limit of requests if you don't set the token environment variable.

In [2]:
import os

os.environ["HF_TOKEN"] = "XXX"

### Prepare the data

We use the [AI Act PDF](https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf), a regulatory framework for AI with different risk levels corresponding to more or less regulation, as the private knowledge in our RAG.

In [3]:
%%bash

if [ ! -f "The-AI-Act.pdf" ]; then
    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi

We use the [`PyPDFLoader`](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/) from LangChain to extract the text from the PDF, and then split the text into smaller chunks. By default, we set the chunk size as 1000 and the overlap as 200, which means each chunk will nearly have 1000 characters and the overlap between two chunks will be 200 characters.

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("The-AI-Act.pdf")
docs = loader.load()
print(len(docs))

108


In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)

In [6]:
text_lines = [chunk.page_content for chunk in chunks]

### Prepare the Embedding Model
Define a function to generate text embeddings. We use [BGE embedding model](https://huggingface.co/BAAI/bge-small-en-v1.5) as an example, but you can use any embedding models, such as those found on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [7]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def emb_text(text):
    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generate a test embedding and print its dimension and first few elements.

In [8]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

384
[-0.07660680264234543, 0.025316733866930008, 0.012505539692938328, 0.004595177713781595, 0.025780005380511284, 0.038167089223861694, 0.08050810545682907, 0.00303537561558187, 0.02439219132065773, 0.004880349617451429]


## Load data into Milvus

### Create the Collection

In [9]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./hf_milvus_demo.db")

collection_name = "rag_collection"

> As for the argument of `MilvusClient`:
> - Setting the `uri` as a local file, e.g.`./hf_milvus_demo.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
> - If you have a large amount of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.
> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud.


Check if the collection already exists and drop it if it does.

In [10]:
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

Create a new collection with specified parameters.

If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.

In [11]:
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

### Insert data
Iterate through the text lines, create embeddings, and then insert the data into Milvus.

Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level.

In [12]:
from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]

Creating embeddings: 100%|██████████| 424/424 [01:15<00:00,  5.60it/s]


424

## Build RAG

### Retrieve data for a query

Let's specify a question to ask about the corpus.

In [13]:
question = "What is the legal basis for the proposal?"

Search for the question in the collection and retrieve the top 3 semantic matches.

In [14]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

Let's take a look at the search results of the query


In [15]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "EN 6  EN \n2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY \n2.1. Legal basis \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and functioning of the internal market.  \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed and used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elemen

### Use LLM to get an RAG response

Before composing the prompt for LLM, let's first flatten the retrieved document list into a plain string.

In [16]:
context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

Define prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus.

In [17]:
PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

We use the [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) hosted on Hugging Face inference server to generate a response based on the prompt.

In [32]:
from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(model=repo_id, timeout=120)

In [48]:
from huggingface_hub import InferenceClient

repo_id = "meta-llama/Llama-3.2-1B-Instruct"

llm_client = InferenceClient(model=repo_id, timeout=120)

In [65]:
from huggingface_hub import InferenceClient

# استخدم نموذج Gemma 2B Instruct، وهو نموذج صغير مناسب لمعالج Colab ويدعم توليد النصوص
repo_id = "bigscience/bloom-560m"

llm_client = InferenceClient(model=repo_id, timeout=120)

Finally, we can format the prompt and generate the answer.

In [66]:
prompt = PROMPT.format(context=context, question=question)

In [67]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=10,
).strip()
print(answer)

StopIteration: 

In [None]:

ValueError: Model mistralai/Mixtral-8x7B-Instruct-v0.1 is not supported for task text-generation and provider together. Supported task: conversational.

Congratulations! You have built an RAG pipeline with Hugging Face and Milvus.

In [37]:
from huggingface_hub import InferenceClient
import os

# Ensure you have HF_TOKEN environment variable set if using private models or for higher rate limits
# or run: huggingface-cli login

try:
    # Replace with a model that has a public inference API or your own endpoint
    # For example, using a popular open model:
    MODEL_ID = "gpt2"
    # Or if you have a dedicated Inference Endpoint URL:
    # MODEL_ID = "https://your-api-id.region.provider.huggingface.cloud"

    llm_client = InferenceClient(
        model=MODEL_ID,
        # token=os.getenv("HF_TOKEN") # Add if needed
    )

    prompt = "What is the capital of France?"
    answer = llm_client.text_generation(
        prompt,
        max_new_tokens=10,
    ).strip()
    print(f"Prompt: {prompt}")
    print(f"Answer: {answer}")

except StopIteration:
    print(f"StopIteration error! Likely failed to find an inference provider for the model.")
    print(f"Check if the model '{MODEL_ID}' is correct and has an available Inference API or if you're using a custom endpoint URL.")
except Exception as e:
    print(f"An error occurred: {e}")
    if hasattr(llm_client, 'model'):
        print(f"Model being used by llm_client: {llm_client.model}")
    else:
        print("Could not determine model from llm_client.")

StopIteration error! Likely failed to find an inference provider for the model.
Check if the model 'gpt2' is correct and has an available Inference API or if you're using a custom endpoint URL.


In [38]:
import os
from huggingface_hub import InferenceClient

HF_TOKEN = os.getenv("ْْْXXX") # Or your_token_string
llm_client = InferenceClient(model="meta-llama/Llama-3.2-1B-Instruct", token=HF_TOKEN)

In [39]:
print(llm_client.model)

meta-llama/Llama-3.2-1B-Instruct


In [44]:
from huggingface_hub import InferenceClient

# الخيار 1: تحديد النموذج عند التهيئة (موصى به)
llm_client = InferenceClient(model="meta-llama/Llama-3.2-1B-Instruct") # مثال لنموذج
# أو
# llm_client = InferenceClient(model="your_hf_username/your_private_model") # إذا كان نموذجك خاصًا
# أو
# llm_client = InferenceClient(model="https://your-custom-inference-endpoint-url") # لنقاط نهاية استدلال مخصصة (TGI)

In [45]:
from huggingface_hub import InferenceClient
import os

# تأكد من تعيين متغير البيئة HF_TOKEN إذا كنت تستخدم نماذج خاصة أو لمعدلات استخدام أعلى
# أو قم بتشغيل: huggingface-cli login

try:
    # استبدل هذا بنموذج له واجهة استدلال عامة أو نقطة النهاية الخاصة بك
    # على سبيل المثال، باستخدام نموذج مفتوح المصدر شائع:
    MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"
    # أو إذا كان لديك عنوان URL لنقطة نهاية استدلال مخصصة:
    # MODEL_ID = "https://your-api-id.region.provider.huggingface.cloud"

    llm_client = InferenceClient(
        model=MODEL_ID,
        # token=os.getenv("HF_TOKEN") # أضف هذا إذا لزم الأمر
    )

    prompt = "ما هي عاصمة فرنسا؟"
    answer = llm_client.text_generation(
        prompt,
        max_new_tokens=10,
    ).strip()
    print(f"السؤال: {prompt}")
    print(f"الإجابة: {answer}")

except StopIteration:
    print(f"حدث خطأ StopIteration! من المحتمل أنه فشل في العثور على مُزوّد استدلال للنموذج.")
    print(f"تحقق مما إذا كان النموذج '{MODEL_ID}' صحيحًا ولديه واجهة Inference API متاحة أو إذا كنت تستخدم عنوان URL لنقطة نهاية مخصصة.")
except Exception as e:
    print(f"حدث خطأ: {e}")
    if hasattr(llm_client, 'model'):
        print(f"النموذج المستخدم بواسطة llm_client: {llm_client.model}")
    else:
        print("لم يتمكن من تحديد النموذج من llm_client.")

حدث خطأ: Model meta-llama/Llama-3.2-1B-Instruct is not supported for task text-generation and provider novita. Supported task: conversational.
النموذج المستخدم بواسطة llm_client: meta-llama/Llama-3.2-1B-Instruct


In [46]:
from huggingface_hub import InferenceClient
import os

# Assuming llm_client is initialized something like this:
# You might need to specify the Novita endpoint or configure the client
# to use Novita if it's not the default.
# If Novita requires a specific endpoint URL, you'd pass it in model.
# Example: llm_client = InferenceClient(model="https://api.novita.ai/v3/whatever_endpoint_for_llama3.2", token=NOVITA_API_KEY)
# OR, if huggingface_hub has direct Novita integration, it might be simpler.
# For now, let's assume llm_client is correctly configured to talk to Novita
# for the 'meta-llama/Llama-3.2-1B-Instruct' model.

# This is just a placeholder for how you might initialize the client.
# The key is that it *must* be configured to use Novita as the provider
# for the meta-llama/Llama-3.2-1B-Instruct model.
# If you were using the previous setup:
# llm_client = InferenceClient(token=os.getenv("HF_TOKEN")) # This uses HF Inference API by default

# If Novita is your intended provider, you might have set it up like:
# (This is a guess, refer to Novita's or huggingface_hub's documentation
# on how to integrate with specific third-party providers if not using HF's own infra)
# For example, some libraries allow specifying provider directly or via model URL.
# Let's assume your llm_client IS correctly set up for Novita and this model.

# Your prompt
user_prompt = "What is the capital of France and what are three interesting facts about it?"

# Prepare messages for the conversational task
messages = [
    {"role": "user", "content": user_prompt}
]

try:
    # If llm_client is from huggingface_hub
    response = llm_client.chat_completion(
        messages=messages,
        model="meta-llama/Llama-3.2-1B-Instruct", # Explicitly pass model if not set in client or to override
        max_tokens=150,  # Note: 'max_tokens' or 'max_new_tokens' can vary by API
        # Add other parameters as supported by Novita's API for this model
    )

    # The response structure for chat_completion is different
    # It usually returns a list of choices, and each choice has a message object
    if response.choices and len(response.choices) > 0:
        assistant_reply = response.choices[0].message.content
        print(f"User: {user_prompt}")
        print(f"Assistant: {assistant_reply.strip()}")
    else:
        print("No response choices received.")
        print("Full response:", response)


except Exception as e:
    print(f"An error occurred: {e}")
    # You might want to inspect `e` further, especially if it's an APIError
    # from Novita, as it might contain more details.

User: What is the capital of France and what are three interesting facts about it?
Assistant: The capital of France is Paris.

Here are three interesting facts about Paris:

1. **The Eiffel Tower is not just a pretty face**: Built for the 1889 World's Fair, the Eiffel Tower was initially intended to be a temporary structure. However, its design became an instant icon, and it has remained a symbol of Paris ever since. Today, it's one of the most recognizable landmarks in the world.

2. **Paris has more museums than any other city in the world**: With over 100 museums, including the Louvre, Orsay, and Rodin, Paris is a treasure trove of art, history, and culture. The city is home to some of the world's most famous museums, including


In [47]:
from huggingface_hub import InferenceClient
import os

# افترض أن llm_client تم تهيئته بشكل صحيح للتحدث مع Novita
# للنموذج 'meta-llama/Llama-3.2-1B-Instruct'.
# قد تحتاج إلى توفير مفتاح API الخاص بـ Novita أو عنوان URL لنقطة النهاية الخاصة بهم عند تهيئة العميل.
# على سبيل المثال (هذا مجرد تخمين لكيفية إعداده مع Novita):
# NOVITA_API_KEY = "YOUR_NOVITA_API_KEY"
# llm_client = InferenceClient(
#     model="meta-llama/Llama-3.2-1B-Instruct", # أو عنوان URL لنقطة نهاية Novita الخاصة بهذا النموذج
#     token=NOVITA_API_KEY, # أو أي طريقة أخرى لتمرير الاعتماديات لـ Novita
#     # قد تحتاج إلى وسيط 'provider' أو ما شابه لتحديد Novita صراحةً إذا لم يتم اكتشافه تلقائيًا
# )
# تأكد من أن llm_client مُعد بالفعل لاستخدام Novita.

# مطالبتك
user_prompt = "ما هي عاصمة فرنسا وما هي ثلاث حقائق مثيرة للاهتمام عنها؟"

# تحضير الرسائل لمهمة المحادثة
messages = [
    {"role": "user", "content": user_prompt}
]

try:
    # إذا كان llm_client من huggingface_hub
    response = llm_client.chat_completion(
        messages=messages,
        model="meta-llama/Llama-3.2-1B-Instruct", # مرر النموذج صراحةً إذا لم يكن معينًا في العميل أو لتجاوزه
        max_tokens=150,  # ملاحظة: 'max_tokens' أو 'max_new_tokens' يمكن أن تختلف حسب الـ API
        # أضف معلمات أخرى كما تدعمها واجهة Novita لهذا النموذج
    )

    # هيكل الاستجابة لـ chat_completion مختلف
    # عادة ما يعيد قائمة من الخيارات (choices)، وكل خيار يحتوي على كائن رسالة (message)
    if hasattr(response, 'choices') and response.choices and len(response.choices) > 0:
        assistant_reply = response.choices[0].message.content
        print(f"المستخدم: {user_prompt}")
        print(f"المساعد: {assistant_reply.strip()}")
    else:
        print("لم يتم تلقي أي خيارات استجابة.")
        print("الاستجابة الكاملة:", response) # اطبع الاستجابة كاملة لفهم هيكلها

except Exception as e:
    print(f"حدث خطأ: {e}")
    # قد ترغب في فحص 'e' بشكل أكبر، خاصة إذا كان APIError
    # من Novita، لأنه قد يحتوي على مزيد من التفاصيل.

حدث خطأ: 429 Client Error: Too Many Requests for url: https://router.huggingface.co/novita/v3/openai/chat/completions (Request ID: Root=1-682c023c-60c3644d165d150726829b24;9c1ec069-df61-48a3-b023-f106f65d49c3)

error, status code: 429, status: 429 Too Many Requests, message: , body: {"error":"failed to schedule worker"}


In [51]:
import os
from huggingface_hub import InferenceClient
import json # For printing retrieved context if needed

# --- Assume your HF_TOKEN is set ---
# os.environ["HF_TOKEN"] = "hf_YOUR_VALID_TOKEN"

# --- Assume 'context' and 'question' are already defined from your RAG process ---
# Example (replace with your actual retrieved context and question):
# retrieved_lines_with_distances = [
#     ("The legal basis is Article 114 of the Treaty on the Functioning of the European Union (TFEU).", 0.9),
#     ("This proposal aims to ensure the proper functioning of the internal market.", 0.85),
#     ("It lays down harmonised rules on artificial intelligence.", 0.82)
# ]
# context = "\n".join(
#     [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
# )
# question = "What is the legal basis for the proposal?"
# print("Retrieved context for LLM:\n", json.dumps(retrieved_lines_with_distances, indent=2))
# print("-" * 30)


# --- Define the PROMPT structure for chat ---
# For Instruct models, it's good to frame the RAG context and question clearly.
# Some models respond well to system prompts, others directly to user prompts.
# Llama-3-Instruct models generally work well with a direct user prompt containing instructions.

PROMPT_FOR_CHAT = f"""Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
{context}

User Question:
{question}

Answer:
"""

# --- LLM Client Initialization ---
repo_id = "meta-llama/Llama-3.2-1B-Instruct" # The model you want to use

# Ensure HF_TOKEN is set in your environment or pass it directly
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    raise ValueError("HF_TOKEN environment variable not set.")

llm_client = InferenceClient(
    model=repo_id,
    token=hf_token,
    timeout=120
)

# --- MODIFIED LLM CALL using chat_completion ---
print(f"Attempting to call {repo_id} via chat_completion...")
print(f"User message content being sent:\n{PROMPT_FOR_CHAT}")
print("-" * 30)

try:
    messages = [
        {"role": "user", "content": PROMPT_FOR_CHAT}
    ]

    response = llm_client.chat_completion(
        messages=messages,
        max_tokens=200,  # Max tokens for the generated answer
        temperature=0.1, # Lower temperature for more factual RAG
        top_p=0.9,
    )

    if response.choices and len(response.choices) > 0:
        answer = response.choices[0].message.content.strip()
        print("\n--- LLM Answer ---")
        print(answer)
    else:
        print("No response choices received from LLM.")
        print("Full LLM response object:", response) # For debugging

except Exception as e:
    print(f"Error during LLM call: {e}")
    print("If this is still the 'novita' provider error, then `huggingface_hub` is still routing")
    print(f"requests for '{repo_id}' to Novita in your environment.")
    print("Consider the debugging steps for provider resolution mentioned previously.")

Attempting to call meta-llama/Llama-3.2-1B-Instruct via chat_completion...
User message content being sent:
Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
EN 6  EN 
2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY 
2.1. Legal basis 
The legal basis for the proposal is in the first place Article 114 of the Treaty on the 
Functioning of the European Union (TFEU), which provides for the adoption of measures to 
ensure the establishment and functioning of the internal market.  
This proposal constitutes a core part of the EU digital single market strategy. The primary 
objective of this proposal is to ensure the proper functioning of the internal market by setting 
harmonised rules in particular on the development, placing on the Union market and the use 
of products and services making use of AI technologies or provided as stand -alone AI 
systems.

In [52]:
import os
from huggingface_hub import InferenceClient
import json # For printing retrieved context if needed

# --- Assume your HF_TOKEN is set ---
# os.environ["HF_TOKEN"] = "hf_YOUR_VALID_TOKEN"

# --- Assume 'context' and 'question' are already defined from your RAG process ---
# Example (replace with your actual retrieved context and question):
# retrieved_lines_with_distances = [
#     ("The legal basis is Article 114 of the Treaty on the Functioning of the European Union (TFEU).", 0.9),
#     ("This proposal aims to ensure the proper functioning of the internal market.", 0.85),
#     ("It lays down harmonised rules on artificial intelligence.", 0.82)
# ]
# context = "\n".join(
#     [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
# )
# question = "What is the legal basis for the proposal?"
# print("Retrieved context for LLM:\n", json.dumps(retrieved_lines_with_distances, indent=2))
# print("-" * 30)


# --- Define the PROMPT structure for chat ---
# For Instruct models, it's good to frame the RAG context and question clearly.
# Some models respond well to system prompts, others directly to user prompts.
# Llama-3-Instruct models generally work well with a direct user prompt containing instructions.

PROMPT_FOR_CHAT = f"""Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
{context}

User Question:
{question}

Answer:
"""

# --- LLM Client Initialization ---
repo_id = "meta-llama/Llama-3.2-1B-Instruct" # The model you want to use

# Ensure HF_TOKEN is set in your environment or pass it directly
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    raise ValueError("HF_TOKEN environment variable not set.")

llm_client = InferenceClient(
    model=repo_id,
    token=hf_token,
    timeout=120
)

# --- MODIFIED LLM CALL using chat_completion ---
print(f"Attempting to call {repo_id} via chat_completion...")
print(f"User message content being sent:\n{PROMPT_FOR_CHAT}")
print("-" * 30)

try:
    messages = [
        {"role": "user", "content": PROMPT_FOR_CHAT}
    ]

    response = llm_client.chat_completion(
        messages=messages,
        max_tokens=200,  # Max tokens for the generated answer
        temperature=0.1, # Lower temperature for more factual RAG
        top_p=0.9,
    )

    if response.choices and len(response.choices) > 0:
        answer = response.choices[0].message.content.strip()
        print("\n--- LLM Answer ---")
        print(answer)
    else:
        print("No response choices received from LLM.")
        print("Full LLM response object:", response) # For debugging

except Exception as e:
    print(f"Error during LLM call: {e}")
    print("If this is still the 'novita' provider error, then `huggingface_hub` is still routing")
    print(f"requests for '{repo_id}' to Novita in your environment.")
    print("Consider the debugging steps for provider resolution mentioned previously.")

Attempting to call meta-llama/Llama-3.2-1B-Instruct via chat_completion...
User message content being sent:
Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
EN 6  EN 
2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY 
2.1. Legal basis 
The legal basis for the proposal is in the first place Article 114 of the Treaty on the 
Functioning of the European Union (TFEU), which provides for the adoption of measures to 
ensure the establishment and functioning of the internal market.  
This proposal constitutes a core part of the EU digital single market strategy. The primary 
objective of this proposal is to ensure the proper functioning of the internal market by setting 
harmonised rules in particular on the development, placing on the Union market and the use 
of products and services making use of AI technologies or provided as stand -alone AI 
systems.

### شغال

In [69]:
import os
import json
from tqdm import tqdm

# --- (كود تحميل المستند، التقسيم، التضمين، و Milvus يبقى كما هو) ---
# ... [Your existing code for PDF loading, splitting, embedding, Milvus setup, search] ...
# ... [Ensure 'context' and 'question' are defined from your RAG process] ...

# --- تعريف PROMPT يبقى كما هو ---
PROMPT_TEMPLATE = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

<context>
{context}
</context>
<question>
{question}
</question>
Answer:
""" # استخدام نفس القالب

filled_prompt = PROMPT_TEMPLATE.format(context=context, question=question)

# --- تحميل وتشغيل النموذج محليًا باستخدام Transformers ---
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# تأكد من أن HF_TOKEN مُعيّن إذا كان النموذج يتطلب ذلك للتنزيل
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    print("Warning: HF_TOKEN not set. This might be required to download gated models like Llama.")

# اختر اسم النموذج Llama 1B الذي تريد استخدامه
# مثال: قد يكون هناك نموذج مثل "meta-llama/Llama-3.1-1B-Instruct" أو مشابه
# ابحث في Hugging Face Hub عن الاسم الدقيق لنموذج Llama 1B Instruct
repo_id_local = "meta-llama/Llama-3.2-1B-Instruct" # كمثال، استبدله بنموذج 1B إذا وجد
# أو إذا كنت تقصد نموذجًا آخر بحجم 1B مثل Phi-3-mini-4k-instruct أو Gemma-2b-it

# إذا كنت تعرف اسمًا دقيقًا لنموذج 1B، استخدمه هنا. لنفترض أنك تقصد
# نموذجًا أصغر بشكل عام، وليس بالضرورة "Llama" رسميًا بحجم 1B.
# دعنا نستخدم نموذجًا صغيرًا ومتاحًا بسهولة كبديل مؤقت إذا لم تجد Llama 1B بسهولة:
# repo_id_local = "HuggingFaceH4/zephyr-7b-beta" # هذا 7B، لكن فقط كمثال للتحميل المحلي
# أو
# repo_id_local = "microsoft/phi-2" # هذا حوالي 2.7B
# أو
# repo_id_local = "google/gemma-2b-it" # هذا 2B

# الأهم هو اختيار نموذج "Instruct" أو "Chat" للحصول على أفضل أداء مع RAG
# *** إذا كان لديك نموذج Llama 1B محدد، ضع اسمه هنا ***
# لنفترض أننا سنستخدم نموذجًا أصغر حجمًا متاحًا بسهولة كـ "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
repo_id_local = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print(f"Attempting to load model {repo_id_local} locally...")

try:
    # تحميل الـ Tokenizer والنموذج
    tokenizer = AutoTokenizer.from_pretrained(repo_id_local, token=hf_token)
    # لتحميل النموذج على الـ CPU إذا لم تكن لديك GPU قوية أو لتوفير الذاكرة:
    # model = AutoModelForCausalLM.from_pretrained(repo_id_local, token=hf_token, device_map="cpu")

    # لتحميل النموذج على GPU إذا متاحة (مفضل للأداء):
    # device_map="auto" سيحاول استخدام الـ GPU إذا وجدت، ويمكن توزيع الطبقات إذا كان النموذج كبيرًا
    # torch_dtype=torch.bfloat16 يمكن أن يوفر الذاكرة ويسرع الاستدلال على الأجهزة المتوافقة
    model_kwargs = {"token": hf_token}
    if torch.cuda.is_available():
        model_kwargs["device_map"] = "auto"
        model_kwargs["torch_dtype"] = torch.bfloat16 # أو torch.float16
        # يمكنك إضافة load_in_8bit=True أو load_in_4bit=True (مع bitsandbytes) لتقليل الذاكرة أكثر
        # model_kwargs["load_in_8bit"] = True

    model = AutoModelForCausalLM.from_pretrained(repo_id_local, **model_kwargs)

    print(f"Model {repo_id_local} loaded successfully.")

    # إنشاء pipeline للتوليد (أسهل طريقة للاستخدام)
    # إذا كان النموذج على الـ CPU، يمكنك تحديد device=-1
    device_for_pipeline = 0 if torch.cuda.is_available() else -1 # 0 for first GPU, -1 for CPU
    text_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device=device_for_pipeline # حدد الجهاز
    )

    # قد تحتاج إلى تنسيق المطالبة بشكل خاص لنماذج الدردشة حتى عند استخدام pipeline
    # TinyLlama-Chat يتوقع تنسيقًا معينًا للدردشة.
    # لـ RAG، يمكنك وضع كل شيء كجزء من مطالبة المستخدم.
    # إذا كان النموذج يتطلب قالب دردشة، قد تحتاج لتطبيقه.
    # المثال التالي بسيط ويفترض أن النموذج سيفهم السياق والسؤال.

    # ملاحظة: نماذج "Chat" غالبًا ما تتوقع قائمة من الرسائل.
    # يمكننا محاكاة ذلك أو تمرير النص مباشرةً إذا كان الـ pipeline يعالجه.
    # الطريقة الأبسط هي تمرير النص الكامل.
    # الطريقة الأكثر دقة للنماذج "Chat" هي استخدام قالب الدردشة الخاص بها.

    # TinyLlama Chat format:
    # <|system|>
    # You are a friendly chatbot.</s>
    # <|user|>
    # User's message here.</s>
    # <|assistant|>
    chat_template_messages = [
        {"role": "system", "content": "You are a helpful AI assistant. Answer the user's question based on the provided context."},
        {"role": "user", "content": PROMPT_TEMPLATE.format(context=context, question=question).replace("Answer:", "").strip()} # أزل "Answer:" من هنا
    ]
    # تطبيق قالب الدردشة إذا كان الـ tokenizer يدعمه بشكل جيد
    try:
        final_prompt_for_local_model = tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True)
        print(f"\n--- Prompt for local model (using chat template) ---\n{final_prompt_for_local_model}")
    except Exception as e_template:
        print(f"Could not apply chat template (error: {e_template}), using filled_prompt directly.")
        final_prompt_for_local_model = filled_prompt # fallback

    # توليد الإجابة
    # text_generator يأخذ النص مباشرة
    # max_length هو الطول الإجمالي (مطالبة + إجابة)، max_new_tokens هو فقط للإجابة الجديدة
    # pipeline text-generation قد لا يدعم chat_template بشكل مباشر في الاستدعاء،
    # لذا نمرر النص المُنسق.
    responses = text_generator(
        final_prompt_for_local_model,
        max_new_tokens=200,  # عدد التوكنات الجديدة التي سيتم إنشاؤها
        do_sample=True,      # مهم إذا كنت تريد إجابات متنوعة، ولكن لـ RAG قد يكون False أو درجة حرارة منخفضة أفضل
        temperature=0.1,     # لـ RAG، يُفضل درجة حرارة منخفضة لإجابات أكثر واقعية
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id # لمنع التحذيرات
    )

    # استخراج الإجابة من استجابة الـ pipeline
    # الـ pipeline يُرجع قائمة من القواميس
    if responses and len(responses) > 0:
        # generated_text يحتوي على المطالبة الأصلية + الإجابة. نحتاج لإزالة المطالبة.
        full_generated_text = responses[0]['generated_text']
        # إزالة المطالبة الأصلية من النص المُنشأ للحصول على الإجابة فقط
        # هذا يعتمد على أن final_prompt_for_local_model لا يتغير بين الاستدعاءات
        if full_generated_text.startswith(final_prompt_for_local_model):
            answer = full_generated_text[len(final_prompt_for_local_model):].strip()
        else:
            # إذا فشلت الإزالة الدقيقة (قد يحدث مع بعض التنسيقات المعقدة)،
            # حاول البحث عن بداية الإجابة بعد آخر جزء من المطالبة.
            # إذا كان آخر جزء من المطالبة ينتهي بـ "Answer:\n" (أو ما شابه)
            # فإن الإجابة تبدأ بعده.
            # بما أننا استخدمنا add_generation_prompt=True، فالمفترض أن الإجابة تلي المطالبة مباشرة.
            # هذه طريقة تقريبية، قد تحتاج للتعديل حسب النموذج.
            # لنفترض أن الإجابة هي ما يأتي بعد آخر سطر من المطالبة.
            # إذا كان `final_prompt_for_local_model` هو النص الذي تم إرساله،
            # فإن `responses[0]['generated_text']` هو النص المُرسل + الإجابة.
            # الطريقة الأكثر أمانًا هي أن النموذج يتبع قالب الدردشة بشكل صحيح وينتهي دوره
            # بـ token نهاية المساعد (assistant).
            # حاليًا، الطريقة البسيطة هي الافتراض أعلاه.
            answer = responses[0]['generated_text'].split(final_prompt_for_local_model)[-1].strip() if final_prompt_for_local_model in responses[0]['generated_text'] else full_generated_text # Fallback
            print(f"Full generated text: {full_generated_text}")
            print(f"Prompt was: {final_prompt_for_local_model}")


        print("\n--- LLM Answer (Locally Generated) ---")
        print(answer)
    else:
        print("No response generated by the local model.")

except Exception as e:
    print(f"Error during local LLM loading or generation: {e}")
    print("Make sure you have enough RAM/VRAM and the model name is correct.")
    print("You might need to install additional packages like 'accelerate' or 'bitsandbytes'.")

Attempting to load model TinyLlama/TinyLlama-1.1B-Chat-v1.0 locally...


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu


Model TinyLlama/TinyLlama-1.1B-Chat-v1.0 loaded successfully.

--- Prompt for local model (using chat template) ---
<|system|>
You are a helpful AI assistant. Answer the user's question based on the provided context.</s>
<|user|>
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

<context>
EN 6  EN 
2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY 
2.1. Legal basis 
The legal basis for the proposal is in the first place Article 114 of the Treaty on the 
Functioning of the European Union (TFEU), which provides for the adoption of measures to 
ensure the establishment and functioning of the internal market.  
This proposal constitutes a core part of the EU digital single market strategy. The primary 
objective of this proposal is to ensure the proper functioning of the internal market by setting 
ha

تقييم الإجابة والمخرجات:
ممتاز جدًا! 🎉
هذه نتيجة رائعة وتدل على أن التعديلات المقترحة للتشغيل المحلي نجحت بشكل كامل.
الجوانب الإيجابية:
نجاح تحميل النموذج المحلي:
تم تنزيل وتحميل نموذج TinyLlama/TinyLlama-1.1B-Chat-v1.0 والـ tokenizer الخاص به بنجاح.
Device set to use cpu: تم تحديد أن النموذج يعمل على الـ CPU، وهذا جيد إذا لم تكن لديك GPU قوية أو لتوفير الموارد.
Model TinyLlama/TinyLlama-1.1B-Chat-v1.0 loaded successfully. تأكيد نجاح التحميل.
تطبيق قالب الدردشة بنجاح:
--- Prompt for local model (using chat template) ---: يُظهر أن الكود استخدم tokenizer.apply_chat_template بنجاح لتنسيق المطالبة بالشكل الذي يتوقعه نموذج TinyLlama Chat. هذا مهم جدًا للحصول على أفضل أداء.
المطالبة المُنسقة تحتوي على:
_SYSTEM_: رسالة نظام توجه النموذج.
_USER_: تحتوي على تعليمات RAG الأصلية، السياق المسترجع (context)، والسؤال.
_ASSISTANT_: جاهز للنموذج ليبدأ التوليد.
الإجابة صحيحة ومباشرة من السياق:
--- LLM Answer (Locally Generated) ---
The legal basis for the proposal is in the first place Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market.
هذه الإجابة مأخوذة مباشرة وبدقة من السياق الذي تم توفيره (Context: ... 2.1. Legal basis The legal basis for the proposal is in the first place Article 114...).
النموذج اتبع التعليمات والتزم بالسياق.
اختفاء المشاكل السابقة:
لم تعد هناك أي أخطاء متعلقة بـ InferenceClient، أو novita، أو StopIteration، أو 429 Too Many Requests. هذا يؤكد أن التشغيل المحلي تجاوز هذه المشاكل تمامًا.
أداء معقول للنموذج الصغير:
نموذج 1.1B مثل TinyLlama تمكن من استخلاص الإجابة الصحيحة، وهو أمر جيد لقدرته.
نقاط قد تحتاج إلى الانتباه إليها في المستقبل (ولكنها ليست مشاكل حالية):
سرعة الاستدلال على CPU: إذا كانت لديك مستندات كبيرة أو أسئلة كثيرة، فقد يكون الاستدلال على CPU بطيئًا. إذا كان الأداء مهمًا، فاستخدام GPU سيسرع العملية بشكل كبير.
جودة الإجابات من نماذج أكبر: بينما TinyLlama 1.1B قام بعمل جيد هنا، قد تحتاج إلى نماذج أكبر (مثل Llama 3 8B، Mixtral، إلخ، إذا كانت مواردك تسمح) لمهام RAG أكثر تعقيدًا أو إذا كنت تحتاج إلى إجابات أكثر تفصيلاً أو قدرة على الاستنتاج تتجاوز مجرد الاستخراج المباشر. النماذج الأكبر تكون عادةً أفضل في فهم الفروق الدقيقة وإنشاء نصوص أكثر سلاسة.
استخلاص الإجابة:
if full_generated_text.startswith(final_prompt_for_local_model):
    answer = full_generated_text[len(final_prompt_for_local_model):].strip()
else:
    # ... fallback logic ...
    answer = responses[0]['generated_text'].split(final_prompt_for_local_model)[-1].strip() if final_prompt_for_local_model in responses[0]['generated_text'] else full_generated_text
Use code with caution.
Python
هذه الطريقة لاستخلاص الإجابة تعمل بشكل جيد هنا. تأكد فقط من أنها قوية بما يكفي لمختلف الردود. بما أن add_generation_prompt=True مستخدم عند تطبيق قالب الدردشة، فمن المفترض أن الإجابة تتبع المطالبة مباشرة.
الخلاصة:
هذا تطبيق ناجح جدًا لـ RAG مع نموذج لغوي صغير يعمل محليًا. لقد تمكنت من تجاوز جميع المشاكل المتعلقة بالـ API الخارجية وحصلت على إجابة صحيحة من النموذج. عمل رائع!
# New Section

In [None]:
# ---------------------------------------------------------------------------
# 0. تثبيت المكتبات الضرورية
# ---------------------------------------------------------------------------
!pip install -q langchain langchain_community sentence_transformers pymilvus "transformers[torch]" accelerate bitsandbytes tqdm

import os
import json
from tqdm import tqdm

# ---------------------------------------------------------------------------
# 1. إعداد مفتاح Hugging Face (إذا لزم الأمر لتنزيل النماذج)
# ---------------------------------------------------------------------------
# لاستخدام نماذج خاصة أو gated models من Hugging Face (مثل بعض نماذج Llama)
# انتقل إلى https://huggingface.co/settings/tokens لإنشاء توكن جديد بصلاحيات القراءة.
# يمكنك لصق التوكن هنا مباشرة (غير موصى به للمشاركة العامة) أو تعيينه كسر في Colab.
# os.environ["HF_TOKEN"] = "hf_YOUR_HUGGINGFACE_TOKEN_HERE" # استبدل بالتوكن الخاص بك

hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    print("Warning: HF_TOKEN not set. This might be required to download some models.")
    print("If you encounter download issues for gated models, please set your HF_TOKEN.")
else:
    print(f"HF_TOKEN found, starting with: {hf_token[:10]}...")

# ---------------------------------------------------------------------------
# 2. تنزيل ملف PDF (إذا لم يكن موجودًا)
# ---------------------------------------------------------------------------
pdf_file_name = "The-AI-Act.pdf"
pdf_url = "https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf"

if not os.path.exists(pdf_file_name):
    print(f"Downloading {pdf_file_name}...")
    import requests
    response = requests.get(pdf_url, stream=True)
    with open(pdf_file_name, "wb") as f:
        for chunk in tqdm(response.iter_content(chunk_size=8192), desc="Downloading PDF"):
            if chunk:
                f.write(chunk)
    print(f"{pdf_file_name} downloaded successfully.")
else:
    print(f"{pdf_file_name} already exists.")

# ---------------------------------------------------------------------------
# 3. تحميل وتقسيم المستند
# ---------------------------------------------------------------------------
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("Loading PDF document...")
loader = PyPDFLoader(pdf_file_name)
docs = loader.load()
print(f"Loaded {len(docs)} pages from the PDF.")

print("Splitting document into chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
print(f"Document split into {len(chunks)} chunks.")

text_lines = [chunk.page_content for chunk in chunks]

# ---------------------------------------------------------------------------
# 4. إنشاء نموذج التضمين (Embedding Model)
# ---------------------------------------------------------------------------
from sentence_transformers import SentenceTransformer

embedding_model_name = "BAAI/bge-small-en-v1.5"
print(f"Loading embedding model: {embedding_model_name}...")
embedding_model = SentenceTransformer(embedding_model_name)
print("Embedding model loaded.")

def emb_text(text_list):
    # .encode يأخذ قائمة من النصوص
    return embedding_model.encode(text_list, normalize_embeddings=True).tolist()

# اختبار التضمين والحصول على البعد
if text_lines:
    test_embedding_list = emb_text([text_lines[0][:50]]) # استخدم جزءًا صغيرًا للاختبار
    embedding_dim = len(test_embedding_list[0])
    print(f"Embedding dimension: {embedding_dim}")
    print(f"Sample embedding (first 10 values): {test_embedding_list[0][:10]}")
else:
    print("No text lines to create test embedding. Document might be empty or splitting failed.")
    embedding_dim = 384 # قيمة افتراضية إذا لم يتمكن من حسابها

# ---------------------------------------------------------------------------
# 5. إعداد قاعدة بيانات MilvusLite (تعمل في الذاكرة أو على ملف محلي)
# ---------------------------------------------------------------------------
from pymilvus import MilvusClient, DataType

# استخدم مسار ملف لـ MilvusLite للحفاظ على البيانات بين الجلسات (اختياري)
# إذا تركتها فارغة، ستكون في الذاكرة فقط.
milvus_db_file = "./hf_milvus_demo.db"
print(f"Initializing MilvusLite client with db_file: {milvus_db_file}...")
# في الإصدارات الأحدث، يتم تحديد uri مباشرة
# milvus_client = MilvusClient(uri=milvus_db_file) # لـ Milvus 2.2.9+
# للإصدارات الأقدم أو إذا واجهت مشاكل، جرب:
# from pymilvus import connections, utility
# connections.connect(uri=milvus_db_file) # الاتصال أولاً
milvus_client = MilvusClient(milvus_db_file) # يعمل مع MilvusLite عادةً


collection_name = "rag_collection_lite"
print(f"Checking for Milvus collection: {collection_name}...")

if milvus_client.has_collection(collection_name):
    print(f"Collection '{collection_name}' found. Dropping it to recreate.")
    milvus_client.drop_collection(collection_name)

print(f"Creating Milvus collection: {collection_name} with dimension {embedding_dim}...")
# تعريف مخطط مبسط لـ MilvusLite
# MilvusClient.create_collection يتطلب primary_field_name و vector_field_name
# في الإصدارات الأحدث.
try:
    milvus_client.create_collection(
        collection_name=collection_name,
        dimension=embedding_dim,
        primary_field_name="id", # اسم حقل المفتاح الأساسي
        id_type=DataType.INT64, # نوع المفتاح الأساسي
        vector_field_name="vector", # اسم حقل الفيكتور
        metric_type="IP",  # Inner product distance
        consistency_level="Strong",
    )
except Exception as e_create_coll:
    print(f"Failed to create collection with new API: {e_create_coll}")
    print("Trying legacy create_collection method if available...")
    try: # محاولة الطريقة القديمة كاحتياطي (قد لا تكون موجودة)
        milvus_client.create_collection(
            collection_name=collection_name,
            dimension=embedding_dim,
            metric_type="IP",
            consistency_level="Strong",
        )
    except Exception as e_legacy_create:
        print(f"Legacy create_collection also failed: {e_legacy_create}")
        raise RuntimeError(f"Could not create Milvus collection '{collection_name}'. Please check Milvus/PyMilvus version and logs.")

print(f"Collection '{collection_name}' created/recreated successfully.")

# ---------------------------------------------------------------------------
# 6. تخزين التضمينات في Milvus والبحث فيها
# ---------------------------------------------------------------------------
print(f"Creating embeddings for {len(text_lines)} chunks and inserting into Milvus...")
data_to_insert = []
# إنشاء التضمينات دفعة واحدة (أكثر كفاءة إذا كانت الذاكرة تسمح)
# أو على دفعات أصغر
batch_size = 32 # يمكنك تعديل حجم الدفعة حسب الذاكرة
all_embeddings = []

for i in tqdm(range(0, len(text_lines), batch_size), desc="Encoding text lines"):
    batch_texts = text_lines[i:i+batch_size]
    batch_embeddings = emb_text(batch_texts) # emb_text تأخذ قائمة
    all_embeddings.extend(batch_embeddings)

for i, (line, vector) in enumerate(zip(text_lines, all_embeddings)):
    data_to_insert.append({"id": i, "vector": vector, "text": line})

if data_to_insert:
    insert_res = milvus_client.insert(collection_name=collection_name, data=data_to_insert)
    print(f"Inserted {insert_res['insert_count']} items into Milvus.")
else:
    print("No data to insert into Milvus.")

# --- البحث ---
question = "What is the legal basis for the proposal?"
print(f"\nSearching for question: '{question}'")

question_embedding = emb_text([question])[0] # emb_text تأخذ قائمة وتُرجع قائمة من التضمينات

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[question_embedding],
    limit=3,
    output_fields=["text"],
)

retrieved_lines_with_distances = []
if search_res and search_res[0]:
    for res in search_res[0]:
        retrieved_lines_with_distances.append(
            (res["entity"]["text"], res["distance"])
        )
    print("\nRetrieved context from Milvus:")
    print(json.dumps(retrieved_lines_with_distances, indent=2))
else:
    print("No results found from Milvus search.")

context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)
if not context:
    print("Warning: Context is empty after retrieval. Using a default placeholder.")
    context = "No relevant context found in the document."

# ---------------------------------------------------------------------------
# 7. إعداد المطالبة (Prompt)
# ---------------------------------------------------------------------------
PROMPT_TEMPLATE = """
Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
{context}

User Question:
{question}

Answer:
"""

# ---------------------------------------------------------------------------
# 8. تحميل وتشغيل النموذج اللغوي المحلي (TinyLlama كمثال)
# ---------------------------------------------------------------------------
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# اختر اسم النموذج المحلي الذي تريد استخدامه
repo_id_local = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print(f"\nAttempting to load model {repo_id_local} locally...")

try:
    tokenizer = AutoTokenizer.from_pretrained(repo_id_local, token=hf_token)

    model_kwargs = {"token": hf_token}
    # التحقق من توفر GPU
    if torch.cuda.is_available():
        print("CUDA (GPU) is available. Loading model on GPU.")
        model_kwargs["device_map"] = "auto" # دع accelerate يوزع النموذج على الـ GPUs
        model_kwargs["torch_dtype"] = torch.bfloat16 # أو torch.float16 لتقليل الذاكرة وتسريع الاستدلال
        # يمكنك إضافة تحميل بدقة أقل إذا لزم الأمر والمكتبة bitsandbytes مثبتة
        # model_kwargs["load_in_8bit"] = True
        # model_kwargs["load_in_4bit"] = True # يتطلب bitsandbytes
    else:
        print("CUDA (GPU) not available. Loading model on CPU.")
        # للـ CPU، device_map="cpu" ليس ضروريًا، النموذج سيُحمّل على الـ CPU افتراضيًا
        # ولكن تحديد torch_dtype قد يسبب مشاكل على الـ CPU إذا لم يكن مدعومًا بشكل جيد
        # لذا، لا نحدده للـ CPU إلا إذا كنت تعرف أنه يعمل.

    model = AutoModelForCausalLM.from_pretrained(repo_id_local, **model_kwargs)
    print(f"Model {repo_id_local} loaded successfully.")

    # تحديد الجهاز للـ pipeline
    device_for_pipeline = 0 if torch.cuda.is_available() else -1 # 0 for first GPU, -1 for CPU

    text_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device=device_for_pipeline
    )

    # تنسيق المطالبة باستخدام قالب الدردشة الخاص بالنموذج
    chat_template_messages = [
        {"role": "system", "content": "You are a helpful AI assistant. Answer the user's question based on the provided context."},
        {"role": "user", "content": PROMPT_TEMPLATE.format(context=context, question=question).replace("Answer:", "").strip()}
    ]

    try:
        final_prompt_for_local_model = tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True)
    except Exception as e_template:
        print(f"Could not apply chat template (error: {e_template}), using raw prompt as fallback.")
        final_prompt_for_local_model = PROMPT_TEMPLATE.format(context=context, question=question)


    print(f"\n--- Prompt for local model ---\n{final_prompt_for_local_model}")

    responses = text_generator(
        final_prompt_for_local_model,
        max_new_tokens=250,
        do_sample=True,
        temperature=0.1,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    if responses and len(responses) > 0:
        full_generated_text = responses[0]['generated_text']
        # استخراج الإجابة بعد المطالبة
        if final_prompt_for_local_model in full_generated_text:
            answer = full_generated_text.split(final_prompt_for_local_model, 1)[-1].strip()
        else:
            # إذا لم يتم العثور على المطالبة بالضبط (قد يحدث بسبب التوكنات الخاصة)
            # هذا حل بديل أقل دقة
            print("Warning: Exact prompt not found in generated text for splitting. The answer might include part of the prompt.")
            answer = full_generated_text # أرجع النص كاملاً كحل بديل
            # أو حاول استخلاص ما بعد آخر جزء معروف من المطالبة
            if "assistant\n" in final_prompt_for_local_model:
                 answer_start_marker = "assistant\n" # أو ما يناسب قالب الدردشة
                 if answer_start_marker in full_generated_text:
                     answer = full_generated_text.split(answer_start_marker, 1)[-1].strip()


        print("\n--- LLM Answer (Locally Generated) ---")
        print(answer)
    else:
        print("No response generated by the local model.")

except Exception as e:
    print(f"\nAn error occurred: {e}")
    import traceback
    traceback.print_exc()
    print("\n--- Troubleshooting ---")
    print(" - Ensure 'HF_TOKEN' is set if your model is gated (e.g., Llama models).")
    print(" - Check model name and availability on Hugging Face Hub.")
    print(" - Ensure you have enough RAM/VRAM for the selected model.")
    print(" - Make sure all required libraries (transformers, torch, accelerate, bitsandbytes) are installed.")

# ---------------------------------------------------------------------------
# 9. تنظيف (اختياري لـ MilvusLite إذا كنت تريد حذف الملف)
# ---------------------------------------------------------------------------
# إذا كنت تريد حذف قاعدة البيانات بعد الانتهاء:
# if os.path.exists(milvus_db_file):
#     print(f"\nCleaning up MilvusLite db file: {milvus_db_file}")
#     # قد تحتاج لإغلاق الاتصال بـ Milvus قبل الحذف إذا كان الملف مقفولاً
#     try:
#         milvus_client.close() # إذا كان هناك اتصال مفتوح يديره العميل
#     except Exception:
#         pass # تجاهل إذا لم يكن هناك اتصال لإغلاقه أو إذا فشل الإغلاق
#     # os.remove(milvus_db_file)
#     print("MilvusLite db file cleanup can be done manually if needed.")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
If you encounter download issues for gated models, please set your HF_TOKEN.
The-AI-Act.pdf already exists.
Loading PDF document...
Loaded 108 pages from the PDF.
Splitting document into chunks...
Document split into 424 chunks.
Loading embedding model: BAAI/bge-small-en-v1.5...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embedding model loaded.
Embedding dimension: 384
Sample embedding (first 10 values): [-0.07536880671977997, 0.010841378942131996, -0.008356144651770592, -0.022615786641836166, -0.021455848589539528, -0.010953365825116634, -0.012614780105650425, 0.05399952828884125, 0.07296384871006012, 0.022082921117544174]
Initializing MilvusLite client with db_file: ./hf_milvus_demo.db...
Checking for Milvus collection: rag_collection_lite...
Creating Milvus collection: rag_collection_lite with dimension 384...
Collection 'rag_collection_lite' created/recreated successfully.
Creating embeddings for 424 chunks and inserting into Milvus...


Encoding text lines: 100%|██████████| 14/14 [01:36<00:00,  6.88s/it]


Inserted 424 items into Milvus.

Searching for question: 'What is the legal basis for the proposal?'

Retrieved context from Milvus:
[
  [
    "EN 6  EN \n2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY \n2.1. Legal basis \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and functioning of the internal market.  \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed and used in compliance with fundamental 

Device set to use cpu


Model TinyLlama/TinyLlama-1.1B-Chat-v1.0 loaded successfully.

--- Prompt for local model ---
<|system|>
You are a helpful AI assistant. Answer the user's question based on the provided context.</s>
<|user|>
Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
EN 6  EN 
2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY 
2.1. Legal basis 
The legal basis for the proposal is in the first place Article 114 of the Treaty on the 
Functioning of the European Union (TFEU), which provides for the adoption of measures to 
ensure the establishment and functioning of the internal market.  
This proposal constitutes a core part of the EU digital single market strategy. The primary 
objective of this proposal is to ensure the proper functioning of the internal market by setting 
harmonised rules in particular on the development, placing on the Union market and the 

### النسخة الكاملة باستخدام ChromaDB

In [None]:
!pip install -q langchain langchain_community sentence_transformers chromadb "transformers[torch]" accelerate bitsandbytes tqdm pypdf

In [None]:
# ---------------------------------------------------------------------------
# 0. تثبيت المكتبات الضرورية (قم بتشغيل هذا في خلية منفصلة إذا لم تكن قد فعلت)
# ---------------------------------------------------------------------------
# !pip install -q langchain langchain_community sentence_transformers chromadb "transformers[torch]" accelerate bitsandbytes tqdm pypdf

import os
import json
from tqdm import tqdm

# ---------------------------------------------------------------------------
# 1. إعداد مفتاح Hugging Face (إذا لزم الأمر لتنزيل النماذج)
# ---------------------------------------------------------------------------
# os.environ["HF_TOKEN"] = "hf_YOUR_HUGGINGFACE_TOKEN_HERE"
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    print("Warning: HF_TOKEN not set. This might be required to download some models.")
else:
    print(f"HF_TOKEN found, starting with: {hf_token[:10]}...")

# ---------------------------------------------------------------------------
# 2. تنزيل ملف PDF (إذا لم يكن موجودًا)
# ---------------------------------------------------------------------------
pdf_file_name = "The-AI-Act.pdf"
pdf_url = "https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf"
if not os.path.exists(pdf_file_name):
    print(f"Downloading {pdf_file_name}...")
    import requests
    response = requests.get(pdf_url, stream=True)
    with open(pdf_file_name, "wb") as f:
        for chunk_pdf in tqdm(response.iter_content(chunk_size=8192), desc="Downloading PDF"):
            if chunk_pdf:
                f.write(chunk_pdf)
    print(f"{pdf_file_name} downloaded successfully.")
else:
    print(f"{pdf_file_name} already exists.")

# ---------------------------------------------------------------------------
# 3. تحميل وتقسيم المستند
# ---------------------------------------------------------------------------
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("Loading PDF document...")
loader = PyPDFLoader(pdf_file_name)
docs_lc = loader.load() # استخدام اسم مختلف لتجنب التعارض مع 'docs' المحتملة
print(f"Loaded {len(docs_lc)} pages from the PDF.")

print("Splitting document into chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks_lc = text_splitter.split_documents(docs_lc) # استخدام اسم مختلف
print(f"Document split into {len(chunks_lc)} chunks.")

# ---------------------------------------------------------------------------
# 4. إعداد نموذج التضمين ودالة التضمين
# ---------------------------------------------------------------------------
from sentence_transformers import SentenceTransformer
from langchain_community.embeddings import HuggingFaceEmbeddings # استخدام Langchain wrapper لسهولة التكامل مع Chroma

embedding_model_name = "BAAI/bge-small-en-v1.5"
print(f"Loading embedding model: {embedding_model_name}...")
# لـ Chroma مع Langchain، من الأسهل استخدام langchain wrapper
embeddings_lc = HuggingFaceEmbeddings(model_name=embedding_model_name)
print("Embedding model loaded.")

# اختبار التضمين (اختياري، لأن Langchain Chroma سيتعامل مع هذا)
if chunks_lc:
    sample_text_for_embedding = chunks_lc[0].page_content[:50]
    sample_embedding = embeddings_lc.embed_query(sample_text_for_embedding)
    embedding_dim_lc = len(sample_embedding)
    print(f"Embedding dimension: {embedding_dim_lc}")
    print(f"Sample embedding (first 10 values): {sample_embedding[:10]}")
else:
    print("No chunks to create test embedding.")

# ---------------------------------------------------------------------------
# 5. إعداد قاعدة بيانات ChromaDB وتخزين المستندات
# ---------------------------------------------------------------------------
from langchain_community.vectorstores import Chroma

collection_name_chroma = "rag_chroma_collection"
persist_directory = "chroma_db_store" # لتخزين قاعدة البيانات على القرص

print(f"Initializing ChromaDB vector store (collection: {collection_name_chroma}, persist_directory: {persist_directory})...")
# إذا كانت chunks_lc موجودة، قم بإنشاء القاعدة من المستندات
if chunks_lc:
    vectorstore = Chroma.from_documents(
        documents=chunks_lc,
        embedding=embeddings_lc,
        collection_name=collection_name_chroma,
        persist_directory=persist_directory
    )
    vectorstore.persist() # حفظ التغييرات على القرص
    print(f"Created and persisted ChromaDB vector store with {len(chunks_lc)} chunks.")
else:
    # إذا لم تكن هناك chunks (لسبب ما)، حاول تحميل قاعدة بيانات موجودة أو أنشئ واحدة فارغة
    print("No chunks found. Attempting to load existing ChromaDB or creating an empty one.")
    if os.path.exists(persist_directory):
         vectorstore = Chroma(
            persist_directory=persist_directory,
            embedding_function=embeddings_lc,
            collection_name=collection_name_chroma
        )
         print("Loaded existing ChromaDB.")
    else:
        # هذا سينشئ قاعدة بيانات فارغة إذا لم تكن هناك مستندات
        # قد تحتاج إلى إضافة مستندات لاحقًا إذا بدأت فارغة.
        print("No existing ChromaDB found and no chunks to add. This might lead to issues in retrieval.")
        # لإنشاء واحدة فارغة تمامًا (قد لا يكون مفيدًا بدون بيانات):
        # client_chroma = chromadb.PersistentClient(path=persist_directory)
        # collection_chroma = client_chroma.get_or_create_collection(name=collection_name_chroma)
        # vectorstore = Chroma(client=client_chroma, collection_name=collection_name_chroma, embedding_function=embeddings_lc)

# ---------------------------------------------------------------------------
# 6. البحث في ChromaDB
# ---------------------------------------------------------------------------
question = "What is the legal basis for the proposal?"
print(f"\nSearching ChromaDB for question: '{question}'")

if 'vectorstore' in locals() and vectorstore is not None:
    # k هو عدد المستندات المشابهة التي سيتم إرجاعها
    retrieved_docs_chroma = vectorstore.similarity_search_with_score(question, k=3)
    print("\nRetrieved context from ChromaDB:")
    retrieved_lines_with_distances_chroma = []
    for doc, score in retrieved_docs_chroma:
        retrieved_lines_with_distances_chroma.append((doc.page_content, score)) # score هنا هو المسافة (أقل أفضل لـ L2، أعلى أفضل لـ косинус)
        # Chroma افتراضيًا يستخدم L2 (اقليدسية)، لذا score هو المسافة.
        # إذا كنت تستخدم cosine، فالـ score سيكون التشابه (أعلى أفضل).
    print(json.dumps(retrieved_lines_with_distances_chromا, indent=2))

    context = "\n".join(
        [doc_content for doc_content, score in retrieved_lines_with_distances_chroma]
    )
else:
    print("Vectorstore not initialized. Cannot perform search.")
    context = "Vectorstore not available, context could not be retrieved."


if not context or context == "Vectorstore not available, context could not be retrieved.":
    print("Warning: Context is empty or unavailable after retrieval. Using a default placeholder.")
    context = "No relevant context found in the document."

# ---------------------------------------------------------------------------
# 7. إعداد المطالبة (Prompt) - يبقى كما هو
# ---------------------------------------------------------------------------
PROMPT_TEMPLATE = """
Use the following pieces of information to answer the user's question.
If the context doesn't contain the answer, say "I cannot answer the question based on the provided context."

Context:
{context}

User Question:
{question}

Answer:
"""

# ---------------------------------------------------------------------------
# 8. تحميل وتشغيل النموذج اللغوي المحلي (TinyLlama كمثال) - يبقى كما هو
# ---------------------------------------------------------------------------
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

repo_id_local = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print(f"\nAttempting to load model {repo_id_local} locally...")

try:
    tokenizer = AutoTokenizer.from_pretrained(repo_id_local, token=hf_token)
    model_kwargs_llm = {"token": hf_token}
    if torch.cuda.is_available():
        print("CUDA (GPU) is available. Loading model on GPU.")
        model_kwargs_llm["device_map"] = "auto"
        model_kwargs_llm["torch_dtype"] = torch.bfloat16
    else:
        print("CUDA (GPU) not available. Loading model on CPU.")

    model = AutoModelForCausalLM.from_pretrained(repo_id_local, **model_kwargs_llm)
    print(f"Model {repo_id_local} loaded successfully.")

    device_for_pipeline_llm = 0 if torch.cuda.is_available() else -1

    text_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device=device_for_pipeline_llm
    )

    chat_template_messages_llm = [
        {"role": "system", "content": "You are a helpful AI assistant. Answer the user's question based on the provided context."},
        {"role": "user", "content": PROMPT_TEMPLATE.format(context=context, question=question).replace("Answer:", "").strip()}
    ]
    try:
        final_prompt_for_local_model_llm = tokenizer.apply_chat_template(chat_template_messages_llm, tokenize=False, add_generation_prompt=True)
    except Exception as e_template_llm:
        print(f"Could not apply chat template (error: {e_template_llm}), using raw prompt as fallback.")
        final_prompt_for_local_model_llm = PROMPT_TEMPLATE.format(context=context, question=question)

    print(f"\n--- Prompt for local model ---\n{final_prompt_for_local_model_llm}")

    responses_llm = text_generator(
        final_prompt_for_local_model_llm,
        max_new_tokens=250,
        do_sample=True,
        temperature=0.1,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    if responses_llm and len(responses_llm) > 0:
        full_generated_text_llm = responses_llm[0]['generated_text']
        if final_prompt_for_local_model_llm in full_generated_text_llm: # تحقق أكثر دقة
            answer = full_generated_text_llm.split(final_prompt_for_local_model_llm, 1)[-1].strip()
        else:
            # محاولة استخلاص بعد آخر جزء معروف إذا كان القالب يضيف شيئًا بعد المطالبة
            # (مثل <|assistant|>)
            assistant_marker = "<|assistant|>" # أو ما يناسب النموذج
            if assistant_marker in full_generated_text_llm:
                answer = full_generated_text_llm.split(assistant_marker,1)[-1].strip()
            else:
                print("Warning: Could not cleanly split prompt from generated text. The answer might be incomplete or include parts of the prompt.")
                answer = full_generated_text_llm # Fallback to full text if splitting fails

        print("\n--- LLM Answer (Locally Generated) ---")
        print(answer)
    else:
        print("No response generated by the local model.")

except Exception as e_llm:
    print(f"\nAn error occurred during LLM loading or generation: {e_llm}")
    import traceback
    traceback.print_exc()
    print("\n--- Troubleshooting ---")
    print(" - Ensure 'HF_TOKEN' is set if your model is gated (e.g., Llama models).")
    print(" - Check model name and availability on Hugging Face Hub.")
    print(" - Ensure you have enough RAM/VRAM for the selected model.")

# ---------------------------------------------------------------------------
# 9. تنظيف (ChromaDB يحفظ على القرص، لذا لا يوجد تنظيف ضروري إلا إذا أردت حذف المجلد)
# ---------------------------------------------------------------------------
# إذا كنت تريد حذف مجلد قاعدة البيانات بعد الانتهاء:
# import shutil
# if os.path.exists(persist_directory):
#     print(f"\nTo clean up ChromaDB, manually delete the directory: {persist_directory}")
#     # shutil.rmtree(persist_directory) # كن حذرًا جدًا مع هذا الأمر!