Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When using HuggingFaceEmbedding in VectorStoreIndex it loses attributes through copy.deepcopy() #14837

Closed
learnbott opened this issue Jul 19, 2024 · 3 comments · Fixed by #14860
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@learnbott
Copy link

Bug Description

I was integrating LlamaIndex with HuggingFaceEmbedding as the vector store in a DSPy model. It worked fine outside of DSPy Teleprompters. But when running with teleprompter.compile() it would result in this error:

AttributeError: 'HuggingFaceEmbedding' object has no attribute '_model'

It turns out that the teleprompters make a copy.deepcopy() of the DSPy model including the LlamaIndex vector store. Before the copy is made the HuggingFaceEmbedding has the attribute _model, but the copy does not.

This may be related to issue #14464

Version

0.10.55

Steps to Reproduce

import copy
filepath = "/some/path/to/.xlsx"
documents = PandasExcelReader(sheet_name="sheet 1").load_data(filepath)
embed_model_name = "BAAI/bge-small-en-v1.5"
embed_model = HuggingFaceEmbedding(model_name=embed_model_name)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
query_engine = index.as_retriever(similarity_top_k=2)
qe_clone = copy.deepcopy(query_engine)
print('Original query engine', hasattr(query_engine._embed_model, "_model"))
print('Copy query engine', hasattr(qe_clone._embed_model, "_model"))

[output]
Original query engine True
Copy query engine False

Relevant Logs/Tracbacks

qe_clone.retrieve("LlamaIndex is awesome.")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 13
---> 13 qe_clone.retrieve("LlamaIndex is awesome.")

File /usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py:230, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    226 self.span_enter(
    227     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    228 )
    229 try:
--> 230     result = func(*args, **kwargs)
    231 except BaseException as e:
    232     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File /usr/local/lib/python3.10/dist-packages/llama_index/core/base/base_retriever.py:243, in BaseRetriever.retrieve(self, str_or_query_bundle)
    238 with self.callback_manager.as_trace("query"):
    239     with self.callback_manager.event(
    240         CBEventType.RETRIEVE,
    241         payload={EventPayload.QUERY_STR: query_bundle.query_str},
    242     ) as retrieve_event:
--> 243         nodes = self._retrieve(query_bundle)
    244         nodes = self._handle_recursive_retrieval(query_bundle, nodes)
    245         retrieve_event.on_end(
    246             payload={EventPayload.NODES: nodes},
    247         )

File /usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py:230, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    226 self.span_enter(
    227     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    228 )
    229 try:
--> 230     result = func(*args, **kwargs)
    231 except BaseException as e:
    232     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File /usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/retrievers/retriever.py:97, in VectorIndexRetriever._retrieve(self, query_bundle)
     94 if self._vector_store.is_embedding_query:
     95     if query_bundle.embedding is None and len(query_bundle.embedding_strs) > 0:
     96         query_bundle.embedding = (
---> 97             self._embed_model.get_agg_embedding_from_queries(
     98                 query_bundle.embedding_strs
     99             )
    100         )
...
    128         prompt_name=prompt_name,
    129         normalize_embeddings=self.normalize,
    130     ).tolist()

AttributeError: 'HuggingFaceEmbedding' object has no attribute '_model'
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
@learnbott learnbott added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jul 19, 2024
Copy link

dosubot bot commented Jul 19, 2024

Hey there, @learnbott! 👋 I'm here to help you with any bugs, questions, or contributions you might have. Let's squash those bugs together!

The issue arises because the HuggingFaceEmbedding object loses its _model attribute after being deep-copied. This is likely due to how copy.deepcopy() handles the internal state of the object.

To resolve this, you can override the __deepcopy__ method in the HuggingFaceEmbedding class to ensure that the _model attribute is correctly copied. Here is an example of how you can modify the HuggingFaceEmbedding class:

from copy import deepcopy
from llama_index.embeddings.huggingface.base import HuggingFaceEmbedding

class CustomHuggingFaceEmbedding(HuggingFaceEmbedding):
    def __deepcopy__(self, memo):
        # Create a new instance of CustomHuggingFaceEmbedding
        new_instance = type(self)(
            model_name=self.model_name,
            max_length=self.max_length,
            query_instruction=self.query_instruction,
            text_instruction=self.text_instruction,
            normalize=self.normalize,
            cache_folder=self.cache_folder,
            device=self._device,
            callback_manager=self.callback_manager,
        )
        # Manually copy the _model attribute
        new_instance._model = deepcopy(self._model, memo)
        return new_instance

By adding the __deepcopy__ method, you ensure that the _model attribute is properly copied when copy.deepcopy() is called.

You can then use this modified class in your code:

import copy
filepath = "/some/path/to/.xlsx"
documents = PandasExcelReader(sheet_name="sheet 1").load_data(filepath)
embed_model_name = "BAAI/bge-small-en-v1.5"
embed_model = CustomHuggingFaceEmbedding(model_name=embed_model_name)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
query_engine = index.as_retriever(similarity_top_k=2)
qe_clone = copy.deepcopy(query_engine)
print('Original query engine', hasattr(query_engine._embed_model, "_model"))
print('Copy query engine', hasattr(qe_clone._embed_model, "_model"))

This should resolve the AttributeError and ensure that the _model attribute is present in the copied HuggingFaceEmbedding object [1].

To continue talking to Dosu, mention @dosu.

@skvrd
Copy link
Contributor

skvrd commented Jul 19, 2024

Here is the line, that causes the deep copy issue:

state["__private_attribute_values__"] = {}

Intent of this line is unclear.

@learnbott
Copy link
Author

Here is the line, that causes the deep copy issue:

state["__private_attribute_values__"] = {}

Intent of this line is unclear.

Commenting out this line allowed happiness to be published throughout all the land. Thank you.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants