Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating AWS DocumentDB as a vector storage method #12217

Merged
merged 28 commits into from Apr 22, 2024

Conversation

prajwalsaokar
Copy link
Contributor

@prajwalsaokar prajwalsaokar commented Mar 24, 2024

Description

This change adds a vector store integration for AWS Document DB. @vidur2 and I saw an issue opened to ask for this integration in the LangChain repository and believed that LLamaIndex users would find it useful as well given that we were using it for a project. There are no new dependencies, this integration just requires the pymongo library.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • [] Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Mar 24, 2024
@vidur2
Copy link

vidur2 commented Mar 26, 2024

Our unit tests require one to input their own AWS credentials to run. In order to avoid exposing AWS secrets, we didn't include ours as part of the PR. Is there some way we can get approved without having these tests pass?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be good to add an example notebook in docs/docs/examples/vector_stores ?

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but still would be nice to have an example so people know how to use this :)

@logan-markewich
Copy link
Collaborator

The tests should mock out API requests, or should be skipped properly if creds are missing

@vidur2
Copy link

vidur2 commented Apr 1, 2024

The tests should mock out API requests, or should be skipped properly if creds are missing
Should be fixed.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@prajwalsaokar
Copy link
Contributor Author

@logan-markewich is there anything else we need to add so that this can be merged? thanks for guiding us through the process

@logan-markewich
Copy link
Collaborator

@prajwalsaokar the tests were a little wonky -- we need to mark them as skip (which is what I did), if we aren't going to mock the API calls. Also cleaned up a few other things, should be good to go now

@logan-markewich
Copy link
Collaborator

@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either
a) things are properly mocked (api calls, etc.)
b) things are properly skipped when not mocked and things are missing
c) that tests actually work (I was hitting all kinds of errors actually haha)

@prajwalsaokar
Copy link
Contributor Author

@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either a) things are properly mocked (api calls, etc.) b) things are properly skipped when not mocked and things are missing c) that tests actually work (I was hitting all kinds of errors actually haha)

We just tried the tests on two machines, they should work it's just a bit of hassle to set them up because you need a DocumentDB instance and an EC2 instance thats connected to the DB instance with the correct networking config for the tests to run properly. Could you send over the errors you're getting?

@prajwalsaokar
Copy link
Contributor Author

We also think we implemented skipping correctly, but we didn't come up with a way to skip the tests if the EC2 configuration isn't correct since that's not visible to LlamaIndex. Can you clarify what changes you think we should make on the skip/mock front?

@samkhano1
Copy link

It looks like we should follow something like Cosmos or Postgres. They simply try to connect to DB and if there is a failure, they set a "*_not_available" variable which is used to skip the test. From a quick glance at tests written for a few vector stores, it doesn't look like mocking the API calls is typically done. I'll leave it up to Logan to provide requirements for the merge, but just wanted to leave this comment first.

@logan-markewich
Copy link
Collaborator

Correct, usually you can do some sort of check and based on the condition, skip or not.

Mocking would allow the tests to actually run, in cicd, but it's usually more work (hence why other vectordbs don't have it)

I'll leave it up to you since this is your integration 👍🏻 either is fine by me

Once cicd passes here I will merge

@logan-markewich
Copy link
Collaborator

Seems there is an incorrect import, maybe take a double check

For example, should be from llama_index.core.schema import...

@logan-markewich
Copy link
Collaborator

Tests are still failing with the same import error 😓 Does these actually run locally for you? I'd be very surprised if they did. I'll push a fix

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 22, 2024
@logan-markewich logan-markewich merged commit 3232a1f into run-llama:main Apr 22, 2024
8 checks passed
@ip9inderpreet
Copy link

Thanks Logan for the help. Thank you Prajwal and Vidur for your contribution. I will try this out soon.

chrisalexiuk-nvidia pushed a commit to chrisalexiuk-nvidia/llama_index that referenced this pull request Apr 25, 2024
mattf pushed a commit to mattf/llama_index that referenced this pull request Apr 25, 2024
@prajwalsaokar
Copy link
Contributor Author

Thank you so much for all the help Logan, this was my first open source contribution and I learned a lot throughout this process. I'm very grateful for the time you spent ensuring that this integration could go through.

@jpcoutinho
Copy link

@prajwalsaokar thank you for the integration, seriously saved me a ton of work. Quick question though: I noticed you added the pre-filter in your code, but I couldn't find anything about pre-filtering in DocumentDB docs for vector search. Did you check if it is working on DocDB? Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants