Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add fail_on_http_error boolean to Github readers #12649

Closed
rwood-97 opened this issue Apr 8, 2024 · 5 comments · Fixed by #13366
Closed

[Feature Request]: Add fail_on_http_error boolean to Github readers #12649

rwood-97 opened this issue Apr 8, 2024 · 5 comments · Fixed by #13366
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@rwood-97
Copy link
Contributor

rwood-97 commented Apr 8, 2024

Feature Description

We are getting an error when trying to pull files from the github api using the GithubRepositoryReader.

So far we have added a timeout and retries argument to the reader to try resolve this but it still seems to fail and so instead we have added a try/except in our own code to catch the HTTPError and then just continue onto the next repo we are trying to get data from. This means that if one request fails when getting data from the repo then we skip the whole repo.

We would like to add a fail_on_http_error flag to the readers which would allow us to have an if statement which catches the exception and continue onto the next file if False or keep the current behaviour of raising an exception if True.

Reason

Continue onto next file if there is a error when requesting a file from the Github API.

Value of Feature

Rather than skipping the whole repo if one request fails we could just skip that file. This would allow us to get more complete data.

@rwood-97 rwood-97 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Apr 8, 2024
@rwood-97
Copy link
Contributor Author

rwood-97 commented Apr 8, 2024

I am happy to do this in some time I have next Friday 19th or beforehand if I have available time.

@rwood-97
Copy link
Contributor Author

rwood-97 commented Apr 8, 2024

Relates to alan-turing-institute/reginald#157

@logan-markewich
Copy link
Collaborator

@rwood-97 what's the error thought? Just 404? Anything more specific?

@rwood-97
Copy link
Contributor Author

rwood-97 commented Apr 9, 2024

This is an old log of the error (before the llama-hub integration) but essentially we are getting an timeout error.
The most recent and relevant bit of our codebase is here

2024-01-04 12:23:44 [    INFO] HTTP Request: GET https://api.github.com/repos/alan-turing-institute/rse-course/git/blobs/4b0216fc47103141a79c20a6d145ad5fbac93040 "HTTP/1.1 200 OK"
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/anyio/_core/_sockets.py", line 189, in connect_tcp
    addr_obj = ip_address(remote_host)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: 'api.github.com' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp
    stream: anyio.abc.ByteStream = await anyio.connect_tcp(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/anyio/_core/_sockets.py", line 192, in connect_tcp
    gai_res = await getaddrinfo(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 339, in __wakeup
    future.result()
  File "/usr/local/lib/python3.11/asyncio/futures.py", line 198, in result
    raise exc
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp
    with anyio.fail_after(timeout):
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/anyio/_core/_tasks.py", line 119, in __exit__
    raise TimeoutError
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
    yield
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_transports/default.py", line 366, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 268, in handle_async_request
    raise exc
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 251, in handle_async_request
    response = await connection.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request
    raise exc
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request
    stream = await self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_async/connection.py", line 124, in _connect
    stream = await self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp
    return await self._backend.connect_tcp(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 112, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/app/reginald/models/create_index.py", line 86, in main
    data_creator.create_index()
  File "/app/reginald/models/models/llama_index.py", line 457, in create_index
    self.prep_documents()
  File "/app/reginald/models/models/llama_index.py", line 208, in prep_documents
    self._load_rse_course(gh_token)
  File "/app/reginald/models/models/llama_index.py", line 295, in _load_rse_course
    self.documents.extend(rse_course_loader.load_data(branch="main"))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/base.py", line 287, in load_data
    return self._load_data_from_branch(branch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/base.py", line 258, in _load_data_from_branch
    return self._loop.run_until_complete(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/nest_asyncio.py", line 99, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 269, in __step
    result = coro.throw(exc)
             ^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/base.py", line 393, in _generate_documents
    async for blob_data, full_path in buffered_iterator:
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/utils.py", line 82, in __anext__
    await self._fill_buffer()
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/utils.py", line 156, in _fill_buffer
    results: List[GitBlobResponseModel] = await asyncio.gather(
                                          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 339, in __wakeup
    future.result()
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 267, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/github_client.py", line 401, in get_blob
    await self.request(
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/github_client.py", line 320, in request
    raise excp
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/llama_hub/github_repo/github_client.py", line 315, in request
    response = await _client.request(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_transports/default.py", line 365, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/root/.cache/pypoetry/virtualenvs/reginald-9TtSrW0h-py3.11/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectTimeout
HTTP Exception for https://api.github.com/repos/alan-turing-institute/rse-course/git/blobs/465e4946c00ab75627f52bb1863e049a16fbc7a2 - 
HTTP Exception for https://api.github.com/repos/alan-turing-institute/rse-course/git/blobs/a945df56353a33ce152f6719edbb739458ac9143 - 
HTTP Exception for https://api.github.com/repos/alan-turing-institute/rse-course/git/blobs/4261cdf35cc1e961c896d47b853f5cc078a19073 - 

run-llama/llama-hub#846 was from us
run-llama/llama-hub#529 and this may be related

@rwood-97
Copy link
Contributor Author

This was a lot harder than I thought :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants