Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/file uploader abstractions #7025

Merged
merged 34 commits into from Jul 31, 2023
Merged

Conversation

kajarenc
Copy link
Collaborator

@kajarenc kajarenc commented Jul 17, 2023

Describe your changes

This PR change is a reworking of a process to upload files for st.file_uploader and st.camera_input.
It introduces the abstraction for UploadedFileManager, the default implementation continues to store files in memory.

Now the process of uploading file will consist from two steps

  1. Issuing upload file URLs (happen via web socket communication, file_urls_request handler).
  2. Upload a file to the URL issued in step 1.

GitHub Issue Link (if applicable)

Testing Plan


Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

@vdonato vdonato force-pushed the feature/file_uploader_abstractions branch from 465d08b to 4430df9 Compare July 18, 2023 00:05
@kajarenc kajarenc force-pushed the feature/file_uploader_abstractions branch from 738aef1 to f683cd0 Compare July 24, 2023 12:01
vdonato and others added 23 commits July 25, 2023 14:50
* Uploaded file manager protocol

* Implement MemoryUploadFileManager

* Fix failing tests, not related to file uploader

* Add some tests for UploadFileRequestHandlerTest

* formatter fixes

* fix all tests in  uploaded_file_request_handler.py

* Tweak some comments and add some TODOs

* Add another TODO

---------

Co-authored-by: Karen Javadyan <kajarenc@gmail.com>
Co-authored-by: Vincent Donato <vincent@streamlit.io>
…and FileUploadClient (#6754)

* Start supporting configurable file upload URLs in StreamlitEndpoints and FileUploadClient

* Clarify some comments and fix formatting

* Remove widgetId from file upload related methods

* Have csrfRequest assume it always gets an absolute URL
* Add new proto definitions for file upload URL requests/responses

* Add server websocket handler for file_urls_request BackMsgs

* Rename FileUrls* -> FileURLs*

* Install uuid types

* Mark resolver fields as readonly

* Allow file URLs to be requested via websocket

* Add temporary  # type: ignore

* Remove HTTP endpoints for fetching file URLs

* Remove unused variable to appease eslint
* * add `file_delete_url` to `UploadedFileInfo` proto
* add `get_upload_urls` to `UploadedFileManager`
* `UploadedFileManager` now works with file_id to get_files.

* Add todos

* remove abstractmothod from get_upload_urls
* Have UploadedFileManager react client fetch and use upload URLs

* Fix issues from rebase
* reimplement camera_input to work with new uploaded file manager

* call delete endpoint when clear camera_input
* Change some types from Promise<number> to Promise<void>

* Remove serverFileId and related fields from FileUploader and friends

* Remove serverFileId from CameraInput

* Remove more newly unused fields from CameraInput and FileUploader
* extract FileUrls to separate proto

* use copy from

* use FileURLS also in camera_input

* fixes after review
* Run autoformatter on e2e/specs/st_file_uploader.spec.js

* Fix some more small type errors

* Fix watched cypress route in st_file_uploader.spec.js
* Fix a bunch of FileUploader js unit tests

* Fix a bunch of type errors in tests
@kajarenc kajarenc force-pushed the feature/file_uploader_abstractions branch from a2831c4 to d45a488 Compare July 25, 2023 10:56
* Wait until file uploaded instead of fix amount of time

* fix tests for CameraInput
fix tests for DefaultStreamlitEndpoints

* fix eslint errors

* add uuid to NOTICES
@@ -179,6 +179,8 @@ def set_widget_metadata(self, widget_meta: WidgetMetadata[Any]) -> None:

def remove_stale_widgets(self, active_widget_ids: set[str]) -> None:
"""Remove widget state for widgets whose ids aren't in `active_widget_ids`."""
# TODO(vdonato / kajarenc): Remove files corresponding to an inactive file
# uploader.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, we keep this for future improvement

Copy link
Collaborator

@vdonato vdonato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but overall this LGTM

I won't officially hit the +1 button given so much of this code is mine that I'm pretty sure that I'm not qualified to approve it 😆

Comment on lines -259 to -270
def _on_files_updated(self, session_id: str) -> None:
"""Event handler for UploadedFileManager.on_file_added.
Ensures that uploaded files from stale sessions get deleted.

Notes
-----
Threading: SAFE. May be called on any thread.
"""
if not self._session_mgr.is_active_session(session_id):
# If an uploaded file doesn't belong to an active session,
# remove it so it doesn't stick around forever.
self._uploaded_file_mgr.remove_session_files(session_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I feel like this might be something that we want to keep even in the new world. We still have the mechanism that calls remove_session_files when a session is shut down, but I'm slightly worried of the possibility that a file corresponding to a now nonexistent session is somehow uploaded so is never cleaned up without this.

Would it be possible to revive this callback function but change it to be called when no corresponding session (active or inactive) exists for a given session_id? This happens when self._session_mgr.get_session_info(session_id) is None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
We check that session is active after the file is fully uploaded, and just before storing it into storage.
https://github.com/streamlit/streamlit/pull/7025/files#diff-3e1711ee513b43ba25f0f9118bfe0135a503ed40da6474b7b2c5ddcef9e940f5R101 , so it shouldn't be possible very unlikely to upload file not corresponding to active sesion

So I don't want to make it part of the protocol but will be happy to return to this during refactoring of removing session_id from URLs, and rethink deleting files mechanism for open source implementation.

Comment on lines 173 to 187
@patch("streamlit.elements.widgets.file_uploader._get_upload_files")
def test_deleted_file_omitted(self, get_upload_files_patch):
"""We should omit DeletedFile objects for final user value ."""

uploaded_files = [DeletedFile(file_id="A")]
get_upload_files_patch.return_value = uploaded_files

st.file_uploader("foo", accept_multiple_files=True)
result_1: UploadedFile = st.file_uploader("a", accept_multiple_files=False)
result_2: UploadedFile = st.file_uploader("b", accept_multiple_files=True)

self.assertEqual(result_1, None)
self.assertEqual(result_2, [])

@patch("streamlit.elements.widgets.file_uploader._get_upload_files")
def test_deleted_files_filtered_out(self, get_upload_files_patch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these two tests essentially identical? I think we can probably remove the first one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the second test is a superset of the first test, I removed the first one!

def _get_file_recs_for_camera_input_widget(
widget_id: str, widget_value: Optional[FileUploaderStateProto]
) -> List[UploadedFileRec]:
def _get_upload_files(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we factor this out into a helper function shared by camera_input and file_uploader? This function seems to be identical for both of the widgets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it difficult to find a new home for def _get_upload_files (because it is used only in file_uploader and camera_input, so utils.py is probably not the best place, so I just now imported _get_upload_files from file_uploader for camera_input, I think it is good idea, and denote the fact that camra_input essentially a derivative form file_uploader

Comment on lines 42 to 44
SomeUploadedFiles = Optional[
Union[UploadedFile, DeletedFile, List[Union[UploadedFile, DeletedFile]]]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can add None to the union instead of having an Optional[Union[...]] to be consistent with camera input

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 450 to 462

filtered_value: Union[UploadedFile, List[UploadedFile], None]

if isinstance(widget_state.value, DeletedFile):
filtered_value = None
elif isinstance(widget_state.value, list):
filtered_value = [
f for f in widget_state.value if not isinstance(f, DeletedFile)
]
else:
filtered_value = widget_state.value

return filtered_value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it'd be fine to return the value directly from each branch of the if/elif/else statement rather than save it to an intermediate variable that's returned immediately

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -225,7 +229,7 @@ def file_uploader(
*, # keyword-only arguments:
disabled: bool = False,
label_visibility: LabelVisibility = "visible",
) -> SomeUploadedFiles:
) -> Optional[Union[UploadedFile, List[UploadedFile]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same comment about adding None to the union instead of having this be Optional (same thing again with L389)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

class DeletedFile(NamedTuple):
"""Represents a deleted file in deserialized values for st.file_uploader and
st.camera_input
Return this from st.file_uploader and st.camera_input deserialize (so they can
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should have a newline between the docstring summary and the rest of it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@LukasMasuch LukasMasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 I just have some nits and questions, but no major concern.

frontend/lib/src/FileUploadClient.ts Outdated Show resolved Hide resolved
@@ -34,6 +34,14 @@ export interface StreamlitEndpoints {
*/
buildMediaURL(url: string): string

/**
* Construct a URL for uploading a file.
* @param url a relative or absolute URL. If `url` is absolute, it will be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think in the JSdoc docstring syntax the parameter is separated via a -, e.g.:

Suggested change
* @param url a relative or absolute URL. If `url` is absolute, it will be
* @param url - a relative or absolute URL. If `url` is absolute, it will be

There are also a few other instances in the PR that could get correct, but it seems that we haven't been following this anyways in the current impl.

}

message FileUploaderState {
// DEPRECATED
sint64 max_file_id = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment above, couldn't we just fully remove this via:

Suggested change
sint64 max_file_id = 1;
reserved 1;
reserved "max_file_id";

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we are okay to remove / change the type of any proto message field, because of dependencies from other teams.

This is a good question to clarify, to have a solid understanding of the contract we keep for proto messages.
For now, I think the safest thing is just to keep the field as is, with a comment that it is now deprecated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oki, let's keep it as is 👍 But maybe we bring it up in a standup soon to figure out if it would break other dependencies.

// Information on a file uploaded via the file_uploader widget.
message UploadedFileInfo {
// DEPRECATED.
sint64 id = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field still used in any way? If not, could we just remove this and add:

Suggested change
sint64 id = 1;
reserved 1;
reserved "id";

all_files: List[UploadedFileRec] = []
# Make copy of self.file_storage for thread safety, to be sure
# that main storage won't be changed form other thread
file_storage_copy = self.file_storage.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does that mean that all files are also duplicated in memory? So calling this (e.g. via the metrics endpoint), might crash the app because of a sudden memory increase. Not a big deal right now since it isn't used a lot, but if we do more with those stats we might want to revisit this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, now we create a duplicate to collect statistics.
The good news is that this duplicate should be short-living, and Python Garbage Collector should collect it very fast because we don't keep any references to file_storage_copy, but yes, this could lead to memory usage increase in case of a lot of stored uploaded files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add a TODO comment or something there, so that we double check this once we do more with the stats.

frontend/app/src/App.tsx Show resolved Hide resolved
frontend/app/src/App.tsx Show resolved Hide resolved
kajarenc and others added 3 commits July 28, 2023 20:30
Co-authored-by: Lukas Masuch <Lukas.Masuch@gmail.com>
…ient (#7092)

We want to avoid having the notebooks team have to make any code changes on their end due to the
interface / constructor changes we're making to StreamlitEndpoints and FileUploadClient. In order
to do this, we make the newly added methods/args optional.
@kajarenc kajarenc merged commit a2bd07f into develop Jul 31, 2023
86 checks passed
@vdonato vdonato deleted the feature/file_uploader_abstractions branch November 1, 2023 23:56
eric-skydio pushed a commit to eric-skydio/streamlit that referenced this pull request Dec 20, 2023
This PR change is a reworking of a process to upload files for st.file_uploader and st.camera_input.
It introduces the abstraction for UploadedFileManager, the default implementation continues to store files in memory.

Now the process of uploading file will consist from two steps

Issuing upload file URLs (happen via web socket communication, file_urls_request handler).
Upload a file to the URL issued in step 1.
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Mar 22, 2024
This PR change is a reworking of a process to upload files for st.file_uploader and st.camera_input.
It introduces the abstraction for UploadedFileManager, the default implementation continues to store files in memory.

Now the process of uploading file will consist from two steps

Issuing upload file URLs (happen via web socket communication, file_urls_request handler).
Upload a file to the URL issued in step 1.
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Apr 16, 2024
This PR change is a reworking of a process to upload files for st.file_uploader and st.camera_input.
It introduces the abstraction for UploadedFileManager, the default implementation continues to store files in memory.

Now the process of uploading file will consist from two steps

Issuing upload file URLs (happen via web socket communication, file_urls_request handler).
Upload a file to the URL issued in step 1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants