Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add feedback trace server APIs #1718

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

jamie-rasmussen
Copy link
Contributor

@jamie-rasmussen jamie-rasmussen commented Jun 4, 2024

Internal Jira: https://wandb.atlassian.net/browse/WB-18697

Adds Trace Server APIs for create / query / purge. These APIs are based on a new "ORM" layer that helps us have more similar code between ClickHouse and SQLite.

Emoji utilities were previously reviewed in core, bringing them over to this repo to correspond to change in where validation logic lives.

@tssweeney - there is obviously a bunch of code duplication here with the mongo processing in Calls that we should figure out what to do about.

@circle-job-mirror
Copy link

circle-job-mirror bot commented Jun 4, 2024

@jamie-rasmussen jamie-rasmussen force-pushed the jamie/feedback-server-api branch 5 times, most recently from 74bf345 to e0387bc Compare June 4, 2024 15:01
@jamie-rasmussen jamie-rasmussen marked this pull request as ready for review June 4, 2024 16:02
@jamie-rasmussen jamie-rasmussen requested a review from a team as a code owner June 4, 2024 16:02
result = query.execute(self.ch_client)
return tsi.FeedbackQueryRes(result=result)

def feedback_purge(self, req: tsi.FeedbackPurgeReq) -> tsi.FeedbackPurgeRes:
Copy link
Contributor

@tssweeney tssweeney Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels extremely risky - a user would be able to purge an entire project's feedback in 1 call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a concern for me too. Batch deletion seems useful, but we could make it so we only allow purging a list of ids. We could also keep this as is but make the Python API more restrictive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might want to heavily restrict this to a list of ids for now

feedback_id = generate_id()
created_at = datetime.datetime.now(ZoneInfo("UTC"))
# TODO: Any validation on weave_ref?
# TODO: What should max payload size be?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 seems like a pretty sufficient cap for now - couple paragraphs in note form

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked a number rather arbitrarily - small because it is easier to loosen than tighten, but I'm open to alternatives.

@@ -0,0 +1,561 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3 loving this!

class FeedbackQueryReq(BaseModel):
project_id: str = Field(examples=["entity/project"])
fields: typing.Optional[list[str]] = Field(
default=None, examples=[["id", "feedback_type", "payload.note", "count(*)"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting idea to combine count(*) in the field selection. I'm curious to read more below how this is handled - both from a sql injection perspective but also if you have count & fields, then you need 2 queries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added GROUP BY or HAVING to ORM yet. Kind of hacked in count(*) because it was useful for tests. Re: injection, I'm only allowing that specific string, we'd have to get fancier if we wanted to support functions more generally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is just useful for tests, i might not advertise it as a primary example in case you want to change the way aggregations are specified when you are doing an agg operation

wb_user_id: str


# The response provides the additional fields needed to convert a request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this? Is there a back-and-forth here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create request has a "partial" feedback object, the server does things like assign the id, determine the creation timestamp, possibly augment the payload (e.g. in the emoji case we augment it with the shortcode). So the server response includes all of the fields that are needed to make a complete "Feedback" object that will match what you would get if you had queried for all of the feedback.

So e.g. in the Python SDK, the response from creation allows us to update the list of feedback without having to do a separate roundtrip query or complete refresh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - makes sense


class FeedbackQueryRes(BaseModel):
# Note: this is not a list of Feedback because user can request any fields.
result: list[dict[str, typing.Any]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too bad we can't have Partials



class FeedbackQueryRes(BaseModel):
# Note: this is not a list of Feedback because user can request any fields.
Copy link
Contributor

@tssweeney tssweeney Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that we support cases where the fields are not present. alternative idea for the "fields" in the request to only control payload fields, which would result in a list of valid feedbacks (where the payload could be modified/restricted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of feedback there are probably not a lot of cases (although supporting COUNT(*) is already useful) where you wouldn't want the payload, but if think about API patterns for more complicated objects, like calls, I think there are more cases where you want the ability to only pull down the fields you need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree that it is useful to filter fields. However, in practice, there are a bunch of id/date/name like fields that are low cost. Then there are big ones (inputs, output, attributes, summary). Similarly here you have a few simple fields, then a possibly huge field (payload). My mental model for calls has been to always return the "simple fields". Then essentially have "partials" for the others. In such cases, the typing would be consistent because the big fields are still dicts, just empty.

I don't think this is a blocking comment btw, but maybe something to consider (both ways) as we start to harden the API

class FeedbackPurgeReq(BaseModel):
project_id: str = Field(examples=["entity/project"])
# TODO: Should we make this required?
query: typing.Optional[Query] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this ... feels very easy to mistakenly delete everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion on how I should mitigate that? Add a limit on how many can rows be deleted at once? Change the API to only allow deletion by ID? Keep it as a (required) Query parameter but raise if it is too generic (by some heuristic)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to retain the query capability, I'd prefer to just allow a single query shape that matches what you need (which i think could just be an "and" over ids)

return self.db_name or self.name

def create_sql(self) -> str:
sql_type = "TEXT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be self.type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like in SQLite we were mapping all of the field types ORM currently supports (string, json, datetime) to a TEXT column, but this would need to be updated with an actual mapping to support other column types.

@jamie-rasmussen jamie-rasmussen merged commit f6ae2f4 into master Jun 6, 2024
24 checks passed
@jamie-rasmussen jamie-rasmussen deleted the jamie/feedback-server-api branch June 6, 2024 12:03
@github-actions github-actions bot locked and limited conversation to collaborators Jun 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants