Skip to content

POC: endpoint to get evals info #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

giovanni-guidini
Copy link
Contributor

@giovanni-guidini giovanni-guidini commented Jun 18, 2025

These changes add 2 new endpoints in the REST api (hidden behing a feature flag) with the objective of being a POC in reporting Eval info saved in Test Analytics.

Probably not what anyone else had in mind, but a POC mind you. I didn't want to deal with building a frontend. Also the current way we use evals is not in the CI, so I didn't think it made much sense to mess around the PR comment either.

Anyway the endpoints are:

  • summary - aggregates eval data (that is saved per test-run).
  • compare - aggregate per commit and compare.

⚠️ I had to change the DB router for the API for tests to pass.
It included a custom field lookup that was moved to the shared DB router.

Copy link

seer-by-sentry bot commented Jun 18, 2025

Sentry detected 2 potential issues in your recent changes

Suspicion of an AttributeError in apps/codecov-api/api/public/v2/evals/views.py:99. Calling .get() on the testrun.properties JSONField could fail if the field is None, potentially causing unexpected behavior.
  • Description: The Testrun model's 'properties' field is defined as JSONField(null=True), allowing it to store NULL in the database, which Django represents as None in Python. The code at apps/codecov-api/api/public/v2/evals/views.py:99 attempts to call .get("eval", {}) directly on testrun.properties. If testrun.properties is None, calling .get() on it will raise an AttributeError: 'NoneType' object has no attribute 'get'. This specific pattern of calling .get() directly on a potentially None JSONField appears to be unique to this new code path. While the queryset filters for properties__isnull=False, there might be edge cases or race conditions where a None value could still be processed, leading to an unexpected server crash when this view is accessed for such a Testrun object.
  • Code location: apps/codecov-api/api/public/v2/evals/views.py:99~104
  • Suggested fix: Add a check to ensure testrun.properties is not None before attempting to call .get() on it. For example, use eval_data = (testrun.properties or {}).get("eval", {}).
Suspicion of a potential `AttributeError` crash in the evals viewset if `testrun.properties` is `None` when attempting to access its attributes.
  • Description: The code in _aggregate_testruns accesses testrun.properties.get("eval", {}) without checking if testrun.properties is None. The Testrun model defines properties as a JSONField(null=True), meaning it can store NULL in the database, which translates to None in Python. If a Testrun object is retrieved where properties is None, attempting to call .get() on it will raise an AttributeError: 'NoneType' object has no attribute 'get'. This would halt the request processing and lead to unexpected behavior for the user. While the get_queryset method includes a filter properties__isnull=False, there's a technical risk that due to race conditions, ORM caching, or other edge cases, a Testrun object with properties=None could still potentially reach the aggregation logic, triggering the AttributeError. For example, if a record's properties were set to NULL between the query execution and the data processing loop, or if caching returned a stale object state. This specific code path in the evals viewset is affected.
  • Code location: apps/codecov-api/api/public/v2/evals/views.py:98~119
  • Suggested fix: Add a check if testrun.properties is not None: before accessing testrun.properties.get("eval", {}) in the _aggregate_testruns method.

Did you find this useful? React with a 👍 or 👎

These changes add 2 new endpoints in the REST api (hidden behing a feature flag)
with the objective of being a POC in reporting Eval info saved in Test Analytics.

Probably not what anyone else had in mind, but a POC mind you. I didn't want to deal
with building a frontend. Also the current way _we_ use evals is not in the CI, so I
didn't think it made much sense to mess around the PR comment either.

Anyway the endpoints are:
* summary - aggregates eval data (that is saved per test-run).
* compare - aggregate per commit and compare.
Apparently there was a custom field lookup in the Database router that was being used in the api.
I moved to the shared one because the api one didn't know about the TA_TIMESERIES database
the shared one didn't know about the special field

I think it's OK to move to just 1, and that should probably be the shared one?
Hopefully. At least tests pass now :E
Copy link

codecov bot commented Jun 18, 2025

Codecov Report

Attention: Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.

Project coverage is 94.23%. Comparing base (90251ff) to head (30767a3).
Report is 3 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...s/shared/shared/django_apps/db_routers/__init__.py 55.55% 4 Missing ⚠️
apps/codecov-api/api/public/v2/evals/views.py 98.97% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #246      +/-   ##
==========================================
+ Coverage   94.21%   94.23%   +0.01%     
==========================================
  Files        1214     1214              
  Lines       45022    45089      +67     
  Branches     1435     1435              
==========================================
+ Hits        42418    42488      +70     
+ Misses       2303     2300       -3     
  Partials      301      301              
Flag Coverage Δ
apiunit 96.53% <99.01%> (+0.05%) ⬆️
sharedintegration 39.75% <55.55%> (+0.01%) ⬆️
sharedunit 88.13% <55.55%> (-0.03%) ⬇️
workerintegration 61.61% <ø> (ø)
workerunit 90.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Jun 18, 2025

Codecov Report

Attention: Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...s/shared/shared/django_apps/db_routers/__init__.py 55.55% 4 Missing ⚠️
apps/codecov-api/api/public/v2/evals/views.py 98.97% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

codspeed-hq bot commented Jun 18, 2025

CodSpeed Performance Report

Merging #246 will not alter performance

Comparing gio/ta-evals-poc (30767a3) with main (6efc689)

Summary

✅ 9 untouched benchmarks

Copy link
Contributor

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I would also recommend a review from someone else who knows this db router things better than me.



@Field.register_lookup
class IsNot(Lookup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m surprised this does not exist out of the box?


def get_queryset(self):
return Testrun.objects.filter(
repo_id=self.repo.repoid, properties__isnull=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable to be, but assumes we only want to return testruns that have something to do with evals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well... the name of the endpoint is "evals" after all :P

repo_id=self.repo.repoid, properties__isnull=False
)

def _aggregate_testruns(self, testruns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this as list[Testrun] please? that should make it clearer that you are indeed fetching all the testruns into memory first.

I think thats fine for a first POC, but otherwise we should maybe offload this computation to postgres. this might be a bit fiddly to write, but should perform better.

Comment on lines 79 to 80
score_avgs = {}
score_counts = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having two dicts might perform a bit better than having one dict and nested tuples, but nested tuples would be easier to read I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always happy to see that we consolidate duplicated code

OpenApiParameter.QUERY,
description="commit SHA for which to return evaluation summary",
),
OpenApiParameter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the classname param mean?

failedItems: int
scores: dict[str, float]


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the eval reports sharing the same table Testrun as TA related reports? How does the queryset only pull eval reports?

else 0
)

# Calculate score averages for passed items
Copy link
Contributor

@JerrySentry JerrySentry Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to have the failed items have a score of 0 instead of being ignored. Maybe a separate metric for that. Because just looking at a 100% score without looking at number of failed items seems deceiving, unless it failed because the evals code errored. Ideally we have 1 number to paint as much picture as possible.

Looks like the report builder from the evals takes in a score-function, then I think we shouldn't add logic in here to do any form of computing and just leave it to the report builder to do i.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants