fix(exp): Add run in github experiment #837

nina-kollman · 2025-11-26T12:31:55Z

Important

Add GitHub Actions support for experiments with new runInGithub() method and sample experiment.

New Features:
- Add runInGithub() method in Experiment class in experiment.ts for GitHub Actions execution.
- Add sample_github_experiment.ts for GitHub-backed experiment sample.
- Add run:github_experiment script in package.json.
SDK Updates:
- Add GithubContext, TaskResult, RunInGithubOptions, and RunInGithubResponse types in experiment.interface.ts.
- Update run() method in Experiment class to detect GitHub Actions and route to runInGithub().
Bug Fixes:
- Conditional logging in sample_experiment.ts to print results only if taskResults exist.

^{This description was created by}^{for b4a5572. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- CI-aware experiment execution: automatic detection and routing for GitHub Actions to support GitHub-backed experiment runs.
- Sample app: added a sample demonstrating GitHub-backed experiments and a runnable npm script to build & run it.
- SDK: added types to support GitHub context, per-row task results, and GitHub run responses.
Bug Fixes
- Conditional result logging: experiment outputs are logged only when per-row results are present.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…to nk/exp_in_github

coderabbitai · 2025-11-26T12:32:01Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds GitHub Actions–aware experiment execution to the SDK, new GitHub-related interfaces and exports, a GitHub experiment sample and npm script in the sample app, and conditional logging guards in an existing sample.

Changes

Cohort / File(s)	Summary
Sample app `packages/sample-app/package.json`, `packages/sample-app/src/sample_github_experiment.ts`, `packages/sample-app/src/sample_experiment.ts`	New npm script `run:github_experiment`; added `sample_github_experiment.ts` demonstrating a GitHub-backed experiment using Traceloop + OpenAI; added guards to conditionally log `taskResults` in `sample_experiment.ts`.
Experiment core `packages/traceloop-sdk/src/lib/client/experiment/experiment.ts`	`Experiment.run` now detects `GITHUB_ACTIONS` and delegates to a GitHub-backed path; added `runInGithub`, `runLocally`, `getGithubContext`, and `executeTasksLocally`; extended run return type to include `RunInGithubResponse`; implements local execution and backend POST flow.
Interfaces & exports `packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts`, `packages/traceloop-sdk/src/lib/node-server-sdk.ts`	Added and exported `GithubContext`, `TaskResult`, `RunInGithubOptions`, and `RunInGithubResponse` to support GitHub-backed runs (new public interfaces/types).

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller (local/CI)
    participant SDK as Traceloop SDK (Experiment)
    participant Backend as Traceloop Backend
    participant Executor as Local Executor (task runner)
    participant OpenAI as OpenAI API

    Note over Caller,SDK: Caller invokes Experiment.run(task, options)
    alt GITHUB_ACTIONS=true (GitHub-backed)
        Caller->>SDK: run(task, options)
        SDK->>SDK: getGithubContext()
        SDK->>Backend: fetch dataset rows (if needed)
        loop per dataset row
            SDK->>Executor: execute task for row
            Executor->>OpenAI: request model
            OpenAI-->>Executor: model response
            Executor-->>SDK: TaskResult (output/error)
        end
        SDK->>Backend: POST /experiment/run with results + github_context + metadata
        Backend-->>SDK: RunInGithubResponse
        SDK-->>Caller: return RunInGithubResponse
    else Local execution
        Caller->>SDK: run(task, options)
        SDK->>Executor: runLocally -> executeTasksLocally(rows)
        Executor->>OpenAI: call model per row
        Executor-->>SDK: ExperimentRunResult (taskResults, evaluations)
        SDK-->>Caller: return ExperimentRunResult
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay attention to:
- getGithubContext parsing and PR URL/PR-number validation.
- Construction and typing of the POST payload in runInGithub and HTTP error handling.
- Mapping between local TaskResult shapes and backend request/response types.
- Call sites affected by the widened return type of run.

Possibly related PRs

feat(experiment): Add experiment capabilities #672 — Overlapping changes to Experiment APIs and GitHub-backed run behavior (interfaces and experiment flow).

Suggested reviewers

galkleinman
nirga

Poem

🐰 I hopped through code at dawn's bright glow,

Experiments now know where GitHub flows.
Rows of prompts and answers in tow,
Sent upstream where the run results go.
A tiny carrot cheer — build and go! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(exp): Add run in github experiment' directly describes the main change: adding GitHub experiment functionality to the SDK and sample app.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch nk/exp_in_github

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…to nk/exp_in_github

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 5573148 in 2 minutes and 45 seconds. Click for details.

Reviewed 416 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 7 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/package.json:41

Draft comment:
Script 'run:github_experiment' added looks correct. Ensure that 'tsx' is installed as a dev dependency for consistency.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/src/sample_github_experiment.ts:55

Draft comment:
Consider adding error handling around the OpenAI API call to capture and log API-specific errors.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

3. packages/sample-app/src/sample_github_experiment.ts:91

Draft comment:
Using 'if ("taskResults" in results)' to distinguish GitHub vs local responses may be brittle if response shapes change.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:138

Draft comment:
In runLocally, task execution isn’t wrapped in a try/catch. If one task fails, it aborts the entire run. Consider handling individual task errors (respecting a potential stopOnError option).
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:408

Draft comment:
Tasks in executeTasksLocally are processed sequentially. Consider parallelizing execution for performance benefits on large datasets.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

6. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:381

Draft comment:
getGithubContext() relies on GITHUB_REF matching a PR pattern. Ensure this assumption holds for all GitHub Action events you intend to support.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

7. packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts:98

Draft comment:
Typo: Consider using the canonical capitalization "GitHub" rather than "Github" in the interface names (GitHubContext, RunInGitHubOptions, RunInGitHubResponse) for consistency with the official branding.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a style/naming comment about capitalization. While technically correct that "GitHub" is the official branding, this falls into the category of minor style preferences. The rules state to not make comments that are "obvious or unimportant" and to only comment if there's "clearly a code change required." Naming conventions like this are subjective - while GitHub's official branding uses capital H, many codebases use "Github" in their code. This is not a functional issue, and changing it would be a breaking change if these interfaces are already being used. The comment doesn't point to a bug or clear problem, just a stylistic preference. However, consistent naming with official branding could be considered a legitimate code quality issue, especially for public-facing APIs. If this is a new interface being added, now would be the best time to fix it before it's used elsewhere. The comment is actionable and clear about what should change. While the comment is actionable, it's still a minor stylistic preference rather than a functional issue. The rules emphasize not making "obvious or unimportant" comments. Capitalization preferences, especially when both forms are widely used in code, fall into this category. This is the type of thing that could be caught by a linter or style guide, not a manual code review. This comment should be deleted. It's a minor stylistic preference about capitalization that doesn't represent a functional issue or clear code quality problem. While technically correct about GitHub's branding, it's not important enough to warrant a review comment.

Workflow ID: wflow_qBGrssmshqMBi0FF

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)

1-19: Address Prettier formatting issues.

The CI pipeline reports formatting issues. Run prettier --write to fix code style.

🧹 Nitpick comments (5)

packages/sample-app/package.json (1)
41-41: Inconsistent script pattern compared to other run scripts.

Other scripts follow the npm run build && node dist/src/... pattern, while this new script uses npx tsx directly. This bypasses TypeScript compilation and runs the source file directly. Consider aligning with the existing pattern for consistency:
-    "run:github_experiment": "npx tsx src/sample_github_experiment.ts",
+    "run:github_experiment": "npm run build && node dist/src/sample_github_experiment.js",
Alternatively, if the intent is faster iteration during development, this is acceptable but could benefit from a comment or naming convention (e.g., dev:github_experiment).
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (1)
105-110: Consider consolidating TaskResult with the existing TaskResponse type.

TaskResult is very similar to TaskResponse (lines 33-40). The main difference is that TaskResult has optional output while TaskResponse has required output. Consider whether these can be unified or if TaskResult can extend/reuse TaskResponse with Partial<> or Pick<> to reduce duplication.
// Alternative: Derive TaskResult from TaskResponse
export type TaskResult = Omit<TaskResponse, 'output' | 'timestamp' | 'input_schema_mapping'> & {
  output?: Record<string, any>;
};
packages/sample-app/src/sample_github_experiment.ts (1)
49-75: Add defensive checks for input data.

The task function assumes row.query exists and is a string. If the dataset row lacks this field, the code will fail at runtime. Consider adding validation:
  const researchTask: ExperimentTaskFunction = async (
    row: TaskInput,
  ): Promise<TaskOutput> => {
-   console.log(`Processing question: ${row.query}`);
-   const question = row.query as string;
+   const question = row.query;
+   if (typeof question !== "string" || !question) {
+     throw new Error("Missing or invalid 'query' field in input row");
+   }
+   console.log(`Processing question: ${question}`);
This provides clearer error messages when dataset schema doesn't match expectations.
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (2)
404-434: Consider adding stopOnError support for consistency with local execution.

runLocally has access to stopOnError option (from ExperimentRunOptions), but executeTasksLocally always continues on error. Consider adding an option to halt early on failure for parity:
private async executeTasksLocally<TInput, TOutput>(
  task: ExperimentTaskFunction<TInput, TOutput>,
  rows: Record<string, any>[],
  stopOnError = false,
): Promise<TaskResult[]> {
  // ...
  for (const row of rows) {
    try {
      // ...
    } catch (error) {
      results.push({ /* ... */ });
      if (stopOnError) {
        break;
      }
    }
  }
}
479-483: Duplicated created_from: "github" metadata assignment.

created_from: "github" is set in both run() (line 87) and runInGithub() (line 482). Since run() always sets it when delegating to runInGithub(), the assignment here is redundant when called via run(). However, it's needed if runInGithub() is called directly. Consider documenting this or ensuring only one location is responsible.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b326268 and 5573148.

📒 Files selected for processing (5)

packages/sample-app/package.json (1 hunks)
packages/sample-app/src/sample_github_experiment.ts (1 hunks)
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3 hunks)
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (1 hunks)
packages/traceloop-sdk/src/lib/node-server-sdk.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

packages/*/package.json

📄 CodeRabbit inference engine (CLAUDE.md)

Use workspace:* for intra-repo package dependencies in package.json

Files:

packages/sample-app/package.json

packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Import AI/LLM semantic attribute constants from @traceloop/ai-semantic-conventions rather than hardcoding strings

Files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

packages/traceloop-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

packages/traceloop-sdk/**/*.{ts,tsx}: Use the provided decorators (@workflow, @task, @agent) for workflow/task/agent spans instead of re-implementing them
For manual LLM operations, use trace.withLLMSpan from @traceloop/node-server-sdk

Files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

packages/traceloop-sdk/src/lib/node-server-sdk.ts

📄 CodeRabbit inference engine (CLAUDE.md)

Register new instrumentation packages during SDK initialization

Files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts

🧠 Learnings (6)

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts
packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization

Applied to files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Applied to files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts
packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Applied to files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts
packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/package.json : When adding a new instrumentation package, add it to the main SDK dependencies

Applied to files:

packages/traceloop-sdk/src/lib/node-server-sdk.ts

📚 Learning: 2025-08-13T08:00:42.668Z

Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 643
File: packages/traceloop-sdk/src/lib/interfaces/dataset.interface.ts:27-27
Timestamp: 2025-08-13T08:00:42.668Z
Learning: In packages/traceloop-sdk/src/lib/interfaces/dataset.interface.ts, when the user said "there is no date" in response to a comment about removing "date" from column type unions, they may have been referring to local changes not yet reflected in the PR branch, as the code still contained "date" in the union types at lines 27 and 43.

Applied to files:

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts

🧬 Code graph analysis (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (2)

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (7)

ExperimentTaskFunction (6-11)

ExperimentRunOptions (20-29)

ExperimentRunResult (42-48)

RunInGithubResponse (121-125)

GithubContext (98-103)

TaskResult (105-110)

RunInGithubOptions (112-119)

packages/traceloop-sdk/src/lib/client/dataset/row.ts (1)

data (33-35)

🪛 GitHub Actions: CI

packages/sample-app/src/sample_github_experiment.ts

[warning] 1-1: Prettier formatting issues detected. Run 'prettier --write' to fix code style.

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

[warning] 1-1: Prettier formatting issues detected. Run 'prettier --write' to fix code style.

🔇 Additional comments (6)

packages/traceloop-sdk/src/lib/node-server-sdk.ts (1)

28-31: LGTM!

The new GitHub-related interfaces are correctly exported from the SDK's public API, enabling consumers to use these types when integrating with GitHub workflows.

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (1)

98-103: LGTM!

The GithubContext, RunInGithubOptions, and RunInGithubResponse interfaces are well-structured and appropriately typed for the GitHub Actions integration.

Also applies to: 112-125
packages/sample-app/src/sample_github_experiment.ts (2)
91-102: Type narrowing for result handling is appropriate.

The check "taskResults" in results correctly discriminates between ExperimentRunResult (local) and RunInGithubResponse (GitHub). This aligns with the union return type of Experiment.run.

1-9: Address Prettier formatting issues.

The CI pipeline reports formatting issues. Run prettier --write to fix code style before merging.
⛔ Skipped due to learnings
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings
Learnt from: nina-kollman
Repo: traceloop/openllmetry-js PR: 672
File: packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:212-231
Timestamp: 2025-08-22T12:44:43.893Z
Learning: In the Traceloop SDK experiment system, JSONL datasets use a custom format where the first line contains column definitions and subsequent lines contain the actual data rows. The parseJsonlToRows method in packages/traceloop-sdk/src/lib/client/experiment/experiment.ts intentionally skips the first line (index 0) to avoid parsing the column definition as data.
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (2)

361-399: LGTM - GitHub context extraction is well-structured.

The method properly validates the GitHub Actions environment, extracts required environment variables, parses the PR number from the ref, and constructs a proper PR URL. The error messages are clear and actionable.

440-525: The runInGithub method is well-implemented overall.

The method correctly:

Generates experiment slugs with fallback

Validates the task function

Fetches dataset rows and executes tasks locally

Prepares the payload with proper snake_case field naming for the API

Handles both camelCase and snake_case response fields (lines 516-518)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 11092cd in 2 minutes and 4 seconds. Click for details.

Reviewed 96 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/package.json:41

Draft comment:
Updated 'run:github_experiment' script to compile TypeScript before execution. This change aligns with the other commands and promotes consistency.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/src/sample_experiment.ts:129

Draft comment:
Added conditional checks for 'taskResults' before logging experiment results to prevent runtime errors if the property is missing. Optionally, consider logging an alternative message when the expected property is absent.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

3. packages/sample-app/src/sample_github_experiment.ts:56

Draft comment:
Changed the model from 'gpt-4' to 'gpt-4o'. Verify that 'gpt-4o' is the intended model identifier to avoid potential API errors.
Reason this comment was not posted:
Comment looked like it was already resolved.

4. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:359

Draft comment:
Removed the strict GitHub Actions check in getGithubContext. Now the method only verifies that necessary GitHub environment variables are present. Ensure this behavior is intended to allow execution outside of strict CI environments.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that the behavior is intended, which violates the rule against asking for confirmation of intention. It does not provide a specific suggestion or point out a clear issue with the code.

5. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:492

Draft comment:
Updated the API endpoint to use '/v2/experiments/run-in-github'. Ensure the backend is updated to support this versioned route.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that the backend is updated to support the new versioned route. This falls under the rule of not asking the author to ensure behavior is intended or tested. Therefore, this comment should be removed.

6. packages/sample-app/src/sample_github_experiment.ts:56

Draft comment:
Typo found: The model identifier 'gpt-4o' on line 56 appears to be a typographical error. It may have been intended to be 'gpt-4'.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment assumes "gpt-4o" is a typo for "gpt-4", but this is incorrect. GPT-4o (GPT-4 Omni) is a real, valid model from OpenAI that was released in 2024. The change from "gpt-4" to "gpt-4o" is clearly intentional - it's an upgrade to a newer model. The automated tool that generated this comment doesn't recognize "gpt-4o" as a valid model identifier. This is a false positive and the comment should be deleted. Could there be a scenario where the author accidentally typed "gpt-4o" when they meant "gpt-4"? Perhaps in some contexts or codebases, "gpt-4o" might not be available or supported? While it's theoretically possible the author made a mistake, the burden of proof is on the comment to show strong evidence of an error. "gpt-4o" is a well-known, widely-used model identifier, and there's no indication this is a typo. The change is clearly intentional, and the comment provides no evidence otherwise. This comment should be deleted. It incorrectly identifies "gpt-4o" as a typo when it's actually a valid OpenAI model identifier. The change from "gpt-4" to "gpt-4o" appears to be an intentional upgrade to a newer model.

Workflow ID: wflow_PTHu7ezeIr1ln65L

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 0619959 in 1 minute and 14 seconds. Click for details.

Reviewed 80 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/src/sample_github_experiment.ts:57

Draft comment:
Minor formatting update: The system message content is now split across multiple lines. This is purely stylistic and does not affect functionality.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/src/sample_github_experiment.ts:77

Draft comment:
Minor formatting change: The log statement for GitHub Actions context is reformatted into a multi-line call for improved readability.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. packages/sample-app/src/sample_github_experiment.ts:105

Draft comment:
Minor formatting change: The log message indicating GitHub execution results is split across multiple lines. No functional changes.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:70

Draft comment:
Documentation update: Redundant JSDoc lines about GitHub Actions context have been removed to reduce clutter. Ensure the remaining comments accurately reflect behavior.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

5. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:84

Draft comment:
Minor formatting improvement: The conditional assignment for experimentMetadata is now split across multiple lines for better readability without changing functionality.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

6. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:506

Draft comment:
Minor formatting update: The returned experimentSlug assignment is split over multiple lines for consistency and readability.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_LTjT9pQ39G8WqEH4

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)
79-93: Logic bugs remain in GitHub Actions branch delegation.

The issues flagged in previous reviews are still present:

Line 81: Empty string fallback options.datasetSlug || "" will pass an empty string to runInGithub, which eventually calls getDatasetRows() that throws "Dataset slug is required". Validate upfront.

Lines 85-87: The condition checks options.relatedRef but then spreads options.aux. This is inverted logic—either check options.aux or ensure consistency.
   if (process.env.GITHUB_ACTIONS === "true") {
+    if (!options.datasetSlug) {
+      throw new Error("datasetSlug is required when running in GitHub Actions");
+    }
     return await this.runInGithub(task, {
-      datasetSlug: options.datasetSlug || "",
+      datasetSlug: options.datasetSlug,
       datasetVersion: options.datasetVersion,
       evaluators: options.evaluators,
       experimentSlug: options.experimentSlug,
-      experimentMetadata: options.relatedRef
-        ? { ...options.aux, created_from: "github" }
-        : { created_from: "github" },
+      experimentMetadata: { ...options.aux, created_from: "github" },
       experimentRunMetadata: {
         ...(options.relatedRef && { related_ref: options.relatedRef }),
         ...(options.aux && { aux: options.aux }),
       },
     });
   }

🧹 Nitpick comments (2)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)
499-505: Redundant response check loses detailed error information.

The check at lines 499-502 is redundant because handleResponse() at line 505 already validates response.ok and provides better error handling by parsing the JSON error body. The current sequence throws a generic message before handleResponse() can extract the API's detailed error.

Either remove the redundant check and let handleResponse() handle errors, or if you need custom handling here, parse the error body:
-    if (!response.ok) {
-      throw new Error(
-        `Failed to submit GitHub experiment: ${response.status} ${response.statusText}`,
-      );
-    }
-
     const data = await this.handleResponse(response);
packages/sample-app/src/sample_github_experiment.ts (1)
49-76: Consider defensive handling for input type.

Line 53 uses an unsafe type assertion row.query as string. If the dataset contains non-string query values, this could produce unexpected behavior.

For more robust samples that users might copy:
-  const question = row.query as string;
+  const question = String(row.query ?? "");

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 11092cd and 0619959.

📒 Files selected for processing (2)

packages/sample-app/src/sample_github_experiment.ts (1 hunks)
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Import AI/LLM semantic attribute constants from @traceloop/ai-semantic-conventions rather than hardcoding strings

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

packages/traceloop-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

packages/traceloop-sdk/**/*.{ts,tsx}: Use the provided decorators (@workflow, @task, @agent) for workflow/task/agent spans instead of re-implementing them
For manual LLM operations, use trace.withLLMSpan from @traceloop/node-server-sdk

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

🧠 Learnings (3)

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Applied to files:

packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Applied to files:

packages/sample-app/src/sample_github_experiment.ts

🔇 Additional comments (6)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3)

98-104: LGTM!

Clean extraction of local execution logic into a dedicated private method.

358-391: LGTM!

Well-structured context extraction with proper validation and clear error messages. The restriction to PR events is appropriately enforced with a descriptive error.

393-426: LGTM!

Proper per-row error isolation allows partial experiment completion. Error handling is consistent and preserves metadata for debugging.

packages/sample-app/src/sample_github_experiment.ts (3)

11-35: LGTM!

Solid initialization pattern with proper error handling for SDK init failure and client validation.

78-121: LGTM!

Clean execution flow with proper type discrimination between local and GitHub execution results. Comprehensive error handling includes stack traces for debugging.

124-128: LGTM!

Standard top-level error handler for async entry point.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed d3573c7 in 1 minute and 16 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/src/sample_github_experiment.ts:53

Draft comment:
Added semicolon for style consistency.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions. It simply states that a semicolon was added for style consistency, which does not require any action from the PR author.

2. packages/sample-app/src/sample_github_experiment.ts:56

Draft comment:
Typo detected: The model name 'gpt-4o' appears to be misspelled. Did you mean 'gpt-4'?
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_iV7xkfrkTZQy1tmG

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (1)

98-127: Differentiate TaskResponse vs TaskResult to avoid confusion

You now have two very similar per-row types (TaskResponse for local runs and TaskResult for GitHub runs). The shapes are close enough that it’s easy for consumers (and future maintainers) to mix them up. Consider either:

Adding short doc comments explaining when each is used (SDK-local vs GitHub run payload), or

Extracting a shared base type and extending it for each use-case, so the naming difference is enforced by the type hierarchy instead of being purely conventional.
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (2)
381-414: Per-row error capture in executeTasksLocally is good; consider richer error metadata

The per-row try/catch in executeTasksLocally is a nice improvement for the GitHub path, ensuring one bad row doesn’t abort the entire run and exposing error alongside the original input.

You might consider also including error.name (or a simple classification) in metadata so the backend / UI can distinguish different failure types without parsing the message:
metadata: {
  rowId: row.id,
  timestamp: Date.now(),
  errorType: error instanceof Error ? error.name : "UnknownError",
}
420-502: Reuse validateRunOptions for GitHub runs to keep evaluator validation consistent

In runLocally you call validateRunOptions to enforce that the task is a function and evaluators have valid names/versions. In runInGithub, you recheck only that task is a function and then map evaluators directly to slugs without any structural validation.

For consistency and earlier feedback on invalid evaluator configs, consider reusing the existing validator here:
// Near the top of runInGithub, before getGithubContext():
this.validateRunOptions(task, {
  datasetSlug,
  datasetVersion,
  evaluators,
  experimentSlug,
});
This would let you drop the manual task typeof check and ensure both local and GitHub paths reject malformed evaluator entries in the same way.
packages/sample-app/src/sample_github_experiment.ts (1)
125-128: Make top-level error logging robust to non-Error throws

Everywhere else you defensively format errors with error instanceof Error ? error.message : String(error), but the top-level main().catch assumes an Error:
main().catch((error) => {
  console.error("💥 Application failed:", error.message);
  process.exit(1);
});
To avoid logging undefined if something throws a non-Error (string, etc.), mirror the earlier pattern:
-main().catch((error) => {
-  console.error("💥 Application failed:", error.message);
-  process.exit(1);
-});
+main().catch((error) => {
+  console.error(
+    "💥 Application failed:",
+    error instanceof Error ? error.message : String(error),
+  );
+  process.exit(1);
+});

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0619959 and d3573c7.

📒 Files selected for processing (3)

packages/sample-app/src/sample_github_experiment.ts (1 hunks)
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3 hunks)
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Import AI/LLM semantic attribute constants from @traceloop/ai-semantic-conventions rather than hardcoding strings

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts

packages/traceloop-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

packages/traceloop-sdk/**/*.{ts,tsx}: Use the provided decorators (@workflow, @task, @agent) for workflow/task/agent spans instead of re-implementing them
For manual LLM operations, use trace.withLLMSpan from @traceloop/node-server-sdk

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts

🧠 Learnings (4)

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/sample-app/src/sample_github_experiment.ts
packages/traceloop-sdk/src/lib/client/experiment/experiment.ts
packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Applied to files:

packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Applied to files:

packages/sample-app/src/sample_github_experiment.ts

📚 Learning: 2025-08-13T08:00:42.668Z

Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 643
File: packages/traceloop-sdk/src/lib/interfaces/dataset.interface.ts:27-27
Timestamp: 2025-08-13T08:00:42.668Z
Learning: In packages/traceloop-sdk/src/lib/interfaces/dataset.interface.ts, when the user said "there is no date" in response to a comment about removing "date" from column type unions, they may have been referring to local changes not yet reflected in the PR branch, as the code still contained "date" in the union types at lines 27 and 43.

Applied to files:

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts

🧬 Code graph analysis (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (7)

ExperimentTaskFunction (6-11)

ExperimentRunOptions (20-29)

RunInGithubOptions (112-121)

ExperimentRunResult (42-48)

RunInGithubResponse (123-127)

GithubContext (98-103)

TaskResult (105-110)

🔇 Additional comments (3)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3)
75-84: run now returns a union type; ensure callers narrow appropriately

The new signature
async run<TInput, TOutput>(
  task: ExperimentTaskFunction<TInput, TOutput>,
  options: ExperimentRunOptions | RunInGithubOptions = {},
): Promise<ExperimentRunResult | RunInGithubResponse>
changes the public API from a single result type to a union. That’s reasonable given the auto GitHub dispatch, but any existing callers that assumed ExperimentRunResult will now need to narrow the type (e.g. 'taskResults' in result) similar to the new GitHub sample.

It’s worth double‑checking downstream usages and updating docs/examples to reflect this unioned return, so consumers aren’t surprised by the new shape when running under GitHub Actions.

86-182: Local path extraction into runLocally looks correct

Pulling the non‑GitHub logic into runLocally preserves the previous behavior: same experiment initialization, dataset loading, per‑row task execution, evaluator invocation, and aggregated result shape. The separation makes the GitHub path easier to reason about without changing local semantics.

346-379: GitHub context extraction and PR validation are solid

getGithubContext correctly validates all required env vars (repository, ref, sha, actor) and enforces that the ref corresponds to a pull request (refs/pull/<n>/...), producing clear error messages when assumptions aren’t met. This should fail fast in misconfigured workflows while giving enough information to debug.

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

traceloop · 2025-11-27T13:52:25Z

🧪 Traceloop Experiment Results

Experiment: research-ts
Tasks Evaluated: 3
Related Commit: 1234567
Last Updated Comment: 2025-11-27 13:52:23 UTC

Evaluators Results

Evaluator	Type Distribution	Success Rate	Avg Score
Research relevancy	-	3/3 (100.0%)	-
Research word counter	-	-	145.00
categories	Calture: 1, Tech medic: 2	-	-

View evaluation results →
View dataset →

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 9db388d in 45 seconds. Click for details.

Reviewed 75 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:388

Draft comment:
Using Promise.all improves concurrency but ensure tasks are safe to run in parallel and consider throttling for large datasets.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:488

Draft comment:
Removing explicit mapping of response fields relies on transformApiResponse. Ensure it always provides the expected keys (e.g. experimentSlug fallback) to avoid breaking consumers.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_8Z2ebIW5B8PxSVM2

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed b4a5572 in 51 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:407

Draft comment:
The change adds a trailing comma after the arrow function block. This is stylistic and aligns with common formatting practices. Ensure it conforms with the project's lint/style guidelines.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_MsnkOpJob6XIw1Ci

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d3573c7 and 9db388d.

📒 Files selected for processing (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Import AI/LLM semantic attribute constants from @traceloop/ai-semantic-conventions rather than hardcoding strings

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

packages/traceloop-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

packages/traceloop-sdk/**/*.{ts,tsx}: Use the provided decorators (@workflow, @task, @agent) for workflow/task/agent spans instead of re-implementing them
For manual LLM operations, use trace.withLLMSpan from @traceloop/node-server-sdk

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

🧠 Learnings (1)

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

🧬 Code graph analysis (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (7)

ExperimentTaskFunction (6-11)

ExperimentRunOptions (20-29)

RunInGithubOptions (112-121)

ExperimentRunResult (42-48)

RunInGithubResponse (123-127)

GithubContext (98-103)

TaskResult (105-110)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and test

🔇 Additional comments (4)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (4)
15-18: LGTM! New imports support GitHub Actions integration.

The imported types (GithubContext, TaskResult, RunInGithubOptions, RunInGithubResponse) are properly used throughout the new GitHub Actions functionality.

346-379: Implementation is sound; ensure environment check is consistent with caller.

The getGithubContext() implementation correctly:

Validates all required GitHub environment variables

Extracts PR number with appropriate regex

Restricts usage to PR events with clear error messages

However, ensure the environment detection in run() (line 79) is consistent with the expectations here. If run() only checks GITHUB_ACTIONS === "true" but this method requires additional variables like GITHUB_REPOSITORY, GITHUB_REF, etc., the error will occur late in execution rather than at the routing decision point.

381-412: Excellent implementation using Promise.all and map!

This method addresses previous feedback perfectly:

Uses Promise.all with map for efficient parallel task execution (as requested in past review)

Proper error handling captures individual task failures without stopping the entire run

Clear result structure with either output or error per task

The implementation is clean and type-safe.

Based on past review feedback requesting this pattern.

414-495: Well-structured implementation that addresses previous concerns.

The runInGithub() method correctly:

Validates task function upfront

Executes tasks locally and captures results

Properly merges experimentMetadata (lines 455-458) - addressing ellipsis-dev's feedback about it being ignored

Correctly handles both relatedRef and aux conditionally (lines 460-464) - fixing the previous logic bug mentioned by coderabbitai

Constructs complete github_context payload with all required fields

Provides comprehensive error wrapping

The metadata construction at lines 455-464 is now correct:
const mergedExperimentMetadata = {
  ...(experimentMetadata || {}),
  created_from: "github",
};

const mergedExperimentRunMetadata = {
  ...(experimentRunMetadata || {}),
  ...(relatedRef && { related_ref: relatedRef }),
  ...(aux && { aux: aux }),
};
This properly spreads all provided metadata before adding GitHub-specific fields.

Based on past review feedback that has been addressed.

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)
74-84: Unsafe cast to RunInGithubOptions and lack of upfront datasetSlug validation in GitHub path.

options is typed as ExperimentRunOptions | RunInGithubOptions, but datasetSlug is optional on ExperimentRunOptions. The cast:
return await this.runInGithub(task, options as RunInGithubOptions);
bypasses that and lets runInGithub assume a non-empty datasetSlug, even though callers of run are not forced to provide it. At runtime getDatasetRows will still throw, but the error is late and the type contract of runInGithub is effectively weakened.

Consider validating datasetSlug before delegating, and only then asserting the type, e.g.:
if (process.env.GITHUB_ACTIONS === "true") {
  if (!("datasetSlug" in options) || !options.datasetSlug) {
    throw new Error("datasetSlug is required when running in GitHub Actions");
  }
  return await this.runInGithub(task, options as RunInGithubOptions);
}
This removes the unsound cast and gives a clearer error specific to the GitHub flow.

🧹 Nitpick comments (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)
418-495: GitHub experiment flow is well-structured; consider reusing evaluator validation.

The runInGithub flow is coherent: it validates the task, pulls GitHub context, loads dataset rows, executes tasks locally, builds evaluator slugs, merges experimentMetadata (now correctly preserving caller metadata and adding created_from: "github"), folds relatedRef/aux into run metadata, and posts a single payload to the backend. That addresses the prior metadata-construction concern.

One improvement: evaluator validation here diverges from the local path. validateRunOptions already enforces non-empty evaluator names/versions, but runInGithub currently just maps evaluators without checks. To keep behavior consistent and avoid malformed evaluator definitions getting through only on GitHub runs, you could reuse the existing validator:
if (!task || typeof task !== "function") {
  throw new Error("Task function is required and must be a function");
}

this.validateRunOptions(task, { evaluators });
This keeps all evaluator-related rules in one place.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9db388d and b4a5572.

📒 Files selected for processing (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Import AI/LLM semantic attribute constants from @traceloop/ai-semantic-conventions rather than hardcoding strings

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

packages/traceloop-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

packages/traceloop-sdk/**/*.{ts,tsx}: Use the provided decorators (@workflow, @task, @agent) for workflow/task/agent spans instead of re-implementing them
For manual LLM operations, use trace.withLLMSpan from @traceloop/node-server-sdk

Files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

🧠 Learnings (1)

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts

🧬 Code graph analysis (1)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (1)

packages/traceloop-sdk/src/lib/interfaces/experiment.interface.ts (7)

ExperimentTaskFunction (6-11)

ExperimentRunOptions (20-29)

RunInGithubOptions (112-121)

ExperimentRunResult (42-48)

RunInGithubResponse (123-127)

GithubContext (98-103)

TaskResult (105-110)

🔇 Additional comments (4)

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts (4)

15-18: New GitHub-related types wired correctly into client.

Importing GithubContext, TaskResult, RunInGithubOptions, and RunInGithubResponse from the shared interfaces keeps the client aligned with the central type definitions; no issues here.

89-182: Local execution refactor into runLocally preserves existing behavior.

The extraction of the previous run logic into runLocally looks behaviorally equivalent: experiment slug generation, dataset row loading, per-row task execution, evaluator invocation, and result shaping are unchanged and still guarded by validateRunOptions. No new issues spotted.

346-379: GitHub context extraction is PR-only; ensure this matches the intended usage.

getGithubContext correctly derives repository, commit SHA, actor, and PR URL from GITHUB_* env vars, but it hard-fails when GITHUB_REF doesn’t match refs/pull/<number>/…. That means runInGithub will currently only work for pull-request-triggered workflows, not push or workflow_dispatch runs on branches.

If that’s intentional, it’s fine, but it should be documented and you may want a more targeted error message or pre-check at call sites to avoid surprises when workflows run on non-PR events.

384-412: Parallel task execution with Promise.all is a good improvement; behavior is clear.

executeTasksLocally runs the task for all rows in parallel and returns structured TaskResult entries, capturing errors per row without throwing. This aligns well with the GitHub use case and is cleanly separated from the local ExperimentRunResult path.

nina-kollman added 2 commits November 26, 2025 14:31

added

789026f

Merge branch 'main' of https://github.com/traceloop/openllmetry-js in…

3f23a68

…to nk/exp_in_github

nina-kollman added 3 commits November 27, 2025 12:15

real setting

6f9a2e4

setting

8b5c4be

Merge branch 'main' of https://github.com/traceloop/openllmetry-js in…

5573148

…to nk/exp_in_github

nina-kollman marked this pull request as ready for review November 27, 2025 10:32

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts Show resolved Hide resolved

works

11092cd

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

pretty

0619959

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

nina-kollman added 2 commits November 27, 2025 14:39

pr comments

6a12586

pretty

d3573c7

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

galzilber reviewed Nov 27, 2025

View reviewed changes

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts Outdated Show resolved Hide resolved

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts Outdated Show resolved Hide resolved

galzilber approved these changes Nov 27, 2025

View reviewed changes

pr comment

9db388d

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

pretty

b4a5572

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

packages/traceloop-sdk/src/lib/client/experiment/experiment.ts Show resolved Hide resolved

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

nina-kollman merged commit 7905242 into main Nov 27, 2025
8 checks passed

fix(exp): Add run in github experiment #837

fix(exp): Add run in github experiment #837

Uh oh!

Conversation

nina-kollman commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

traceloop bot commented Nov 27, 2025

🧪 Traceloop Experiment Results

Evaluators Results

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nina-kollman commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading