You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status: DRAFT. Rough guesses. This issue needs review for accuracy.
Reviewed open issues. I'd like to propose the following three as highest-impact, lowest-effort fixes. Posting here to give the community a chance to weigh in before anyone picks them up.
Estimates below are three-point: optimistic / realistic / pessimistic. Pessimistic is the "would actually surprise me if it blew past this" ceiling, optimistic is technically achievable if everything lines up, realistic is the likely outcome.
Status: Fix already landed. scoring/src/scoring/constants.py:602-622 defines isCollaborativeNoteKey and includes it in noteTSVColumnsAndTypes with pd.Int8Dtype(). The reporter was on commit d2f2ea3 which predates the fix.
Action (close as fixed): O 5 min / R 15 min / P 45 min
Optimistic: reproduce on current main with a 24-column TSV to confirm parse succeeds, then close with a pointer to the commit.
Realistic: confirm + write a short regression test on the schema length against the constants table.
Pessimistic: reporter's dataset exposes a second schema drift (e.g. isCollaborativeNote dtype edge cases with NA), requiring a small dtype fix.
#306, cuda:0 tensor → numpy TypeError in normalized_loss.py
Action (close as fixed): O 5 min / R 15 min / P 1 hr
Optimistic: confirm on main, close with commit reference.
Realistic: grep for any remaining .numpy() calls on potentially-CUDA tensors in the MF module; close with confidence.
Pessimistic: find at least one other unguarded .numpy() path (there are several tensor ops in matrix_factorization/) that needs the same guard.
Still actionable
#345, pandas int64 vs Int64 merge mismatch in pflip_plus_model.py:348
Status: Reproducible on main. _compute_scoring_cutoff casts cutoffByRatings to pd.Int64Dtype() (nullable) via .astype(pd.Int64Dtype()) at lines ~336 and ~342, then merges against scoringCutoff whose noteId is plain int64 (numpy). This repo enforces column dtypes via a custom pandas_utils wrapper (Type expectation mismatch on noteId: found=Int64 expected=int64), so the fix must align with the canonical registered dtype, not just paper over with a local cast.
Fix shape: cast cutoffByRatings[c.noteIdKey] back to np.int64 before the merge (canonical noteId dtype in the repo is int64, based on the type-checker's "expected=int64" message). Apply the same pattern to the "ratingMin" column if it carries through a type expectation.
Estimate: O 20 min / R 1 hr / P 4 hrs
Optimistic (20 min): one-line .astype(np.int64) on the noteId column of cutoffByRatings before .merge, verified against the reporter's stack trace; no other call sites affected.
Realistic (~1 hr): reproduce locally, trace the dtype through _compute_scoring_cutoff and the surrounding fit/_prepare_note_info flow to confirm the canonical dtype choice, apply the cast, run the existing test suite, add a small regression test that exercises the merge.
Pessimistic (4 hrs): the Int64Dtype() cast on the whole cutoffByRatings frame was intentional (e.g. nullable semantics for "nthRatingMts" when minRatings exceeds the group size), so narrowing just the noteId column cascades into downstream dtype mismatches on "ratingMin" or _SCORING_CUTOFF_MTS. Resolution requires coordinating with maintainers on the canonical dtype registered in pandas_utils and potentially updating a handful of adjacent call sites.
Honorable mentions (considered but skipped, not repo code fixes)
Add author id to notes export #265 add author id to notes export, valid feature request, but requires data-pipeline changes outside this repo's scoring code.
For CONCAT ERROR(S) when running CN algorithm #345: confirmation on the canonical noteId dtype (looks like int64 from the type-checker message), should the fix narrow cutoffByRatings to int64, or widen the registered expectation to Int64?
Status: DRAFT. Rough guesses. This issue needs review for accuracy.
Reviewed open issues. I'd like to propose the following three as highest-impact, lowest-effort fixes. Posting here to give the community a chance to weigh in before anyone picks them up.
Estimates below are three-point: optimistic / realistic / pessimistic. Pessimistic is the "would actually surprise me if it blew past this" ceiling, optimistic is technically achievable if everything lines up, realistic is the likely outcome.
Already fixed on
main, recommend closing#396,
tsv_parser()23 vs 24 columnsscoring/src/scoring/constants.py:602-622definesisCollaborativeNoteKeyand includes it innoteTSVColumnsAndTypeswithpd.Int8Dtype(). The reporter was on commitd2f2ea3which predates the fix.mainwith a 24-column TSV to confirm parse succeeds, then close with a pointer to the commit.isCollaborativeNotedtype edge cases withNA), requiring a small dtype fix.#306,
cuda:0tensor → numpyTypeErrorinnormalized_loss.pyscoring/src/scoring/matrix_factorization/normalized_loss.py:108already readstargets.cpu().numpy()..numpy()calls on potentially-CUDA tensors in the MF module; close with confidence..numpy()path (there are several tensor ops inmatrix_factorization/) that needs the same guard.Still actionable
#345, pandas
int64vsInt64merge mismatch inpflip_plus_model.py:348main._compute_scoring_cutoffcastscutoffByRatingstopd.Int64Dtype()(nullable) via.astype(pd.Int64Dtype())at lines ~336 and ~342, then merges againstscoringCutoffwhosenoteIdis plainint64(numpy). This repo enforces column dtypes via a custompandas_utilswrapper (Type expectation mismatch on noteId: found=Int64 expected=int64), so the fix must align with the canonical registered dtype, not just paper over with a local cast.cutoffByRatings[c.noteIdKey]back tonp.int64before the merge (canonicalnoteIddtype in the repo isint64, based on the type-checker's "expected=int64" message). Apply the same pattern to the"ratingMin"column if it carries through a type expectation..astype(np.int64)on the noteId column ofcutoffByRatingsbefore.merge, verified against the reporter's stack trace; no other call sites affected._compute_scoring_cutoffand the surroundingfit/_prepare_note_infoflow to confirm the canonical dtype choice, apply the cast, run the existing test suite, add a small regression test that exercises the merge.Int64Dtype()cast on the wholecutoffByRatingsframe was intentional (e.g. nullable semantics for"nthRatingMts"whenminRatingsexceeds the group size), so narrowing just the noteId column cascades into downstream dtype mismatches on"ratingMin"or_SCORING_CUTOFF_MTS. Resolution requires coordinating with maintainers on the canonical dtype registered inpandas_utilsand potentially updating a handful of adjacent call sites.Honorable mentions (considered but skipped, not repo code fixes)
content-lengthheader on dataset downloads, server-side (ton.twimg.com), not fixable in this repo.Feedback welcome
main?noteIddtype (looks likeint64from the type-checker message), should the fix narrowcutoffByRatingstoint64, or widen the registered expectation toInt64?cc the reporters: @2vitalik, @Jacobsonradical, @SarahGrevy