Skip to content

Correct allowed IMM values for cvtps_ph#2147

Merged
folkertdev merged 1 commit into
rust-lang:mainfrom
sayantn:aliases
Jun 1, 2026
Merged

Correct allowed IMM values for cvtps_ph#2147
folkertdev merged 1 commit into
rust-lang:mainfrom
sayantn:aliases

Conversation

@sayantn
Copy link
Copy Markdown
Contributor

@sayantn sayantn commented Jun 1, 2026

Related: #t-libs/stdarch > Documentation of _mm256_cvtps_ph seems incorrect

This corrects all the cvtps_ph functions, almost all of them had wrong documentation, and all of them had wrong checks for the rounding imm. Also updates intrinsic-test with the correct values

r? @folkertdev

@sayantn
Copy link
Copy Markdown
Contributor Author

sayantn commented Jun 1, 2026

There is a ~2m increase in CI time, because we are testing all possible values of _MM_PERM, not just till 32

@sayantn sayantn marked this pull request as ready for review June 1, 2026 21:00
@folkertdev
Copy link
Copy Markdown
Contributor

Hmm, we should start sharding these really...

@folkertdev folkertdev added this pull request to the merge queue Jun 1, 2026
Merged via the queue into rust-lang:main with commit f722891 Jun 1, 2026
82 checks passed
@sayantn
Copy link
Copy Markdown
Contributor Author

sayantn commented Jun 2, 2026

Hmm, we should start sharding these really...

Wdym exactly? I am pretty free for the next couple of days

@folkertdev
Copy link
Copy Markdown
Contributor

I mean somehow split the workload so that the total wall time is smaller.

Maybe just using cargo nextest work. CI machines have 4 cores usually, but an individual nextest test runs slower, and we have many small tests so it's not as clear a win as it seems at first, but worth a try perhaps.

Alternatively we split e.g. the _mm, _mm256 and _mm512 tests into their own CI jobs still using cargo test?

@sayantn
Copy link
Copy Markdown
Contributor Author

sayantn commented Jun 2, 2026

The increase is from intrinsic-test, due to SDE being hopelessly slow. We can probably make intrinsic-test emit the tests in two batches, one with SSE, AVX and whatever else qemu can handle, and the other one with AVX512, Then we can probably use qemu for the first case, although I'm not sure how beneficial that would be.

also, interestingly the dev profile tests are slower, with clang being the fastest (~18m) and gcc and icx being similar (~21m). The release profile tests all take ~15m.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants