builtin: Make all string.index methods return ?int (second try) #13758

txgruppi · 2022-03-17T12:30:16Z

The string type has come index methods like (index, index_after, etc) and some return int and others return ?int.

This MR makes all of them return ?int.

The tests (v test-all) have beem updated to work with the changes but there are some tests that were skipped in my machine.

All string index methods that end with _opt return ?int with none as the value when the substring is not found
and the index methods that do not end in _opt return int with -1 as the returned value when the substring is not found

This PR is the second iteration on the idea. Origin PR here: #13693

It was easier to dump the last changes and redo them from the latest weekly tag.

VLS PR: vlang/vls#329

All string index methods that end with `_opt` return `?int` with none as the value when the substring is not found and the index methods that do not end in `_opt` return `int` with -1 as the returned value when the substring is not found

txgruppi · 2022-03-17T12:30:51Z

@JalonSolov @yuyi98 @spytheman

I think this one will make everyone happy.

cmd/tools/vcheck-md.v

dumblob · 2022-03-17T22:09:04Z

I for myself am really not a fan of duplicated methods. So having second variants (those ending with _opt) of certain methods sounds bad.

Also this PR doesn't endorse safe-by-default behavior - i.e. the syntactically shorter variant shall be the one returning int | none and the syntactically longer variant shall return int.

All in all what I'd be interested in the most are the (rudimentary) benchmarks @spytheman asked for in the previous PR discussion.

txgruppi · 2022-03-19T04:20:30Z

The benchmarks.

Int version: Summary for done: 10000 passed, 10000 total. Runtime: 5459 ms.
Opt version: Summary for done: 10000 passed, 10000 total. Runtime: 5130 ms.

Each code was built using v -prod <source> and executed ./<binary>.

To run this I generated an html file with lorem ipsum text where every word
with 5 letters and no symbol I updated to <a>. The script's job was to extract
all the url and text from the html file.

First I made them println('$url $text') and compared the output with a Node.js
script I created to do the same job. All 3 outputs (v int, v opt, node) are
identical.

For the benchmark I commented out the lines where I defined the variables url
and text and the line where I printed the test string and added the loop for
the benchmark.

The index_after and index_after_opt have the same implementation with the only differences:

return type int and ?int
return values for not found -1 and none

You can check the code I used in https://gist.github.com/txgruppi/31663efdb380ced7f8d31f6f5f4c3d98

spytheman · 2022-03-19T07:51:39Z

Thanks for the benchmarks.

spytheman · 2022-03-19T09:45:00Z

I've made some simplifications to the extract-int.v and extract-opt.v files, to reduce the additional operations, that the loops perform (s.substr especially since it does allocation, b.step, b.ok), and added a Makefile to compare different modes of compilation more easily.
The modified version is in:
https://gist.github.com/spytheman/ef8f298c48fbc7678c1ce281385f511c

Here is the result on my machine:

0[11:01:56]delian@nemesis: (master) /v/misc/2022_03_19__01/index_int_vs_index_opt $ make clean all compare
rm -rf extract-int-tcc extract-int-gcc extract-int-clang extract-opt-tcc extract-opt-gcc extract-opt-clang extract-int-tcc-prod extract-int-gcc-prod extract-int-clang-prod extract-opt-tcc-prod extract-opt-gcc-prod extract-opt-clang-prod
v -cc tcc -o extract-int-tcc extract-int.v
v -cc gcc -o extract-int-gcc extract-int.v
v -cc clang -o extract-int-clang extract-int.v
v -cc tcc -o extract-opt-tcc extract-opt.v
v -cc gcc -o extract-opt-gcc extract-opt.v
v -cc clang -o extract-opt-clang extract-opt.v
v -prod -cc tcc -o extract-int-tcc-prod extract-int.v
Note: tcc is not recommended for -prod builds
v -prod -cc gcc -o extract-int-gcc-prod extract-int.v
v -prod -cc clang -o extract-int-clang-prod extract-int.v
v -prod -cc tcc -o extract-opt-tcc-prod extract-opt.v
Note: tcc is not recommended for -prod builds
v -prod -cc gcc -o extract-opt-gcc-prod extract-opt.v
v -prod -cc clang -o extract-opt-clang-prod extract-opt.v
hyperfine './extract-int-tcc        1000'   './extract-opt-tcc        1000'
Benchmark #1: ./extract-int-tcc        1000
  Time (mean ± σ):      2.441 s ±  0.024 s    [User: 2.435 s, System: 0.005 s]
  Range (min … max):    2.404 s …  2.487 s    10 runs
 
Benchmark #2: ./extract-opt-tcc        1000
  Time (mean ± σ):      3.142 s ±  0.030 s    [User: 3.139 s, System: 0.001 s]
  Range (min … max):    3.089 s …  3.186 s    10 runs
 
Summary
  './extract-int-tcc        1000' ran
    1.29 ± 0.02 times faster than './extract-opt-tcc        1000'
hyperfine './extract-int-gcc        1000'   './extract-opt-gcc        1000'
Benchmark #1: ./extract-int-gcc        1000
  Time (mean ± σ):      1.772 s ±  0.040 s    [User: 1.769 s, System: 0.002 s]
  Range (min … max):    1.728 s …  1.850 s    10 runs
 
Benchmark #2: ./extract-opt-gcc        1000
  Time (mean ± σ):      2.190 s ±  0.023 s    [User: 2.186 s, System: 0.002 s]
  Range (min … max):    2.160 s …  2.233 s    10 runs
 
Summary
  './extract-int-gcc        1000' ran
    1.24 ± 0.03 times faster than './extract-opt-gcc        1000'
hyperfine './extract-int-clang      1000'   './extract-opt-clang      1000'
Benchmark #1: ./extract-int-clang      1000
  Time (mean ± σ):      1.858 s ±  0.027 s    [User: 1.857 s, System: 0.000 s]
  Range (min … max):    1.818 s …  1.902 s    10 runs
 
Benchmark #2: ./extract-opt-clang      1000
  Time (mean ± σ):      2.386 s ±  0.080 s    [User: 2.382 s, System: 0.002 s]
  Range (min … max):    2.265 s …  2.495 s    10 runs
 
Summary
  './extract-int-clang      1000' ran
    1.28 ± 0.05 times faster than './extract-opt-clang      1000'
hyperfine './extract-int-tcc-prod   1000'   './extract-opt-tcc-prod   1000'
Benchmark #1: ./extract-int-tcc-prod   1000
  Time (mean ± σ):      1.942 s ±  0.030 s    [User: 1.940 s, System: 0.001 s]
  Range (min … max):    1.904 s …  1.985 s    10 runs
 
Benchmark #2: ./extract-opt-tcc-prod   1000
  Time (mean ± σ):      2.702 s ±  0.045 s    [User: 2.698 s, System: 0.001 s]
  Range (min … max):    2.628 s …  2.798 s    10 runs
 
Summary
  './extract-int-tcc-prod   1000' ran
    1.39 ± 0.03 times faster than './extract-opt-tcc-prod   1000'
hyperfine './extract-int-gcc-prod   1000'   './extract-opt-gcc-prod   1000'
Benchmark #1: ./extract-int-gcc-prod   1000
  Time (mean ± σ):     495.2 ms ±   5.7 ms    [User: 493.4 ms, System: 1.5 ms]
  Range (min … max):   488.8 ms … 508.3 ms    10 runs
 
Benchmark #2: ./extract-opt-gcc-prod   1000
  Time (mean ± σ):     520.5 ms ±   5.0 ms    [User: 518.1 ms, System: 2.2 ms]
  Range (min … max):   514.8 ms … 528.1 ms    10 runs
 
Summary
  './extract-int-gcc-prod   1000' ran
    1.05 ± 0.02 times faster than './extract-opt-gcc-prod   1000'
hyperfine './extract-int-clang-prod 1000'   './extract-opt-clang-prod 1000'
Benchmark #1: ./extract-int-clang-prod 1000
  Time (mean ± σ):     539.9 ms ±  10.1 ms    [User: 536.4 ms, System: 3.0 ms]
  Range (min … max):   523.5 ms … 554.0 ms    10 runs
 
Benchmark #2: ./extract-opt-clang-prod 1000
  Time (mean ± σ):     532.8 ms ±   9.9 ms    [User: 530.6 ms, System: 1.5 ms]
  Range (min … max):   516.3 ms … 548.9 ms    10 runs
 
Summary
  './extract-opt-clang-prod 1000' ran
    1.01 ± 0.03 times faster than './extract-int-clang-prod 1000'

Except for the last case (./extract-opt-clang-prod vs extract-int-clang-prod), the version returning pure int was faster, varying between 5% to 39% faster, compared to the version that returns ?int.

When using -prod, both clang and gcc managed to optimize well the difference.

I have to note however, that when I compared the more common mixed usage (.index_after calls intermixed with other methods like .substr), the performance difference becomes insignificant, since other more important factors become more dominant (allocation mostly).

spytheman · 2022-03-19T10:00:42Z

Also this PR doesn't endorse safe-by-default behavior - i.e. the syntactically shorter variant shall be the one returning int | none and the syntactically longer variant shall return int.

Another important consideration is readability - the ?int using code, does look more readable (and more compact), compared to the int version, because of the syntax support that V and vfmt provide for that common pattern.

Compare:

        for i := 0; i < max; i++ {
                mut start := -1
                mut end := -1
                for {
                        start = raw.index_after(atag, end)
                        if start == -1 {
                                break
                        }
                        start = raw.index_after(href, start)
                        if start == -1 {
                                break
                        }
                        start += 6
                        end = start
                        start = raw.index_after(gt, end)
                        if start == -1 {
                                break
                        }
                        start += 1
                        end = raw.index_after(lt, start)
                        if end == -1 {
                                break
                        }
                }
                checksum+=u64(end)
        }

vs

        for i := 0; i < max; i++ {
                mut start := -1
                mut end := -1
                for {
                        start = raw.index_after_opt(atag, end) or { break }
                        start = raw.index_after_opt(href, start) or { break }
                        start += 6
                        end = start
                        start = raw.index_after_opt(gt, end) or { break }
                        start += 1
                        end = raw.index_after_opt(lt, start) or { break }
                }
                checksum+=u64(end)
        }

Given all of the above, I think the faster int version should stay, and the optional ?int versions should be just wrappers, that call the int versions, with an _opt suffix in their names, because using the shorter names for the ?int versions will be a breaking change.

In this way, users could call the more inconvenient, but faster versions where that matters, and the more convenient _opt variants, outside of hot loops, while code duplication will be minimized, and breaking changes will be avoided.

@medvednikov what do you think?

spytheman · 2022-03-19T10:12:51Z

using the shorter names for the ?int versions will be a breaking change

Actually, I am wrong about this - currently on master, .index() does return an ?int, so this PR will be a breaking change, even in case we decide to have _opt suffixes for the slower but more convenient ?int versions, and the shorter names for the int returning methods.

In that case, I do support using ?int for the shorter names and an _int suffix for the faster less convenient int returning versions, while the ?int methods should still be wrappers around the faster int versions.

txgruppi · 2022-03-19T21:58:27Z

Any change we make will be a breaking change. In the current state we have some methods returning int and some returning ?int. My reason for this MR was to make them all follow the same standard, it was not an attempt to push ?int as the default.

Let me know what is the final decision after you're all in agreement so I can update the PR.

Cheers.

# Conflicts: # vlib/toml/checker/checker.v

txgruppi · 2022-04-30T00:30:16Z

@spytheman @dumblob

I think it is ready for you.

txgruppi · 2022-04-30T00:31:16Z

I tried running v test-all in Windows and OSX but I'm not familiar with these and I'm not sure I did a good job on them.

StunxFS · 2022-05-01T00:32:25Z

@txgruppi please fix VLS

server/diagnostics.v:12:24: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   10 |     }
   11 |
   12 |     line_colon_idx := msg.index_after(':', 2) // deal with `d:/v/...:2:4: error: ...`
      |                           ~~~~~~~~~~~~~~~~~~~
   13 |     if line_colon_idx < 0 {
   14 |         return none
server/diagnostics.v:21:23: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   19 |     }
   20 |
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
server/diagnostics.v:22:23: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   20 |
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
   24 |     if msg_type_colon_idx == -1 || col_colon_idx == -1 || colon_sep_idx == -1 {
server/diagnostics.v:23:28: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
      |                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   24 |     if msg_type_colon_idx == -1 || col_colon_idx == -1 || colon_sep_idx == -1 {
   25 |         return error('idx is -1')

txgruppi · 2022-05-02T17:11:39Z

@StunxFS vlang/vls#329

txgruppi · 2022-05-13T04:25:15Z

Any news here?

dumblob · 2022-05-13T13:06:48Z

vlib/builtin/js/string_test.js.v

-	assert x.index_any('ef') == 4
-	assert x.index_any('fe') == 4
+	assert x.index_any_int('ef') == 4
+	assert x.index_any_int('fe') == 4


Shouldn't we test the none versions also? Or if not in addition to _int versions, then at least the none versions as they are generally wrappers around the _int versions which will be thus tested as well.

This holds for other _test files as well.

@JalonSolov I think except for this suggestion I think it is ready for merging.

One file conflict still to be resolved, and need the tests to run again.

dumblob · 2022-05-13T13:11:04Z

What are the failing checks? I think we can re-run CI as the PR looks OK to me.

StunxFS · 2022-05-19T23:15:08Z

@txgruppi the PR is fine, only a small fix was missing in ved, which I have already done: vlang/ved#140.

…-funcs-return-opt-int

Hunam6 · 2022-05-26T18:56:07Z

This PR is ready to be merged then, as the only failing test is due to vlang/ved#140 not yet merged

txgruppi · 2022-06-03T15:13:07Z

I want to update you all.

I was adding more tests to cover the new methods but I got some urgent stuff to do at work and this is taking all the time I had to work on this MR.

As soon as things go back to normal at work I will add tests.

Hunam6 · 2022-06-09T16:46:18Z

(it was to resolve conflicts, not useless don't worry Delyan :))

Signed-off-by: Hunam <hunam@disroot.org>

* master: (24 commits) tools: build c2v in non verbose mode by default tools: use os.system for the c2v runs to monitor the progress more easily tools: fix `v vet file.v` for `return if x { y // comment } else { z }` ast: fix array of reference sumtype appending (vlang#14797) pref: disable gc for translated code builder: add -c when building object files sokol: mark pub structs checker: improve pub struct check (fix vlang#14446) (vlang#14777) tools: do show the output of c2v, when it fails tools: fix the first run of `v translate hw.c` ci: use VTEST_JUST_ESSENTIAL=1 for the -cstrict test-self task in ubuntu-clang too cmd: enable `v translate`, download and install c2v ci: use VTEST_JUST_ESSENTIAL=1 for the ubuntu -cstrict gcc task (prevent 2 hour runs) native: initial support for `defer` (vlang#14779) parser, cgen: temporary prefix ++ for translated code tools: handle fn attributes/comments more robustly, when `v missdoc` is run (vlang#14774) cgen: add a minor optimisation for array.push_many (vlang#14770) pref: is_o orm: mysql fixes (vlang#14772) builder: handle linker errors when building .o files ...

…-funcs-return-opt-int

…t' into string-index-funcs-return-opt-int

# Conflicts: # vlib/v/parser/parser.v

JalonSolov · 2022-12-17T01:22:45Z

This one has been out a while - any further thoughts?

JalonSolov · 2023-08-06T15:53:23Z

6 conflicting files need to be fixed, now.

txgruppi mentioned this pull request Mar 17, 2022

builtin: Make all string.index methods return ?int #13693

Closed

dumblob reviewed Mar 17, 2022

View reviewed changes

cmd/tools/vcheck-md.v Outdated Show resolved Hide resolved

txgruppi added 10 commits March 23, 2022 20:07

Merge branch 'master' into string-index-funcs-return-opt-int

59082fe

Merge branch 'master' into string-index-funcs-return-opt-int

5033570

# Conflicts: # vlib/toml/checker/checker.v

[wip] changing index _opt and _int

ce532da

wip

b0cd3a8

Merge branch 'master' into string-index-funcs-return-opt-int

ba592d3

Merge branch 'master' into string-index-funcs-return-opt-int

8f6bee7

Merge branch 'master' into string-index-funcs-return-opt-int

a5ad646

Merge branch 'master' into string-index-funcs-return-opt-int

d35f13f

Updated test

122caba

Fixed index_u8 call in windows environment.c.v

1d25586

txgruppi mentioned this pull request May 2, 2022

Updated to work with new standard for string index methods. vlang/vls#329

Closed

dumblob reviewed May 13, 2022

View reviewed changes

medvednikov closed this May 13, 2022

medvednikov reopened this May 13, 2022

Merge branch 'master' into string-index-funcs-return-opt-int

777e3a1

StunxFS mentioned this pull request May 19, 2022

fix index_after() returns an option vlang/ved#140

Closed

txgruppi added 3 commits May 23, 2022 13:13

Merge branch 'master' into string-index-funcs-return-opt-int

b692872

Merge branch 'master' of https://github.com/vlang/v into string-index…

00678a8

…-funcs-return-opt-int

Merge branch 'master' into string-index-funcs-return-opt-int

fd83d67

Merge branch 'master' into string-index-funcs-return-opt-int

3f8d687

Hunam6 and others added 7 commits June 9, 2022 18:54

fix

bf080df

Signed-off-by: Hunam <hunam@disroot.org>

Merge branch 'master' into string-index-funcs-return-opt-int

a1136a1

Merge branch 'master' of https://github.com/vlang/v into string-index…

c93262b

…-funcs-return-opt-int

Merge remote-tracking branch 'origin/string-index-funcs-return-opt-in…

559c9f3

…t' into string-index-funcs-return-opt-int

Merge branch 'master' into string-index-funcs-return-opt-int

51f393f

# Conflicts: # vlib/v/parser/parser.v

Merge branch 'master' into string-index-funcs-return-opt-int

47bd485

KamkoAmoh approved these changes Sep 26, 2022

View reviewed changes

ArtemkaKun added the Needs Rebase The code of the PR must be rebased over current master before it can be approved. label Nov 1, 2022

medvednikov closed this Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

builtin: Make all string.index methods return ?int (second try) #13758

builtin: Make all string.index methods return ?int (second try) #13758

txgruppi commented Mar 17, 2022 •

edited

txgruppi commented Mar 17, 2022

dumblob commented Mar 17, 2022

txgruppi commented Mar 19, 2022

spytheman commented Mar 19, 2022

spytheman commented Mar 19, 2022

spytheman commented Mar 19, 2022 •

edited

spytheman commented Mar 19, 2022 •

edited

txgruppi commented Mar 19, 2022

txgruppi commented Apr 30, 2022

txgruppi commented Apr 30, 2022

StunxFS commented May 1, 2022

txgruppi commented May 2, 2022

txgruppi commented May 13, 2022

dumblob May 13, 2022 •

edited

dumblob Dec 17, 2022

JalonSolov Dec 17, 2022

dumblob commented May 13, 2022

StunxFS commented May 19, 2022

Hunam6 commented May 26, 2022 •

edited

txgruppi commented Jun 3, 2022

Hunam6 commented Jun 9, 2022

JalonSolov commented Dec 17, 2022

JalonSolov commented Aug 6, 2023 •

edited

builtin: Make all string.index methods return ?int (second try) #13758

builtin: Make all string.index methods return ?int (second try) #13758

Conversation

txgruppi commented Mar 17, 2022 • edited

txgruppi commented Mar 17, 2022

dumblob commented Mar 17, 2022

txgruppi commented Mar 19, 2022

spytheman commented Mar 19, 2022

spytheman commented Mar 19, 2022

spytheman commented Mar 19, 2022 • edited

spytheman commented Mar 19, 2022 • edited

txgruppi commented Mar 19, 2022

txgruppi commented Apr 30, 2022

txgruppi commented Apr 30, 2022

StunxFS commented May 1, 2022

txgruppi commented May 2, 2022

txgruppi commented May 13, 2022

dumblob May 13, 2022 • edited

Choose a reason for hiding this comment

dumblob Dec 17, 2022

Choose a reason for hiding this comment

JalonSolov Dec 17, 2022

Choose a reason for hiding this comment

dumblob commented May 13, 2022

StunxFS commented May 19, 2022

Hunam6 commented May 26, 2022 • edited

txgruppi commented Jun 3, 2022

Hunam6 commented Jun 9, 2022

JalonSolov commented Dec 17, 2022

JalonSolov commented Aug 6, 2023 • edited

txgruppi commented Mar 17, 2022 •

edited

spytheman commented Mar 19, 2022 •

edited

spytheman commented Mar 19, 2022 •

edited

dumblob May 13, 2022 •

edited

Hunam6 commented May 26, 2022 •

edited

JalonSolov commented Aug 6, 2023 •

edited