Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names #97906

Merged
merged 42 commits into from
Nov 4, 2023

Conversation

cfbolz
Copy link
Contributor

@cfbolz cfbolz commented Oct 5, 2022

gh-96954: use a directed acyclic word graph/Deterministic acyclic finite state automaton/finite state transducer for storing the unicodedata codepoint names. This is the approach that PyPy recently switched to. The names are encoded into a packed string that represents the finite state machine to recognize valid names, and also to map names to an index. The packed representation can be used to match names without decompression. The same representation can be used for the inverse operation of mapping a code point to a codepoint name.

This changes reduces the size of the unicodedata shared library from 1222 KiB to 791 KiB.

Relevant papers:

/CC @isidentical

@isidentical isidentical self-requested a review October 5, 2022 13:24
Modules/unicodedata.c Outdated Show resolved Hide resolved
Copy link
Contributor

@ambv ambv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of reading through the algorithm and comparing the implementation here to it. I just wanted to get those nits out of the way since otherwise my mind would uselessly keep returning to them.

Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
also add a DEBUG flag that checks the correctness of the packed
representation at unicodedata build time, using the Python variants of
the lookup/inverse_lookup algorithms
@cfbolz
Copy link
Contributor Author

cfbolz commented Oct 6, 2022

I'm still in the process of reading through the algorithm and comparing the implementation here to it. I just wanted to get those nits out of the way since otherwise my mind would uselessly keep returning to them.

Thanks @ambv, all very sensible complaints, fixed them.

Comment on lines 1353 to 1354
assert(buflen >= 0);
return _inverse_dawg_lookup(buffer, (unsigned int)buflen, offset);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use Py_SAFE_DOWNCAST here (int -> unsigned int).

Modules/unicodedata.c Outdated Show resolved Hide resolved
Modules/unicodedata.c Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
@cfbolz
Copy link
Contributor Author

cfbolz commented Oct 22, 2023

Thanks a lot for the feedback @sweeneyde! I'll get on fixing these.

Re hypothesis tests: I don't quite know where to put them though, maybe simply into a file parallel to dawg.py and accept that they won't automatically run by CI (none of the other things below the Tools/ directly seem to have tests)? Or we could be sneaky and have the unit tests be run whenever someone regenerates the db? We should definitely turn DEBUG always on, it's very fast and gives a lot of assurance. Thanks for suggesting that – the final db gets tested via test_unicodedata.py, but it's still better to test things during constructing too.

@sweeneyde
Copy link
Member

It looks like there's a Lib/test/test_tools/ for testing scripts in the tools directory

@cfbolz
Copy link
Contributor Author

cfbolz commented Oct 23, 2023

ah wonderful, thanks for finding that! and with the hypothesis stubs we can even do this properly :-)

Modules/unicodedata.c Outdated Show resolved Hide resolved
cfbolz and others added 2 commits October 23, 2023 10:19
now it's less a fixpoint and more an optimization process, ie we can
stop at any point and simply get a less optimal but still correct
result.
Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
@cfbolz
Copy link
Contributor Author

cfbolz commented Oct 30, 2023

@sweeneyde I've addressed your comments, I think. would you have some time to review the C code as well?

@sweeneyde
Copy link
Member

Thanks, I'll take another look tonight or tomorrow.

Copy link
Member

@sweeneyde sweeneyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more subjective things, but this LGTM!

I went ahead and tried adding versions of all of the packed-bytes C functions parameterized by the packed bytes instead of hard-coded, then haphazardly added thin python wrappers testunicodedata._dawg_lookup and unicodedata._dawg_inverse_lookup, and threw Hypothesis at them, and I didn't find any extra corner cases, so that is another good sign.

hypothesis code
    @given(st.sets(
            st.text("ABCD _", min_size=1, max_size=20),
            min_size=2, max_size=20)
    )
    def test_c_lookup(self, words0):
        words = list(words0)
        not_a_word = words.pop()
        words.sort()
        dawg = Dawg()
        for i, word in enumerate(words):
            dawg.insert(word, i*10)
        packed, pos_to_code, reversedict = dawg.finish()

        word2pos = {}
        for word in words:
            word = word.encode("ascii")
            pos = c_lookup(packed, word)
            word2pos[word] = pos
        self.assertEqual(set(word2pos.values()), set(range(len(words))))
        for word, pos in word2pos.items():
            self.assertEqual(c_inverse_lookup(packed, pos), word)
        self.assertEqual(c_lookup(packed, not_a_word.encode("ascii")), -1)

        self.assertEqual(c_inverse_lookup(packed, len(words)), None)
        self.assertEqual(c_inverse_lookup(packed, len(words) + 1), None)

Modules/unicodedata.c Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Outdated Show resolved Hide resolved
Tools/unicode/dawg.py Show resolved Hide resolved
Modules/unicodedata.c Outdated Show resolved Hide resolved
Modules/unicodedata.c Show resolved Hide resolved
cfbolz and others added 3 commits November 3, 2023 09:22
Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
- rename child_count to descendant_count
- rename final_edge to last_edge to reduce the confusion with "final
states"
@cfbolz
Copy link
Contributor Author

cfbolz commented Nov 3, 2023

@sweeneyde thanks a lot for the thorough feedback! I adopted your suggestions, they made sense to me too.

Also thanks for the extra hypothesis checks. Do you think it would make sense to push for including them in the test suite? would be a little bit annoying, because we would have to pass the packed representation everywhere, as opposed to just referring to the single global one we need outside of tests.

@cfbolz
Copy link
Contributor Author

cfbolz commented Nov 3, 2023

The test failures look unrelated, maybe it's the same as #111644 ?

@sweeneyde
Copy link
Member

Also thanks for the extra hypothesis checks. Do you think it would make sense to push for including them in the test suite?

Up to you. I don't think it's totally necessary because as you mentioned all the code is already exercised by test_unicodedata.

@ambv ambv merged commit 9573d14 into python:main Nov 4, 2023
30 checks passed
@ambv
Copy link
Contributor

ambv commented Nov 4, 2023

Thanks for persevering, Carl Friedrich! ✨ 🍰 ✨

Also thanks for the review, @sweeneyde.

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Clang Installed 3.x has failed when building commit 9573d14.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/531/builds/4757) and take a look at the build logs.
  4. Check if the failure is related to this commit (9573d14) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/531/builds/4757

Failed tests:

  • test_tools

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 179, in _runtest_env_changed_exc
    _load_run_test(result, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 136, in _load_run_test
    regrtest_runner(result, test_func, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 89, in regrtest_runner
    test_result = test_func()
                  ^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 133, in test_func
    return run_unittest(test_mod)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")
Exception: errors while loading tests


Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 394, in _find_test_path
    module = self._get_module_from_name(name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 337, in _get_module_from_name
    __import__(name)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/test_tools/test_makeunicodedata.py", line 12, in <module>
    from dawg import Dawg, build_compression_dawg, lookup, inverse_lookup
ModuleNotFoundError: No module named 'dawg'

@ambv
Copy link
Contributor

ambv commented Nov 4, 2023

s390x failure looks unrelated:

  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Clang Installed 3.x has failed when building commit 9573d14.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/531/builds/4758) and take a look at the build logs.
  4. Check if the failure is related to this commit (9573d14) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/531/builds/4758

Failed tests:

  • test_tools

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 179, in _runtest_env_changed_exc
    _load_run_test(result, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 136, in _load_run_test
    regrtest_runner(result, test_func, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 89, in regrtest_runner
    test_result = test_func()
                  ^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 133, in test_func
    return run_unittest(test_mod)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")
Exception: errors while loading tests


Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 394, in _find_test_path
    module = self._get_module_from_name(name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 337, in _get_module_from_name
    __import__(name)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/test_tools/test_makeunicodedata.py", line 12, in <module>
    from dawg import Dawg, build_compression_dawg, lookup, inverse_lookup
ModuleNotFoundError: No module named 'dawg'

@sweeneyde
Copy link
Member

I think it is related--the buildbots labelled "installed" seem to not get dawg.py.

I opened #111764 to add skip_if_missing.

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…codedata codepoint names (python#97906)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants