Skip to content

Commit

Permalink
Added notes to HowToUsePyparsing.rst for using Unicode language sets
Browse files Browse the repository at this point in the history
  • Loading branch information
ptmcg committed Apr 30, 2022
1 parent 17a39e8 commit e9e56bd
Showing 1 changed file with 40 additions and 0 deletions.
40 changes: 40 additions & 0 deletions docs/HowToUsePyparsing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1336,6 +1336,46 @@ Common string and token constants
by ``urllib.parse.urlparse()``


Unicode character sets for international parsing
------------------------------------------------
Pyparsing includes the ``unicode`` namespace that contains definitions for ``alphas``, ``nums``, ``alphanums``,
``identchars``, ``identbodychars``, and ``printables`` for character ranges besides 7- or 8-bit ASCII. You can
access them using code like the following:

import pyparsing as pp
ppu = pp.unicode

greek_word = pp.Word(ppu.Greek.alphas)
greek_word[...].parse_string("Καλημέρα κόσμε")

The following language ranges are defined:

========================== ================= ================================================
Unicode set Alternate names Description
-------------------------- ----------------- ------------------------------------------------
Arabic العربية
Chinese 中文
Cyrillic кириллица
Greek Ελληνικά
Hebrew עִברִית
Japanese 日本語 Union of Kanji, Katakana, and Hiragana sets
Japanese.Kanji 漢字
Japanese.Katakana カタカナ
Japanese.Hiragana ひらがな
Hangul Korean, 한국어
Latin1 All Unicode characters up to code point 255
LatinA
LatinB
Thai ไทย
Devanagari देवनागरी
BasicMultilingualPlane BMP All Unicode characters up to code point 65535
CJK Union of Chinese, Japanese, and Korean sets
========================== ================= ================================================

The base ``unicode`` class also includes definitions based on all Unicode code points up to ``sys.maxunicode``. This
set will include emojis, wingdings, and many other specialized and typographical variant characters.


Generating Railroad Diagrams
============================
Grammars are conventionally represented in what are called "railroad diagrams", which allow you to visually follow
Expand Down

0 comments on commit e9e56bd

Please sign in to comment.