Skip to content

Commit

Permalink
ready for release, except doi
Browse files Browse the repository at this point in the history
  • Loading branch information
bambooforest committed Jun 23, 2018
1 parent c2063ad commit 2b2bf63
Show file tree
Hide file tree
Showing 12 changed files with 150 additions and 144 deletions.
68 changes: 34 additions & 34 deletions book/chapters/implementation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ \section{Python package: segments}

The Python package \texttt{segments} is available both as a command line interface (CLI) and as an application programming interface (API).


\subsection*{Installation}

To install the Python package \texttt{segments} \citep{ForkelMoran2018} from the Python Package Index (PyPI) run:
Expand All @@ -35,56 +36,57 @@ \subsection*{Installation}

\noindent on the command line. This will give you access to both the CLI and programmatic functionality in Python scripts, when you import the \texttt{segments} library.

You can also install the \texttt{segments} package from the GitHub repository:\footnote{\url{https://github.com/cldf/segments}}
You can also install the \texttt{segments} package from the GitHub repository,\footnote{\url{https://github.com/cldf/segments}} in particular if you would like to contribute to the code base:\footnote{\url{https://github.com/cldf/segments/blob/master/CONTRIBUTING.md}}

\begin{lstlisting}[language=bash, basicstyle=\myfont]
$ git clone https://github.com/cldf/segments
$ cd segments
$ python setup.py develop
\end{lstlisting}


\subsection*{Application programming interface}
The \texttt{segments} API can be accessed by importing the package into Python or by writing a Python script. Here is an example of how to import the libraries, create a tokenizer object, tokenize a string, and create an orthography profile.
The \texttt{segments} API can be accessed by importing the package into Python. Here is an example of how to import the library, create a tokenizer object, tokenize a string, and create an orthography profile. Begin by importing the \texttt{Tokenizer} from the \texttt{segments} library.

\begin{lstlisting}[basicstyle=\myfont]
>>> from segments.tokenizer import Tokenizer
\end{lstlisting}

\noindent The \texttt{characters} function will segment a string at Unicode code points.
\noindent Next, instantiate a tokenizer object, which takes optional arguments for an orthography profile and an orthography profile rules file.

% \lstset{extendedchars=false, escapeinside=**}
\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}]
\begin{lstlisting}[basicstyle=\myfont]
>>> t = Tokenizer()
>>> result = t.characters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
>>> print(result)
>>> '(*@c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p@*)'
\end{lstlisting}

\noindent The \texttt{grapheme\_clusters} function will segment text at the Unicode Extended Grapheme Cluster boundaries.\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}}
\noindent The default tokenization strategy is to segment some input text at the Unicode Extended Grapheme Cluster boundaries,\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}} and to return, by default, a space-delimited string of graphemes. White space between input string sequences is by default separated by a hash symbol <\#>, which is a linguistic convention used to denote word boundaries. The default grapheme tokenization is useful when you encounter a text that you want to tokenize to identify potential orthographic or transcription elements.

\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
>>> result = t.grapheme_clusters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
>>> print(result)
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p@*)'
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | \# k͡ p@*)'
\end{lstlisting}

\noindent The \texttt{grapheme\_clusters} function is the default segmentation algorithm for the \texttt{segments.Tokenizer}. It is useful when you encounter a text that you want to tokenize to identify orthographic or transcription elements.

\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', segment_separator='(*@-@*)')
>>> print(result)
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p@*)'
>>> '(*@ĉ-h-á-ɾ-ã̌-c-t-ʼ-ɛ-↗-ʐ-ː-| \# k͡ -p@*)'
\end{lstlisting}

\noindent The \texttt{ipa} parameter forces grapheme segmentation for IPA strings.
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, showstringspaces=false]
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', separator=' // '))
>>> print(result)
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | // k͡ p@*)'
\end{lstlisting}

\noindent The optional \texttt{ipa} parameter forces grapheme segmentation for IPA strings.\footnote{\url{https://en.wikipedia.org/wiki/International\_Phonetic\_Alphabet}} Note here that Unicode Spacing Modifier Letters,\footnote{\url{https://en.wikipedia.org/wiki/Spacing\_Modifier\_Letters}} such as <ː> and <\dia{0361}{\large\fontspec{CharisSIL}◌}>, will be segmented together with base characters (although you might need orthography profiles and rules to correct these in your input source; see Section \ref{pitfall-different-notions-of-diacritics} for details).

\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', ipa=True)
>>> print(result)
>>> '(*@ĉ h á ɾ ã̌ c ɛ ↗ ʐː | # k͡p@*)'
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐː | \# k͡p@*)'
\end{lstlisting}

\noindent We can also load an orthography profile and tokenize an input string with it. In the data directory, we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.
\noindent You can also load an orthography profile and tokenize input strings with it. In the data directory,\footnote{https://github.com/unicode-cookbook/recipes/tree/master/Basics/data} we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.


\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, showstringspaces=false]
Expand All @@ -103,7 +105,7 @@ \subsection*{Application programming interface}
\end{lstlisting}


\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labelled \texttt{Grapheme}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns IPA and XSAMPA, which are mappings from our graphemes to their IPA and XSAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!
\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labeled \texttt{Grapheme}, as discussed in Section \ref{formal-specification-of-orthography-profiles}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns \texttt{IPA} and \texttt{XSAMPA}, which are mappings from our graphemes to their IPA and X-SAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!

Let's load the orthography profile with our tokenizer.

Expand All @@ -120,10 +122,10 @@ \subsection*{Application programming interface}
>>> '(*@aa b ch on n - ih@*)'
\end{lstlisting}

\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other formats, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA.
\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other forms, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA or X-SAMPA.

\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
>>> t.transform('(*@aabchonn-ih@*)', 'IPA')
>>> t('(*@aabchonn-ih@*)', column='IPA')
>>> '(*@aː b tʃ õ n í@*)'
\end{lstlisting}

Expand All @@ -133,41 +135,39 @@ \subsection*{Application programming interface}
% >>> '(*@a: b tS o~ n i_H@*)'
% \end{lstlisting}

\begin{lstlisting}[basicstyle=\myfont, showstringspaces=false]
>>> t.transform('aabchonn-ih', 'XSAMPA')
\begin{lstlisting}[basicstyle=\myfont, showstringspaces=false, escapeinside={(*@}{@*)}]
>>> t('aabchonn(*@-@*)ih', column='XSAMPA')
>>> 'a: b tS o~ n i_H'
\end{lstlisting}


\noindent It is also useful to know which characters in your input string are not in your orthography profile. Use the function \texttt{find\_missing\_characters}.
\noindent It is also useful to know which characters in your input string are not in your orthography profile. By default, missing characters are displayed with the Unicode \textsc{replacement character} at \uni{FFFD}, which appears below as a white question mark within a black diamond.

\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, language=bash]
>>> t.find_missing_characters('(*@aa b ch on n - ih x y z@*)')
>>> t('(*@aa b ch on n - ih x y z@*)')
>>> '(*@aa b ch on n - ih � � �@*)'
\end{lstlisting}


\noindent We set the default as the Unicode \texttt{replacement character} \uni{fffd}.\footnote{\url{http://www.fileformat.info/info/unicode/char/fffd/index.htm}} But you can simply change this by specifying the replacement character when you load the orthography profile with the tokenizer.

\noindent You can change the default by specifying a different replacement character when you load the orthography profile with the tokenizer.

\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
>>> t = Tokenizer('data/orthography(*@-@*)profile.tsv',
errors_replace=lambda c: '?')
>>> t.find_missing_characters("aa b ch on n (*@-@*) ih x y z")
>>> t('aa b ch on n (*@-@*) ih x y z')
>>> 'aa b ch on n (*@-@*) ih ? ? ?'
\end{lstlisting}

\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
>>> t = Tokenizer('data/orthography(*@-@*)profile.tsv',
errors_replace=lambda c: '<{0}>'.format(c))
>>> t.find_missing_characters("aa b ch on n (*@-@*) ih x y z")
>>> t('aa b ch on n (*@-@*) ih x y z')
>>> 'aa b ch on n (*@-@*) ih <x> <y> <z>'
\end{lstlisting}

\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes x, y, z? Note that the space character and its frequency are also captured in this initial profile.
\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes <x>, <y>, and <z>? Note that the space character and its frequency are also captured in this initial profile.

\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
>>> profile = Profile.from_text("aa b ch on n (*@-@*) ih x y z")
>>> profile = Profile.from_text('aa b ch on n (*@-@*) ih x y z')
>>> print(profile)
\end{lstlisting}

Expand Down Expand Up @@ -279,11 +279,11 @@ \subsection*{Command line interface}
'(*@ʃ ɛː ç t e l ç e n@*)'
\end{lstlisting}

\noindent And we can transliterate to XSAMPA.
\noindent And we can transliterate to X-SAMPA.

\begin{lstlisting}[language=bash, basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}]
$ cat sources/german.txt | segments (*@--@*)mapping=XSAMPA
(*@--@*)profile=data/german(*@-orthography(*@-@*)profile.tsv tokenize
(*@--@*)profile=data/german(*@-@*)orthography(*@-@*)profile.tsv tokenize

'S E: C t e l C e n'
\end{lstlisting}
Expand Down
18 changes: 9 additions & 9 deletions book/chapters/pitfalls.tex
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ \section{Pitfall: Missing glyphs}
Font}. This font does not show a real
glyph, but instead shows the hexadecimal code inside a box
for each character, so a user can at least see the Unicode code point of the
character to be displayed.\footnote{\url{http://scripts.sil.org/UnicodeBMPFallbackFont}}
character intended for display.\footnote{\url{http://scripts.sil.org/UnicodeBMPFallbackFont}}

% ==========================
\section{Pitfall: Faulty rendering}
Expand All @@ -200,7 +200,7 @@ \section{Pitfall: Faulty rendering}
reasons for unexpected visual display, namely automatic font substitution and
faulty rendering. Like missing glyphs, any such problems are independent from
the Unicode Standard. The Unicode Standard only includes very general
information about characters and leaves the specific visual display to others to
information about characters and leaves the specific visual display for others to
decide on. Any faulty display is thus not to be blamed on the Unicode
Consortium, but on a complex interplay of different mechanisms happening in a
computer to turn Unicode code points into visual symbols. We will only sketch a
Expand All @@ -212,26 +212,26 @@ \section{Pitfall: Faulty rendering}
have a glyph within this font, then the software application will automatically
search for another font to display the glyph. The result will be that this
specific glyph will look slightly different from the others. This mechanism
works differently depending on the software application, only limited
user influence is usually expected and little feedback is given, which might be rather
works differently depending on the software application; only limited
user influence is usually expected and little feedback is given. This may be rather
frustrating to font-aware users.

% \footnote{For example, Apple Pages does not give any feedback that a font is being replaced, and the user does not seem to have any influence on the choice of replacement (except by manually marking all occurrences). In contrast, Microsoft Word does indicate the font replacement by showing the name in the font menu of the font replacement. However, Word simply changes the font completely, so any text written after the replacement is written in a different font as before. Both behaviors leave much to be desired.}

Another problem with visual display is related to so-called \textsc{font
rendering}. Font rendering refers to the process of the actual positioning of
Unicode characters on a page of written text. This positioning is actually a
highly complex challenge, and many things can go wrong in the process. Well-known
highly complex challenge and many things can go wrong in the process. Well-known
rendering difficulties, like proportional glyph size or ligatures, are reasonably
well understood by developers. Nevertheless , the positioning of multiple diacritics relative to
well understood by developers. Nevertheless, the positioning of multiple diacritics relative to
a base character is still a widespread problem. Especially problematic is when
more than one diacritic is supposed to be placed above (or
below) another. Even within the Latin script vertical placement
often leads to unexpected effects in many modern software applications.
The rendering problems arising in Arabic and in many scripts of Southeast
Asia (like Devanagari or Burmese) are even more complex.

To understand why any problems arise it is important to realize that there are
To understand why these problems arise it is important to realize that there are
basically three different approaches to font rendering. The most widespread is
Adobe's and Microsoft's \textsc{OpenType} system. This approach makes it
relatively easy for font developers, but the font itself does not include all
Expand Down Expand Up @@ -446,11 +446,11 @@ \section{Pitfall: Canonical equivalence}
In other words, there are equivalent sequences of Unicode characters that should
be normalized, i.e.~transformed into a unique Unicode-sanctioned representation
of a character sequence called a \textsc{normalization form}. Unicode provides a
Unicode Normalization Algorithm, which essentially puts combining marks
Unicode Normalization Algorithm, which puts combining marks
into a specific logical order and it defines decomposition and composition
transformation rules to convert each string into one of four normalization
forms. We will discuss here the two most relevant normalization forms: NFC and
NFD.\@
NFD.

The first of the three characters above is considered the \textsc{Normalization
Form C (NFC)}, where \textsc{C} stands for composition. When the process of NFC
Expand Down
6 changes: 3 additions & 3 deletions book/chapters/preface.tex
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
\chapter{Preface}
\label{preface}

This text is meant as a practical guide for linguists and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.
This text is meant as a practical guide for linguists and programmers who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.

The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with the computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.

We use quantitative methods in our research to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using orthography profiles and these tools allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.
In our research, we use quantitative methods to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we have created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using these tools in combination with orthography profiles allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.

We welcome comments and corrections regarding this book, our source code, and the supplemental case studies that we provide online.\footnote{\url{https://github.com/unicode-cookbook/}} Please use the issue tracker, email us directly, or make suggestions on PaperHive.\footnote{\url{https://paperhive.org/}}

Expand Down
Loading

0 comments on commit 2b2bf63

Please sign in to comment.