Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
unicode.texnicle

*.adx
*.bcf
*.toc
*.aux
*.glo
*.idx
*.ldx
*.log
*.toc
*.ist
Expand All @@ -25,6 +27,8 @@ unicode.texnicle
*.maf
*.mtc
*.mtc1
*.mw
*.sdx
*.out
*.synctex.gz
*.fdb_latexmk
Expand Down
44 changes: 29 additions & 15 deletions book/chapters/implementation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ \section{Overview}

\section{How to install Python and R}
\label{installing-python-and-r}

When one encounters problems installing software, or bugs in programming code, search engines are your friend! Installation problems and incomprehensible error messages have typically been encountered and solved by other users. Try simply copying and pasting the output of an error message into a search engine; the solution is often already somewhere online. We are fans of Stack Exchange\footnote{\url{https://stackexchange.com/}} -- a network of question-and-answer websites -- which are extremely helpful in solving issues regarding software installation, bugs in code, etc.

Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Unix operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).
Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Linux operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).

Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Unix) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Unix.
Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Linux) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Linux.

Once you have R or Python (or both) installed on your computer, you are ready to use the orthography profiles software libraries presented in the next two sections. As noted above, we make this material available online on GitHub,\footnote{\url{https://github.com/}} a web-based version control system for source code management. GitHub repositories can be cloned or downloaded,\footnote{\url{https://help.github.com/articles/cloning-a-repository/}} so that you can work through the examples on your local machine. Use your favorite search engine to figure out how to install Git on your computer and learn more about using Git.\footnote{\url{https://git-scm.com/}} In our GitHub repository, we make the material presented below (and more use cases described briefly in Section \ref{use-cases}) available as Jupyter Notebooks. Jupyter Notebooks provide an interface where you can run and develop source code using the browser as an interface. These notebooks are easily viewed in our GitHub repository of use cases.\footnote{\url{https://github.com/unicode-cookbook/recipes}}

Expand Down Expand Up @@ -432,17 +433,24 @@ \subsection*{Installation}
by using Rscript to get the paths to the executables within the terminal.

<<eval=FALSE, tidy=FALSE, engine='bash'>>=
# get the paths to the R executables in bash
pathT=`Rscript -e 'cat(file.path(find.package("qlcData"),
"exec", "tokenize"))'`
pathW=`Rscript -e 'cat(file.path(find.package("qlcData"),
"exec", "writeprofile"))'`

# make softlinks to the R executables in /usr/local/bin
# you will have to enter your user's password!
pathT=`Rscript -e 'cat(system.file("exec/tokenize", package="qlcData"))'`
pathW=`Rscript -e 'cat(system.file("exec/writeprofile", package="qlcData"))'`
@

Then you can make softlinks to the R executables in \texttt{/usr/local/bin} by using the following command in the terminal:

<<eval=FALSE, tidy=FALSE, engine='bash'>>=
sudo ln -is $pathT $pathW /usr/local/bin
@

You can also do this within R by using the following commands, again possible replacing \texttt{/user/local/bin} with a suitable location on your system:

<<eval=FALSE>>=
# link executables from within R
file.symlink(system.file("exec/tokenize", package="qlcData"), "/usr/local/bin")
file.symlink(system.file("exec/writeprofile", package="qlcData"), "/usr/local/bin")
@

After inserting this softlink it should be possible to access the
\texttt{tokenize} function from the shell. Try \texttt{tokenize --help} to test
the functionality.
Expand All @@ -458,10 +466,10 @@ \subsection*{Installation}
% online at \url{TODO}. The webapps are also included inside the \texttt{qlcData}
% package and can be started with the following helper function:

To make the functionality even more accessible, we have prepared webapps with
the \texttt{Shiny} framework for the R functions. The webapps are
included inside the \texttt{qlcData} package and can be started with the
helper function (in R): \texttt{launch\_shiny('tokenize')}.
% To make the functionality even more accessible, we have prepared webapps with
% the \texttt{Shiny} framework for the R functions. The webapps are
% included inside the \texttt{qlcData} package and can be started with the
% helper function (in R): \texttt{launch\_shiny('tokenize')}.

% <<eval=FALSE>>=
% launch_shiny('tokenize')
Expand Down Expand Up @@ -1204,5 +1212,11 @@ \section{Recipes online}

\noindent The ASJP use case shows how to download the full set of ASJP wordlists, to combine them into a single large CSV file, and to tokenize the ASJP orthography. The Dutch use case takes as input the 10K corpus for Dutch (``nld'') from the Leipzig Corpora Collection,\footnote{\url{http://wortschatz.uni-leipzig.de/en/download/}} which is then cleaned and tokenized with an orthography profile that captures the intricacies of Dutch orthography.

% In closing, using GitHub to share code and data provides a platform for sharing scientific results and it also promotes a means for scientific replicability of results. Moreover, we find that in cases where the scientists are building the to tools for analysis, open repositories and data help to ensure that what you see is what you get.
\section{Closing words}

In closing, we hope that these rather elaborate musings on writing systems, Unicode, and the IPA, will help readers appreciate the progress that has been made over the last decades, while acknowledging the many pitfalls that are still lurking below the surface. But mainly we hope that our proposals point towards a way forward in sharing scientific data, interpretations, and analyses, in a more transparent manner.

GitHub (or any other similar service) provides a platform for sharing scientific research and it also promotes a means for scientific replicability of results. Not only do we use GitHub for the software packages that accompany this book, but we actually used it to write this book. GitHub allowed us to openly and collaboratively work on the book and then to participate interactively in an open review process by using an issue tracker and version control.\footnote{\url{https://userblogs.fu-berlin.de/langsci-press/2018/07/11/what-it-means-to-be-open-and-community-based-the-unicode-cookbook-as-a-showcase/}} We will continue to use our repositories to make corrections and updates to the book and to the associated orthography profile software packages, when necessary.\footnote{\url{https://github.com/unicode-cookbook/}}

Moreover, we find that in situations where scientists are building tools for analysis, open repositories and open data help to ensure that what you see is what you get.

20 changes: 10 additions & 10 deletions book/chapters/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -239,14 +239,14 @@ \subsubsection*{Binary encoding}
To also allow for different uppercase and lowercase letters and for a large
variety of control characters to be used in the newly developing technology of
computers, the American Standards Association decided to propose a new 7-bit
encoding in 1963 (with $2^7 = 128$ different possible characters), known as the
encoding in 1963 (with 2\textsuperscript{7} = 128 different possible characters),
known as the
\textsc{American Standard Code for Information Interchange} (ASCII), geared
towards the encoding of English orthography. With the ascent of other
orthographies in computer usage, the wish to encode further variations of Latin
letters (including German <ß> and various letters with diacritics, e.g.\ <è>) led the
Digital Equipment Corporation to introduce an 8-bit \textsc{Multinational
Character Set} (MCS, with $2^8 = 256$ different possible characters), first used
with the introduction of the VT{\large 220} Terminal in 1983.
Character Set} (MCS, with 2\textsuperscript{8} = 256 different possible characters), first used with the introduction of the VT{\large 220} Terminal in 1983.

Because 256 characters were clearly not enough for the unique representation of
many different characters
Expand All @@ -269,17 +269,17 @@ \subsubsection*{Binary encoding}
In the 1980s various people started to develop true
international code sets. In the United States, a group of computer scientists
formed the \textsc{unicode consortium}, proposing a 16-bit encoding in 1991
(with $2^{16} = 65,536$ different possible characters). At the same time in
(with 2\textsuperscript{16} = 65,536 different possible characters). At the same time in
Europe, the \textsc{international organization for standardization} (ISO) was
working on ISO~10646 to replace the ISO/IEC~8859 standard. Their first draft of
the \textsc{universal character set} (UCS) in 1990 was 31-bit (with
theoretically $2^{31} = 2,147,483,648$ possible characters, but because of some
theoretically 2\textsuperscript{31} = 2,147,483,648 possible characters, but because of some
technical restrictions only 679,477,248 were allowed). Since 1991, the Unicode
Consortium and the ISO jointly develop the \textsc{unicode standard}, or
ISO/IEC~10646, leading to the current system including the original 16-bit
Unicode proposal as the \textsc{basic multilingual plane}, and 16 additional
planes of 16-bit for further extensions (with in total $(1+16) \cdot 2^{16} =
1,114,112$ possible characters). The most recent version of the Unicode Standard
planes of 16-bit for further extensions (with in total (1+16)\times 2\textsuperscript{16} =
1,114,112 possible characters). The most recent version of the Unicode Standard
(currently at version number 11.0.0) was published in June 2018 and it defines
137,374 different characters \citep{Unicode2018}.

Expand Down Expand Up @@ -382,10 +382,10 @@ \subsubsection*{Script systems}

Breaking it down further, a script consists of \textsc{graphemes}, which are writing
system-specific minimally distinctive symbols (see below). Graphemes may consist of one or more
\textsc{characters}. The term \textsc{character} is overladen. In the linguistic terminology of writing
systems, a \textsc{character} is a general term for any self-contained element
\textsc{characters}. The term `character' is overladen. In the linguistic terminology of writing
systems, a character is a general term for any self-contained element
in a writing system. A second interpretation is used as a conventional term for a unit in the Chinese writing
system \citep{Daniels1996}. In technical terminology, a \textsc{character}
system \citep{Daniels1996}. In technical terminology, a character
refers to the electronic encoding of a component in a writing system that has semantic
value (see Section \ref{character-encoding-system}). Thus in this work we must navigate
between the general linguistic and technical terms for \textsc{character}
Expand Down
6 changes: 3 additions & 3 deletions book/chapters/ipa_background.tex
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ \chapter{The International Phonetic Alphabet}
diacritics (Section~\ref{EncodingIPA}). Occurring a little over a hundred years after
the inception of the IPA, its encoding was a major challenge
(Section~\ref{need-for-multilingual-environment}); many
linguists have encountered the pitfalls when the two are used together
linguists have encountered pitfalls when the two are used together
(Chapter~\ref{ipa-meets-unicode}).

% ==========================
Expand Down Expand Up @@ -48,7 +48,7 @@ \section{Brief history}
\url{https://en.wikipedia.org/wiki/History\_of\_the\_International\_Phonetic_Alphabet}.}

Over the years there have been several revisions, but mostly minor ones. Articulation
labels -- what are often called \textit{features} even though the IPA
labels -- what are often called \textit{features}, even though the IPA
deliberately avoids this term -- have changed, e.g.\ terms like \textit{lips}, \textit{throat}
or \textit{rolled} are no longer used. Phonetic symbol values have changed, e.g.\
voiceless is no longer marked by <h>. Symbols have been dropped, e.g.\ the
Expand Down Expand Up @@ -292,7 +292,7 @@ \section{IPA encodings}
devised a system of base characters with secondary diacritic marks
(e.g.\ in the previous example <kp>, the base character, is modified with <W>).
This encoding approach is
also used in SAMPA and X-SAMPA (Section~\ref{sampa-xsampa}) and in the
also used in SAMPA and X-SAMPA (see below) and in the
ASJP.\footnote{See the ASJP use case in the online supplementary
materials to this book: \url{https://github.com/unicode-cookbook/recipes}.}
But before UPSID, SAMPA and ASJP, IPA was encoded with numbers.
Expand Down
2 changes: 1 addition & 1 deletion book/chapters/ipa_meets_unicode.tex
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ \section{The twain shall meet}
sometimes look like the Unicode Consortium is making incomprehensible decisions,
but it is important to realize that the consortium has tried and is continuing
to try to be as consistent as possible across a wide range of use cases, and it
does place linguistic traditions above other orthographic choices. Furthermore,
does not place linguistic traditions above other orthographic choices. Furthermore,
when we look at the history of how the IPA met Unicode, we see that many of the
decisions for IPA symbols in the Unicode Standard come directly from the
International Phonetic Association itself. Therefore, many pitfalls that we will
Expand Down
7 changes: 4 additions & 3 deletions book/chapters/orthography_profiles.tex
Original file line number Diff line number Diff line change
Expand Up @@ -193,13 +193,14 @@ \subsection*{File Format}
% normalized following NFC (or NFD if specified in the metadata)
that includes information pertinent to the orthography.\footnote{See
Section~\ref{pitfall-file-formats} in which we suggest to use NFC,
no-BOM and LF line breaks because of the pitfalls they avoid. A keen reviewer notes, however, that specifying
a convention for line endings and BOM is overly strict because most
no-BOM and LF line breaks because of the pitfalls they avoid. Specifying
a convention for line endings and BOM is often overly strict because most
computing environments (now) transparently handle both alternatives.
For example, using Python a file can be decoded using the encoding
``utf-8-sig'', which strips away the BOM (if present) and reads
an input full in text mode, so that both line feed variants ``LF'' and
``CRLF'' will be stripped.}
``CRLF'' will be stripped. However, note that most shells (e.g. bash) will not
behave properly with CRLF line endings.}

\item \textsc{A profile is a delimited text file with an obligatory header
line}. A minimal profile must have a single column with the header \texttt{Grapheme}.
Expand Down
Loading