# Basics in R

[Michael Cysouw](http://cysouw.de/home/index.html)

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/Basics](https://github.com/unicode-cookbook/recipes/Basics). 

This use case illustrates orthography profile creation and error reporting in R. More code and examples in [Chapter 8](https://github.com/unicode-cookbook/cookbook).

Github renders Jupyter notebooks nicely, so you can copy and paste the code from your browser into your interpreter or scripts. However, if you `git clone` the `recipes` repository and have Jupyter installed on your machine, this file is also executable in your own browser locally. Install [Jupyter](http://jupyter.org/) and run `jupyter notebook` in this directory to get started.

## Overview


Let's use the R [qlcData](https://cran.r-project.org/web/packages/qlcData/index.html) package to create an orthography profile and do some error checking.

Installation instructures here: [https://github.com/unicode-cookbook/recipes](https://github.com/unicode-cookbook/recipes).

For more information, see also [Specifying orthography: harmonization, tokenization and transliteration](https://cran.r-project.org/web/packages/qlcData/vignettes/orthography_processing.html).

## Profiles and error reporting

In [1]:
library(qlcData)

Create some input

In [2]:
test <- c("AABB", "AАBВ")
test

Use the function `write.profile` to produce a basic orthography profile from some data.

In [3]:
write.profile(test)

Grapheme,Frequency,Codepoint,UnicodeName
A,3,U+0041,LATIN CAPITAL LETTER A
B,3,U+0042,LATIN CAPITAL LETTER B
А,1,U+0410,CYRILLIC CAPITAL LETTER A
В,1,U+0412,CYRILLIC CAPITAL LETTER VE


In [4]:
test <- c("AABB", "AАBВ")
tokenize(test, profile = c("A", "B"))

“
There were unknown characters found in the input data.
Check output$errors for a table with all problematic strings.”

originals,tokenized
AABB,A A B B
AАBВ,A ⁇ B ⁇

Grapheme,Frequency
B,3
A,3

Unnamed: 0,originals,errors
2,AАBВ,A ⁇ B ⁇

Grapheme,Frequency,Codepoint,UnicodeName
А,1,U+0410,CYRILLIC CAPITAL LETTER A
В,1,U+0412,CYRILLIC CAPITAL LETTER VE


## Different ways to write a profile


In [5]:
# example <- "ÙÚÛÙÚÛ"
example <- '\u00d9\u00da\u00db\u0055\u0300\u0055\u0301\u0055\u0302'
example

In [6]:
profile_1 <- write.profile(example)
profile_1

Grapheme,Frequency,Codepoint,UnicodeName
Ú,1,U+00DA,LATIN CAPITAL LETTER U WITH ACUTE
Ú,1,"U+0055, U+0301","LATIN CAPITAL LETTER U, COMBINING ACUTE ACCENT"
Ù,1,U+00D9,LATIN CAPITAL LETTER U WITH GRAVE
Ù,1,"U+0055, U+0300","LATIN CAPITAL LETTER U, COMBINING GRAVE ACCENT"
Û,1,U+00DB,LATIN CAPITAL LETTER U WITH CIRCUMFLEX
Û,1,"U+0055, U+0302","LATIN CAPITAL LETTER U, COMBINING CIRCUMFLEX ACCENT"


In [7]:
profile_2 <- write.profile(example, sep = "")
profile_2

Grapheme,Frequency,Codepoint,UnicodeName
́,1,U+0301,COMBINING ACUTE ACCENT
̀,1,U+0300,COMBINING GRAVE ACCENT
̂,1,U+0302,COMBINING CIRCUMFLEX ACCENT
U,3,U+0055,LATIN CAPITAL LETTER U
Ú,1,U+00DA,LATIN CAPITAL LETTER U WITH ACUTE
Ù,1,U+00D9,LATIN CAPITAL LETTER U WITH GRAVE
Û,1,U+00DB,LATIN CAPITAL LETTER U WITH CIRCUMFLEX


In [8]:
# after NFC normalization unicode codepoints have changed
profile_3 <- write.profile(example, normalize = "NFC", sep = "")
profile_3

Grapheme,Frequency,Codepoint,UnicodeName
Ú,2,U+00DA,LATIN CAPITAL LETTER U WITH ACUTE
Ù,2,U+00D9,LATIN CAPITAL LETTER U WITH GRAVE
Û,2,U+00DB,LATIN CAPITAL LETTER U WITH CIRCUMFLEX


In [9]:
# NFD normalization gives yet another structure of the codepoints 
profile_4 <- write.profile(example, normalize = "NFD", sep = "")
profile_4

Grapheme,Frequency,Codepoint,UnicodeName
́,2,U+0301,COMBINING ACUTE ACCENT
̀,2,U+0300,COMBINING GRAVE ACCENT
̂,2,U+0302,COMBINING CIRCUMFLEX ACCENT
U,6,U+0055,LATIN CAPITAL LETTER U


In [10]:
# note that NFC and NFD normalization are identical
# for unicode grapheme definitions
profile_5 <- write.profile(example, normalize = "NFD")
profile_5
profile_6 <- write.profile(example, normalize = "NFC")
profile_6

Grapheme,Frequency,Codepoint,UnicodeName
Ú,2,"U+0055, U+0301","LATIN CAPITAL LETTER U, COMBINING ACUTE ACCENT"
Ù,2,"U+0055, U+0300","LATIN CAPITAL LETTER U, COMBINING GRAVE ACCENT"
Û,2,"U+0055, U+0302","LATIN CAPITAL LETTER U, COMBINING CIRCUMFLEX ACCENT"


Grapheme,Frequency,Codepoint,UnicodeName
Ú,2,U+00DA,LATIN CAPITAL LETTER U WITH ACUTE
Ù,2,U+00D9,LATIN CAPITAL LETTER U WITH GRAVE
Û,2,U+00DB,LATIN CAPITAL LETTER U WITH CIRCUMFLEX


## Using an orthography profile skeleton

In [11]:
# a few words to be graphemically parsed
example <- c("mishmash", "mishap", "mischief", "scheme")
example

In [12]:
# write a profile skeleton to a file
write.profile(example, file = "sandbox/profile_skeleton.txt")

In [13]:
tokenize(example, profile = "sandbox/profile_skeleton.txt")$string

originals,tokenized
mishmash,m i s h m a s h
mishap,m i s h a p
mischief,m i s c h i e f
scheme,s c h e m e


In [14]:
# make a profile, just select the column 'Grapheme'
profile <- write.profile(example)[, "Grapheme"]
profile

In [15]:
# extend the profile with multigraphs
profile <- c("sh", "ch", "sch", "ie", "oo", profile)
profile

In [16]:
# use the profile to tokenize
tokenize(example, profile)$strings

originals,tokenized
mishmash,m i sh m a sh
mishap,m i sh a p
mischief,m i sch ie f
scheme,sch e m e
