## `lab07`—Probabilistic Language Prediction

❖ Objectives

-   Work with data in files.
-   Learn the standard pipeline of data analysis:  data cleaning and preparation, data processing, and output.

<div class="alert alert-warning">
**Pair Programming**
<br />
Since this lab is fairly involved, we encourage you to work in pairs *at a single machine*.  You and your partner will consult, and should occasionally trade off so that the time at the keyboard is roughly equal.  At the end, when you report collaborators, please report the names and NetIDs of all partners in this lab exercise.  (In exceptional cases, such as the room layout, trios are permitted.)
</div>

A random sampling of English text produces approximately the following letter frequency distribution:

<img src="./img/freq-eng.png" width="80%;"/>

whereas Latin has the letter frequency distribution:

<img src="./img/freq-lat.png" width="80%;"/>

and Welsh has the letter frequency distribution:

<img src="./img/freq-cym.png" width="80%;"/>

Each language tends to have a unique "fingerprint" because of the relative frequency of letters and sounds.  Such letter frequency information could be used, for instance, to determine how much type should be ordered for a letterpress, or how many tiles should be included in a country-specific version of Scrabble.

Today you will use this fingerprint to assign rough probabilities to the likely language of a given text sample in an unknown language.  (This is similar to what [Google Translate](https://translate.google.com/) does when it auto-detects the language of a text sample, except that it uses whole words instead of letter frequencies to make its guess.)

There are three steps in the data processing pipeline for you to complete today:

1.  Count the frequency of each letter in the text sample.  Then divide the resulting list of frequencies by the total number of letters and get the normalized letter frequency distribution.
1.  Load the reference language frequencies.
1.  Predict the most likely language based on comparing the text letter frequency with each of the reference frequencies.

<br/>
<div class="alert alert-info">
We will restrict ourselves to the 26 letters of the basic Latin alphabet, disallowing diacritics ('naïve'→'naive'), accents ('recherché'→'recherche'), and nonbasic letters ('Skjærvø'→'Skjarvo').  (If you are a native speaker of another language, we sincerely apologize for this rank philistinism.)
</div>

### 1.  Calculate the normalized letter frequencies.

In order to calculate letter frequencies, you need a list of letters and the string in all upper-case letters.  To avoid confusion, we will rename this built-in string `ascii_uppercase` as `alphabet` when we `import` it.

In [1]:
from string import ascii_uppercase as alphabet
print(alphabet)

ABCDEFGHIJKLMNOPQRSTUVWXYZ


In [2]:
# Our example text.
text = 'Jackdaws love my big Sphinx of Quartz.'
text = text.upper()
print(text)

JACKDAWS LOVE MY BIG SPHINX OF QUARTZ.


Now create an empty frequency dictionary `letter_freq`.  Loop over each letter of the `alphabet` and `count` the number of times each letter occurs in `text`.  Add this count to `letter_freq`.

In [3]:
letter_freq = {}  # a blank dictionary

# Loop over the alphabet.
for letter in alphabet:
    # For each letter, get the number of times it occurs in the string `text`.
    letter_count = text.count(letter)
    letter_freq[letter] = letter_count

letter_freq

{'A': 3,
 'B': 1,
 'C': 1,
 'D': 1,
 'E': 1,
 'F': 1,
 'G': 1,
 'H': 1,
 'I': 2,
 'J': 1,
 'K': 1,
 'L': 1,
 'M': 1,
 'N': 1,
 'O': 2,
 'P': 1,
 'Q': 1,
 'R': 1,
 'S': 2,
 'T': 1,
 'U': 1,
 'V': 1,
 'W': 1,
 'X': 1,
 'Y': 1,
 'Z': 1}

The final step is to normalize the values.  To do this, you need to calculate the total number of letters in `text` (letters, NOT whitespace or punctuation).  Since this is a bit involved, the following lines of code will give you a copy of `text` without whitespace or punctuation:

In [4]:
# These are built-in collections of characters, useful for just this sort of filtering.
from string import whitespace, punctuation, digits
print(whitespace, punctuation, digits)
for character in whitespace+punctuation+digits:
    text = text.replace(character, '')

print(text)

('\t\n\x0b\x0c\r ', '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', '0123456789')
JACKDAWSLOVEMYBIGSPHINXOFQUARTZ


Now set each frequency value in the dictionary to its normalized value.

In [5]:
for key in letter_freq.keys():
    letter_freq[key] = letter_freq[key] * 1.0 / len(text)
letter_freq

{'A': 0.0967741935483871,
 'B': 0.03225806451612903,
 'C': 0.03225806451612903,
 'D': 0.03225806451612903,
 'E': 0.03225806451612903,
 'F': 0.03225806451612903,
 'G': 0.03225806451612903,
 'H': 0.03225806451612903,
 'I': 0.06451612903225806,
 'J': 0.03225806451612903,
 'K': 0.03225806451612903,
 'L': 0.03225806451612903,
 'M': 0.03225806451612903,
 'N': 0.03225806451612903,
 'O': 0.06451612903225806,
 'P': 0.03225806451612903,
 'Q': 0.03225806451612903,
 'R': 0.03225806451612903,
 'S': 0.06451612903225806,
 'T': 0.03225806451612903,
 'U': 0.03225806451612903,
 'V': 0.03225806451612903,
 'W': 0.03225806451612903,
 'X': 0.03225806451612903,
 'Y': 0.03225806451612903,
 'Z': 0.03225806451612903}

<h4 style="color:#FF8C00">Exercises</h4>

Now we will turn the above process into a general function to process a string into its letter frequency.

-   Compose a function `calc_freq` which accepts a string `text`.  `calc_freq` should `return` a dictionary containing the normalized frequency by letter.
    
    You should use the above process just outlined to write this function.
    
<div class="alert alert-warning">
When diagnosing the behavior of your code, we encourage you to use `print` statements freely.
</div>

In [6]:
# define your function here
from string import whitespace, punctuation, digits
from string import ascii_uppercase as alphabet

def calc_freq(text):
    # Create an empty frequency dictionary letter_freq.
    letter_freq = {}
    
    # Make text upper-case.
    ## YOU WRITE THIS LINE
    text = text.upper()
    
    # Loop over each letter of the alphabet:
    for letter in alphabet:
        # Count the number of times each letter occurs in text.
        letter_freq[letter] = text.count(letter)
    
    # Make a copy of text without non-alphabet characters.
    from string import whitespace, punctuation, digits
    for character in whitespace+punctuation+digits:
        text = text.replace(character, '')
    
    # Normalize the frequencies and put the results back into letter_freq.
    for key in letter_freq.keys():
        letter_freq[key] = letter_freq[key] *1.0 / len(text)
    
    # Finally, return the dict letter_freq.
    return letter_freq

In [7]:
# test your code here.  You may edit this cell, and you may use any sample text, but the following is provided for convenience.
text = """Neither the naked hand nor the understanding left to itself can effect much. It is by instruments and helps that the work is done,
which are as much wanted for the understanding as for the hand. And as the instruments of the hand either give motion or guide it, so the
instruments of the mind supply either suggestions for the understanding or cautions.  (Francis Bacon, Novum Organon, Aphorism II)"""
calc_freq(text)

{'A': 0.065625,
 'B': 0.00625,
 'C': 0.025,
 'D': 0.05,
 'E': 0.10625,
 'F': 0.03125,
 'G': 0.025,
 'H': 0.071875,
 'I': 0.078125,
 'J': 0.0,
 'K': 0.00625,
 'L': 0.0125,
 'M': 0.028125,
 'N': 0.109375,
 'O': 0.065625,
 'P': 0.0125,
 'Q': 0.0,
 'R': 0.0625,
 'S': 0.075,
 'T': 0.10625,
 'U': 0.040625,
 'V': 0.00625,
 'W': 0.009375,
 'X': 0.0,
 'Y': 0.00625,
 'Z': 0.0}

In [8]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_text1 = """The study of nature with a view to works is engaged in by the mechanic, the mathematician, the physician, the alchemist, and
the magician; but by all (as things now are) with slight endeavor and scanty success.  (Francis Bacon, Novum Organon, Aphorism V)"""
result_text1 = calc_freq(test_text1)
assert isclose(result_text1['T'], 0.09045226130653267) and \
       isclose(result_text1['Q'], 0.0) and \
       isclose(result_text1['Y'], 0.0251256281407035)
print('Success!')

Success!


In [9]:
# it should pass this test---do NOT edit this cell
test_text2 = """In order to penetrate into the inner and further recesses of nature, it is necessary that both notions and axioms be derived
from things by a more sure and guarded way, and that a method of intellectual operation be introduced altogether better and more certain.
(Francis Bacon, Novum Organon, Aphorism XVIII)"""
result_text2 = calc_freq(test_text2)
assert isclose(result_text2['K'], 0.0) and \
       isclose(result_text2['N'], 0.09523809523809523) and \
       isclose(result_text2['L'], 0.015873015873015872)
print('Success!')

Success!


### 2.  Load the reference language frequencies.

As in previous labs, some of the data we are interested in analyzing are stored in files.  Each language has a characteristic pattern of letter frequencies stored in the `./lang/` directory of `lab7`.  Reference frequencies for the following languages are available.  (These frequencies are derived from the work of Stefan Trost<sup>[[Trost2015](http://www.sttmedia.com/characterfrequencies)]</sup> and used with his permission.)

In [10]:
from os import listdir
listdir('./lang/')  # this function shows us what is located in a given directory (as a list)

['afrikaans.txt',
 'catalan.txt',
 'danish.txt',
 'english.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'latin.txt',
 'polish.txt',
 'portuguese.txt',
 'spanish.txt',
 'welsh.txt']

In order to obtain the reference language frequencies, you will first write a function `load_ref` to load a given language reference file.  You will write a function `load_languages` which uses `load_ref` with a list of file names to create a `dict` of all of the language frequencies available.

Take a look at the file format of `danish`:
    
    A,8.27%
    B,1.42%
    C,0.45%
    ...

If you wanted to read this into a dictionary, you could take each line and split it by the comma.

Since you want to include the second part as a `float`, you need to convert it.  Try this out directly (but it will *fail*):

In [11]:
testDict = {}
#testDict['A'] = float('8.27%')
' 8.27%'.strip().strip('%')


'8.27'

<div class="alert alert-danger">
The problem is that Python doesn't know if the percent sign in the string is supposed to be a string format marker or actually a percent sign, so it doesn't correctly parse this string into a `float`.
</div>

<h4 style="color:#FF8C00">Exercises</h4>

-  In order to convert a string of a percent value into a float, compose a function `p2f` (short for `percentToFloat`) which accepts a string `value`.  `p2f` `strip`s the percent sign off of the string `value`, converts this to a `float`, and then divides by `100` and `return`s the result.  (Python provides a function `round` which you may elect to use here to simplify the result, but this is not required.)

In [12]:
# define your function here
def p2f(value):
    # Strip any whitespace and then strip the percent sign off of value.
    ## YOU WRITE THIS
    value = value.strip().strip('%')
    
    # Convert the result to a float and divide by 100.
    result = float(value)/100.0 ## YOU WRITE THIS
    
    # Finally, return the result.
    return result

In [13]:
# test your code here.  You may edit this cell, and you may use any sample value, but the following is provided for convenience.
value = "5.6%"
p2f(value)

0.055999999999999994

In [14]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
assert isclose(p2f('1.79%'), 0.0179)
print('Success!')

Success!


Now try to add to the dictionary:

In [15]:
testDict = {}
testDict['A'] = p2f('8.27%')

<h4 style="color:#FF8C00">Exercises</h4>

-   Compose a function `load_ref` which accepts a string `language`.  `load_ref` should `return` a `dict` containing the reference language letter frequencies stored in the file `./lang/language`, where `language` will be replaced by `danish`, `catalan`, etc.

In [16]:
a,b = 'A,8.27%'.split(',')

print(a,b)

('A', '8.27%')


In [17]:
# define your function here
def load_ref(language):
    # Create an empty dictionary called `languages`.
    languages = {}
    
    # Open the language file, read the data out, and close the file.
    ## YOU WRITE THIS BLOCK (check lab6 if you need a refresher)
    ## Keep in mind the differences between read(), readlines(), and read().splitlines()
    file = open('./lang/'+language+'.txt', 'r')
    data = file.read().splitlines()
    file.close()
    for line in data:
        letter,frequency = line.split(',')
        languages[letter] = p2f(frequency)
    # Loop over each line in the data.
    ## YOU WRITE THIS BLOCK
        # Split each line at the comma.  The first part should be assigned to a variable `letter`, the second part to a variable `frequency`.
        ## YOU WRITE THIS LINE
    
        # Add the second part (the frequency) to the dictionary as the value (converted to a float)
        # with the first part (the letter) as the key.  MAKE SURE THE KEY IS UPPER-CASE!
        #languages[letter] = p2f(frequency)
    
    # Finally, return the dict `languages`.
    return languages

In [18]:
# test your code here.  You may edit this cell, and you may use any language listed above, but the following is provided for convenience.
language = 'german'
load_ref(language)

{'A': 0.061200000000000004,
 'B': 0.0196,
 'C': 0.0316,
 'D': 0.049800000000000004,
 'E': 0.1693,
 'F': 0.0149,
 'G': 0.0302,
 'H': 0.049800000000000004,
 'I': 0.0802,
 'J': 0.0024,
 'K': 0.0132,
 'L': 0.036000000000000004,
 'M': 0.0255,
 'N': 0.10529999999999999,
 'O': 0.0254,
 'P': 0.0067,
 'Q': 0.0002,
 'R': 0.0689,
 'S': 0.0716,
 'T': 0.0579,
 'U': 0.044800000000000006,
 'V': 0.0084,
 'W': 0.0178,
 'X': 0.0005,
 'Y': 0.0005,
 'Z': 0.0121}

In [19]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
language = 'english'
test_ref = load_ref(language)
assert isclose(test_ref['A'], 0.0834)
print('Success!')

Success!


Next, you need to write a function `load_languages` which accepts a list of languages and creates a dictionary for each using `load_ref`.  Then all of these dictionaries will be added to an overall dictionary, by language.  That is, `master` will look something like this:

        `master` is a dictionary with keys:
            'afrikaans' -> (a dictionary with keys:
                                  letter -> frequency)
            'catalan'   -> (a dictionary with keys:
                                  letter -> frequency)
            'danish'    -> (a dictionary with keys:
                                  letter -> frequency)

Specifically,
    
    master['afrikaans']  # returns a dict containing the reference language frequencies for Afrikaans

You need to get a list of available language files.  You can then open each of them, reading them into a dictionary using `load_ref`.

While we could just list these out and do it manually, that's a little clunky and hard to fix if we add more languages later (or if one is missing).  Thus we will instruct Python to ask which files are available to us in the directory using the handy `listdir` function<sup>[[docs](https://docs.python.org/3/library/os.html#os.listdir)]</sup>.

In [20]:
from os import listdir
languageNames = listdir('lang')
print(languageNames)

['afrikaans.txt', 'catalan.txt', 'danish.txt', 'english.txt', 'finnish.txt', 'french.txt', 'german.txt', 'latin.txt', 'polish.txt', 'portuguese.txt', 'spanish.txt', 'welsh.txt']


Now we can loop over the list `languageNames`, and for each language we can 1) create a dictionary using `load_ref` and 2) add this dictionary to the master dictionary `master` with the language as the key.  Do this in the function `loadLanguages` (which need have no parameters) and `return` `master`.

In [21]:
# define your function here
def load_languages():
    # Create an empty dictionary `master`.
    master = {}
    
    # Get a list of language files.
    ## YOU WRITE THIS LINE (you can use the code block above as a starting point)
    from os import listdir
    languageNames = listdir('lang')
    
    # Call `load_ref` on each of these and add the resulting dictionary as a value to `master` with key `language`.
    ## YOU WRITE THIS LOOP
    for language in languageNames:
        language = language[:-4]
        language_freq = load_ref(language)
        master[language] = language_freq
    
    # Finally, return the dict `master`.
    return master

In [22]:
# test your code here.  You may edit this cell.
master = load_languages()
print(master.keys())
print(master['welsh'])

['danish', 'latin', 'welsh', 'finnish', 'portuguese', 'german', 'spanish', 'french', 'catalan', 'english', 'polish', 'afrikaans']
{'A': 0.09359999999999999, 'C': 0.028900000000000002, 'B': 0.0182, 'E': 0.08310000000000001, 'D': 0.09880000000000001, 'G': 0.0341, 'F': 0.031200000000000002, 'I': 0.0698, 'H': 0.0387, 'K': 0.0, 'J': 0.0013, 'M': 0.0248, 'L': 0.050300000000000004, 'O': 0.0559, 'N': 0.0812, 'Q': 0.0, 'P': 0.0091, 'S': 0.0291, 'R': 0.0652, 'U': 0.0258, 'T': 0.028399999999999998, 'W': 0.0398, 'V': 0.0, 'Y': 0.0849, 'X': 0.0, 'Z': 0.0}


In [23]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_master = load_languages()
assert isclose(test_master['english']['A'], 0.0834)
print('Success!')

Success!


In [24]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_master = load_languages()
assert isclose(test_master['catalan']['Z'], 0.001)
print('Success!')

Success!


### 3.  Predict the most likely language.

With `load_languages` and `calc_freq`, you are now prepared to assess the similarity of a text to a reference language.  This last step is the most mathematically involved.

We will define a frequency metric $f$ to assess the closeness of the match between two sets of frequencies.  In human language, you will calculate the difference between the two lists $L_{\text{unknown}}$ and $L_{\text{ref}}$, which yields a third list of the differences.  To make this list positive, take its absolute value.  (This keeps equal but opposite errors from canceling each other out.)  To provide a single value to compare, let $f$ be equal to the sum of these absolute values.  Thus a low value of $f$ means a low difference and a better fit between two frequency distributions than a high value of $f$.  As an equation,

$$
f \left( L_{\text{text}}, L_{\text{ref}} \right) =
\sum_{\text{letters}} \left| L_{\text{text}} - L_{\text{ref}}\right| \text{.}
$$

To be clear, the metric we are calculating, $f$, is a metric for how *different* two letter frequency distributions are.

<h4 style="color:#FF8C00">Exercises</h4>

-   Compose a function `calc_match` which accepts two dictionaries `L_text` and `L_ref`.  `calc_match` should return the calculated metric `f` according to the formula above.

In [25]:
d = {}
d['a'] = 1
d['b'] = 3
d.get('c')

In [26]:
# define your function here
def calc_match(L_text, L_ref):
    # Create an empty dictionary `L_diff`.
    L_diff = {}
    
    # Loop through the keys of the dictionaries (either by loading `alphabet` as above or by using `L_ref.keys()`).
    # For sanity check: assert the key (letter) exist in both L_text and L_ref.
    # Calculate the absolute value of the difference between each dictionary value for each letter
    #     L_diff['A'] = abs(L_text['A'] - L_ref['A'])  # for each letter (or key in L_ref)
    ## YOU WRITE THIS LOOP
    for letter in L_text.keys():
        assert (letter in L_text.keys()) and (letter in L_ref.keys())
        L_diff[letter] = abs(L_text[letter] - L_ref[letter])
    
    # Next, loop through `L_diff` and sum all of the differences into the variable `f`.
    f = 0.0
    for letter in L_diff.keys():
        f += L_diff[letter]
    
    # Finally, return the metric `f`.
    return f

In [27]:
# test your code here.  You may edit this cell.
text   = '''The conclusions of human reason as ordinarily applied in matters of nature, I call for the sake of distinction Anticipations of
Nature (as a thing rash or premature). That reason which is elicited from facts by a just and methodical process, I call Interpretation of
Nature.  (Francis Bacon, Novum Organon, Aphorism XXVI)'''
L_text = calc_freq(text)
master = load_languages()
L_ref  = master['english']
f = calc_match(L_text, L_ref)
print('welsh, %f'%f)

welsh, 0.348949


In [28]:
# it should pass this test---do NOT edit this cell
# test self-similarity and similarity across languages
from numpy import isclose
master = load_languages()
assert isclose(calc_match(master['danish'], master['danish']), 0.0)
assert isclose(calc_match(master['english'], master['finnish']), 0.5338)
print('Success!')

Success!


In [29]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
from numpy import isclose
text   = '''The conclusions of human reason as ordinarily applied in matters of nature, I call for the sake of distinction Anticipations of
Nature (as a thing rash or premature). That reason which is elicited from facts by a just and methodical process, I call Interpretation of
Nature.  (Francis Bacon, Novum Organon, Aphorism XXVI)'''
L_text = calc_freq(text)
master = load_languages()
L_ref  = master['english']
f = calc_match(L_text, L_ref)
print('welsh, %f'%f)

welsh, 0.348949


<h4 style="color:#FF8C00">Exercises</h4>

Finally, we will capture the above logic in a function `find_best_fit` which will accept a string `text` and a dictionary of reference language dictionaries `master`.  `find_best_fit` compares `text` against all languages in `master`.  `find_best_fit` will return the language corresponding to the lowest value of `f` across the different available reference languages.

This is a freebie, so you can see the fruits of your labor in action.

In [30]:
# This code already works---you don't need to write anything here.
def find_best_fit(text, master):
    # Create an empty dictionary `fs`.
    fs = {}
    
    L_text = calc_freq(text)
    
    # Loop through the keys of `master` (by using `master.keys()`).
    for language in master.keys():
        # Calculate `f` for each using `calc_match` and store the result in `fs` with the key of the language.
        L_ref = master[language]
        fs[language] = calc_match(L_text, L_ref)
    
    # Finally, return the language corresponding to the minimum `f` in `fs` and the value of `f` in a tuple.
    best_language = min(fs, key=fs.get)  # get the key with minimum value in `fs`
    best_f = fs[best_language]
    return (best_language, best_f)

In [31]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
text = '''
    Soren Kierkegaard ("Frygt og baven:  Dialektisk lyrik", 1843)
    Er det virkelig saa, er al den Spidsborgerlighed, jeg seer i Livet, som jeg ikke lader mit Ord men min Gjerning domme, er den virkelig
    ikke hvad den synes, er den Vidunderet? Det lod sig jo tanke; thi hiin Troens Helt havde jo en paafaldende Lighed dermed; thi hiin Troens
    Helt var end ikke Ironiker og Humorist, men noget endnu Hoiere. Der tales i vor Tid meget om Ironi og Humor, Lsær af Folk, som aldrig have
    formaaet at praktisere deri, men som desuagtet vide at forklare Alt. Jeg er ikke ganske ubekjendt med disse tvende Lidenskaber, jeg veed
    lidt mere om dem end hvad der staaer i tydske og tydsk-danske Compendier. Jeg veed derfor, at disse tvende Lidenskaber ere vasentlig
    forskjellige fra Troens Lidenskab. Ironi og Humor reflektere ogsaa paa sig selv og hore derfor hjemme i den uendelige Resignations
    Sphare, de have deres Elasticitet i, at Individet er incommensurabelt for Virkeligheden.
    '''
master = load_languages()
language, f = find_best_fit(text, master)
print('The best fit for the text is *%s* with a metric of %f.'%(language,f))

The best fit for the text is *danish* with a metric of 0.174175.


In [32]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
from numpy import isclose
text = '''
    Below the thunders of the upper deep;
    Far, far beneath in the abysmal sea, 
    His ancient, dreamless, uninvaded sleep
    The Kraken sleepeth: faintest sunlights flee
    About his shadowy sides: above him swell
    Huge sponges of millennial growth and height; 
    And far away into the sickly light, 
    From many a wondrous grot and secret cell
    Unnumbered and enormous polypi
    Winnow with giant arms the slumbering green.
    There hath he lain for ages and will lie
    Battening upon huge sea-worms in his sleep,
    Until the latter fire shall heat the deep;
    Then once by man and angels to be seen,
    In roaring he shall rise and on the surface die.
    (Alfred Lord Tennyson)
    '''
master = load_languages()
language, f = find_best_fit(text, master)
assert isclose(f, 0.198151072125)
print('Success!')

Success!


This is the most complex program you've yet written.  Let's take a survey of its overall logic:

![](./img/flowchart.png)

It's easy to get lost, but charting out your program's logic can help you navigate and think about coding challenges.

---

The lab is now complete, but you may find it interesting to use this function to predict the language of the following text samples.

In [33]:
master = load_languages()

Consider the text

In [34]:
text = '''Onder hierdie hoof wil ek u kortliks op grondige teëstelling wys en ook op verbinding. Die satiere, immers, is algemeen opgevat as
spottende uiting van tenminste ontevredenheid of misnoeë ten opsigte van slegtheid en dwaasheid, bestaande wantoestande in die werklikheid,
met die doel om daarteen gedagte, wil en gevoel op te wek. Hierby wil ek vooropstel die verskillende grade ven gevoel in satieriese spot,
variërende tussen die uiterstes van hoon en sarkasme aan die een kant en gemoedelikheid van komiek en mildheid van humor aan die ander. 'n
Definiesie van satiere wat enkel op hoon en bitterheid wys, skyn my egter nie ruim genoeg vir hierdie begrip nie. Hierteen kan miskien
ingebring word dat ons dan die satiere nie langer in sy essensieelste vorm kry nie.  (F.E.J. Malherbe, Humor in die algemeen en sy uiting in
die Afrikaanse letterkunde)
    '''

-   Which language is the best match, and its value of $f$?

In [35]:
language, f = find_best_fit(text, master)
print('The best fit for the text is *%s* with a metric of %f.'%(language,f))

The best fit for the text is *afrikaans* with a metric of 0.149817.


-   Which language is the worst match, and its value of $f$?

In [36]:
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is *%s* with a metric of %f.'%(language,f))

The worst fit for the text is *welsh* with a metric of 0.572124.


---

Consider the text:

In [37]:
text = '''Tots els essers humans neixen lliures i iguals en dignitat i en drets. Son dotats de rao i de consciencia, i han de comportar-se
    fraternalment els uns amb els altres.'''

-   Which language is the best match, and its value of $f$?

In [38]:
language, f = find_best_fit(text, master)
print('The best fit for the text is *%s* with a metric of %f.'%(language,f))

The best fit for the text is *french* with a metric of 0.277559.


-   Which language is the worst match, and its value of $f$?

In [39]:
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is *%s* with a metric of %f.'%(language,f))

The worst fit for the text is *polish* with a metric of 0.560747.


(You will note that, unsurprisingly, short text samples are harder to statistically analyze in this manner.  The foregoing sample is written in Catalan, but this method detects a slightly different language.)

---

Consider the text:

In [40]:
text = '''Quoi que puisse dire Aristote, et toute la philosophie, il n'est rien d'eg
al au tabac ; c'est la passion des honnetes gens ; et qui vit sans tabac n'es
t pas digne de vivre. Non seulement il rejouit et purge les cerveaux huma
ins, mais encore il instruit les ames a la vertu, et l'on apprend avec lui a deve
nir honnete homme. Ne voyez-vous pas bien, des qu'on en prend, de quelle mani
ere obligeante on en use avec tout le monde, et comme on est ravi d'en donn
er a droite et a gauche, partout ou l'on se trouve ? On n'attend pas meme qu'o
n en demande, et l'on court au-devant du souhait des gens ; tant il est vrai que
le tabac inspire des sentiments d'honneur et de vertu a tous ceux qui en pren
nent. Mais c'est assez de cette matiere, reprenons un peu notre discours. Si bien
 donc, cher Gusman, que done Elvire, ta maitresse, surprise de notre depart, s'es
t mise en campagne apres nous ; et son coeur, que mon Maitre a su toucher trop
 fortement, n'a pu vivre, dis-tu, sans le venir chercher ici. Veux-tu qu'e
ntre-nous je te dise ma pensee ? J'ai peur qu'elle ne soit mal payee de son amou
r, que son voyage en cette ville produise peu de fruit, et que vous eussiez auta
nt gagne a ne bouger de la.

(Moliere, Don Juan ou le Festin de pierre)
'''

-   Which language is the best match, and its value of $f$?

In [41]:
language, f = find_best_fit(text, master)
print('The best fit for the text is *%s* with a metric of %f.'%(language,f))

The best fit for the text is *french* with a metric of 0.162854.


-   Which language is the worst match, and its value of $f$?

In [42]:
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is *%s* with a metric of %f.'%(language,f))

The worst fit for the text is *welsh* with a metric of 0.653765.


---

Consider the text:

In [43]:
text = '''
En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho
tiempo que vivia un hidalgo de los de lanza en astillero, adarga antigua,
rocin flaco y galgo corredor. Una olla de algo mas vaca que carnero,
salpicon las mas noches, duelos y quebrantos los sabados, lantejas los
viernes, algun palomino de anadidura los domingos, consumían las tres
partes de su hacienda. El resto della concluian sayo de velarte, calzas de
velludo para las fiestas, con sus pantuflos de lo mesmo, y los dias de
entresemana se honraba con su vellori de lo mas fino. Tenia en su casa una
ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte,
y un mozo de campo y plaza, que asi ensillaba el rocin como tomaba la
podadera. Frisaba la edad de nuestro hidalgo con los cincuenta anos; era de
complexion recia, seco de carnes, enjuto de rostro, gran madrugador y amigo
de la caza. Quieren decir que tenia el sobrenombre de Quijada, o Quesada,
que en esto hay alguna diferencia en los autores que deste caso escriben;
aunque, por conjeturas verosimiles, se deja entender que se llamaba
Quejana. Pero esto importa poco a nuestro cuento; basta que en la narracion
del no se salga un punto de la verdad.
(Miguel de Saavedra Cervantes, Don Quixote)
'''

-   Which language is the best match, and its value of $f$?

In [44]:
language, f = find_best_fit(text, master)
print('The best fit for the text is *%s* with a metric of %f.'%(language,f))

The best fit for the text is *spanish* with a metric of 0.145452.


-   Which language is the worst match, and its value of $f$?

In [45]:
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is *%s* with a metric of %f.'%(language,f))

The worst fit for the text is *welsh* with a metric of 0.548099.


#### Congratulations! This is all we have for today.