# NumPy exercise
## Yoav Ram

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from datetime import date
import string
import pickle

The following cell defines frequencies arrays for the characters of English, Deutch, and French texts.

For example, `en[0]` gives the frequency of `a` in English texts, whereas `de[10]` gives the frequency of `k`. 

We only consider lowercase letters a-z.

In [2]:
en = np.array([0.08167, 0.01492, 0.02782, 0.04253, 0.12702, 0.02228, 0.02015,
            0.06094, 0.06966, 0.00153, 0.00772, 0.04025, 0.02406, 0.06749,
            0.07507, 0.01929, 0.00095, 0.05987, 0.06327, 0.09056, 0.02758,
            0.00978, 0.0236 , 0.0015 , 0.01974, 0.00074])
de = np.array([0.06516, 0.01886, 0.02732, 0.05076, 0.16396, 0.01656, 0.03009,
            0.04577, 0.0655 , 0.00268, 0.01417, 0.03437, 0.02534, 0.09776,
            0.02594, 0.0067 , 0.00018, 0.07003, 0.0727 , 0.06154, 0.04166,
            0.00846, 0.01921, 0.00034, 0.00039, 0.01134])
fr = np.array([0.07636, 0.00901, 0.0326 , 0.03669, 0.14715, 0.01066, 0.00866,
            0.00737, 0.07529, 0.00613, 0.00074, 0.05456, 0.02968, 0.07095,
            0.05796, 0.02521, 0.01362, 0.06693, 0.07948, 0.07244, 0.06311,
            0.01838, 0.00049, 0.00427, 0.00128, 0.00326])

We want to use these frequency arrays to determine the language of a text of our choice.

First, **write a function called `frequencies(text)`** that takes a string and returns the frequency of each letter in the text in a NumPy array, similar to those above. Disregard casing (`A` is the same as `a`) and punctuation.

Apply the function on `../data/Gulliver.txt`, an English book by Jonathan Swift.

**Tip**: use the function `ord`.

In [3]:
def frequencies(text):
    freq = np.zeros_like(en)
    for char in text:
        if 'a' <= char <= 'z':
            freq[ord(char) - ord('a')] += 1
        elif 'A' <= char <= 'Z':
            freq[ord(char) - ord('A')] += 1
    freq /= freq.sum()
    return freq

In [4]:
with open('../data/Gulliver.txt') as f:
    gulliver = f.read()

gulliver_freq = frequencies(gulliver)
print(gulliver_freq)

[0.07723943 0.01607287 0.02736497 0.04360799 0.12618015 0.02455543
 0.01963356 0.05897952 0.06971137 0.00181354 0.00639097 0.03791422
 0.02900006 0.06678563 0.07670824 0.01953396 0.00110804 0.06150686
 0.06039051 0.09297201 0.02853111 0.010155   0.02230614 0.00185089
 0.01929741 0.0003901 ]


**Implement a function called `relative_entropy(p, q)`** to compute the relative entropy of a text frequencies array `q` and a language frequencies array `p` according to the following formula:
$$
re(p, q) = \sum_{i=1}^{n}{p_i \log{\Big(\frac{p_i}{q_i}\Big)}}
$$
where $i=1$ for `a` and $i=n$ for `z`.

Relative entropy is a measure of similarity between two distributions or frequencies: the lower the relative entropy of `p` and `q`, the more similar `p` and `q`.

Reminder: frequency array values are between 0 and 1 and sum to 1.

Tip: replace zero values in `q` with some very low number (1e-8?) to avoid `nan` or `inf` values.

In [6]:
def relative_entropy(p, q):
    q[q==0] = 1e-8
    return (p * np.log(p/q)).sum()

In [7]:
print('en', relative_entropy(en, gulliver_freq))
print('fr', relative_entropy(fr, gulliver_freq))
print('de', relative_entropy(de, gulliver_freq))

en 0.0013754412558092133
fr 0.12101819221287134
de 0.08499522955014038


Finally, **write a function called `detect_language(text)`** that takes a text and returns the name of the language most likely for that text, e.g. `en`, `fr`, or `de`.

In [8]:
language_frequencies = {'en':en, 'fr':fr, 'de':de}

def detect_language(text):
    text_freq = frequencies(text)
    language_rel_ent = [
        (lang, relative_entropy(freq, text_freq))
        for lang, freq in language_frequencies.items()
    ]
    return min(language_rel_ent, key=lambda t: t[1])[0]

In [10]:
print(detect_language(gulliver))

en


In [11]:
ne_me_quitte_pas = """Ne me quitte pas
Il faut oublier
Tout peut s'oublier
Qui s'enfuit déjà
Oublier le temps
Des malentendus
Et le temps perdu
A savoir comment
Oublier ces heures
Qui tuaient parfois
A coups de pourquoi
Le coeur du bonheurNe me quitte pas
Ne me quitte pas
Ne me quitte pas
Ne me quitte pas"""

print(detect_language(ne_me_quitte_pas))

fr
