# Language Detection

In this exercise we try to use the previously learned methods to develop a ngram based Language Detection module.

Language Detection is the task of deciding the language of a given text using only the textual surface. 

The main assumption is that character statistics are very specific for any given language using the same alphabet.
Using language specific corpora we can establish the ngram statistic of this given language by counting all the occuring ngrams.

The assumption for the classification of new texts is that the most probable ngrams of a language are also more likely to appear in any new text.

So the algorithm is:
 - Clean the corpus in a way that it only contains (ascii) letters of the alphabet (no punctuation and digits)
 - Given a corpus of some language, count all the character ngrams and generate a list of the top `k` ngrams. This list should contain unique entries and only character ngrams occurring within tokens (no spaces).
 - Given a new text generate a list of ngrams occurring within the words. (This list does not have to be unique, because the most likely ngrams of a language are also more likely to occurr.)
 - Count how many of these ngrams are among the top k ngrams of all the `known` languages.
 - The output of the algorithm is the language with the most absolute hits.
 
 
As a formula:
Let $ngrams^k_{l_i}$ the list of
the top $k$ ngrams of language $l_i$

Let $ngrams_{input}$ be the list of ngrams of a given input text $$ L(input) = \text{argmax}_i (\{g |g \in l_i ,\text{for g in } ngrams_{input}\}) $$
 
 
Try to distinguish between English and German. 
As English corpus use the complete gutenberg corpus from nltk.
As German corpus use the tagesschau corpus we provided. (Download)

**Hint**: Use the information from the power distribution over ngrams from the 'Zipf' notebook.


__Deliberations__
 - What are the limits of this approach?
 - Where can you see problems?
 - You could also use Zipf's law for words to classify language by checking new sentences for words in the vocabulary. What would be the difference in approach, advantages and disadvantages?

In [None]:
import nltk
import pathlib
import matplotlib.pyplot as plt

In [2]:
def load_german_text(path):
    """Load all the text files from the tagesschau folder"""
    text = ""
    for f in pathlib.Path(path).glob("*.txt"):
        with open(f, "r") as openf:
            text += openf.read()
    return text