# Keyword Extraction

The simple count based method to extract sublanguage specific vocabulary only allows explorative approaches. It gives no objective measurement of how specific a word is to a sublanguage corpus.
To alleviate this problem we can either use Log-Likelihood or tf-idf to extract sublanguage specific vocabulary.

### TF-IDF

To determine the difference between 2 or more sources, we have to formulate a weight for the
each word with regards to each text source. One possible measure is the tf-idf measure which is a weighting based on the unique usage of a term in single documents. The more often a term is used in different
documents the less importance it gets w.r.t. the tf-idf weight. In detail, this follows the intuition
that a term which appears very often can’t be unique to a certain class or domain.
Following Wikipedia, the tf-idf value increases proportionally to the number of times a word
appears in the document, but is often offset by the frequency of the word in the corpus, which
helps to adjust for the fact that some words appear more frequently in general.[1]

For normalized term frequency $tf(t,D)$ there are various options (see lecture, videos in moodle or research).

### Log Likelihood 
Another possibility to measure relative importance of words is Log-Likelihood.
When using a reference corpus for comparison we use the word-counts in the different domains and
a reference corpus in order to determine significant differences. 
The used significance test is called the “Log-Likelihood”-Ratio Test (LL). The LL-value gives the expectation of a term to be appearing in the target w.r.t. the reference
corpus. 


### Corpora

We provide text for the three domains `Automobil`, `Wirtschaft` and `Sport`.
When in need of a reference corpus, visit the [wortschatz-portal](https://wortschatz.uni-leipzig.de/de/download/German#deu_news_2021) and download a large enough sample of references, around 4 million sentences should suffice.

### Text Preprocessing

Be aware that the prepocessing of text has considerable influence on the outcome. Part of this exercise is to
to deploy a reasonable preprocessing pipeline. Make use of the knowledge about the Zipf distribution and other text preprocessing techniques.

To analyze differences we need to build a single "document" for each domain. This
means, if there is more than one document per domain, we’ll concat all texts belonging to one domain to a single text source.

### Also

It makes sense to first introduce a function that transforms a collection of documents into an Document-Term-Matrix (DTM). For that, the numpy library's array class is worth a look. (In practice data sizes may quickly exceed memory. It is then necessary to consider data structures to accomodate for that, e.g. sparse arrays. In this exercise standard numpy arrays should suffice.)

**Hint 1** If you use numpy, be aware that numpy contains a lot of useful functions like logarithms or sorting.

**Hint 2** Beware of numerical traps like the undefined logarithm of 0.

### Task

Apply the two measures tf-idf and Log-Likelihood to extract the top keywords for the 3 corpora `Automobil`, `Wirtschaft` and `Sport`.