<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 18: tf-idf and precision/recall

In this notebook, we discuss **tf-idf**, a way to assign a topic to a text.
It stands for _term frequency -- inverse document frequency,_ and shows how important a word is for the meaning of the text. Then, in order to evaluate the obtained output, we will use **precision** and **recall**.

Additionally, this is the last notebook, so in the end, I discuss the relevance of the topics covered in this class to different aspects and sub-fields of computational lingusitics.

## tf-idf: term frequency -- inverse document frequency

The **tf-idf** metric/alforithm contains two parts, **tf** and **idf**, that are being multiplied with each other.
This score shows how important some word is for the meaning of a sentence or a text. 


Consider five following texts:

In [None]:
t1 = """Musk's warnings about artificial intelligence have brought
him some controversy. He and Facebook founder Mark Zuckerberg have 
clashed, with the latter calling his warnings "pretty irresponsible". 
Musk responded to Zuckerberg's censure by saying that he had discussed 
AI with Zuckerberg and found him to have only a limited understanding of 
the subject. In 2014, Slate's Adam Elkus argued that current AIs were as 
intelligent as a toddler, and only in certain fields, going on to say 
that Musk's "summoning the demon" analogy may be harmful because it could 
result in significant cuts to AI research budgets."""

t2 = """Like most members of the horse family, zebras are highly social. 
Their social structure, however, depends on the species. Mountain zebras 
and plains zebras live in groups, known as 'harems', consisting of one 
stallion with up to six mares and their foals. Bachelor males either live 
alone or with groups of other bachelors until they are old enough to 
challenge a breeding stallion. When attacked by packs of hyenas or wild dogs 
a zebra group will huddle together with the foals in the middle while the 
stallion tries to ward them off."""

t3 = """The names of pseudo-ops often start with a dot to distinguish them 
from machine instructions. Pseudo-ops can make the assembly of the program 
dependent on parameters input by a programmer, so that one program can be 
assembled different ways, perhaps for different applications. Or, a pseudo-op 
can be used to manipulate presentation of a program to make it easier to read 
and maintain. Another common use of pseudo-ops is to reserve storage areas 
for run-time data and optionally initialize their contents to known values."""

First of all, let us represent these texts as a dictionaries of words, where keys are the unique terms from the text, and values are the number of times that term is used in that text.

In [None]:
from nltk.tokenize import TweetTokenizer
from string import punctuation

tokenizer = TweetTokenizer()

data = []
for t in [t1, t2, t3]:
    
    no_punct = []
    for w in tokenizer.tokenize(t.lower()):
        if w not in punctuation:
            no_punct.append(w)
    d = {w:no_punct.count(w) for w in no_punct}
    data.append(d)

Now we obtained the list `data` describing a collection of $3$ documents.

In [None]:
for i in range(len(data)):
    print("Text", i+1)
    print(data[i], "\n")

### tf (term frequency)

First, we calculate how frequent some term is in the text that contains it.

$$\textrm{tf(word)} = \frac{\textrm{# of times this word appeared in this document}}{\textrm{total # of terms in this document}}$$


In [None]:
def tf(word, text):
    assert word in text
    words_total = sum(text.values())
    return text[word] / words_total

In [None]:
tf_t1 = {word:tf(word, data[0]) for word in data[0]}
tf_t2 = {word:tf(word, data[1]) for word in data[1]}
tf_t3 = {word:tf(word, data[2]) for word in data[2]}

data = [tf_t1, tf_t2, tf_t3]

print(tf_t1)

Now, variables `tf_t1`, `tf_t2` and `tf_t3` contain the information about frequencies of the words of the documents with respect to other words within the same document. However, the high values of some words within these dictionaries does not mean that these words are the topics of the text. **Why?**

### idf (inverse document frequency)

Second, we need to determine how frequently a word is used in texts in general.

$$\textrm{idf(word)} = \textrm{log}(\frac{\textrm{total # of documents}}{\textrm{# of documents with this word in them}})$$

In [None]:
from math import log

def idf(word, documents):
    mentions = sum([1 for i in documents if word in i])
    return log(len(documents) / mentions)

Let us now compute **idf** scores of all words across all documents.

In [None]:
words = list(set([j for i in data for j in i]))
idf_scores = {i:idf(i, data) for i in words}

### tf*idf

Now, we can finally compute the tf-idf score for every term of every document.

$$\textrm{tf-idf(word)} = \textrm{tf(word)} * \textrm{idf(word)}$$

In [None]:
for d in data:
    for k in d:
        d[k] *= idf_scores[k]
        
print("Text 1, for example:")
print(data[0])

**Question:** how exactly were the $0$ scores derived?

Now, we are ready to print, let's say, $3$ most important words for every one of those texts.
First, to sort the obtained tf-idf values, we need to switch from a dictionary to another data type. **Why?**
Then we can sort the container by first choosing word with a high tf-idf score.

In [None]:
new_data = []
for i in data:
    switched = [(k, i[k]) for k in i]
    switched.sort(key = lambda x : x[1], reverse = True)
    new_data.append(switched)
    
for i in range(len(new_data)):
    print("Text", i+1)
    print(new_data[i][:3], "\n")

As you can see, some stop words still made it to the top of the tf-idf word lists.
However, this is a problem only for very small datasets.

After we obtained the scores for word in all the texts, this values serve as input to the next step. For example, we might want to calculate the summed and normalized **tf-idf for a sentence** checking how important that sentence is -- it shows if the sentence or its parts needs to be preserved when preparing the text summarization task. Alternatively, we can **label the text** as its top N words with respect to tf-idf -- for example, for automatic labeling of news articles.

## Model evaluations: precision and recall

Linguists in industry frequently face a task of evaluating a given model.


**Formulate the task.** We have a model that that annotates tweets with one of the two labels: "offensive" and "not offensive". We also have $1000$ tweets that are annotated manually _(gold standard corpus)._

**Get predictions of the model.** In order to evaluate predictions of the model, we need to get its predictions: automatically annotate the gold corpus.

**Calculate precision and recall.** Those are the two metrics that evaluate the performance of models.

<img src="images/18_1.png" width="300">

* **False negatives** are tweets that are claimed by the system to be not offensive, but they are, in fact, offensive;
* **True positives** are correctly determined offensive tweets,
* **True negatives** are correctly determined non-offensive tweets,
* **False positives** are tweets that are claimed by the system to be offensive, but they are, in fact, not offensive.


**Precision** evaluates how many or the selected items were, in fact, correct. It is also called _positive predictive value._

$$ \textrm{precision} = \frac{\textrm{# of tweets correctly annotated as offensive}}{\textrm{total # of tweets annotated as offensive}} $$

**Recall** checks how many relevant items were detected. It is also called _sensitivity._

$$ \textrm{recall} = \frac{\textrm{# of tweets correctly annotated as offensive}}{\textrm{total # of offensive tweets}} $$

**Practice.** For example, consider the following two annotations, where `golden` is referring to the golden corpus values, and `model` are the predictions of the model we are evaluating.

In [None]:
golden = [False, True, True, False, True, False, False,
           True, False, False, True, False, False, True, 
           True, True, True, True, True, False]

model = [False, False, True, False, False, True, False,
           True, True, False, False, False, False, True, 
           True, True, True, True, False, False]

Calculate the precision and recall of that model.

_Part 1._ Calculate the precision of the model.

_Part 2._ Calculate the recall of the model.

**F-score** is a measure of the annotation accurracy relying on precision and recall. It reaches its best value in $1$, and worst in $0$.

$$\textrm{F-score} = 2 * \frac{\textrm{precision} * \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$

**Practice.** Calculate the F-score of the model.

## To sum up the semester...