# Problem 07: Text Classification with NGram LMs

In this exercise we'll train language models and use them to classify texts. We'll use AG News data, and select one of the four classes to treat the data as a binary classification problem: texts of the selected class are positive texts, and the rest are negative texts. 

For each of the two sets of texts we'll estimate a language model. With a new text, we can then compute the probability of the text under the two models: we'll classify as positive/negative depending on which model gives most probability to the text. 

This idea follows from the probabilities:

$p(+ | \text{text}) = \frac{p(+, \text{text})}{p(\text{text})} = \frac{p(+) p(\text{text} | +)}{p(\text{text})}$ 

We will be estimating $p(\text{text}|+)$ and $p(\text{text}|-)$ using ngram language models, as well as the priors $p(+)$ and $p(-)$ which can be estimated from the texts. Also note that $p(\text{text}) = p(\text{text}, +) + p(\text{text}, -)$: we can compute as well this term, but it is not needed to decide if a text is positive or negative (because it appears as a constant in $p(+ | \text{text})$ and $p(- | \text{text})$). 

As a final note, this method for classification is very similar to Naive Bayes for text classification: both methods use the Bayes rule and estimate the probability distrubutions $p(\text{text}|+)$ and $p(\text{text}|-)$ from data: a language model is directly a generative model of the text, while Naive Bayes applies the assumption that words in a text are independent, which is like an ngram language model of order 1 in which the conditioning part is empty. While Naive Bayes is very competitive, using an actual ngram language model for classification is problematic because of unseen words and ngrams. We will see this in this exercise. 

## Preliminaries

Start a new notebook and add these code blocks. The file `ngram_lm.py` has methods to estimate ngram language models, it must be placed in the same folder as your notebook. We recommend solving problem 06 first to get familiar with the estimation of ngram language models. Naturally, we also recommend looking at notebook 04 on 
ngram language models that we presented in class. 

In [1]:
import pandas as pd
from ngram_lm import count_ngrams_up_to, NGramLanguageModel, prob_text, text_generator

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\rsast\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
from ngram_lm import stop_symbol

def lm_classify(text: str, lm_pos: NGramLanguageModel, lm_neg: NGramLanguageModel, prior_pos=None):
    """Classify the texts in to positive/negative using two ngram language models, one for each class. 
    
    By default, the prior probabilities of each class are estimated from the relative counts in the 
    ngram model: for each class, the number of stop symbols indicates the number of texts of that class 
    in the data. 

    Returns a dictionary with the predicted label, as well as all the probability terms used to make the prediciton. 

    """
    if prior_pos is None:
        counts_stop_pos = lm_pos.ngram_counts[()].get(stop_symbol)
        counts_stop_neg = lm_neg.ngram_counts[()].get(stop_symbol)
        prior_pos = counts_stop_pos / (counts_stop_pos + counts_stop_neg)
    prior_neg = 1 - prior_pos 
    p_text_given_pos = prob_text(lm_pos, text) 
    p_text_and_pos = p_text_given_pos * prior_pos

    p_text_given_neg = prob_text(lm_neg, text)
    p_text_and_neg = p_text_given_neg * prior_neg

    pred_int = 1 if p_text_and_pos > p_text_and_neg else 0
    pred_label = "POSITIVE" if pred_int else "NEGATIVE"

    p_text = p_text_and_pos + p_text_and_neg

    p_pos_given_text = p_text_and_pos / p_text if p_text>0 else None
    p_neg_given_text = p_text_and_neg / p_text if p_text>0 else None

    return {        
        'pred_label': pred_label,
        'pred_int': pred_int,
        'p(+|text)': p_pos_given_text, 
        'p(-|text)': p_neg_given_text,
        'p(text|+)': p_text_given_pos,
        'p(+,text)': p_text_and_pos,
        'p(text|-)': p_text_given_neg,        
        'p(-,text)': p_text_and_neg,
        'p(text)': p_text,
    }    

def lm_classify_texts(texts, lm_pos, lm_neg):
    """Runs lm_classify on a list of texts and returns a dataframe with the results."""
    return pd.DataFrame([
        {"text": text} | lm_classify(text, lm_pos, lm_neg) for text in texts],
       dtype=object
   )

The method `lm_classify` receives a text and two ngram language models. It computes the probability of the text under each of the language models, and then uses the Bayes rule to determine the probability of each class given the text, which is used as a classification rule.  

Additionally, a there's also a helper method `lm_classify_texts` that receives a list of texts, runs the classification, and returns a dataframe with all the information. 

Let's try it on toy sentences using toy language models. 

In [4]:
toy_lm_pos = NGramLanguageModel(n=2, ngram_counts=count_ngrams_up_to(2, texts=["i am positive", "i like +", "yes", "ambivalent"]), back_off_discount=0.1)
toy_lm_neg = NGramLanguageModel(n=2, ngram_counts=count_ngrams_up_to(2, texts=["i am negative", "i like -", "no", "ambivalent"]), back_off_discount=0.1)

<ngram_lm.NGramLanguageModel object at 0x000002611E805720>
<ngram_lm.NGramLanguageModel object at 0x000002611CD756C0>


In [5]:
lm_classify_texts(
    ["yes", "no", "i am +", "i am ambivalent", "yes i am ambivalent", "no i am ambivalent"],
    toy_lm_pos, 
    toy_lm_neg
)

Unnamed: 0,text,pred_label,pred_int,p(+|text),p(-|text),p(text|+),"p(+,text)",p(text|-),"p(-,text)",p(text)
0,yes,POSITIVE,1,1.0,0.0,0.2025,0.10125,0.0,0.0,0.10125
1,no,NEGATIVE,0,0.0,1.0,0.0,0.0,0.2025,0.10125,0.10125
2,i am +,POSITIVE,1,1.0,0.0,0.001574,0.000787,0.0,0.0,0.000787
3,i am ambivalent,NEGATIVE,0,0.5,0.5,0.001574,0.000787,0.001574,0.000787,0.001574
4,yes i am ambivalent,POSITIVE,1,1.0,0.0,1.8e-05,9e-06,0.0,0.0,9e-06
5,no i am ambivalent,NEGATIVE,0,0.0,1.0,0.0,0.0,1.8e-05,9e-06,9e-06


## LM Classifiers for AGNews

We will now load the AGNews data, and restrict to a random selection of 20K texts. This data has a large vocabulary, and therefore the number of distinct ngrams is also large. You can increase it if you have enough memory on your computer. 

In [6]:
agnews_train = pd.read_csv('../data/agnews_train.csv')
agnews_test = pd.read_csv('../data/agnews_test.csv')

# restrict to a random selection of 20K texts (feel free to increase)
agnews_sample = agnews_train.sample(20000)

We will now select one of the four classes and binarize the data into positive and negative texts. Here we use "Sports", but please select a class distinct from the other members of your group. 

It is also important to lowercase all texts, in order to keep the vocabulary smaller. Remember, when trying new texts, to use lowercased words, because the language models we'll estimate will not recognize words in uppercase. 

In [11]:
positive_class = "Sports"  # change this to your selected class
counts_train_pos = count_ngrams_up_to(3, agnews_sample[agnews_train.label == positive_class].text.str.lower())
counts_train_neg = count_ngrams_up_to(3, agnews_sample[agnews_train.label != positive_class].text.str.lower())

  counts_train_pos = count_ngrams_up_to(3, agnews_sample[agnews_train.label == positive_class].text.str.lower())
  counts_train_neg = count_ngrams_up_to(3, agnews_sample[agnews_train.label != positive_class].text.str.lower())


We can now estimate two language models, one for the positive texts and another for the negative texts. In both cases, we use a trigram model with some smoothing. 

In [12]:
lm_pos = NGramLanguageModel(3, counts_train_pos, back_off_discount=0.1)
lm_neg = NGramLanguageModel(3, counts_train_neg, back_off_discount=0.1)

Let's try to classify the sentence "today". We can see the predicted label (POSITIVE/NEGATIVE), and all the probability terms that are involved in the decision. 

In [13]:
lm_classify("today", lm_pos, lm_neg)

{'pred_label': 'POSITIVE',
 'pred_int': 1,
 'p(+|text)': 0.7706754989078187,
 'p(-|text)': 0.22932450109218128,
 'p(text|+)': 4.0616878569538904e-07,
 'p(+,text)': 1.003846153846154e-07,
 'p(text|-)': 3.9676888747979256e-08,
 'p(-,text)': 2.987074569391618e-08,
 'p(text)': 1.3025536107853158e-07}

### Question 1. Generate texts from the positive/negative language models. 

Using the `text_generator` method, generate 10 random sentences that start with "today" using the "positive" LM, and then generate 10 random sentences starting with "today" using the "negative" model. 

Judge the quality of these senteces by two different aspects. First, are the sentences fluent, grammatical, and meaningful? 

Second, do the positive/negative sentences reflect the category you chose? Why?

In [14]:
for i in range(10):
    print(text_generator(lm_pos, tokens=["today"], randomize=True))

(1.8533606595580132e-39, ['today', ',', 'in', 'the', 'texas', 'rangers', 'before', 'the', 'third', 'quarter', ',', 'helping', 'the', 'new', 'york', 'yankees', ',', 'leading', 'the', 'chicago', 'bears', 'have', 'sent', 'wide', 'receiver', 'marvin', 'harrison', 'have', 'done', 'almost', 'everything', 'together', '.', 'they', 'had', 'five', 'rival', 'cities', 'delivered', 'their', 'bid', 'documents', 'to', 'the', 'semifinals', 'of', 'the', 'ninth', 'inning', ',', 'quot', ';', 'it', '#', '39', ';', 't', '#', '39', ';', 's', 'mohan', 'as', 'bcci', 'administrator', 'and', 'restored', '_STOP_'])
(1.9612369378599287e-17, ['today', ',', 'at', 'least', '70', 'kilometers', 'northeast', 'of', 'the', 'game', 'between', '16th-ranked', 'cavaliers', 'prepare', 'for', 'the', 'first', 'playoff', 'hole', '.', '_STOP_'])
(2.057168252548857e-18, ['today', ',', 'there', 'aren', '#', '39', ';', 'll', 'be', 'without', 'ronaldo', 'on', 'sunday', 'night', ':', 'yankees', 'general', 'manager', 'lou', 'lamoriello

In [16]:
for i in range(10):
    print(text_generator(lm_neg, tokens=["today"], randomize=True))

(1.464078299961657e-16, ['today', 'we', 'would', 'like', 'to', 'be', 'even', 'more', 'popular', 'by', 'hollywood', 'movies', 'such', 'as', 'the', 'fourth', 'quarter', 'ended', 'sept.', '30.', 'the', 'results', '.', '_STOP_'])
(3.358485854648778e-87, ['today', ',', 'said', 'it', 'had', 'struck', 'a', 'crucial', 'vote', 'in', 'sydney', 'after', 'being', 'made', 'available', 'saturday', 'to', 'personal', 'computers', 'to', 'be', 'the', 'government', 'and', 'business', 'consultant', 'mirrors', 'china', "'s", 'first', 'such', 'facility', 'to', 'open', 'membership', 'talks', 'with', 'a', 'funny', 'name', '.', 'i', 'have', 'been', 'the', 'quot', ';', 'plastics', 'sales', ',', 'including', 'california', 'and', 'for', 'some', 'featherweight', 'notebook', 'offerings', 'of', 'sun', "'s", 'visual', 'development', ',', 'speech', 'recognition', 'on', 'a', 'remote', 'australian', 'embassy', 'bombing', 'was', 'briefly', 'detained', 'by', 'their', 'families', 'and', 'friends', 'will', 'be', 'revised', 

### Question 2. Classify texts from the AGNews test data. 

We will now try the LM-based classifiers on the AGNews test sentence, but we will only try it for the first 10 sentences: this is very imbalanced, 7 sentences are of Science, and there's a single one for each of World, Sports and Business classes. 

In [17]:
agnews_test[:10][['text', 'label']]

Unnamed: 0,text,label
0,It #39;s over. Our relationship just hasn #39;...,Science
1,Toshiba Corp. announced Tuesday a 80 gigabyte ...,Science
2,Scientists go back to the drawing board in the...,Science
3,The first shuttle flight since the Columbia tr...,Science
4,"NEW YORK, Sept 21: Iraqi Prime Minister Iyad A...",World
5,Hynix of Korea has sold its non-memory semicon...,Science
6,Four seconds after he checked into his first b...,Sports
7,Virgin will use Airbus A340-600 aircraft on th...,Business
8,"From 26,000 light-years-- near the center of o...",Science
9,"At Storage Networking World yesterday, Dell Pr...",Science


Classify the first 10 texts, and evaluate the correctness of the predictions. As you will see, the prediction method is very slow, this is because our implementation of the LM is actually a non-optimized one (but, in contrast, the code should be simple to follow). 

Report on how many predictions were correct or wrong. 

Check also the probabilities predicted by the pos/neg models for texts? Do you see many zeros? Can you tell why? 

Hint: each language model has the vocabulary as a set of strings in `lm.vocab`. Take one of the LMs and compute the probability of a sentence with a single word that is or not in the vocabulary. Since we use smooothed LMs, it should be that words within vocab always have been observed at least once, and always receive a non-zero probability. In contrast, words outside the vocab will always receive a zero probability. 


In [18]:
test_texts = agnews_test[:10].text.str.lower()
lm_classify_texts(test_texts, lm_pos, lm_neg)

Unnamed: 0,text,pred_label,pred_int,p(+|text),p(-|text),p(text|+),"p(+,text)",p(text|-),"p(-,text)",p(text)
0,it #39;s over. our relationship just hasn #39;...,NEGATIVE,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,toshiba corp. announced tuesday a 80 gigabyte ...,NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
2,scientists go back to the drawing board in the...,NEGATIVE,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,the first shuttle flight since the columbia tr...,NEGATIVE,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,"new york, sept 21: iraqi prime minister iyad a...",NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
5,hynix of korea has sold its non-memory semicon...,NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
6,four seconds after he checked into his first b...,NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
7,virgin will use airbus a340-600 aircraft on th...,NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
8,"from 26,000 light-years-- near the center of o...",NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0
9,"at storage networking world yesterday, dell pr...",NEGATIVE,0,,,0.0,0.0,0.0,0.0,0.0


### Question 3. Classify simple texts. 

Given that our LMs are slow, and that they do not seem to be working so well for real sentences, let's try it on simple sentences, like "today there is a game". 

Invent a few simple sentences that are positive or negative according to your selected class and your judgement. Then classify them using the LM. 

Try to invent sentences for which both the positive and negative LM give a non-zero probability. Hint: restrict to words that are in the intersection of the vocabularies of the two LMs. 

In [19]:
lm_classify_texts(["today there is a game"], lm_pos, lm_neg)

Unnamed: 0,text,pred_label,pred_int,p(+|text),p(-|text),p(text|+),"p(+,text)",p(text|-),"p(-,text)",p(text)
0,today there is a game,POSITIVE,1,0.970636,0.029364,0.0,0.0,0.0,0.0,0.0


In [20]:
print(lm_pos.vocab.intersection(lm_neg.vocab))



In [25]:
pos_texts = ['the olympian scored valuable goals',
             'tottenham consistent streak hammered the conference',
             'the assistant coach tapped a substitute for the game']
neg_texts = ['haiti blackouts left officials powerless',
             'clinical worms tapped on the planet caused illness',
             'the commercial dealer crass contract faded']

lm_classify_texts(neg_texts, lm_pos, lm_neg)

Unnamed: 0,text,pred_label,pred_int,p(+|text),p(-|text),p(text|+),"p(+,text)",p(text|-),"p(-,text)",p(text)
0,haiti blackouts left officials powerless,POSITIVE,1,0.536345,0.463655,0.0,0.0,0.0,0.0,0.0
1,clinical worms tapped on the planet caused ill...,NEGATIVE,0,0.000211,0.999789,0.0,0.0,0.0,0.0,0.0
2,the commercial dealer crass contract faded,NEGATIVE,0,0.180346,0.819654,0.0,0.0,0.0,0.0,0.0
