### Workflow

1. Split the dataset into a training and a testing subset. Use the category “title” for the testing set and the categories “comment” and “post” for the training set. The short length of titles will make them good candidates later as seeds for text generation.
1. Build the matrix of prefix—word frequencies.
  + Use the `ngrams` function from `nltk.utils` to generate all n-grams from the corpus
  + Set the following `left_pad_symbol = <s> and right_pad_symbol = </s>`
1. Write a text generation function:
  + takes a bigram as input and generates the next token
  + iteratively slide the prefix over the generated text so that the new prefix includes the most recent token; generates the next token
  + to generate each next token, sample the list of words associated with the prefix using the probability distribution of the prefix
  + stop the text generation when a certain number of words have been generated or the latest token is a `</s>`.
1. Write a function that can estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences.
 + Split the sentence into trigrams and use the chain rule to calculate the probability of the sentence as a product of the bigrams—tokens probabilities
1. Implement the **perplexity** scoring function for a given sentence and for the training corpus.
1. Implement **Additive Laplace** smoothing to give a non-zero probability to missing prefix—token combinations when calculating perplexity.
1. Calculate the perplexity of the language model on the test set composed of titles.
1. Try to improve the perplexity score of your model by:
  + modifying the preprocessing phase of the corpus,
  + increasing or decreasing number of tokens in the model (bi grams, 4-grams, etc.),
  + varying the delta parameter in the Additive Laplace smoothing step.

### Resources

* https://programminghistorian.org/en/lessons/counting-frequencies
* https://www.kite.com/python/docs/nltk.ngrams

In [43]:
import pandas as pd

In [2]:
# Loading the data
TOK_FILE= 'stackexchange_tokenized.csv'
df = pd.read_csv(f'data/{TOK_FILE}')

In [3]:
df['category'].unique()

array(['title', 'post', 'comment'], dtype=object)

#### Split into Training and Test Set

In [4]:
title = df['category'] == 'title'
posts = df['category'] == 'post'
commt = df['category'] == 'comment'

test_df  = df[title]
train_df = df[posts | commt] 

In [10]:
test_df.count()

Unnamed: 0    91648
post_id       91648
parent_id         0
comment_id        0
text          91648
category      91648
length        91648
tokenized     91648
dtype: int64

In [9]:
train_df.count()

Unnamed: 0    717518
post_id       717518
parent_id      75430
comment_id    550329
text          717518
category      717518
length        717518
tokenized     717518
dtype: int64

#### Build the matrix of prefix—word frequencies

In [57]:
from nltk.util import ngrams
from collections import Counter, defaultdict

raw_counts = defaultdict(Counter) #this is imcomplete. normalize it by dividing by `sum(Counter.values())`

def bigram_counts(s_string):
    for tg in ngrams(s_string.split(' '), 3):
        bg = tg[:-1]
        wd = tg[-1]
        if not bg in raw_counts[bg]:
            raw_counts[bg] = Counter()
        else:
            ### LEFT OFF HERE: this code never runs
            print(bg)
        raw_counts[bg][wd] += 1

def print_it(str):
    print(str+"\n")
    

In [58]:
#this took 5-10 mins. We could make it faster. 
train_df['tokenized'].apply(bigram_counts)

91648     None
91649     None
91650     None
91651     None
91652     None
          ... 
809161    None
809162    None
809163    None
809164    None
809165    None
Name: tokenized, Length: 717518, dtype: object

In [50]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
91648,91752,1,,,How should I elicit prior distributions from e...,post,84,How should I elicit prior distributions from e...
91649,91753,2,,,In many different statistical methods there is...,post,138,In many different statistical methods there is...
91650,91754,3,,,What are some valuable Statistical Analysis op...,post,191,What are some valuable Statistical Analysis op...
91651,91755,4,,,I have two groups of data. Each with a differe...,post,477,I have two groups of data . Each with a differ...
91652,91756,5,3.0,,The R-project R is valuable and significant be...,post,289,The R - project R is valuable and significant ...


In [59]:
list(raw_counts.keys())[:50]

[('How', 'should'),
 ('should', 'I'),
 ('I', 'elicit'),
 ('elicit', 'prior'),
 ('prior', 'distributions'),
 ('distributions', 'from'),
 ('from', 'experts'),
 ('experts', 'when'),
 ('when', 'fitting'),
 ('fitting', 'a'),
 ('a', 'Bayesian'),
 ('Bayesian', 'model'),
 ('In', 'many'),
 ('many', 'different'),
 ('different', 'statistical'),
 ('statistical', 'methods'),
 ('methods', 'there'),
 ('there', 'is'),
 ('is', 'an'),
 ('an', '"'),
 ('"', 'assumption'),
 ('assumption', 'of'),
 ('of', 'normality'),
 ('normality', '"'),
 ('"', '.'),
 ('.', 'What'),
 ('What', 'is'),
 ('is', '"'),
 ('"', 'normality'),
 ('"', 'and'),
 ('and', 'how'),
 ('how', 'do'),
 ('do', 'I'),
 ('I', 'know'),
 ('know', 'if'),
 ('if', 'there'),
 ('is', 'normality'),
 ('What', 'are'),
 ('are', 'some'),
 ('some', 'valuable'),
 ('valuable', 'Statistical'),
 ('Statistical', 'Analysis'),
 ('Analysis', 'open'),
 ('open', 'source'),
 ('source', 'projects'),
 ('projects', 'available'),
 ('available', 'right'),
 ('right', 'now'),
 

In [56]:
#I need to see some counts over 1 or I'm going to expect that I have this wrong. 
list(raw_counts.values())[:50]

[Counter({'I': 1}),
 Counter({'keep': 1}),
 Counter({'prior': 1}),
 Counter({'distributions': 1}),
 Counter({'can': 1}),
 Counter({'independent': 1}),
 Counter({'like': 1}),
 Counter({'fitting': 1}),
 Counter({'a': 1}),
 Counter({'"': 1}),
 Counter({'world': 1}),
 Counter({'.': 1}),
 Counter({'cases': 1}),
 Counter({'things': 1}),
 Counter({'behavior': 1}),
 Counter({'can': 1}),
 Counter({'is': 1}),
 Counter({'an': 1}),
 Counter({'interaction': 1}),
 Counter({'elementary': 1}),
 Counter({'for': 1}),
 Counter({'normality': 1}),
 Counter({',': 1}),
 Counter({'and': 1}),
 Counter({'You': 1}),
 Counter({'probability': 1}),
 Counter({'your': 1}),
 Counter({'unfair': 1}),
 Counter({'"': 1}),
 Counter({'help': 1}),
 Counter({'big': 1}),
 Counter({'i': 1}),
 Counter({'generate': 1}),
 Counter({'this': 1}),
 Counter({'your': 1}),
 Counter({"'s": 1}),
 Counter({']': 1}),
 Counter({'the': 1}),
 Counter({'cases': 1}),
 Counter({'info': 1}),
 Counter({'Analysis': 1}),
 Counter({'of': 1}),
 Counter(

In [65]:
#This shows that my bigram counter is broken. 

#seems like something I'd find.
bg  = ('I', 'know')
jbg = ' '.join(bg)
print(raw_counts[bg])

#check in the training set
found = train_df['tokenized'].str.contains(jbg)
train_df[found]['tokenized']

Counter({'this': 1})


91649     In many different statistical methods there is...
91662     Two projects spring to mind : Bugs - taking ( ...
91724     I had a plan of learning R in the near future ...
91748     Let us say a man rolls a six sided die and it ...
91779     This is one I 've used successfully : I just s...
                                ...                        
808933    I do n't know * anything * about your data oth...
808983    Thank you for the answer . yes your understand...
809075    @whuber I know how to compute p value from sta...
809076    @ŁukaszGrad I know S1 ^ 2&S2 ^ 2 shown in abov...
809081    @whuber Now I understand your previous comment...
Name: tokenized, Length: 14388, dtype: object

In [29]:
title = "How should I elicit prior distributions from experts when fitting a Bayesian model ?"
bigram_counts(title)

AttributeError: 'int' object has no attribute 'items'

In [31]:
list(ngrams(title.split(' '), 2))

[('How', 'should'),
 ('should', 'I'),
 ('I', 'elicit'),
 ('elicit', 'prior'),
 ('prior', 'distributions'),
 ('distributions', 'from'),
 ('from', 'experts'),
 ('experts', 'when'),
 ('when', 'fitting'),
 ('fitting', 'a'),
 ('a', 'Bayesian'),
 ('Bayesian', 'model'),
 ('model', '?')]

In [41]:

rc = defaultdict(Counter)
for bg in ngrams(title.split(' '), 2):
    if bg not in rc:
        rc[bg] = 1 
    else:
        rc[bg] += 1

In [42]:
rc

defaultdict(collections.Counter,
            {('How', 'should'): 1,
             ('should', 'I'): 1,
             ('I', 'elicit'): 1,
             ('elicit', 'prior'): 1,
             ('prior', 'distributions'): 1,
             ('distributions', 'from'): 1,
             ('from', 'experts'): 1,
             ('experts', 'when'): 1,
             ('when', 'fitting'): 1,
             ('fitting', 'a'): 1,
             ('a', 'Bayesian'): 1,
             ('Bayesian', 'model'): 1,
             ('model', '?'): 1})