In [1]:
import nltk
import numpy as np

In [2]:
text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

# 1. N-Gram with NLTK

## 1.1 Ngrams with Padding

**nltk.bigrams**
- input a list of tokens (sequence)
- output a generator for a list of tuples of bigrams

In [3]:
[*nltk.bigrams(sequence=text[0])]

[('a', 'b'), ('b', 'c')]

**nltk.pad_sequence**
- input sequence and padding params
- output padded sequence

In [4]:
pad_seq = [*nltk.pad_sequence(text[0], pad_left=True, left_pad_symbol='^',
                    pad_right=True, right_pad_symbol='#', n=2)]
pad_seq

['^', 'a', 'b', 'c', '#']

short hand: **nltk.lm.preprocessing.pad_both_ends**
- input sequence and N
- output padded sequence with default padding symbol

In [5]:
[*nltk.lm.preprocessing.pad_both_ends(text[0], n=2)]

['<s>', 'a', 'b', 'c', '</s>']

In [6]:
# bigram with padded sequence
[*nltk.bigrams(sequence=pad_seq)]

[('^', 'a'), ('a', 'b'), ('b', 'c'), ('c', '#')]

NGram and padding togather: **nltk.ngrams**
- input a list of tokens (sequence), integer N, and padding params
- output a generator for a list of tuples of padded ngrams

In [7]:
# get same bigram result
[*nltk.ngrams(text[0], pad_left=True, left_pad_symbol='^',
              pad_right=True, right_pad_symbol='#', n=2)]

[('^', 'a'), ('a', 'b'), ('b', 'c'), ('c', '#')]

In [8]:
# three grams
[*nltk.ngrams(text[1], pad_left=True, left_pad_symbol='^',
              pad_right=True, right_pad_symbol='#', n=3)]

[('^', '^', 'a'),
 ('^', 'a', 'c'),
 ('a', 'c', 'd'),
 ('c', 'd', 'c'),
 ('d', 'c', 'e'),
 ('c', 'e', 'f'),
 ('e', 'f', '#'),
 ('f', '#', '#')]

Uni to N grams: **nltk.everygrams**
- input (padded) sequence, and max N
- output a generator of list of tuples of all uni to N grams

In [9]:
pad_seq = [*nltk.lm.preprocessing.pad_both_ends(text[1], n=3)]
[*nltk.everygrams(pad_seq, max_len=3)]

[('<s>',),
 ('<s>',),
 ('a',),
 ('c',),
 ('d',),
 ('c',),
 ('e',),
 ('f',),
 ('</s>',),
 ('</s>',),
 ('<s>', '<s>'),
 ('<s>', 'a'),
 ('a', 'c'),
 ('c', 'd'),
 ('d', 'c'),
 ('c', 'e'),
 ('e', 'f'),
 ('f', '</s>'),
 ('</s>', '</s>'),
 ('<s>', '<s>', 'a'),
 ('<s>', 'a', 'c'),
 ('a', 'c', 'd'),
 ('c', 'd', 'c'),
 ('d', 'c', 'e'),
 ('c', 'e', 'f'),
 ('e', 'f', '</s>'),
 ('f', '</s>', '</s>')]

**nltk.lm.preprocessing.flatten**
- Merge list of sequences to a single sequence

In [10]:
[*nltk.lm.preprocessing.flatten(np.zeros((2,2)))]

[0.0, 0.0, 0.0, 0.0]

In [11]:
# the purpose is to have a single sequence for padded docs
[*nltk.lm.preprocessing.flatten(nltk.lm.preprocessing.pad_both_ends(t, n=2) for t in text)]

['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

pipeline: **nltk.lm.preprocessing.padded_everygram_pipeline**
- input: order (our max N) and docs (list of sequence)
- output: a generator of generator of ngrams (corresponds to each doc), flattened sequence of all tokens

In [12]:
train, pad_corpus = nltk.lm.preprocessing.padded_everygram_pipeline(3, text)

In [13]:
for seq in train:
    print([*seq])
    print()

[('<s>',), ('<s>',), ('a',), ('b',), ('c',), ('</s>',), ('</s>',), ('<s>', '<s>'), ('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>'), ('</s>', '</s>'), ('<s>', '<s>', 'a'), ('<s>', 'a', 'b'), ('a', 'b', 'c'), ('b', 'c', '</s>'), ('c', '</s>', '</s>')]

[('<s>',), ('<s>',), ('a',), ('c',), ('d',), ('c',), ('e',), ('f',), ('</s>',), ('</s>',), ('<s>', '<s>'), ('<s>', 'a'), ('a', 'c'), ('c', 'd'), ('d', 'c'), ('c', 'e'), ('e', 'f'), ('f', '</s>'), ('</s>', '</s>'), ('<s>', '<s>', 'a'), ('<s>', 'a', 'c'), ('a', 'c', 'd'), ('c', 'd', 'c'), ('d', 'c', 'e'), ('c', 'e', 'f'), ('e', 'f', '</s>'), ('f', '</s>', '</s>')]



In [14]:
print([*pad_corpus])

['<s>', '<s>', 'a', 'b', 'c', '</s>', '</s>', '<s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>', '</s>']


# 2. Prepare Some Data

In [15]:
import requests

In [16]:
url = ('https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/'
       'raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt')
text = requests.get(url).content.decode('utf8')
print(text)

                       Language is never, ever, ever, random

                                                               ADAM KILGARRIFF




Abstract
Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing uses a null hypothesis, which
posits randomness. Hence, when we look at linguistic phenomena in cor-
pora, the null hypothesis will never be true. Moreover, where there is enough
data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a rela-
tion between two phenomena is demonstrably non-random, does not sup-
port the inference that it is not arbitrary. We present experimental evidence
of how arbitrary associations between word frequencies and corpora are
systematically non-random. We review literature in which hypothesis test-
ing has been used, and show how it has often led to unhelpful or mislead-
ing results.
Keywords: 쎲쎲쎲

1. Int

In [17]:
def basic_text_preprocess(text):
    tokenizer = nltk.TreebankWordTokenizer()
    #lemmatizer = nltk.stem.WordNetLemmatizer()
    text = text.lower().replace('-\n', '')
    tokens = [*map(tokenizer.tokenize, nltk.tokenize.sent_tokenize(text))]
    
    return tokens

tokens = basic_text_preprocess(text)
print(tokens[0])

['language', 'is', 'never', ',', 'ever', ',', 'ever', ',', 'random', 'adam', 'kilgarriff', 'abstract', 'language', 'users', 'never', 'choose', 'words', 'randomly', ',', 'and', 'language', 'is', 'essentially', 'non-random', '.']


In [18]:
# preprocess for a 3-gram model
n = 3
train_data, pad_sents = nltk.lm.preprocessing.padded_everygram_pipeline(n, tokens)

# 3. Training

## 3.1 Vanilla Model

In [19]:
model = nltk.lm.MLE(order=n)
model.fit(train_data, pad_sents)

In [20]:
print(f'size of vocab: {len(model.vocab)}')
print(f'symbol for unknown {model.vocab.unk_label}')

size of vocab: 1315
symbol for unknown <UNK>


In [21]:
# helps us handle unseen token
model.vocab.lookup('not in train'.split())

('not', 'in', '<UNK>')

In [22]:
# counts unigrams
print(model.counts['language'])
# counts bigrams
print(model.counts[['language']]['is'])
# counts trigrams
print(model.counts[['language', 'is']]['never'])

25
11
7


## 3.2 Scoring

In [23]:
model.score('never', ['language', 'is']) # P('never'|'language is')

0.6363636363636364

## 3.3 Kneser-Ney

In [24]:
train_data, pad_sents = nltk.lm.preprocessing.padded_everygram_pipeline(n, tokens)
model = nltk.lm.KneserNeyInterpolated(order=3, discount=0.25)
model.fit(train_data, pad_sents)

In [25]:
model.score('never', ['language', 'is'])

0.6189833872775456

# 4. Text Generation

In [26]:
print(model.generate(20, random_seed=10))

['mit', 'press', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [27]:
def process_generated_sent(tokens):
    detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
    content = [*filter(lambda t: t not in ('<s>', '</s>'), tokens)]
    return detokenizer.detokenize(content)

In [31]:
for s in range(10):
    tokens = model.generate(np.random.randint(10, 50), ['<s>'], s)
    print(''.join(process_generated_sent(tokens)))
    print()

rayson, paul, geoffrey and roger fallon 1992 computer corpora — what do they tell us about

there is any evidence as a way of providing statistical support for distinguishing associated from non-associated pairs of corpora are set up to

thus the expected values, e, c and d for the non-technical one.

leech, and random events are very large corpora in the scf if ho is not true.

alternatives to inappropriate hypothesis-testing are presented.

but perhaps either cat food was bought in the last paragraph is ‘ indistinguishable ’.

mean error term are far greater than the critical value?

bergen: the norwe gian computing centre for the benefits of large data over sophisticated mathematics to produce a pseudo-random sequence algorithmically ,

where it is in no way critical of using probability models, all from true random samples from the distribution, the sum is over the four cells of the probability of x, for that subset

however, their methods are inevitably noisy, suffering, for examp

In [32]:
model.vocab.lookup('norwe')

'norwe'

In [33]:
text

'                       Language is never, ever, ever, random\n\n                                                               ADAM KILGARRIFF\n\n\n\n\nAbstract\nLanguage users never choose words randomly, and language is essentially\nnon-random. Statistical hypothesis testing uses a null hypothesis, which\nposits randomness. Hence, when we look at linguistic phenomena in cor-\npora, the null hypothesis will never be true. Moreover, where there is enough\ndata, we shall (almost) always be able to establish that it is not true. In\ncorpus studies, we frequently do have enough data, so the fact that a rela-\ntion between two phenomena is demonstrably non-random, does not sup-\nport the inference that it is not arbitrary. We present experimental evidence\nof how arbitrary associations between word frequencies and corpora are\nsystematically non-random. We review literature in which hypothesis test-\ning has been used, and show how it has often led to unhelpful or mislead-\ning results.\n