In [80]:
import pandas as pd
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten
from nltk import ngrams
from nltk import word_tokenize
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
import re

- `nltk.download('punkt')`

This downloads a model to your local system. required just once

In [4]:
# one time use
#nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [49]:
data = pd.read_csv("../data/Codes Summed-Main View.csv")
comment_list = [c for c in data["Qualitative Response"]]

I'm tokenizing and padding over a whole comment. That is, the start and end symbols are padded out on the whole comment. It might result in different results if they were paded out by sentence.

- `n` is setting the number of n-grams

In [140]:
n = 3
comment_tokens = [word_tokenize(c) for c in comment_list]
train, vocab = padded_everygram_pipeline(n, comment_tokens)
MLE_model = MLE(n)
MLE_model.fit(train, vocab)

`MLE_model.generate()` is generating a list of tokens. The `text_seed` is what it should begin with at the start state.. I have it set to

```python
["<s>"] * (n - 1)
```

This will generate a list of `"<s>"` that is one shorter than the number of ngrams. This is because for a given ngram model, it pads the start and end with `"<s>"` n-1 times. E.g. for the following sentence:

```python
["I", "am", "a", "sentence"]
```

For a 3-gram model, it will pad out to:

```python
["<s>", "<s>", "I", "am", "a", "sentence", "</s>","</s>"]
```

So the first trigram is `( "<s>", "<s>", "I" )`, the second trigram is `("<s>", "I", "am")` etc. By giving `text_seed` the list `["<s>", "<s>"]` you're basically telling it to sample the most common word that starts a comment.

If it then samples, `"The"`, it will re-run the sampling for words that most commonly follow `("<s>", "The")`, and so on.

In [141]:
output = " ".join(MLE_model.generate(50, text_seed=["<s>"]*(n-1)))
clean = re.sub(r"(</s> ?)+", "", output)
print(clean)

More affordable housing units close to what leadership for the entire county . People can not afford gas . 


## Uniqueness sanity check

I was worried that with a large enough n, it would just re-generate exact tokens from the training set. Double checking that 

In [267]:
sampled = MLE_model.generate(100, text_seed=["<s>"]*(n-1))
sampled = [token for token in sampled if token != "</s>"]
# This grabs all of the training tokens with a matching initial trigram
orig = [c for c in comment_tokens if all([x==y for x,y in zip(c[0:n],sampled[0:n])])]
print(f"There are {len(orig)} comments that start with this ngram")
print("generated comment:")
print(f'\t-{" ".join(sampled)}')
print("\noriginal comments")
for orig_c in orig:
    print(f'\t-{" ".join(orig_c)}')

There are 1 comments that start with this ngram
generated comment:
	-increase park sizes and street widths . Promote individual ownership rather than sprawl Focus on green spaces and access to community transportation issues and ultimately reduce traffic , making it a better mass transit initiative to improve is to relocate the railroad out of neighborhoods ! Keep improving the accessibility . Housing is too dangerous The lights — there are also unavailable or unsafe in the rain waiting for the environment .

original comments
	-increase park sizes and green space , better improve the recycling capabilities of city waste management , provide alternate traffic options [ bus routes , bike and walking routes ] , host water way cleanings if already done better announcement
