### Make a Spinner
1. Learn similar words, and convert a corpus to trigrams
1. try to replace some of the words in the trigrams to make a tweaked version of the original corpus.

In [43]:
import nltk
import random
import numpy as np
from bs4 import BeautifulSoup

In [44]:
# import data
positive_reviews = BeautifulSoup(open('data/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

### Create trigrams
* Trigrams are three consecutive words from a string of words or corpus.  
* In this case I'll start by trying to replace the middle word in trigrams.
* Using the trigrams, build up a dictionary of trigrams, where the *key* = tuple of before-word and after-word, and the *value* is the list of words found between said tuple.

Example:

```{('i', 'this'): ['purchased', 'bought', 'recommend', 'use'], (x, y):[abc, def], ...} ```

In [45]:
trigrams = {} # ex: {('i', 'this'): ['purchased', 'bought', 'recommend', 'use'], (x, y):[abc, def], ...}
for review in positive_reviews:
    s = review.text.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        k = (tokens[i], tokens[i+2])
        if k not in trigrams:
            trigrams[k] = []
        trigrams[k].append(tokens[i+1]) # key = tuple of before-word and after-word, val = list of between words


In [46]:
print(len(tokens), len(positive_reviews))
print(len(trigrams))

# look at a couple of trigram middle-words:
print(trigrams[('my', 'and')], '\n') # nouns
print(trigrams[('i', 'this')]) # verbs

33 1000
67295
['area', 'camera', 'files', 'itrip', 'ipod', 'needs', 'mac', 'ipod', 'dvd/cd', 'head', 'ears', 'imac', 'notebook', 't.v', 'cellphone', 'notes', 'office', 'window', 'pc', 'friends', 'mac', 'ear', 'taste', 'netgear', 'phone', 'jazz', 'shoulder', 'pocket', 'gut', 'side', 'walls', 'computer', 'husband', 'movies', 'wife', 'pc', 'wife', 'psp', 'laptop', 'wife', 'stereo', 'order', 'router', 'ipod', 'pocket', 'lap', 'mp3', 'first', 'friend', 'nightstand', 'purchase', 'ipod', 'computer', 'wife', 'ear', 'head', 'husband', 'laptop', 'calendar', 'powerbook', 'quickcam', 'mp500'] 

['purchased', 'bought', 'bought', 'recomend', 'made', 'picked', 'say', 'bought', 'purchased', 'use', 'bought', 'had', 'bought', 'got', 'got', 'purchased', 'think', 'use', 'ordered', 'matched', 'bought', 'think', 'bought', 'picked', 'picked', 'noticed', 'ordered', 'purchased', 'bought', 'use', 'bought', 'purchased', 'bought', 'thought', 'recommend', 'got', 'bought', 'use', 'use', 'bought', 'choose', 'like', 

In [47]:
trigrams_probabilities = {} # key = before & after word tuple, val = dict of middlewords:probabilities
# ex: {('i', 'this'): {'purchased': 0.12422360248447205, 'bought': 0.3105590062111801},...}
for k, words in trigrams.items():
    if len(set(words)) > 1:
        d = {}
        n = 0
        for w in words:
            if w not in d:
                d[w] = 0
            d[w] += 1
            n += 1
        for w, c in d.items():
            d[w] = float(c) / n
        trigrams_probabilities[k] = d

Evaluate trigram probabilities

In [48]:
def random_sample(d):
    r = random.random()
    cumulative = 0
    for w, p in d.items():
        cumulative += p
        if r < cumulative:
            return w

In [49]:
# test the random_sample() function.  select a review, make substitution, print both for comparison
def test_spinner():
    review = random.choice(positive_reviews)
    s = review.text.lower()
    print('Original: ', s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) -2):
        if random.random() < 0.2:
            k = (tokens[i], tokens[i+2])
            if k in trigrams_probabilities:
                w = random_sample(trigrams_probabilities[k])
                tokens[i+1] = w
    print('Spun: ')
    print(' '.join(tokens).replace(' .','.').replace(" ''", "''").replace(' ,',',').replace('$ ','$').replace(' !','!'))

In [51]:
# inspect results
test_spinner()

Original:  
this map set met all my expectations, and was far more current than i had expected.  software for turn by turn directions on roadways while driving works great with the garmin ctrex gps.  recommended. to load many states you will need large memory card, 1 gb. 

Spun: 
this exact set met all my sansa, and was far more current than i had expected. software for turn by turn back on roadways while driving works great! the garmin handheld gps. recommended accessory to load many adapters you will need large memory card, 1 gb.


### Conclusions
In general, this is not a very good spinner.  A lot of the sampled spun reviews are nonsense.  Should probably add POS or additional previous words as context, even at a cost of having less samples on which to learn.