In [1]:
from nltk.corpus import movie_reviews

pos_reviews = movie_reviews.sents(categories=['pos'])
neg_reviews = movie_reviews.sents(categories=['neg'])

def review_filter(review):
    return len(review) > 10

pos_reviews = [review for review in pos_reviews if review_filter(review)]
neg_reviews = [review for review in neg_reviews if review_filter(review)]

Both the "sents" and "raw" access to a corpus in nltk typically returns whitespace tokenized sentences.

To reflect a more realistic scenario, we will detokenize the reviews

In [3]:
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer
twd = TreebankWordDetokenizer()

def detokenize(review):
    detokenized = twd.detokenize(review)
    # treebankworddetokenizer doesn't work with "there's" or "don't". It will return "there 's" etc.
    detokenized = re.sub(r"'\s([a-z])", r"'\1", detokenized)  # remove spaces around apostrophes
    detokenized = re.sub(r"\s([.,?!:;])", r"\1", detokenized)  # remove spaces before punctuation
    detokenized = detokenized.replace('"', '')  # get rid of quotes
    detokenized = re.sub(r"\s+", r" ", detokenized)  # multiple spaces to single space
    return detokenized

print("normal:", " ".join(pos_reviews[0]))
print("detokenized:", detokenize(pos_reviews[0]))

normal: films adapted from comic books have had plenty of success , whether they ' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there ' s never really been a comic book like from hell before .
detokenized: films adapted from comic books have had plenty of success, whether they're about superheroes (batman, superman, spawn), or geared toward kids (casper) or the arthouse crowd (ghost world), but there's never really been a comic book like from hell before.
