In [None]:
# The NLTK book assumes we import these each time
import nltk, re, pprint
from nltk import word_tokenize
from nltk.book import *

Several ways to remove punctuation.  The first is to make a list of punctuation manually, and remove those words that do not match.

In [None]:
punctuation = [':', ',', '.', '!', '\'', '[', ']']
nopunct = [w for w in text6 if w not in punctuation]
' '.join(nopunct[0:100])

The next is to use isalpha()

In [None]:
alpha = [x for x in text6 if x.isalpha()]
' '.join(alpha[200:300])

In [None]:
For text with numbers, you can alternatively use isalnum (alphanumeric)

In [None]:
alnum = [x for x in text6 if x.isalnum()]
' '.join(alnum[1000:1100])

Or use the string.punctuation function

In [None]:
import string
nopunct = [w for w in text6 if w not in string.punctuation]
' '.join(nopunct[0:100])

Another option is to look words up in a wordlist, but then you miss the names of the characters

In [None]:
from nltk.corpus import stopwords
normalized = [w for w in text6 if w.lower() not in stopwords.words('english') and w not in string.punctuation]
' '.join(normalized[0:100])

Now let's process more gnarly text

raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

First, just split on simple white space (blanks)

In [None]:
print(re.split(r' ',raw))

Below, the punctuation is not separated out.  To fix that; explicitly look for the tabs and newlines:

In [None]:
print(re.split(r'[ \t\n]+', raw))

We can use \w for word characters, equivalent to [a-zA-Z0-9_]. The capital \W+ splits on everything **other** than the word class.

In [None]:
print(re.findall(r'\w+', raw))
print(re.split(r'\W+', raw))

But now the contraction *I'm* is split up, which is a problem!  And we are losing the punctuation we do want, like those hyphens and apostrophes that do have meaning.

Now let's get more complicated:
* \w+ to match words 
* \w+(?:[-']\w+)*|' to match word-internal hyphens and apostrophes
* \S+ to match non-whitespace characters (the complement of spaces)
* followed by \w* to optionally match more words
* precede these with [-.(]+ to match double hyphen, ellipses, and open parentheses, which are to be tokenized separately.  

In [None]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

What does that ?: mean?  Here is a comparison.  This first findall both finds the pattern and selects it out:

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

This second one, using the ?:, suppresses the selecting out of the pattern, and just does the matching.

In [None]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

Going back to the complex tokenization pattern, is there an easier way?
nltk.regex_tokenize() might be worth a shot:

In [None]:
pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+([-']\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [.,;"'?():-_`]+  # these are separate tokens
... '''
print(nltk.regexp_tokenize(raw,pattern))

This still makes errors -- When is separated into two terms because of the initial apostrophe, for example.