# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 02 - Accessing Text from the Web and from Disk

###  Electronic Books
A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish (with more than 100 texts each).

In [None]:
# from urllib3.request import urlopen  
# from urllib3 import urlopen
from urllib.parse import quote

In [None]:
url = "http://www.gutenberg.org/files/2554/2554.txt"

In [None]:
raw = quote(url)

In [None]:
type(raw)

In [None]:
len(raw)

In [None]:
raw[:75]

In [None]:
import nltk
tokens = nltk.word_tokenize(raw)
type(tokens)

In [None]:
len(tokens)

In [None]:
tokens[:10]

In [None]:
text = nltk.Text(tokens)
type(text)

In [None]:
text[1020:1060]

In [None]:
text.collocations()

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("End of Project Gutenberg's Crime")

In [None]:
raw = raw[5303:1157681]
raw.find("PART I")

## Dealing with HTML
Much of the text on the Web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the later section on files. However, if you’re going to do this often, it’s easiest to get Python to
do the work directly. The first step is the same as before, using urlopen.

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = quote(url)
html[:60]

In [None]:
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
tokens

In [None]:
tokens = tokens[96:399]
text = nltk.Text(tokens)
text.concordance('gene')

In [None]:
# small test:
# Search the Web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in
# English?

## Processing RSS Feeds
The blogosphere is an important source of text, in both formal and informal registers. With the help of a third-party Python library called the Universal Feed Parser, freely downloadable from http://feedparser.org/, we can access the content of a blog, as shown here:

In [None]:
!pip install feedparser
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

In [None]:
len(llog.entries)

In [None]:
post = llog.entries[2]
post.title

In [None]:
content = post.content[0].value
content[:70]

In [None]:
import nltk
nltk.word_tokenize(nltk.html_clean(content))
nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))

## Reading Local Files
In order to read a local file, we need to use Python’s built-in open() function, followed by the read() method. Supposing you have a file document.txt, you can load its contents like this:

In [None]:
f = open('document.txt','rU')
raw = f.read()

In [None]:
import os
os.listdir('.')

In [None]:
f = open('untitled.txt','rU')
raw = f.read()

In [None]:
f.read()

In [None]:
f = open('untitled.txt', 'rU')
for line in f:
    print(line.strip())

In [None]:
import nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

## Extracting Text from PDF, MSWord, and Other Binary Formats
ASCII text and HTML text are human-readable formats. Text often comes in binary formats—such as PDF and MSWord—that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats.

## Capturing User Input
Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.

In [None]:
s = input("Enter some text: ")

In [None]:
import nltk

In [None]:
raw = open('untitled.txt').read()
type(raw)

In [None]:
tokens = nltk.word_tokenize(raw)
type(tokens)

In [None]:
words = [w.lower() for w in tokens]
type(words)

In [None]:
vocab = sorted(set(words))
type(vocab)

The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:

In [None]:
vocab.append('blog')
raw.append('blog')

Similarly, we can concatenate strings with strings, and lists with lists, but we cannot
concatenate strings with lists:

In [None]:
query = 'Who knows?'
beatles = ['john', 'paul', 'george', 'ringo']
query + beatles

# Strings: Text Processing at the Lowest Level
## Basic Operations with Strings

Strings are specified using single quotes or double quotes , as shown in the following code example. If a string contains a single quote, we must backslash-escape the quote so Python knows a literal quote character is intended, or else put the string in double quotes . Otherwise, the quote inside the string will be interpreted as a close quote, and the Python interpreter will report a syntax error:

In [None]:
monty = 'Monty Python'
monty

In [None]:
circus = "Monty Python's Flying Circus"
circus

In [None]:
circus = 'Monty Python\'s Flying Circus'
circus

In [None]:
circus = 'Monty Python's Flying Circus'

In [None]:
couplet = "Shall I compare thee to a Summer's day?"\
                    "Thou are more lovely and more temperate:"
print(couplet)    

In [None]:
couplet = ("Rough winds do shake the darling buds of May."
           "And Summer's lease hath all too short a dta.")

In [None]:
print(couplet)

In [None]:
couplet = """Shall I compare thee to a Summer's day?
            Thou are more lovely and more temperate:"""

In [None]:
print(couplet)

In [None]:
couplet = """Rough winds do shake the darling buds of May,
            And Summer's lease hath all too short a date."""

In [None]:
print(couplet)

In [None]:
'very' + 'very' + 'very'

In [None]:
'very' * 3

In [None]:
# small test:
# Try running the following code, then try to use your understanding of the string + and * operations to figure out how it works. 
# Be careful to distinguish between the string ' ', which is a single whitespace character, and '', which is the empty string.

In [None]:
'very' - 'y'

In [None]:
'very' / 2

## Printing Strings

When we have wanted to look at the contents of a variable or see the result of a calculation, we have just typed the variable name into the interpreter. We can also see the contents of a variable using the print statement:

In [None]:
print(monty)

In [None]:
grail = 'Holy Grail'
print(monty + grail)

In [None]:
print(monty,grail)

In [None]:
print(monty,"and the",grail)

## Accessing Individual Characters

Sstrings are indexed, starting from zero. When we index a string, we get one of its characters (or letters). A single character is nothing special—it’s just a string of length 1.

In [None]:
monty[0]

In [None]:
monty[3]

In [None]:
monty[5]

In [None]:
monty[20]

In [None]:
monty[-1]

In [None]:
monty[5]

In [None]:
monty[-7]

In [None]:
sent = 'colorless gree ideas sleep furiously'
for char in sent:
    print(char)

In [None]:
# count individual character
import nltk
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.keys()

## Accessing Substrings

A substring is any continuous section of a string that we want to pull out for further processing. We can easily access substrings using the same slice notation we used for lists (see Figure 3-2). For example, the following code accesses the substring starting
at index 6, up to (but not including) index 10:

In [None]:
monty[6:10]

In [None]:
monty[-12:-7]

In [None]:
monty[6:]

In [None]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

In [None]:
monty.find('Python')

In [None]:
# small test:
# Make up a sentence and assign it to a variable, e.g., 
# sent = 'my sentence...'. Now write slice expressions to pull out individual words. 
# (This is obviously not a convenient way to process the words of a text!)

## The Difference Between Lists and Strings

Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists:

In [None]:
query = 'Who knows?'
beatles = ['John','Paul','George','Ringo']
query[2]

In [None]:
beatles[2]

In [None]:
query[:2]

In [None]:
beatles[:2]

In [None]:
query + " I don't "

In [None]:
beatles + 'Brian'

In [None]:
beatles + ['Brian']

In [None]:
# Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements
beatles[0] = "John Lennon"
del beatles[-1]
beatles

In [None]:
query[0] = 'F'

In [None]:
# small test:
# Consolidate your knowledge of strings by trying some of the exercises on strings at the end of this chapter.

# Text Processing with Unicode
## Extracting Encoded Text from Files

Let’s assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the PolishWikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for us.

In [None]:
import nltk
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [None]:
import codecs
f = codecs.open(path,encoding = 'latin2')

In [None]:
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

In [None]:
ord('a')

In [None]:
a = u'\u0061'

In [None]:
print(a)

In [None]:
nacute = u'\u0144'
nacute

In [None]:
nacute_utf = nacute.encode('utf-8')
print(repr(nacute_utf))

In [None]:
import unicodedata
lines = codecs.open(path,encoding = 'latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))

In [None]:
for c in line:
    if ord(c) > 127:
        print('%r U+%04x %s' % (c.encode('utf8'),ord(c),unicodedata.name(c)))

In [None]:
# The next examples illustrate how Python string methods and the re module accept Unicode strings
line.find(u'zosta\u0142y')

In [None]:
line = line.lower()
print(line.encode('unicode_escape'))

In [None]:
import re
m = re.search(u'\u015b\w*',line)

In [None]:
m.group()

In [None]:
# NLTK tokenizers allow Unicode strings as input, and correspondingly yield Unicode strings as output.
nltk.word_tokenize(line)

# Regular Expressions for Detecting Word Patterns
To use regular expressions in Python, we need to import the re library using: import re. We also need a list of words to search; we’ll use the Words Corpus again (Section 2.4). We will preprocess it to remove any proper names.

In [None]:
import re
import nltk
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

## Using Basic Metacharacters
Let’s find words ending with ed using the regular expression «ed$». We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. We need to specify the characters of interest, and use the dollar sign,
which has a special behavior in the context of regular expressions in that it matches the end of the word:

In [None]:
[w for w in wordlist if re.search('ed$',w)]

In [None]:
[w for w in wordlist if re.search('^..j..t..$',w)]

In [None]:
# small test: 
# The caret symbol ^ matches the start of a string, just like the $ matches the end. 
# What results do we get with the example just shown if we leave out both of these, and search for «..j..t..»?

## Ranges and Closures
The first part of the expression, «^[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only four words satisfy all these constraints. Note that the order of characters inside the square brackets is not
significant, so we could have written «^[hig][nom][ljk][fed]$» and matched the same words.

In [None]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

In [None]:
# small test：
# Look for some “finger-twisters,” by searching for words that use only part of the number-pad. 
# For example «^[ghijklmno]+$», or more concisely, «^[g-o]+$», will match words that only use keys 4, 5, 6 in the center row, 
# and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

In [None]:
# Let’s explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

In [None]:
[w for w in chat_words if re.search('^[ha]+$', w)]

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

In [None]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

In [None]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

In [None]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

In [None]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

In [None]:
[w for w in wsj if re.search('(ed|ing)$', w)]

In [None]:
# small test: 
# Study the previous examples and try to work out what the \, {}, (), and | notations mean before you read on.

## Extracting Word Pieces

The re.findall() (“find all”) method finds all (non- overlapping) matches of the given regular expression. Let’s find all the vowels in a word, then count them:

In [None]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))
fd.items()

In [None]:
# Small Test：
# In the W3C Date Time Format, dates are represented like this: 2009-12-31. 
# Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

## Doing More with Word Pieces
Once we can use re.findall() to extract material from words, there are interesting things to do with the pieces, such as glue them back together or plot them.

In [None]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

In [None]:
import nltk
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

In [None]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

In [None]:
cv_word_pairs = [(cv,w) for w in rotokas_words
                        for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']

In [None]:
cv_index['po']

## Finding Word Stems

When we use a web search engine, we usually don’t mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems.
There are various ways we can pull out the stem of a word. Here’s a simple-minded approach that just strips off anything that looks like a suffix:

In [None]:
def stem(word):
    for suffix in ['ing','ly','ed','ious','ies','ive','es','s','ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'reading')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

In [None]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem,suffix = re.findall(regexp,word)[0]
    return stem

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
    is no basis for a system of government. Supreme executive power derives from
    a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]

## Searching Tokenized Text

In [None]:
from nltk.corpus import gutenberg,nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")

In [None]:
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")

In [None]:
chat.findall(r"<l.*>{3,}")

In [None]:
# small test: 
# Consolidate your understanding of regular expression patterns and substitutions using nltk.re_show(p, s), 
# which annotates the string s to show every place where pattern p was matched, and nltk.app.nemo(), 
# which provides a graphical interface for exploring regular expressions. 

In [None]:
from nltk.corpus import brown

In [None]:
hobbies_learned = nltk.Text(brown.words(categories=['hobbies','learned']))

In [None]:
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

In [None]:
# small test： 
# Look for instances of the pattern as x as y to discover information about entities and their properties.

# Normalizing Text
In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. We discuss each of these in turn. First, we need to define the data we will use in this section:

In [None]:
import nltk

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords 
        is no basis for a system of government. Supreme executive power derives from 
        a mandate from the masses, not from some farcical aquatic ceremony."""

In [None]:
tokens = nltk.word_tokenize(raw)

## Stemmers

The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not.

In [None]:
porter = nltk.PorterStemmer()

In [None]:
lancaster = nltk.LancasterStemmer()

In [None]:
[porter.stem(t) for t in tokens]

In [None]:
class IndexedText(object):
    def __init__(self,stemmer,text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                    for (i, word) in enumerate(text))
   
    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = width/4           # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '%*s' % (width, lcontext[-width:])
            rdisplay = '%-*s' % (width, rcontext[:width])
            print(ldisplay, rdisplay)
            
    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [None]:
porter = nltk.PorterStemmer()

In [None]:
grail = nltk.corpus.webtext.words('grail.txt')

In [None]:
text = IndexedText(porter, grail)
# text.concordance('lie')

## Lemmatization

The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary.This additional checking process makes the lemmatizer slower than the stemmers just mentioned. Notice that it doesn’t handle lying, but it converts women to woman.

In [None]:
wnl = nltk.WordNetLemmatizer()

In [None]:
[wnl.lemmatize(t) for t in tokens]

# Regular Expressions for Tokenizing Text
## Simple Approaches to Tokenization

The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:

In [None]:
raw = """'When I'M a Duchess,' she said to herself,(not in a very hopeful tone
    though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
    well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [None]:
import re

In [None]:
re.split(r' ',raw)

In [None]:
re.split(r'[ \t\n]+',raw)

In [None]:
re.split(r'\W+',raw)

In [None]:
re.findall(r'\w+|\S\w*',raw)

In [None]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*",raw))

## NLTK’s Regular Expression Tokenizer

The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using
it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments.

In [None]:
text = 'That U.S.A poster-print costs $12.40...'

In [None]:
pattern = r'''(?x)     # set flag to allow verbose regexps
    ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*       # words with optional internal hyphens
    | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
    | \.\.\.           # ellipsis
    | [][.,;"'?():-_`] # these are separate tokens
    '''

In [None]:
import nltk

In [None]:
nltk.regexp_tokenize(text, pattern)

# Segmentation
## Sentence Segmentation

Manipulating texts at the level of individual words often presupposes the ability to
divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:


In [None]:
import nltk

In [None]:
len(nltk.corpus.brown.words())/len(nltk.corpus.brown.sents())

In [None]:
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
import pprint

In [None]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')

In [None]:
sent = sent_tokenizer.tokenize(text)

In [None]:
pprint.pprint(sent[171:181])

## Word Segmentation

For some writing systems, tokenizing text is made more difficult by the fact that there is no visual representation of word boundaries. For example, in Chinese, the threecharacter string: 爱国人 (ai4 “love” [verb], guo3 “country”, ren2 “person”) could be tokenized as 爱国 / 人, “country-loving person,” or as 爱 / 国人, “love country-person.”
we need to find a way to separate text content from the segmentation. We can do this by annotating each character with a boolean value to indicate whether or not a word-break appears after the character. Let’s assume that the learner is given the utterance breaks, since these often correspond to extended pauses. Here is a possible representation, including the initial and target segmentations:

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
seg1 = "0000000000000001000000000010000000000000000100000000000"

In [None]:
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [None]:
# the segment() function can use them to reproduce the segmented text.
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
            words.append(text[last:])
            return words

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
seg1 = "0000000000000001000000000010000000000000000100000000000"

In [None]:
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [None]:
segment(text,seg1)

In [None]:
segment(text,seg2)

In [None]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = len(' '.join(list(set(words))))
    return text_size + lexicon_size

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
# Computing the cost of storing the lexicon and reconstructing the source text
def evaluate(text,segs):
    words = segment(text,segs)
    text_size = len(words)
    lexicon_size = len(''.join(list(set(words))))
    return text_size + lexicon_size

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
seg1 = "0000000000000001000000000010000000000000000100000000000"

In [None]:
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [None]:
seg3 = "0000100100000011001000000110000100010000001100010000001"

In [None]:
segment(text,seg3)

In [None]:
evaluate(text,seg3)

In [None]:
evaluate(text,seg2)

In [None]:
evaluate(text,seg1)

In [None]:
from random import randint
def flip(segs,pos):
    return segs[:pos] + str(1-int(segs[pos])) + seg[pos+1:]

def flip_n(segs,n):
    for i in range(n):
        segs = flip(segs,randint(0,len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, int(round(temperature)))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()                  
    return segs    

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
seg1 = "0000000000000001000000000010000000000000000100000000000"

In [None]:
anneal(text, seg3, 5000, 1.2)

# Formatting: From Lists to Strings
## From Lists to Strings

The simplest kind of structured object we use for text processing is lists of words. When we want to output these to a display or a file, we must convert these lists into strings. To do this in Python we use the join() method, and specify the string to be used as the “glue”:

In [None]:
silly = ['we','called','him','Toroise','because','he','taught','us','.']
' '.join(silly)

In [None]:
';'.join(silly)

In [None]:
''.join(silly)

## Strings and Formats
We have seen that there are two ways to display the contents of an object:

In [None]:
word = 'cat'
sentence = """hello 
    word"""
print(word)

In [None]:
print(sentence)

In [None]:
word

In [None]:
sentence

In [None]:
# Formatted output typically contains a combination of variables and pre-specified strings
import nltk
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:
    print(word, '->', fdist[word], ';',)

In [None]:
for word in fdist:
    print('%s->%d;' % (word, fdist[word]),)

In [None]:
'%s->%d;' % ('cat', 3)

In [None]:
'%s->%d;' % 'cat'

In [None]:
'%s->' % 'cat'

In [None]:
'%d' % 3

In [None]:
'I want a %s right now' % 'coffee'

In [None]:
"%s wants a %s %s" % ("Lee", "sandwich", "for lunch")

In [None]:
template = 'Lee wants a %s right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
    print(template % snack)

## Lining Things Up
Formatting strings generated output of arbitrary width on the page (or screen), such as %s and %d. We can specify a width as well, such as %6s, producing a string that is padded to width 6. It is right-justified by default , but we can include a minus sign to make it left-justified . In case we don’t know in advance how wide a displayed value should be, the width value can be replaced with a star in the formatting string, then specified using a variable.

In [None]:
'%6s' % 'dog'

In [None]:
'%-6s' % 'dog'

In [None]:
width = 6
'%-*s' % (width, 'dog')

In [None]:
count, total = 3205, 9375
"accuracy for %d words: %2.4f%%" % (total, 100 * count / total)

In [None]:
# Frequency of modals in different sections of the Brown Corpus.
def tabulate(cfdist, words, categories):
    print('%-16s' % 'Category',)
    for word in words:                                  # column headings、
        print('%6s' % word,)
    print()
    for category in categories:
        print('%-16s' % category,)                    # row heading
        for word in words:                          # for each word
            print('%6d' % cfdist[category][word],)    # print table cell
        print()

In [None]:
from nltk.corpus import  brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

In [None]:
genres = ['news','religion','hobbies','science_fiction','romance','humor']
modals = ['can','could','may','might','must','will']
tabulate(cfd,modals,genres)

In [None]:
'%*s' % (15,"Monty Python")

## Writing Results to a File

The following code opens a file output.txt for writing, and saves the program output to the file.

In [None]:
output_file = open('output.txt', 'w')

In [None]:
string = "nltk.corpus.genesis.words('english-kjv.txt')"

In [None]:
words = set(string)

In [None]:
for word in sorted(words):
    output_file.write(word + "\n")

In [None]:
len(words)

In [None]:
str(len(words))

In [None]:
output_file.write(str(len(words)) + "\n")

In [None]:
output_file.close()

## Text Wrapping

When the output of our program is text-like, instead of tabular, it will usually be necessary to wrap it so that it can be displayed conveniently. Consider the following output, which overflows its line, and which uses a complicated print statement:

In [None]:
saying = ['After','all','is','said','and','done',',',
         'more','is','said','than','done','.']

In [None]:
for word in saying:
    print(word, '(' + str(len(word)) + '),',)

In [None]:
from textwrap import fill

In [None]:
format = '%s(%d),'

In [None]:
pieces = [format % (word, len(word)) for word in saying]

In [None]:
output = ''.join(pieces)

In [None]:
wrapped = fill(output)

In [None]:
print(wrapped)