# 3.   Processing Raw Text

The **most important source of texts** is undoubtedly the **Web**.

It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. 

However, you probably **have your own text sources** in mind, and need to learn **how to access them**.

The goal of this chapter is to answer the following questions:

- How can we write programs to **access text from local files** and **from the web**, in order to get hold of an unlimited range of language material?

- How can we **split documents up into individual words and punctuation symbols**, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?

- How can we write programs to **produce formatted output** and **save it in a file**?

In [1]:
# 以后每章的代码都要先导入以下模块
from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

## 3.1   Accessing Text from the Web and from Disk

***Electronic Books***

You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. 

Although 90% of the texts in Project Gutenberg are in English, it includes material in over **50 other languages**, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows:

In [2]:
from urllib import request

In [3]:
url = "http://www.gutenberg.org/files/2554/2554-0.txt"

In [4]:
respon = request.urlopen(url)

In [5]:
if respon.code == 200:
    raw = respon.read().decode('utf-8-sig')
else:
    print("未能成功获取网页内容，请检查网络")

In [6]:
type(raw)

str

In [7]:
len(raw)

1176811

In [8]:
raw[:75]

'The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

For our language processing, we want to break up the string into ***words and punctuation***, as we saw in before.

This step is called ***tokenization***, and it produces our familiar structure, ***a list of words and punctuation***.

In [9]:
tokens = word_tokenize(raw)

In [10]:
type(tokens)

list

In [11]:
len(tokens)

257058

In [12]:
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

If we now take the further step of **creating an NLTK text from this list**, we can carry out all of the other linguistic processing we saw before, along with the regular list operations like slicing.

In [13]:
text = nltk.Text(tokens)

In [14]:
type(text)

nltk.text.Text

In [15]:
text[1024:1062]

['insight',
 'impresses',
 'us',
 'as',
 'wisdom',
 '...',
 'that',
 'wisdom',
 'of',
 'the',
 'heart',
 'which',
 'we',
 'seek',
 'that',
 'we',
 'may',
 'learn',
 'from',
 'it',
 'how',
 'to',
 'live',
 '.',
 'All',
 'his',
 'other',
 'gifts',
 'came',
 'to',
 'him',
 'from',
 'nature',
 ',',
 'this',
 'he',
 'won',
 'for']

In [16]:
text.collocations() # 寻找词语搭配

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Project Gutenberg; Ilya
Petrovitch; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


We cannot reliably **detect where the content begins and ends**, and so have to resort to **manual inspection of the file**, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:

In [17]:
raw.find("*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***")

813

In [18]:
raw.rfind("*** END OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***")

1158048

In [19]:
raw = raw[813:1158048]

<font size=2 style="color:#9B59B6">**Think**</font>:

这样得到的raw文本是我们想要的结果吗，当然不是，你可以检查一下raw首尾的内容

如果这不是我们想要的结果，那么我们应该怎么继续处理呢？

In [20]:
len('*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***')

65

In [21]:
raw[:10]

'*** START '

In [22]:
raw[-10:]

'\r\n\r\n\r\n\r\n\r\n'

In [23]:
raw = raw[65:].strip()

In [24]:
raw[:100]

'CRIME AND PUNISHMENT\r\n\r\nBy Fyodor Dostoevsky\r\n\r\n\r\n\r\nTranslated By Constance Garnett\r\n\r\n\r\n\r\n\r\nTRANSLA'

In [25]:
raw[-100:]

'into a new unknown life.\r\nThat might be the subject of a new story, but our present story is\r\nended.'

Texts found on the web **may contain unwanted material**, and there may not be an automatic way to remove it. But with **a small amount of extra work** we can extract the material we need.

***Dealing with HTML***

In [26]:
# China's leading liquor producer reports gains in first two months
# 中国领先的白酒生产商报告前两个月收益
# From China Daily
url = 'https://www.chinadaily.com.cn/a/202203/12/WS622c503fa310cdd39bc8c31b.html'

In [27]:
html = request.urlopen(url).read().decode('utf8')

In [28]:
html[:60]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//E'

You can type `print(html)` to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

In [29]:
print(html)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>China's leading liquor producer reports gains in first two months - Chinadaily.com.cn</title>
    <meta name="keywords" content="China,Kweichou Moutai" />
    <meta name="description" content="China&#39;s leading liquor producer, Kweichow Moutai, reported rising revenue and profits in the first two months." />
    
      <meta property="og:xi" content="0" />
      <meta property="og:title" content="China&#39;s leading liquor producer reports gains in first two months" />
      <meta property="og:recommend" content="0" />
      <meta property="og:url" content="https://www.chinadaily.com.cn/a/202203/12/WS622c503fa310cdd39bc8c31b.html" />
      <meta property="og:image" content="http://img2.chinadaily.com.cn/images/202203/12/622c503

To **get text out of HTML** we will use a Python library called ***BeautifulSoup***.

In [30]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)

In [31]:
tokens

['China',
 "'s",
 'leading',
 'liquor',
 'producer',
 'reports',
 'gains',
 'in',
 'first',
 'two',
 'months',
 '-',
 'Chinadaily.com.cn',
 'Search',
 'HOME',
 'CHINA',
 'WORLD',
 'BUSINESS',
 'LIFESTYLE',
 'CULTURE',
 'TRAVEL',
 'WATCHTHIS',
 'SPORTS',
 'OPINION',
 'REGIONAL',
 'FORUM',
 'NEWSPAPER',
 'MOBILE',
 'Global',
 'EditionASIA',
 '中文双语Français',
 'HOME',
 'CHINA',
 'WORLD',
 'BUSINESS',
 'LIFESTYLE',
 'CULTURE',
 'TRAVEL',
 'WATCHTHIS',
 'SPORTS',
 'OPINION',
 'REGIONAL',
 'FORUM',
 'NEWSPAPER',
 'China',
 'Daily',
 'PDF',
 'China',
 'Daily',
 'E-paper',
 'MOBILE',
 'Business',
 'Macro',
 'Companies',
 'Industries',
 'Technology',
 'Motoring',
 'China',
 'Data',
 'Finance',
 'Top',
 '10',
 'Home',
 '/',
 'Business',
 '/',
 'Companies',
 'Home',
 'Business',
 'Companies',
 'China',
 "'s",
 'leading',
 'liquor',
 'producer',
 'reports',
 'gains',
 'in',
 'first',
 'two',
 'months',
 'Xinhua',
 '|',
 'Updated',
 ':',
 '2022-03-12',
 '15:48',
 'Share',
 'Share',
 '-',
 'WeChat',


This **still contains unwanted material** concerning site navigation and related stories. With **some trial and error** you can find the start and end indexes of the content and select the tokens of interest, and **initialize a text as before**.

In [32]:
# 手工查看一下页面即可定位开始字符串和结束字符串

start_idx = raw.find('BEIJING -')

In [33]:
end_idx = raw.rfind('by 1.42 percent.') + len('by 1.42 percent.')

In [34]:
raw = raw[start_idx:end_idx]

In [35]:
print(raw)

BEIJING - China's leading liquor producer, Kweichow Moutai, reported rising revenue and profits in the first two months.
The company's net profits attributable to shareholders surged 20 percent year on year to 10.2 billion yuan ($1.61 billion), said a statement the company filed with the Shanghai Stock Exchange.
During the period, operating revenue generated by the company totaled 20.2 billion yuan, up 20 percent from the same period last year, said the statement.
The company attributed the upbeat performance to booming sales of its products during the Chinese New Year Holiday, or the Spring Festival, according to the statement.
As of the close of trading on Friday, Moutai's share price was 1,769.01 yuan, down by 1.42 percent.


***Processing Search Engine Results*** (omitted)

***Processing RSS Feeds*** (omitted)

***Reading Local Files***

Please refer to my *programming basics* course, chapter 9.

Course Homepage: https://zhangjianzhang.github.io/programming_basics/

***Extracting Text from PDF, MSWord and other Binary Formats***

Please find excellent third-party packages for extracting text from PDF, MS Word, and other binary formats by yourself.

***Capturing User Input***

use the `input` function.

***The NLP Pipeline***

<div align=center>
<img src="https://www.nltk.org/images/pipeline1.png">
<br>
<center><em><strong>The Processing Pipeline</strong></em></center>
</div>

We open a URL and read its HTML content, remove the markup and select a slice of characters; this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the vocabulary.

## 3.2   Strings: Text Processing at the Lowest Level

The contents of a word, and of a file, are represented by programming languages as a fundamental data type known as a **string**.

- Basic Operations with Strings
- Printing Strings
- Accessing Individual Characters
- Accessing Substrings

Please refer to my *programming basics* course, chapter 3.

Course Homepage: https://zhangjianzhang.github.io/programming_basics/

- The Difference between Lists and Strings

Strings and lists are both kinds of **sequence**. We can pull them apart by **indexing and slicing** them, and we can join them together by **concatenating** them.

Strings are **immutable**. However, lists are **mutable**.

## 3.3   Text Processing with Unicode

Unicode supports over a million characters. Each character is assigned a number, called a **code point**. 

Unicode支持上百万字符，每个字符对应一个数字（序数），称为**code point**，例如`a`对应`97`，`一`对应`19968`。

`ord()`函数返回一个字符的unicode序数值，`chr()`函数返回一个unicode序数值对应的字符。

In [36]:
ord('a')

97

In [37]:
ord('一')

19968

In Python, code points are written in the form `\uXXXX`, where XXXX is the number in **4-digit hexadecimal form**.

Python中unicode序数值的表示形式为`\uXXXX`，`XXXX`是四位十六进制数字。

Within a program, we can manipulate Unicode strings just like normal strings. 

However, when Unicode characters are **stored in files** or **displayed on a terminal**, they must be encoded as **a stream of bytes**. 

Some encodings (such as ASCII and Latin-2) use **a single byte per code point**, so they can only support **a small subset of Unicode**, enough for a single language. Other encodings (such as UTF-8) use **multiple bytes** and can **represent the full range of Unicode characters**.

ASCII编码和Latin-2编码使用一个字节（8比特）表示字符，因此仅能支持一小部分unicode（128个），因此只能用于编码一种语言（如，英语）。

ASCII码表：https://tool.oschina.net/commons?type=4

Text in files will be in a particular encoding, so we need some mechanism for **translating it into Unicode** — translation into Unicode is called **decoding**. 

Conversely, to **write out Unicode to a file or a terminal**, we first need to translate it into a suitable encoding — this translation out of Unicode is called **encoding**.

<div align=center>
<img width="850" height="550" src="https://www.nltk.org/images/unicode.png">
<br>
<center><em><strong>Unicode Decoding and Encoding</strong></em></center>
</div>

***Extracting encoded text from files***

In [38]:
import nltk

In [39]:
# nltk.download('unicode_samples')

In [40]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt') # 波兰语

In [41]:
path

FileSystemPathPointer('/usr/local/share/nltk_data/corpora/unicode_samples/polish-lat2.txt')

In [42]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)
f.close()

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


Convert all **non-ASCII characters** into their **two-digit \xXX** and **four-digit \uXXXX** representations.

In [43]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape')) # 编码为 unicode转义 序列
f.close()

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In Python 3, source code is **encoded using UTF-8 by default**.

we can define a string with the appropriate escape sequence.

也可以使用适当的unicode转义序列定义字符串。

In [44]:
ord('a')

97

In [45]:
ord('ó')

243

In [46]:
str_1 = 'aó' # 直接输入字符串进行定义

In [47]:
hex(97),hex(243)

('0x61', '0xf3')

In [48]:
str_2 = '\u0061\u00f3' # 使用转义字符串进行定义

In [49]:
str_1

'aó'

In [50]:
str_2

'aó'

We can also see how this character is represented as a sequence of bytes inside a text file.

本质上，文本文件中保存的是字节序列。

In [51]:
str_1.encode('utf-8'), str_2.encode('utf-8')

(b'a\xc3\xb3', b'a\xc3\xb3')

The module `unicodedata` lets us inspect the properties of Unicode characters.

使用`unicodedata`模块查看unicode字符属性。

以上面文件第三行为例，查看非ASCII字符的属性**（屏幕显示字符，字符的UTF-8编码字节串，字符的标准的Unicode序数值，字符的Unicode名称）**

In [52]:
import unicodedata

In [53]:
lines = open(path, encoding='latin2').readlines()

In [54]:
line = lines[2]

In [55]:
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [56]:
print(line)

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały



In [57]:
for c in line:
    if ord(c) > 127:
        print('{} {} U+{:04x} {}'.format(c, c.encode('utf8'), ord(c), unicodedata.name(c))) # 字符串格式化，04x表示左边补0，宽度为4，16进制数字

ó b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
Ś b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
ł b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


***Using your local encoding in Python***

在`.py`文件头制定编码，格式为`# -*- coding: utf-8 -*-`

## 3.4   Regular Expressions for Detecting Word Patterns

Many linguistic processing tasks involve **pattern matching**.

For example, we can find words ending with ed using `endswith('ed')`. 

**Regular expressions** give us a more powerful and flexible method for describing the character patterns we are interested in.

To use regular expressions in Python we need to import the `re` library using: `import re`.

We also need a list of words to search; we'll use the Words Corpus again.

We will preprocess it to **remove any proper names** (过滤掉专有名词).

In [58]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [59]:
wordlist

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aardvark',
 'aardwolf',
 'aba',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abaptiston',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abastardize',
 'abatable',
 'abate',
 'abatement',
 'abater',
 'abatis',
 'abatised',
 'abaton',
 'abator',
 'abattoir',
 'abature',
 'abave',
 'abaxial',
 'abaxile',
 'abaze',
 'abb',
 'abbacomes',
 'abbacy',
 'abbas',
 'abbasi',
 'abbassi',


***Using Basic Meta-Characters***

Let's find words ending with *ed* using the regular expression **ed$**. 

We will use the `re.search(p, s)` function to check whether the pattern p can be found somewhere inside the string s.

使用正则表达式找出以`ed`为结尾的单词。

In [60]:
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

**$**在正则表达式里表示文本结尾

Suppose we have room in a crossword puzzle for an **8-letter word** with j as its third letter and t as its sixth letter. Find words that meet the above conditions via regular expression.

使用正则表达式玩填字游戏，假定单词长度为8，第3个字符为j，第6个字符为t。

In [61]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

`^`在正则表达式中表示为本开头，`.`表示任意字符。

本课程以文本处理问题为主线，此处只讲解正则表达式如何在一些文本处理问题中发挥重要作用，不展开讲解正则表达式的全部内容。

更多关于正则表达是的详细教程如下：

- https://www.runoob.com/python/python-reg-expressions.html
- https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

`?`在正则表达式中表示前面的字符是可选的（optional），正则表达式`^e-?mail$`可以匹配*email*和*e-mail*，

In [62]:
# 数一下email这个单词在text中出现的次数
# http://www.gutenberg.org/files/2554/2554-0.txt
sum(1 for w in text if re.search('^e-?mail$', w))

2

***Ranges and Closures*** (范围和闭包)

The **T9 system** is used for entering text on mobile phones, see the figure below. 

Two or more words that are entered with **the same sequence of keystrokes** (相同的按键序列) are known as **textonyms**.

For example, both *hole* and *golf* are entered by pressing the sequence 4653. 

What other words could be produced with the same sequence? Here we use the regular expression `^[ghi][mno][jlk][def]$`

<div align=center>
<img  src="https://www.nltk.org/images/T9.png">
<br>
<center><em><strong>T9: Text on 9 Keys</strong></em></center>
</div>

In [63]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

The first part of the expression, `^[ghi]`, matches the start of a word followed by *g, h, or i*.

The next part of the expression, `[mno]`, constrains the second character to be *m, n, or o*. 

The third and fourth characters are also constrained.

Only four words satisfy all these constraints. 

Note that **the order of characters inside the square brackets is not significant**, so we could have written `^[hig][nom][ljk][fed]$` and matched the same words.

<font size=2 style="color:#BA4A00">**Exercise**</font>

Look for some "finger-twisters", by searching for words that only use part of the number-pad. 

For example `^[ghijklmno]+$`, or more concisely, `^[g-o]+$`, will match words that only use keys 4, 5, 6 in the center row, and `^[a-fj-o]+$` will match words that use keys 2, 3, 5, 6 in the top-right corner. 

 `+` simply means **one or more instances of the preceding item**, which could be an individual character like `m`, a set like `[fed]` or a range like `[d-f]`. 

`*` means **zero or more instances of the preceding item**. The regular expression `^m*i*n*e*$` will match everything that we found using `^m+i+n+e+$`, but also words where some of the letters don't appear at all, e.g. *me, min, and mmmmm*.

In [64]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

In [65]:
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [66]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

In [67]:
[w for w in chat_words if re.search('^m*i*n*e*$', w)]

['',
 'e',
 'i',
 'in',
 'm',
 'me',
 'meeeeeeeeeeeee',
 'mi',
 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'min',
 'mine',
 'mm',
 'mmm',
 'mmmm',
 'mmmmm',
 'mmmmmm',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
 'mmmmmmmmmm',
 'mmmmmmmmmmmmm',
 'mmmmmmmmmmmmmm',
 'n',
 'ne']

the `+` and `*` symbols are sometimes referred to as **Kleene closures**, or simply **closures**.

The `^` operator has another function **when it appears as the first character inside square brackets**.

For example `[^aeiouAEIOU]` matches any character other than a vowel (元音).

We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using `^[^aeiouAEIOU]+$` to find items like these*: :):):), grrr, cyb3r and zzzzzzzz*. 

In [68]:
[w for w in chat_words if re.search('^[^aeiouAEIOU]+$', w)]

['!',
 '!!',
 '!!!',
 '!!!!',
 '!!!!!',
 '!!!!!!',
 '!!!!!!!',
 '!!!!!!!!',
 '!!!!!!!!!',
 '!!!!!!!!!!',
 '!!!!!!!!!!!',
 '!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!.',
 '!!!!!.',
 '!!!!....',
 '!!!.',
 '!!.',
 '!!...',
 '!.',
 '!...',
 '!=',
 '!?',
 '!??',
 '!???',
 '"',
 '"...',
 '"?',
 '"s',
 '#',
 '###',
 '####',
 '$',
 '$$',
 '$27',
 '&',
 '&^',
 "'",
 "''",
 "'.",
 "'d",
 "'ll",
 "'m",
 "'n'",
 "'s",
 '(',
 '(((',
 '((((',
 '(((((',
 '((((((',
 '(((((((',
 '((((((((',
 '(((((((((',
 '((((((((((',
 '(((((((((((',
 '((((((((((((',
 '(((((((((((((',
 '((((((((((((((',
 '(((((((((((((((',
 '(((((((((((((((((',
 '((((((((((((((((((',
 '((((((((((((((((((((',
 '(((((((((((((((((((((',
 '(((((((((((((((((((((((',
 '((((((((((((((((((((((((',
 '(((((((((((((((((((((((((',
 '((((((((((((((((((((((((((',
 '((((

Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: `\`, `{}`, `()`, and `|`.

In [69]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [70]:
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 '0.82',
 '0.84',
 '0.9',
 '0.95',
 '0.99',
 '1.01',
 '1.1',
 '1.125',
 '1.14',
 '1.1650',
 '1.17',
 '1.18',
 '1.19',
 '1.2',
 '1.20',
 '1.24',
 '1.25',
 '1.26',
 '1.28',
 '1.35',
 '1.39',
 '1.4',
 '1.457',
 '1.46',
 '1.49',
 '1.5',
 '1.50',
 '1.55',
 '1.56',
 '1.5755',
 '1.5805',
 '1.6',
 '1.61',
 '1.637',
 '1.64',
 '1.65',
 '1.7',
 '1.75',
 '1.76',
 '1.8',
 '1.82',
 '1.8415',
 '1.85',
 '1.8500',
 '1.9',
 '1.916',
 '1.92',
 '10.19',
 '10.2',
 '10.5',
 '107.03',
 '107.9',
 '109.73',
 '11.10',
 '11.5',
 '11.57',
 '11.6',
 '11.72',
 '11.95',
 '112.9',
 '113.2',
 '116.3',
 '116.4',
 '116.7',
 '116.9',
 '118.6',
 '12.09',
 '12.5',
 '12.52',
 '12.68',
 '12.7',
 '12.82',
 '12.97',
 '120.7',
 '1206.26',
 '121.6',
 '126.1',
 '126.15',
 '127.03',
 '129.91',
 '13.1',
 '13.15',
 '13.5',
 '13.50',
 '13.625',
 '13.65',
 '13.73',
 '13.8',
 '13.90',
 '130.6',
 '130.7',
 '

In [71]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

In [72]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 '1975',
 '1976',
 '1977',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2005',
 '2009',
 '2017',
 '2019',
 '2029',
 '3057',
 '8300']

In [73]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [74]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [75]:
[w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Boeing',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Funded',
 'Funding',
 'Generalized',
 'Germany-based',
 'Getting',
 'Guaranteed',
 'Having',
 'Heating',
 'Heightened',
 'Holding',
 'Housing',
 'Illumin

`\` is escape character (转义字符), `\.` matches a period.

The braced expressions, like `{3,5}`, specify **the number of repeats of the previous item**.

The **pipe character** (管道符号) indicates a choice between the material on **its left or its right**. 

**Parentheses** indicate **the scope of an operator**: they can be used together with the pipe (or disjunction) symbol like this: `w(i|e|ai|oo)t`, matching *wit, wet, wait, and woot*. 

In [76]:
# 找出包含ed或者以ing结尾的单词
[w for w in wsj if re.search('ed|ing$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Biedermann',
 'Boeing',
 'Breeden',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Cathedral',
 'Cedric',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confederation',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Credit',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Federal',
 'Federalist',
 'Federation',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Freddie',
 'Frederick',
 'Frie

The meta-characters we have seen are summarized as follow:

<table border="1" class="docutils" id="tab-regexp-meta-characters1">
<colgroup>
<col width="15%">
<col width="85%">
</colgroup>
<thead valign="bottom">
<tr><th class="head">Operator</th>
<th class="head">Behavior</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="doctest"><span class="pre">.</span></tt></td>
<td>Wildcard, matches any character</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">^abc</span></tt></td>
<td>Matches some pattern <span class="math">abc</span> at the start of a string</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">abc$</span></tt></td>
<td>Matches some pattern <span class="math">abc</span> at the end of a string</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">[abc]</span></tt></td>
<td>Matches one of a set of characters</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">[A-Z0-9]</span></tt></td>
<td>Matches one of a range of characters</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">ed|ing|s</span></tt></td>
<td>Matches one of the specified strings (disjunction)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">*</span></tt></td>
<td>Zero or more of previous item, e.g. <tt class="doctest"><span class="pre">a*</span></tt>, <tt class="doctest"><span class="pre">[a-z]*</span></tt> (also known as <em>Kleene Closure</em>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">+</span></tt></td>
<td>One or more of previous item, e.g. <tt class="doctest"><span class="pre">a+</span></tt>, <tt class="doctest"><span class="pre">[a-z]+</span></tt></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">?</span></tt></td>
<td>Zero or one of the previous item (i.e. optional), e.g. <tt class="doctest"><span class="pre">a?</span></tt>, <tt class="doctest"><span class="pre">[a-z]?</span></tt></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{n}</span></tt></td>
<td>Exactly <span class="math">n</span> repeats where n is a non-negative integer</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{n,}</span></tt></td>
<td>At least <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{,n}</span></tt></td>
<td>No more than <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{m,n}</span></tt></td>
<td>At least <span class="math">m</span> and no more than <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">a(b|c)+</span></tt></td>
<td>Parentheses that indicate the scope of the operators</td>
</tr>
</tbody>


</table>

From now on, we will use `r'...'` for regular expressions to avoid some complications caused by specifical characters denoted by a backslash followed by particular characters.

## 3.5   Useful Applications of Regular Expressions

***Extracting Word Pieces***

The `re.findall()` ("find all") method **finds all** (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them.

In [77]:
word = 'supercalifragilisticexpialidocious'

In [78]:
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [79]:
len(re.findall(r'[aeiou]', word))

16

Look for **all sequences of two or more vowels** in some text, and determine their relative frequency.

In [80]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [81]:
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))

In [82]:
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

<font size=2 style="color:#BA4A00">**Exercise**</font>

Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers `[2009, 12, 31]`:

`[int(n) for n in re.findall(?, '2009-12-31')]`

***Doing More with Word Pieces***

For English words, it is still easy to read when **word-internal vowels are left out**.

For example, *declaration* becomes *dclrtn*, and *inalienable* becomes *inlnble*, **retaining any initial or final vowel sequences**. 

The regular expression in our next example matches **initial vowel sequences**, **final vowel sequences**, and **all consonants**; **everything else is ignored**.

This three-way disjunction is processed **left-to-right**, if one of the three parts matches the word, any later parts of the regular expression are ignored.

We use `re.findall()` to extract all the matching pieces, and `''.join()` to join them together. 

In [83]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

In [84]:
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

In [85]:
compress('problem')

'prblm'

In [86]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')

In [87]:
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


Next, let's combine regular expressions with **conditional frequency distributions**.

Here we will extract all **consonant-vowel sequences** (辅音元音序列) from the words of Rotokas, such as *ka* and *si*.

Since each of these is a **pair**, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair.

In [88]:
# nltk.download('toolbox')

In [89]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')

正则表达式的工作过程是从左往右一次性遍历字符串（不回头看），进行模式匹配，默认是贪婪匹配（即取最长的符合模式的子串）。

In [90]:
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [91]:
cvs

['ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ro',
 'ka',
 'ka',
 'vi',
 'ko',
 'ka',
 'ka',
 'vo',
 'ka',
 'ka',
 'ko',
 'ka',
 'ka',
 'si',
 'ka',
 'ka',
 'ka',
 'ka',
 'ko',
 'ka',
 'ki',
 'to',
 'ka',
 'ku',
 'pa',
 'to',
 'ka',
 'va',
 'ka',
 'pa',
 'ka',
 'pe',
 'ka',
 'pi',
 'ka',
 'pi',
 'ka',
 'pi',
 'pa',
 'to',
 'ka',
 'pi',
 'si',
 'ka',
 'pi',
 'si',
 'vi',
 'ra',
 'ka',
 'po',
 'ka',
 'po',
 'pa',
 'to',
 'ka',
 'ra',
 'ka',
 're',
 'ka',
 're',
 'ko',
 'ka',
 're',
 'ko',
 'pi',
 'ka',
 're',
 'to',
 're',
 'va',
 'ka',
 'va',
 'ka',
 'va',
 'ka',
 've',
 'ka',
 'ka',
 've',
 'ka',
 'pi',
 'ka',
 've',
 'ka',
 'pi',
 'vi',
 'ra',
 'ka',
 've',
 'ka',
 'vi',
 'ra',
 'ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ro',
 'ka',
 'ka',
 'ka',
 'ka',
 'so',
 'to',
 'ka',
 'ka',
 'vi',
 'ra',
 'ka',
 'ke',
 'ru',
 'ka',
 'pa',
 'ka',
 'pi',
 'ka',
 'pi',
 'ka',
 'pi',
 'vi',
 'ra',
 'ka',
 're',
 'si',
 'ka',
 're',
 'si',
 'vi',
 'ra',
 'ka',
 'tu',
 'ka',
 'tu',
 'pi',
 'ka',

In [92]:
cfd = nltk.ConditionalFreqDist(cvs)

In [93]:
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


In [94]:
cv_word_pairs = [(cv, w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [95]:
cv_word_pairs

[('ka', 'kaa'),
 ('ka', 'kaa'),
 ('ka', 'kaa'),
 ('ka', 'kaakaaro'),
 ('ka', 'kaakaaro'),
 ('ro', 'kaakaaro'),
 ('ka', 'kaakaaviko'),
 ('ka', 'kaakaaviko'),
 ('vi', 'kaakaaviko'),
 ('ko', 'kaakaaviko'),
 ('ka', 'kaakaavo'),
 ('ka', 'kaakaavo'),
 ('vo', 'kaakaavo'),
 ('ka', 'kaakaoko'),
 ('ka', 'kaakaoko'),
 ('ko', 'kaakaoko'),
 ('ka', 'kaakasi'),
 ('ka', 'kaakasi'),
 ('si', 'kaakasi'),
 ('ka', 'kaakau'),
 ('ka', 'kaakau'),
 ('ka', 'kaakauko'),
 ('ka', 'kaakauko'),
 ('ko', 'kaakauko'),
 ('ka', 'kaakito'),
 ('ki', 'kaakito'),
 ('to', 'kaakito'),
 ('ka', 'kaakuupato'),
 ('ku', 'kaakuupato'),
 ('pa', 'kaakuupato'),
 ('to', 'kaakuupato'),
 ('ka', 'kaaova'),
 ('va', 'kaaova'),
 ('ka', 'kaapa'),
 ('pa', 'kaapa'),
 ('ka', 'kaapea'),
 ('pe', 'kaapea'),
 ('ka', 'kaapie'),
 ('pi', 'kaapie'),
 ('ka', 'kaapie'),
 ('pi', 'kaapie'),
 ('ka', 'kaapiepato'),
 ('pi', 'kaapiepato'),
 ('pa', 'kaapiepato'),
 ('to', 'kaapiepato'),
 ('ka', 'kaapisi'),
 ('pi', 'kaapisi'),
 ('si', 'kaapisi'),
 ('ka', 'kaapisivi

In [96]:
cv_index = nltk.Index(cv_word_pairs)

In [97]:
cv_index['su']

['kasuari']

In [98]:
cv_index['po']

['kaapo',
 'kaapopato',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kapo',
 'kapoa',
 'kapokao',
 'kapokapo',
 'kapokapo',
 'kapokapoa',
 'kapokapoa',
 'kapokapora',
 'kapokapora',
 'kapokaporo',
 'kapokaporo',
 'kapokari',
 'kapokarito',
 'kapokoa',
 'kapoo',
 'kapooto',
 'kapoovira',
 'kapopaa',
 'kaporo',
 'kaporo',
 'kaporopa',
 'kaporoto',
 'kapoto',
 'karokaropo',
 'karopo',
 'kepo',
 'kepoi',
 'keposi',
 'kepoto']

***Finding Word Stems***

When using search engine, a query for *laptops* finds documents containing *laptop* and vice versa. 

Indeed, laptop and laptops are **just two forms of the same dictionary word (or lemma)**. 

For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. 

Here's a simple-minded approach which **just strips off anything that looks like a suffix**:

In [99]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

Although we will **ultimately use NLTK's built-in stemmers**, it's interesting to see how we can ***use regular expressions for this task***. 

Our first step is to **build up a disjunction of all the suffixes** (构建一个这些后缀的析取). 

We need to **enclose it in parentheses** in order to **limit the scope of the disjunction**.

In [100]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

Here, `re.findall()` just gave us the suffix even though the regular expression matched the entire word. 

This is because the parentheses have **a second function**, to **select substrings to be extracted**. 

If we want to use the parentheses to **specify the scope of the disjunction**, but not to select the material to be output, we **have to add ?:**

In [101]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

However, we'd actually like to **split the word into stem and suffix**. 

So we should just **parenthesize both parts of the regular expression**:

In [102]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

This looks promising, but still has a problem. Let's look at a different word, *processes*

In [103]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix.

This demonstrates another subtlety: **the star operator is "greedy"** and the .* part of the expression tries to **consume as much of the input as possible**.

If we use the **"non-greedy" version** of the star operator, written `*?`, we get what we want:

In [104]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

This works even when we **allow an empty suffix**, by making the content of **the second parentheses optional** (便于处理没有后缀的单词)

In [105]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

This approach **still has many problems** (can you spot them?) but we will move on to define a function to perform stemming, and apply it to a whole text.

In [106]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'process')

[('proces', 's')]

In [107]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

In [108]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."""

In [109]:
raw

'DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.'

In [110]:
tokens = word_tokenize(raw)

In [111]:
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Notice that our regular expression **removed the s from ponds but also from is and basis**. 

It **produced some non-words like distribut and deriv**, but these are acceptable stems in some applications.

***Searching Tokenized Text***

You can use **a special kind of regular expression** for **searching across multiple words*** in a text (**where a text is a list of tokens**).

For example, `<a> <man>` finds all instances of a man in the text. 
    
The **angle brackets** are used to mark **token boundaries**, and any whitespace between the angle brackets is ignored (**behaviors that are unique to NLTK's findall() method for texts**).
    
In the following example, we include `<.*>` which will **match any single token**, and enclose it in parentheses so only the matched word (e.g. monied) and not the matched phrase (e.g. a monied man) is produced. 

In [112]:
from nltk.corpus import gutenberg, nps_chat

In [113]:
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))

In [114]:
moby.findall(r"<a> (<.*>) <man>")

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


In [115]:
chat = nltk.Text(nps_chat.words())

In [116]:
chat.findall(r"<.*> <.*> <bro>")

you rule bro; telling you bro; u twizted bro


In [117]:
chat.findall(r"<l.*>{3,}")

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


For more practice, try some of the exercises on regular expressions at the end of this chapter.

It is easy to build search patterns when the linguistic phenomenon we're studying is tied to particular words.

For instance, searching a large text corpus for expressions of the form **x and other ys** allows us to discover **hypernyms**.

In [118]:
from nltk.corpus import brown

In [119]:
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))

In [120]:
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


With **enough text**, this approach would give us a useful store of information about **the taxonomy of objects**, without the need for any manual labor.

However, our search results will usually contain **false positives**, i.e. cases that we would want to exclude. 

For example, the result: *demands and other factors* suggests that demand is an instance of the type factor, but this sentence is actually about wage demands. 

Nevertheless, we could construct our own ontology of English concepts by **manually correcting the output of such searches**.

<font size=2 style="color:#BA4A00">**Exercise**</font>

Look for instances of the pattern as x as y to discover information about entities and their properties.

## 3.6   Normalizing Text

In earlier program examples we have often converted text to **lowercase** before doing anything with its words, e.g. `set(w.lower() for w in text)`. 

By using lower(), we have **normalized the text to lowercase** so that the distinction between The and the is ignored. 

Often we want to **go further** than this, and **strip off any affixes**, a task known as **stemming**.

A **further step** is to make sure that **the resulting form is a known word in a dictionary**, a task known as **lemmatization**. 

In [121]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords \
is no basis for a system of government.  Supreme executive power derives from \
a mandate from the masses, not from some farcical aquatic ceremony."""

In [122]:
raw

'DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government.  Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.'

In [123]:
tokens = word_tokenize(raw)

***Stemmers***

NLTK includes several **off-the-shelf stemmers**, and if you ever need a stemmer **you should use one of these** in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases.

The **Porter** and **Lancaster** stemmers follow their own rules for stripping affixes. 

Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.

Stemming is **not a well-defined process**, and we typically pick the stemmer that best suits the application we have in mind. 

The **Porter Stemmer** is a good choice if you are **indexing some texts** and want to **support search using alternative forms of words**.

In [124]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [125]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


In [126]:
text.concordance('lies')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


In [127]:
text.concordance('lying')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


***Lemmatization***

The WordNet lemmatizer only **removes affixes** if **the resulting word is in its dictionary**. 

This additional checking process makes the lemmatizer **slower than the above stemmers**.

In [128]:
wnl = nltk.WordNetLemmatizer()

In [129]:
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

The **WordNet lemmatizer** is a good choice if you want to **compile the vocabulary of some texts** and want a list of valid lemmas (or lexicon headwords).

## 3.7   Regular Expressions for Tokenizing Text

Tokenization is the task of **cutting a string into identifiable linguistic units** that constitute a piece of language data. 

Now that you are familiar with **regular expressions**, you can learn how to **use them to tokenize text**, and to have much more control over the process.

***Simple Approaches to Tokenization***

The very simplest method for tokenizing text is to **split on whitespace**.

In [130]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone \
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very \
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [131]:
raw

"'When I'M a Duchess,' she said to herself, (not in a very hopeful tone though), 'I won't have any pepper in my kitchen AT ALL. Soup does very well without--Maybe it's always pepper that makes people hot-tempered,'..."

In [132]:
raw.split()

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [133]:
re.split(r'\s+', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [134]:
re.split(r'\s+', raw) == raw.split()

True

Splitting on whitespace gives us tokens like **'(not' and 'herself,'**.

We can use `\W` in a simple regular expression to split the input on **anything other than a word character**.

In [135]:
# [^a-zA-Z0-9_]
re.split(r'\W+', raw)

['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

Observe that this gives us empty strings at the start and the end (to understand why, try doing `'xx'.split('x')`).

In [136]:
'xx'.split('x')

['', '', '']

We get the same tokens, but without the empty strings, with `re.findall(r'\w+', raw)`, using a pattern that matches the words instead of the spaces. 

In [137]:
re.findall(r'\w+', raw)

['When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered']

Extend the regular expression to cover a wider range of cases.

In [138]:
re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

After several improvements, regular expression for tokenization can be obtained.

In [139]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


Some useful regular expression symbols are listed as below:

<table border="1" class="docutils" id="tab-re-symbols">
<colgroup>
<col width="14%">
<col width="86%">
</colgroup>
<thead valign="bottom">
<tr><th class="head">Symbol</th>
<th class="head">Function</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="doctest"><span class="pre">\b</span></tt></td>
<td>Word boundary (zero width)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\d</span></tt></td>
<td>Any decimal digit (equivalent to <tt class="doctest"><span class="pre">[0-9]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\D</span></tt></td>
<td>Any non-digit character (equivalent to <tt class="doctest"><span class="pre">[^0-9]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\s</span></tt></td>
<td>Any whitespace character (equivalent to <tt class="doctest"><span class="pre">[ \t\n\r\f\v]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\S</span></tt></td>
<td>Any non-whitespace character (equivalent to <tt class="doctest"><span class="pre">[^ \t\n\r\f\v]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\w</span></tt></td>
<td>Any alphanumeric character (equivalent to <tt class="doctest"><span class="pre">[a-zA-Z0-9_]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\W</span></tt></td>
<td>Any non-alphanumeric character (equivalent to <tt class="doctest"><span class="pre">[^a-zA-Z0-9_]</span></tt>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\t</span></tt></td>
<td>The tab character</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">\n</span></tt></td>
<td>The newline character</td>
</tr>
</tbody>


</table>

***NLTK's Regular Expression Tokenizer***

The function `nltk.regexp_tokenize()` is similar to `re.findall()` (as we've been using it for tokenization). 

However, `nltk.regexp_tokenize()` is **more efficient for this task**, and avoids the need for special treatment of parentheses. 

In [140]:
text = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

In [141]:
nltk.regexp_tokenize(text, pattern = r'\w+|\$[\d\.]+|\S+')

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [142]:
nltk.regexp_tokenize(text, pattern='\s+', gaps=True)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

## 3.8   Segmentation

**Tokenization** is an instance of **a more general problem** of **segmentation**. 

In this section we will look at **two other instances of this problem**, which use radically different techniques to the ones we have seen so far in this chapter.

***Sentence Segmentation***

In most cases, the text is only available as **a stream of characters**. 

Before tokenizing the text into words, we need to **segment it into sentences**.

NLTK facilitates this by including the **Punkt sentence segmenter**.

Here is an example of its use in segmenting the text of a novel.

In [143]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']


Notice that this example is **really a single sentence**, reporting the speech of Mr Lucian Gregory. (上面其实是一句话，报道卢西安·格雷戈里先生的演讲)

However, **the quoted speech contains several sentences** (引用的演讲包含了几个句子), and these have been split into individual strings. 

This is reasonable behavior for most applications.

In [144]:
print(' '.join(sents[79:89]))

"Nonsense!" said Gregory, who was very rational when anyone else
attempted paradox. "Why do all the clerks and navvies in the
railway trains look so sad and tired, so very sad and tired? I will
tell you. It is because they know that the train is going right. It
is because they know that whatever place they have taken a ticket
for that place they will reach. It is because after they have
passed Sloane Square they know that the next station must be
Victoria, and nothing but Victoria. Oh, their wild rapture! oh,
their eyes like stars and their souls again in Eden, if the next
station were unaccountably Baker Street!" "It is you who are unpoetical," replied the poet Syme.


Sentence segmentation is difficult because period is used to mark abbreviations (缩写), and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A. (有些句点同时表示缩写和句子结束，这种情况对于首字母缩略词非常常见)

***Word Segmentation***

For some writing systems, tokenizing text is made more **difficult** by the fact that **there is no visual representation of word boundaries**. For example, Chinese, Japanese, Korean.

爱国人

- 爱 / 国人
- 爱国 / 人

A similar problem arises in the **processing of spoken language**, where the hearer must **segment a continuous speech stream into individual words**.

A particularly challenging version of this problem arises **when we don't know the words in advance**. 

This is the problem faced by a language learner, such as **a child hearing utterances from a parent**. 

Consider the following artificial example, where **word boundaries have been removed**:


a.		doyouseethekitty

b.		seethedoggy

c.		doyoulikethekitty

d.		likethedoggy

Our first challenge is simply to **represent the problem**: we need to find a way to separate text content from the segmentation.

We can do this by **annotating each character** with a **boolean value** to indicate whether or not **a word-break appears after the character** (an idea that will be used heavily for "chunking".

Let's assume that the learner is **given the utterance breaks**, since these often correspond to **extended pauses**. 

人在说话时，每句话之间会有自然的停顿，听者便知道句子是如何分割的，因此，句子分割可以作为一段待分词文本的初始分割

Here is a possible representation, including the **initial and target segmentations**:

In [145]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [146]:
# 根据分割，将原始文本转换为已分词序列
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [147]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [148]:
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [149]:
segment(text, seg2)

['do',
 'you',
 'see',
 'the',
 'kitty',
 'see',
 'the',
 'doggy',
 'do',
 'you',
 'like',
 'the',
 'kitty',
 'like',
 'the',
 'doggy']

Now the segmentation task becomes **a search problem**: find the bit string that causes the text string to be correctly segmented into words. (分词问题变为搜索问题，寻找一个比特串，即0-1串，将文本正确分割为词语)

We assume the learner is acquiring words and storing them in an internal lexicon.

假定听者获得词语，并将其存储在一个内部词典中

Given a suitable lexicon, it is possible to reconstruct the source text as a sequence of lexical items. 

<font size=2 style="color:#2ECC71">**Example**</font>

***Non-Deterministic Search Using Simulated Annealing*** (基于模拟退火的不确定搜索分词算法)

Following <a href="https://www.sciencedirect.com/science/article/abs/pii/S0010027796007196" target="_blank">(Brent, 1995)</a>:

We can define an **objective function**, a scoring function whose value we will try to optimize, based on **the size of the lexicon** (number of characters in the words plus an extra delimiter character to mark the end of each word) and **the amount of information needed to reconstruct the source text from the lexicon**. We illustrate this in the following figure.

<div align=center>
<img  src="https://www.nltk.org/images/brent.png">
<br>
</div>

**Calculation of Objective Function**: 

- Given a hypothetical segmentation of the source text (on the left); (给定一个假设分割)

- Derive a lexicon and a derivation table (导出一个词典和导出表) that permit the source text to be reconstructed;

- Total up **the number of characters** used by each lexical item (including a boundary marker) and **the number of lexical items** used by each derivation, to serve as a score of the quality of the segmentation;

- **Smaller values** of the score indicate a better segmentation.

It is a simple matter to implement this objective function

In [150]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size

In [151]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
segment(text, seg3)

['doyou',
 'see',
 'thekitt',
 'y',
 'see',
 'thedogg',
 'y',
 'doyou',
 'like',
 'thekitt',
 'y',
 'like',
 'thedogg',
 'y']

In [152]:
evaluate(text, seg3)

47

In [153]:
evaluate(text, seg2)

48

In [154]:
evaluate(text, seg1)

64

The final step is to **search for the pattern of zeros and ones** that **minimizes this objective function**. 

Notice that the best segmentation **includes "words" like thekitty**, since **there's not enough evidence** in the data to split this any further.

In [155]:
from random import randint

# 扰动一次0或1
def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

# 扰动n次0或1
def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    print('Initial temperature: {}'.format(temperature))
    # 终止条件为温度不大于0.5
    while temperature > 0.5:
        print('Current Temperature: {}'.format(temperature))
        best_segs, best = segs, evaluate(text, segs)
        # 对每个temperature，迭代iterations次
        for i in range(iterations):
            # 根据温度确定每次迭代的扰动次数
            # 温度越低扰动次数越少
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        # 降温（退火）
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

In [156]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

Initial temperature: 55.0
Current Temperature: 55.0
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 45.833333333333336
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 38.19444444444445
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 31.82870370370371
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 26.523919753086425
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 22.103266460905356
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 18.419388717421132
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 15.349490597850943
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
Current Temperature: 12.791242164875786
61 ['doy', 'ouseeth', 

'0010000000000000001000000010010000000000000000010000000'

Begin searching with phrase segmentations only; 

Randomly perturb the zeros and ones proportional to the "temperature";

In each iteration, the temperature is lowered and the perturbation of boundaries is reduced.

As this search algorithm is non-deterministic, you may see a slightly different result.

由于该算法是不确定启发式算法，故每次运行结果会不同，只能找到最优解的一个好的逼近。

With **enough data**, it is possible to automatically segment text into words with **a reasonable degree of accuracy**.

Such methods **can be applied to** tokenization for writing systems that **don't have any visual representation of word boundaries**.

## 3.9   Formatting: From Lists to Strings

Please refer to Chapter 3 and Chapter 9 of my *programming basics* course:

- `join` method
- string formatting
- print to file
- write results to file

## 3.10   Summary

<ul class="simple">
<li>In this book we view a text as a list of words.  A "raw text" is a potentially
long string containing words and whitespace formatting, and is how we
typically store and visualize a text.</li>
<li>A string is specified in Python using single or double quotes: <tt class="doctest"><span class="pre"><span class="pysrc-string">'Monty Python'</span></span></tt>, <tt class="doctest"><span class="pre"><span class="pysrc-string">"Monty Python"</span></span></tt>.</li>
<li>The characters of a string are accessed using indexes, counting from zero:
<tt class="doctest"><span class="pre"><span class="pysrc-string">'Monty Python'</span>[0]</span></tt> gives the value <tt class="doctest"><span class="pre">M</span></tt>.  The length of a string is
found using <tt class="doctest"><span class="pre">len()</span></tt>.</li>
<li>Substrings are accessed using slice notation: <tt class="doctest"><span class="pre"><span class="pysrc-string">'Monty Python'</span>[1:5]</span></tt>
gives the value <tt class="doctest"><span class="pre">onty</span></tt>.  If the start index is omitted, the
substring begins at the start of the string; if the end index is omitted,
the slice continues to the end of the string.</li>
<li>Strings can be split into lists: <tt class="doctest"><span class="pre"><span class="pysrc-string">'Monty Python'</span>.split()</span></tt> gives
<tt class="doctest"><span class="pre">[<span class="pysrc-string">'Monty'</span>, <span class="pysrc-string">'Python'</span>]</span></tt>.  Lists can be joined into strings:
<tt class="doctest"><span class="pre"><span class="pysrc-string">'/'</span>.join([<span class="pysrc-string">'Monty'</span>, <span class="pysrc-string">'Python'</span>])</span></tt> gives <tt class="doctest"><span class="pre"><span class="pysrc-string">'Monty/Python'</span></span></tt>.</li>
<li>We can read text from a file <tt class="doctest"><span class="pre">input.txt</span></tt> using <tt class="doctest"><span class="pre">text = open(<span class="pysrc-string">'input.txt'</span>).read()</span></tt>.
We can read text from <tt class="doctest"><span class="pre">url</span></tt> using <tt class="doctest"><span class="pre">text = request.urlopen(url).read().decode(<span class="pysrc-string">'utf8'</span>)</span></tt>.
We can iterate over the lines of a text file using <tt class="doctest"><span class="pre"><span class="pysrc-keyword">for</span> line <span class="pysrc-keyword">in</span> open(f)</span></tt>.</li>
<li>We can write text to a file by opening the file for writing
<tt class="doctest"><span class="pre">output_file = open(<span class="pysrc-string">'output.txt'</span>, <span class="pysrc-string">'w'</span>)</span></tt>, then adding content to the
file <tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span>(<span class="pysrc-string">"Monty Python"</span>, file=output_file)</span></tt>.</li>
<li>Texts found on the web may contain unwanted material (such as headers, footers, markup),
that need to be removed before we do any linguistic processing.</li>
<li>Tokenization is the segmentation of a text into basic units — or tokens —
such as words and punctuation.
Tokenization based on whitespace is inadequate for many applications because it
bundles punctuation together with words.
NLTK provides an off-the-shelf tokenizer <tt class="doctest"><span class="pre">nltk.word_tokenize()</span></tt>.</li>
<li>Lemmatization is a process that maps the various forms of a word (such as <span class="example">appeared</span>, <span class="example">appears</span>)
to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. <span class="lex">appear</span>).</li>
<li>Regular expressions are a powerful and flexible method of specifying
patterns. Once we have imported the <tt class="doctest"><span class="pre">re</span></tt> module, we can use
<tt class="doctest"><span class="pre">re.findall()</span></tt> to find all substrings in a string that match a pattern.</li>
<li>If a regular expression string includes a backslash, you should tell Python not to
preprocess the string, by using a raw string with an <tt class="doctest"><span class="pre">r</span></tt> prefix: <tt class="doctest"><span class="pre">r<span class="pysrc-string">'regexp'</span></span></tt>.</li>
<li>When backslash is used before certain characters, e.g. <tt class="doctest"><span class="pre">\n</span></tt>, this takes on
a special meaning (newline character); however, when backslash is used
before regular expression wildcards and operators, e.g. <tt class="doctest"><span class="pre">\.</span></tt>, <tt class="doctest"><span class="pre">\|</span></tt>, <tt class="doctest"><span class="pre">\$</span></tt>,
these characters <span class="emphasis">lose</span> their special meaning and are matched literally.</li>
<li>A string formatting expression <tt class="doctest"><span class="pre">template % arg_tuple</span></tt> consists of a
format string <tt class="doctest"><span class="pre">template</span></tt> that contains conversion specifiers
like <tt class="doctest"><span class="pre">%-6s</span></tt> and <tt class="doctest"><span class="pre">%0.2d</span></tt>.</li>
</ul>

<font size=2 style="color:#BA4A00">**Exercise**</font>

从以下网址获取政府工作报告，并进行分词。

http://www.gov.cn/guowuyuan/zfgzbg.htm

参考代码如下：

<center><font size=2 style="color:#FF0000"><strong>禁止使用爬虫大规模高频爬取公开数据</strong></font></center>

In [157]:
import requests
from bs4 import BeautifulSoup

In [158]:
url = 'http://www.gov.cn/guowuyuan/2021zfgzbg.htm'

response = requests.get(url)

In [159]:
# 正常情况下，response的status code为200
response.status_code

200

In [160]:
if response.status_code == 200:

    cont = response.content

    soup = BeautifulSoup(cont, 'lxml')

    report_div = soup.find('div',{'id':'conlun2_box_text'})

    report = report_div.text.strip()
else:
    print("未能成功获取网页内容，请检查网络")

In [161]:
type(report)

str

In [162]:
len(report)

17486

In [163]:
report[:100]

'各位代表：\n现在，我代表国务院，向大会报告政府工作，请予审议，并请全国政协委员提出意见。\n一、2020年工作回顾\n过去一年，在新中国历史上极不平凡。面对突如其来的新冠肺炎疫情、世界经济深度衰退等多重严'

In [164]:
report[-100:]

'中国特色社会主义思想为指导，齐心协力，开拓进取，努力完成全年目标任务，以优异成绩庆祝中国共产党百年华诞，为把我国建设成为富强民主文明和谐美丽的社会主义现代化强国、实现中华民族伟大复兴的中国梦不懈奋斗！'

In [165]:
print(report)

各位代表：
现在，我代表国务院，向大会报告政府工作，请予审议，并请全国政协委员提出意见。
一、2020年工作回顾
过去一年，在新中国历史上极不平凡。面对突如其来的新冠肺炎疫情、世界经济深度衰退等多重严重冲击，在以习近平同志为核心的党中央坚强领导下，全国各族人民顽强拼搏，疫情防控取得重大战略成果，在全球主要经济体中唯一实现经济正增长，脱贫攻坚战取得全面胜利，决胜全面建成小康社会取得决定性成就，交出一份人民满意、世界瞩目、可以载入史册的答卷。全年发展主要目标任务较好完成，我国改革开放和社会主义现代化建设又取得新的重大进展。
在艰辛的抗疫历程中，党中央始终坚持人民至上、生命至上，习近平总书记亲自指挥、亲自部署，各方面持续努力，不断巩固防控成果。我们针对疫情形势变化，及时调整防控策略，健全常态化防控机制，有效处置局部地区聚集性疫情，最大限度保护了人民生命安全和身体健康，为恢复生产生活秩序创造必要条件。
一年来，我们贯彻党中央决策部署，统筹推进疫情防控和经济社会发展，主要做了以下工作。
一是围绕市场主体的急需制定和实施宏观政策，稳住了经济基本盘。面对历史罕见的冲击，我们在“六稳”工作基础上，明确提出“六保”任务，特别是保就业保民生保市场主体，以保促稳、稳中求进。立足国情实际，既及时果断又保持定力，坚持不搞“大水漫灌”，科学把握规模性政策的平衡点。注重用改革和创新办法，助企纾困和激发活力并举，帮助受冲击最直接且量大面广的中小微企业和个体工商户渡难关。实施阶段性大规模减税降费，与制度性安排相结合，全年为市场主体减负超过2.6万亿元，其中减免社保费1.7万亿元。创新宏观政策实施方式，对新增2万亿元中央财政资金建立直达机制，省级财政加大资金下沉力度，共同为市县基层落实惠企利民政策及时补充财力。支持银行定向增加贷款并降低利率水平，对中小微企业贷款延期还本付息，大型商业银行普惠小微企业贷款增长50%以上，金融系统向实体经济让利1.5万亿元。对大企业复工复产加强“点对点”服务。经过艰苦努力，我们率先实现复工复产，经济恢复好于预期，全年国内生产总值增长2.3%，宏观调控积累了新的经验，以合理代价取得较大成效。
二是优先稳就业保民生，人民生活得到切实保障。就业是最大的民生，保市场主体也是为稳就业保民生。各地加大稳岗扩岗激励力度，企业和员工共同克服困难。多渠道做好重点群体就业工作，支持大众创业万

In [166]:
# ! pip install jieba

In [167]:
import jieba

In [168]:
word_list = list(jieba.cut(report))

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.778 seconds.
Prefix dict has been built successfully.


In [169]:
word_list

['各位',
 '代表',
 '：',
 '\n',
 '现在',
 '，',
 '我',
 '代表',
 '国务院',
 '，',
 '向',
 '大会',
 '报告',
 '政府',
 '工作',
 '，',
 '请予',
 '审议',
 '，',
 '并',
 '请',
 '全国政协',
 '委员',
 '提出',
 '意见',
 '。',
 '\n',
 '一',
 '、',
 '2020',
 '年',
 '工作',
 '回顾',
 '\n',
 '过去',
 '一年',
 '，',
 '在',
 '新',
 '中国',
 '历史',
 '上极',
 '不',
 '平凡',
 '。',
 '面对',
 '突如其来',
 '的',
 '新冠',
 '肺炎',
 '疫情',
 '、',
 '世界',
 '经济',
 '深度',
 '衰退',
 '等',
 '多重',
 '严重',
 '冲击',
 '，',
 '在',
 '以',
 '习近平',
 '同志',
 '为',
 '核心',
 '的',
 '党中央',
 '坚强',
 '领导',
 '下',
 '，',
 '全国',
 '各族人民',
 '顽强拼搏',
 '，',
 '疫情',
 '防控',
 '取得',
 '重大',
 '战略',
 '成果',
 '，',
 '在',
 '全球',
 '主要',
 '经济体',
 '中',
 '唯一',
 '实现',
 '经济',
 '正',
 '增长',
 '，',
 '脱贫',
 '攻坚战',
 '取得',
 '全面',
 '胜利',
 '，',
 '决胜',
 '全面',
 '建成',
 '小康社会',
 '取得',
 '决定性',
 '成就',
 '，',
 '交出',
 '一份',
 '人民满意',
 '、',
 '世界',
 '瞩目',
 '、',
 '可以',
 '载入史册',
 '的',
 '答卷',
 '。',
 '全年',
 '发展',
 '主要',
 '目标',
 '任务',
 '较',
 '好',
 '完成',
 '，',
 '我国',
 '改革开放',
 '和',
 '社会主义',
 '现代化',
 '建设',
 '又',
 '取得',
 '新',
 '的',
 '重大进展',
 '。',
 '\n',
 '在',
 '艰辛',
 '的',

In [170]:
print("THE END!!!")

THE END!!!
