# 英文文本预处理（English Text Preprocessing）
英文文本的数据预处理主要分为三大部分：数据收集和整理，分词和归一化。数据收集和整理通常是使用数据库，文本文件或者从网站上抓取的原始文本(raw data)，这里我们直接从sklearn中获取文本数据[20_news](http://qwone.com/~jason/20Newsgroups/)。分词指将文本划分为一个个的单词。英文的单词划分较为简单，通常通过检测句末标点完成。归一化的内容多而杂，且没有固定的先后顺序标准，以及处理方法标准，归一化的流程要根据具体的任务场景来分析，这部分往往是很花费时间的。归一化通常包括：去除噪声，大小写转换，去除停用词，词形还原，词干还原，拼写检查和词性标注等。

## 数据收集和整理（Data Collection and Assembly）

In [35]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
text, categories = news.data, news.target
single_text = text[0]
category = categories[0]
print(single_text, news.target_names[category])
print('Lines: ', len(single_text.split('\n')))

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----




 rec.autos
Lines:  22


## Extract Text that you want

In [36]:
lines = single_text.split('\n')
num_lines = int(lines[4].split(': ')[-1])
# print(num_lines)
cont = ' '.join(lines[-num_lines-1:])
print(cont)

 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is  all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----     


## Tokenization
+ **Sentence Tokenization**

In [37]:
from nltk.tokenize import sent_tokenize
# split sentences with original single_text
sents = sent_tokenize(single_text)
print('Sentences for original text:\n', sents)
# split sentences with extracted and format content
sents = sent_tokenize(cont)
print('\n\nSentences for only content:\n', sents)

sample = 'This is Mr.Li.'
sample_sents = sent_tokenize(sample)
print('\n\nSentences for sample\n', sents)

Sentences for original text:
 ["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?", 'Nntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day.', 'It was a 2-door sports car, looked to be from the late 60s/\nearly 70s.', 'It was called a Bricklin.', 'The doors were really small.', 'In addition,\nthe front bumper was separate from the rest of the body.', 'This is \nall I know.', 'If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.', 'Thanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n']


Sentences for only content:
 [' I was wondering if anyone out there could enlighten me on this car I saw the other day.', 'It was a 2-door sports car, looked to be from the late 60s/ early 70s.', 'It w

----------
观察结果可以发现，句子分割的结果根据文本中是否检测到句末标点决定。可以检测到几种特殊的场景，比如邮件地址（lerxst@wam.umd.edu），连续的句末标点（WHAT car is this!?），以及一些缩写（This is Mr.Li.）

+ **Word Toenization**

In [38]:
from nltk.tokenize import word_tokenize
words = word_tokenize(cont)
print('Words for whole content:\n', words)

print('\n\nWords for each sentences:\n')
for sent in sents:
    words = word_tokenize(sent)
    print(words)

sample = "They'll cannot couldn't New York"
print('\n\n')
print(word_tokenize(sample))

Words for whole content:
 ['I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.', 'It', 'was', 'a', '2-door', 'sports', 'car', ',', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70s', '.', 'It', 'was', 'called', 'a', 'Bricklin', '.', 'The', 'doors', 'were', 'really', 'small', '.', 'In', 'addition', ',', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'This', 'is', 'all', 'I', 'know', '.', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'specs', ',', 'years', 'of', 'production', ',', 'where', 'this', 'car', 'is', 'made', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', ',', 'please', 'e-mail', '.', 'Thanks', ',', '-', 'IL', '--', '--', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'Lerxst', '--', '--']


Words for each sentences:

['I', 'was', 'wonderin

--------
从结果中可以观察到：
+ 单条句子的分词和整个文本的分词结果是一样的。如果只是进行篇章级别的分类等任务，其实没有必要单独对句子进行处理。
+ 对于某些缩写，分词只是简单的将单词分开而已，并不会恢复单词原型。其实这样做是合理的，因为这些可以被缩写的词通常并没有实际语义，比如will。
+ 虽然not会影响情感的正负极，但是可以通过额外的处理将其恢复，比如将couldn't 处理为could_not等。
+ 标点符号被识别为单个单词。
+ 专有名词比如New York被拆分成两个单词，因此分词之前可以先做命名实体识别，对这些单词做额外处理。

## Normalization
+ Noise Removal
移除掉一些噪声文本，比如html的标签，非英文字符等。
+ Spell Check
英文拼写检查常用的pyhton库是[pyenchant](https://pypi.org/project/pyenchant/)
参考资料:[https://blog.csdn.net/hpulfc/article/details/80997252](https://blog.csdn.net/hpulfc/article/details/80997252)  [https://pythonhosted.org/pyenchant/tutorial.html](https://pythonhosted.org/pyenchant/tutorial.html)

In [39]:
from enchant.checker import SpellChecker
chkr = SpellChecker("en_US") # 使用美式英语
chkr.set_text(cont)
for err in chkr:
    print("Error word: %s" % (err.word))

Error word: Bricklin
Error word: tellme
Error word: Lerxst


从上述结果中可以观察到：可能将人名识别为错误单词。可以考虑增加一个过滤单词列表，检查完成后筛选词语。
+ **POS（词性标注）**

In [40]:
from nltk import pos_tag
sent = sents[0]
words = word_tokenize(sent)
words_tags = pos_tag(words)
print(words_tags)
# tags对应的词性可以查看http://www.nltk.org/book/ch05.html#tab-universal-tagset

[('I', 'PRP'), ('was', 'VBD'), ('wondering', 'VBG'), ('if', 'IN'), ('anyone', 'NN'), ('out', 'IN'), ('there', 'RB'), ('could', 'MD'), ('enlighten', 'VB'), ('me', 'PRP'), ('on', 'IN'), ('this', 'DT'), ('car', 'NN'), ('I', 'PRP'), ('saw', 'VBD'), ('the', 'DT'), ('other', 'JJ'), ('day', 'NN'), ('.', '.')]


+ **Stemming(词干提取) / Lemmatization（词形还原）**

Stemming(词干提取)用于提取一个词的词干，可能得到的%debug果其实并不是一个词。Lemmatization（词形还原）用于将一个词从正在进行时，过去式等还原。通常采用先词形还原后词干提取，归一化不同词性的单词。

WordNet词形还原方法是常用的算法，如果一个词的后缀移除后所的单词在它的词库里，就移除这个词的后缀。

In [41]:
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# 1. 最简单的使用方法
print ('result of simple use: \n', [lemmatizer.lemmatize(w) for w in words])

# 2. 加上了POS信息
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # default is NOUN

res = []
for word, pos in words_tags:
    wordnet_pos = get_wordnet_pos(pos)
    res.append(lemmatizer.lemmatize(word, wordnet_pos))
print('\n\nresult of with pos info:\n', res)

sample_text = 'It takes me several minutes in reading the news about mathematics'
print ('\n\nresult of sample: \n', [lemmatizer.lemmatize(w) for w in word_tokenize(sample_text)])


result of simple use: 
 ['I', 'wa', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.']


result of with pos info:
 ['I', 'be', 'wonder', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.']


result of sample: 
 ['It', 'take', 'me', 'several', 'minute', 'in', 'reading', 'the', 'news', 'about', 'mathematics']


------
从上述结果中，我们可以观察到：
+ 没加pos信息之前，wordnet无法还原单词变化较大的过去式（was->wa, saw->saw），并且从侧面验证了wordnet只是比较词语去掉后缀后是否可以构成新单词
+ 没加pos信息之前，wordnet还无法还原动词正在进行时（wondering -> wondering，wondering的词性是动词，wordnet知道这个信息就尝试还原词行)。
+ wordnet可以还原单词复数(minutes -> minute)
+ 如果单词的复数已经存在于词库中，就不会还原（news -> new），无法还原正在进行时应该也是这个原因

-----
最常用的词干化工具是porter stemmer，但是网上有人推荐使用snowball stemmer。

In [42]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

cont_words = word_tokenize(cont)
print("Words:\n", cont_words)

stemmer = PorterStemmer()
res = [stemmer.stem(w) for w in cont_words]
print("\nresult for use PorterStemmer:\n", res)

stemmer = SnowballStemmer("english")
res = [stemmer.stem(w) for w in cont_words]
print("\nresult for use SnowballStemmer:\n", res)

stemmer = LancasterStemmer()
res = [stemmer.stem(w) for w in cont_words]
print("\nresult for use LancasterStemmer:\n", res)

Words:
 ['I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.', 'It', 'was', 'a', '2-door', 'sports', 'car', ',', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70s', '.', 'It', 'was', 'called', 'a', 'Bricklin', '.', 'The', 'doors', 'were', 'really', 'small', '.', 'In', 'addition', ',', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'This', 'is', 'all', 'I', 'know', '.', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'specs', ',', 'years', 'of', 'production', ',', 'where', 'this', 'car', 'is', 'made', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', ',', 'please', 'e-mail', '.', 'Thanks', ',', '-', 'IL', '--', '--', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'Lerxst', '--', '--']

result for use PorterStemmer:
 ['I', 'wa', 'wonder', 'if', 'anyon', 

+ 总体来说，激进程度`LancasterStemmer > PorterStemmer > SnowballStemmer`，比如`wondering -> wonder/wonder/wond, late -> late/late/lat, bumper-> bumper/bumper/bump， neighborhood -> neighborhood/neighborhood/neighb`
+ `LancasterStemmer`太激进，很多正常的单词被还原太多，下面不做分析。
+ `SnowballStemmer`相比`PorterStemmer`较为温和，一些常用的词语不会还原，比如`this, there, was`等
+ `SnowballStemmer`可以适用于多种语言，并且可以指定是否使用过滤词表
+ `SnowballStemmer`还会将词语转化为小写。

----
+ ** 去除停用词**

停用词指对于当前任务场景一些无意义的词语，比如对于大多数场景而言was, the, that等都是无意义词语。但是对于情感分析而言，that却可以指代情感的对象，比如一部手机，一个杯子等。因此没有所谓通用的停用词，停用词列表需要针对特定场景使用。

In [43]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
print("stop_words:\n", stop_words)

print("\n\nnumber of words before filtering is %d" % len(cont_words))
filter_words = [w for w in cont_words if w not in stop_words]
print("\nFilter words:\n", filter_words)
print("number of words after filtering is %d" % len(filter_words))

stop_words:
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 