## 1. 自然语言处理入门

自然语言处理(NLP, Natural Language Process)

处理自然语言的python库：nltk

In [1]:
import nltk

### 1.1 分词

分词(Tokenization): 将文本划分为词语的过程。

In [2]:
text = "I’ve spent the last 14 years in sales and business development for investment research firms and capital market data vendors. I am now on a journey to pivot my career from data sales to data analytics. I am learning a lot of cool stuff about artificial intelligence, machine learning and how to use Python code to make use of these techniques for investing. I have found that the fastest way to learn something new is to try to teach it. So I’d like to share what I’m learning as I go along."
print(text)

I’ve spent the last 14 years in sales and business development for investment research firms and capital market data vendors. I am now on a journey to pivot my career from data sales to data analytics. I am learning a lot of cool stuff about artificial intelligence, machine learning and how to use Python code to make use of these techniques for investing. I have found that the fastest way to learn something new is to try to teach it. So I’d like to share what I’m learning as I go along.


word_tokenize()方法能够将标准词汇和标点符号分开。

In [3]:
nltk_tokens = nltk.word_tokenize(text)
print(nltk_tokens)

['I', '’', 've', 'spent', 'the', 'last', '14', 'years', 'in', 'sales', 'and', 'business', 'development', 'for', 'investment', 'research', 'firms', 'and', 'capital', 'market', 'data', 'vendors', '.', 'I', 'am', 'now', 'on', 'a', 'journey', 'to', 'pivot', 'my', 'career', 'from', 'data', 'sales', 'to', 'data', 'analytics', '.', 'I', 'am', 'learning', 'a', 'lot', 'of', 'cool', 'stuff', 'about', 'artificial', 'intelligence', ',', 'machine', 'learning', 'and', 'how', 'to', 'use', 'Python', 'code', 'to', 'make', 'use', 'of', 'these', 'techniques', 'for', 'investing', '.', 'I', 'have', 'found', 'that', 'the', 'fastest', 'way', 'to', 'learn', 'something', 'new', 'is', 'to', 'try', 'to', 'teach', 'it', '.', 'So', 'I', '’', 'd', 'like', 'to', 'share', 'what', 'I', '’', 'm', 'learning', 'as', 'I', 'go', 'along', '.']


### 1.2 提取词干

词干(stemming): 词语背后的核心概念。

很多单词代表的概念是相同的，例如'am','is','be','are'都代表'是'，'tables'和'table'都代表桌子。

**为什么要提取词干？**

In [4]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

for token in nltk_tokens[:10]:
    token_stem = stemmer.stem(token)
    print(f"{token} --> {token_stem}")

I --> i
’ --> ’
ve --> ve
spent --> spent
the --> the
last --> last
14 --> 14
years --> year
in --> in
sales --> sal


### 1.3 词性标注

词性标注(POS-tagging, Part Of Speech tagging): 标注词语的性质，性质包括名词，动词，连词等。

In [5]:
tags = nltk.pos_tag(nltk_tokens)
for token, tag in tags[:10]:
    print(token, tag)

I PRP
’ VBP
ve JJ
spent VBD
the DT
last JJ
14 CD
years NNS
in IN
sales NNS


### 1.4 停止词

停止词(stopword): 使用频率很高但含义很少的词语，例如'the', 'this', 'as'。

在特征选择阶段，通常要剔除停止词。

In [6]:
from nltk.corpus import stopwords

print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## 2. 文本分类

采用newsgroups数据集，建立简单的文本分类模型，预测新闻文本属于哪个类别。

In [7]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

### 2.1 newsgroups数据集

"新闻组(newsgroups)"专门用于训练文本分类模型。数据集总共包含18000多条新闻，分属于20个类别，已经被划分为训练集和检验集。

sklearn.datasets.fetch_20newsgroups接口用于下载/从本地加载数据。

In [8]:
# 从本地加载数据，若本地无数据会自动下载
newsgroups = fetch_20newsgroups()

In [14]:
# data属性包含了所有的新闻文本
print(newsgroups.data[1])

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



In [16]:
# 所有新闻被划分为20个不同的类别
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### 2.2 多项式朴素贝叶斯分类器

In [48]:
# 为简单起见，加载几个类别的数据
categories = [
    'talk.religion.misc',
    'soc.religion.christian',
    'sci.space',
    'comp.graphics'
]
train = fetch_20newsgroups(subset="train", categories=categories)
test = fetch_20newsgroups(subset="test", categories=categories)

# 处理文本数据，用Tfidf法将文本转化为数值矩阵
# 第一次调用fit_transform后已经获取所有文档的单词量，后续只需要调用transform即可
vec = TfidfVectorizer()
X_train = vec.fit_transform(train.data)
X_test = vec.transform(test.data)
y_train = train.target
y_test = test.target

# 多项式朴素贝叶斯分类器
model = MultinomialNB()

# 拟合数据，训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 用预测精度简单评估分类器的性能
accuracy_score(y_test, y_pred)

0.8016759776536313

### 2.3 模型优化

先对原始文本进行加工处理，如词语分词，提炼词干等，看是否能提供分类器性能。

In [60]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer

stop_words = stopwords.words("english")
stemmer = LancasterStemmer()

def clean_text(text):
    # 词语分词
    tokens = nltk.word_tokenize(text.lower())
    # 剔除停止词
    tokens_ex_stopwords = [token for token in tokens if token not in stop_words]
    # 提炼词干
    tokens_stem = [stemmer.stem(token) for token in tokens_ex_stopwords]
    return " ".join(tokens_stem)
    
clean_text(train.data[3])

": revdak @ netcom.com ( d. andrew kil ) subject : : serb genocid work god ? org : netcom on-line commun serv ( 408 241-9760 guest ) lin : 22 jam sled ( jsledd @ ssdc.sas.upenn.edu ) wrot : : serb work god ? hmm ... : 've wond anyon would ev ask quest , : govern unit stat europ mov : end ethn cleans serb target : muslim ? : can/does god us follow accompl : task ? esp task pun ? : jam sled : cut sig ... . 'm work . suggest god support genocid ? perhap germ `` pun '' jew god 's behalf ? god work way indescrib evil , unworthy wor fai . revdak @ netcom.com"

In [61]:
# 先进行文本预处理
train_clean = [clean_text(text) for text in train.data]
test_clean = [clean_text(text) for text in test.data]

# 用Tfidf法将文本转化为数值矩阵
# 第一次调用fit_transform后已经获取所有文档的单词量，后续只需要调用transform即可
vec = TfidfVectorizer()
X_train = vec.fit_transform(train_clean)
X_test = vec.transform(test_clean)
y_train = train.target
y_test = test.target

# 多项式朴素贝叶斯分类器
model = MultinomialNB()

# 拟合数据，训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 用预测精度简单评估分类器的性能
accuracy_score(y_test, y_pred)

0.8400837988826816

进行预处理后预测精度提升了近4个百分点。