Agenda:
- 标记解析方法处理数据
- 提取文本数据的词干
- 词形还原的方法还原文本的基本形式
- 分块方法来划分文本
- 创建词袋模型 bag-of-words model
- 文本分类器
- 识别性别
- 分析句子情感
- 主题建模识别文本模式

- NP最常用的领域包括搜索引擎,情感分析,主题建模,词性标注,实体识别
- pip install NLTK(Natural Language Toolkit)
- NLTK库: https://www.nltk.org/install.html
- NLTK数据: https://www.nltk.org/data.html NLTK数据中包含了很多语料和训练模型,是文本分析的重要部分

In [0]:
# !pip install nltk

import nltk
nltk.__version__

Out[3]: '3.7'

#### 用标记解析法预处理数据 - tokenizer
- 标记解析: 将文本分割成一组有意义片段的过程
- 片段被成为标记, 将一段文字分割成单词或句子

In [0]:
import nltk
nltk.download('punkt')

# 定义文本分析示例
text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."

# Sentence tokenization - 句子解析器
from nltk.tokenize import sent_tokenize

# 对输入文本运行句子解析器,提取出标记
sent_tokenize_list = sent_tokenize(text)
print("\nSentence tokenizer:")
print(sent_tokenize_list)

# Create a new word tokenizer - 建立一个新的单词解析器,单词中的标点不做分割
from nltk.tokenize import word_tokenize
print("\nWord tokenizer:")
print(word_tokenize(text))

# Create a new punkt word tokenizer
# from nltk.tokenize import WordPunctTokenizer
# punkt_word_tokenizer = WordPunctTokenizer()
# print("\nPunkt word tokenizer:")
# print(punkt_word_tokenizer.tokenize(text))

# Create a new WordPunct tokenizer - 将标点符号保留到不同句子中
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()
print("\nWord punct tokenizer:")
print(word_punct_tokenizer.tokenize(text))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Sentence tokenizer:
['Are you curious about tokenization?', "Let's see how it works!", 'We need to analyze a couple of sentences with punctuations to see it in action.']

Word tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'s", 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']

Word punct tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'", 's', 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


#### 提取文本数据的词干 - stemmer
- 词干提取的目标是将不同词性的单词都变成其原形
- 词干提取使用启发式处理方法截取单词的尾部,以提取单词的原形

In [0]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

In [0]:
# 定义一些单词来进行词干提取
words = ['table', 'probably', 'wolves', 'playing', 'is', 
        'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

# Compare different stemmers
# 对比不同的词干提取算法. 
# 算法的本质目标都是提取词干，消除词形的影响，不同之处在于操作的严格程度不同。结果会显示，Lancaster比其他两个更严格，Porter是最宽松的。Lancaster提取的词干会减少单词的很大部分，往往比较模糊，难以立即额，但速度快。因此通常会选择Snowball词干提取器。
stemmers = ['PORTER', 'LANCASTER', 'SNOWBALL']

# 初始化三个词干提取器对象
stemmer_porter = PorterStemmer()
stemmer_lancaster = LancasterStemmer()
stemmer_snowball = SnowballStemmer('english')

# 设定正确的格式，将输出数据以表格形式展示
formatted_row = '{:>16}' * (len(stemmers) + 1)
print('\n', formatted_row.format('WORD', *stemmers), '\n')

# 迭代列表，用三个词干提取器进行词干提取
for word in words:
    stemmed_words = [stemmer_porter.stem(word), 
            stemmer_lancaster.stem(word), stemmer_snowball.stem(word)]
    print(formatted_row.format(word, *stemmed_words))


             WORD          PORTER       LANCASTER        SNOWBALL 

           table            tabl            tabl            tabl
        probably         probabl            prob         probabl
          wolves            wolv            wolv            wolv
         playing            play            play            play
              is              is              is              is
             dog             dog             dog             dog
             the             the             the             the
         beaches           beach           beach           beach
        grounded          ground          ground          ground
          dreamt          dreamt          dreamt          dreamt
        envision           envis           envid           envis


#### 用词性还原方法，还原文本的基本形式
- 目标：将单词转化为原形，更结构化的一种方法
- 通过对单词进行词汇和语法分析来实现。词形变换的结尾，如'ing'或'ed',可以返回单词的原形形式词根(lemma)。比提取词干的方式更精准

In [0]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
Out[7]: True

In [0]:
from nltk.stem import WordNetLemmatizer

# 定义词组来测试词性还原
words = ['table', 'probably', 'wolves', 'playing', 'is', 
        'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

# Compare different lemmatizers
# 比较两个不同的词形还原器(名词和动词的词形还原器)
lemmatizers = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']

# 基于WordNet词形还原器创建一个对象
lemmatizer_wordnet = WordNetLemmatizer()

# 规范化输出格式
formatted_row = '{:>24}' * (len(lemmatizers) + 1)
print('\n', formatted_row.format('WORD', *lemmatizers), '\n')

# 迭代列表中的单词，并用词形还原器进行词形还原
for word in words:
    lemmatized_words = [lemmatizer_wordnet.lemmatize(word, pos='n'),lemmatizer_wordnet.lemmatize(word, pos='v')]
    print(formatted_row.format(word, *lemmatized_words))


                     WORD         NOUN LEMMATIZER         VERB LEMMATIZER 

                   table                   table                   table
                probably                probably                probably
                  wolves                    wolf                  wolves
                 playing                 playing                    play
                      is                      is                      be
                     dog                     dog                     dog
                     the                     the                     the
                 beaches                   beach                   beach
                grounded                grounded                  ground
                  dreamt                  dreamt                   dream
                envision                envision                envision


#### 用分块方法划分文本
- 与标记解析不同，分块没有条件约束，结果不需要有实际意义
- 处理非常大的文本时，通常需要先分块，再进行进一步分析
- 将文本分成若干块，每块包含固定数目的单词

In [0]:
import numpy as np
from nltk.corpus import brown
nltk.download('brown')

# 定义一个将文本分割成块的函数
# Split a text into chunks 
def splitter(data, num_words):
    words = data.split(' ')
    output = []
    
    # 初始化变量
    cur_count = 0
    cur_words = []
    
    # 迭代单词
    for word in words:
        cur_words.append(word)
        cur_count += 1
        
        # 当单词数量与所需数量相等时，重置变量
        if cur_count == num_words:
            output.append(' '.join(cur_words))
            cur_words = []
            cur_count = 0
    
    # 将快添加到输出变量列表
    output.append(' '.join(cur_words) )

    return output


# Read the data from the Brown corpus - 从布朗语料库加载数据
data = ' '.join(brown.words()[:10000])

# Number of words in each chunk - 定义每个块包含的单词数目
num_words = 1700

# 定义两个相关变量
chunks = []
counter = 0

text_chunks = splitter(data, num_words)

print("\nNumber of text chunks =", len(text_chunks))
# print("\nText chunks =", text_chunks)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.

Number of text chunks = 6


#### 创建词袋模型 (bag-of-words)
- 对于数百万单词量级的文本数据，需将其转化为某种数值表形式，让机器来学习算法
- 词袋是从所有文档中的所有单词学习词汇的模型。学习之后，词袋通过构建文档中所有单词的直方图来对每篇文档进行建模

In [0]:
!pip install chunking

Collecting chunking
  Downloading Chunking-0.0.2.tar.gz (1.7 kB)
Building wheels for collected packages: chunking
  Building wheel for chunking (setup.py) ... [?25l- \ done
[?25h  Created wheel for chunking: filename=Chunking-0.0.2-py3-none-any.whl size=2080 sha256=4a36f93ee17974445d7d83be6950f9349d1fcf6ea6e5db1a88ee97cca19b7969
  Stored in directory: /root/.cache/pip/wheels/09/f3/a2/e8fc3b318aa3cfa63c6a4a7043091f62c18917f0a1e781f5f5
Successfully built chunking
Installing collected packages: chunking
Successfully installed chunking-0.0.2
You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import numpy as np
from nltk.corpus import brown
# from chunking import split

In [0]:
# splitter('a b c d e f',2)

In [0]:
# Read the data from the Brown corpus - 布朗语料库加载数据
data = ' '.join(brown.words()[:10000])

# Number of words in each chunk - 分成5块，每块2000个单词
num_words = 2000

chunks = []
counter = 0

text_chunks = splitter(data, num_words)

# 创建一个基于这些文本块的词典
for text in text_chunks:
    chunk = {'index': counter, 'text': text}
    chunks.append(chunk)
    counter += 1

# Extract document term matrix
# 提取'文档-词'矩阵。文档-词矩阵记录了文档中每个单词出现的频次。通过sklearn来构建这样的矩阵
from sklearn.feature_extraction.text import CountVectorizer

# 对象实例化，定义对象，并提取文档-词矩阵
vectorizer = CountVectorizer(min_df=5, max_df=.95)
doc_term_matrix = vectorizer.fit_transform([chunk['text'] for chunk in chunks])

# 从vectorizer对象中提取词汇，并打印
vocab = np.array(vectorizer.get_feature_names())
print("\nVocabulary:")
print(vocab)

# 打印文档-词矩阵
print("\nDocument term matrix:")
chunk_names = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4']

# 格式化输出内容
formatted_row = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_row.format('Word', *chunk_names), '\n')

# 迭代所有单词，输出每个单词出现在不同块中的次数
for word, item in zip(vocab, doc_term_matrix.T):
    # 'item' is a 'csr_matrix' data structure
    output = [str(x) for x in item.data]
    print(formatted_row.format(word, *output))


Vocabulary:
['about' 'after' 'against' 'aid' 'all' 'also' 'an' 'and' 'are' 'as' 'at'
 'be' 'been' 'before' 'but' 'by' 'committee' 'congress' 'did' 'each'
 'education' 'first' 'for' 'from' 'general' 'had' 'has' 'have' 'he'
 'health' 'his' 'house' 'in' 'increase' 'is' 'it' 'last' 'made' 'make'
 'may' 'more' 'no' 'not' 'of' 'on' 'one' 'only' 'or' 'other' 'out' 'over'
 'pay' 'program' 'proposed' 'said' 'similar' 'state' 'such' 'take' 'than'
 'that' 'the' 'them' 'there' 'they' 'this' 'time' 'to' 'two' 'under' 'up'
 'was' 'were' 'what' 'which' 'who' 'will' 'with' 'would' 'year' 'years']

Document term matrix:

         Word     Chunk-0     Chunk-1     Chunk-2     Chunk-3     Chunk-4 

       about           1           1           1           1           3
       after           2           3           2           1           3
     against           1           2           2           1           1
         aid           1           1           1           3           5
         all       

In [0]:
# 工作原理解释,如何生成词频特征向量?

# >>考虑以下句子:

# Sentence 1: The brown dog is running.
# Sentence 2: The black dog is in the black room.
# Sentence 3: Running in the room is forbidden.
# 如果你考虑所有这三个句子, 我们有以下九个独特的词语:

# the
# brown
# dog
# is
# running
# black
# in
# room
# forbidden
# 现在, 我们用每个句子中的单词数将每个句子转换为直方图. 每个特征向量将是9维, 因为我们有九个独特的单词:

# Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0, 0]
# Sentence 2: [2, 0, 1, 1, 0, 2, 1, 1, 0]
# Sentence 3: [0, 0, 0, 1, 1, 0, 1, 1, 1]
# 一旦我们提取这些特征向量, 我们可以使用机器学习算法来分析它们.

#### 创建文本分类器
- 目的是为了将文本文档分为不同的类
- 技术:TF-IDF. 词频-逆文档频率 (Term Frequency - Inverse Document Frequency)
- TF-IDF帮助理解一个单词在一组文档中对某一个文档的重要性，它可以作为特征向量来进行文档分类
- http://www.tfidf.com/

In [0]:
from sklearn.datasets import fetch_20newsgroups

# 选择类型列表，用词典映射方式定义。这些类型是加载的新闻组数据集的一部分
category_map = {'misc.forsale': 'Sales', 'rec.motorcycles': 'Motorcycles', 'rec.sport.baseball': 'Baseball', 'sci.crypt': 'Cryptography', 'sci.space': 'Space'}

# 基于定义的类型加载训练数据
training_data = fetch_20newsgroups(subset='train', categories=category_map.keys(), shuffle=True, random_state=7)

# 到入特征提取器 - Feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# 用训练数据提取特征：
vectorizer = CountVectorizer()
X_train_termcounts = vectorizer.fit_transform(training_data.data)
print("\nDimensions of training data:", X_train_termcounts.shape)

# Training a classifier - 训练分类器。用朴素贝叶斯分类器
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

# tf-idf transformer - 定义并训练TF-IDF变换器对象
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_termcounts)

# Multinomial Naive Bayes classifier - 得到特征向量后，用该数据训练多项式朴素贝叶斯分类器
classifier = MultinomialNB().fit(X_train_tfidf, training_data.target)

# 定义随机输入的句子
input_data = [
    "The curveballs of right handed pitchers tend to curve to the left", 
    "Caesar cipher is an ancient form of encryption",
    "This two-wheeler is really good on slippery roads"
]

# 用词频统计转换输入数据
X_input_termcounts = vectorizer.transform(input_data)

# 用TF-IDF变换器变换输入数据
X_input_tfidf = tfidf_transformer.transform(X_input_termcounts)

# Predict the output categories - 用训练的分类器预测输入句子的输出类型
predicted_categories = classifier.predict(X_input_tfidf)

# Print the outputs - 打印输出
for sentence, category in zip(input_data, predicted_categories):
    print('\nInput:', sentence, '\nPredicted category:', category_map[training_data.target_names[category]])


Dimensions of training data: (2968, 40605)

Input: The curveballs of right handed pitchers tend to curve to the left 
Predicted category: Baseball

Input: Caesar cipher is an ancient form of encryption 
Predicted category: Cryptography

Input: This two-wheeler is really good on slippery roads 
Predicted category: Motorcycles


In [0]:
# TF-IDF工作原理
# 常用于信息检索领域，目的是为了了解文档中每个单词的重要性。且要排除普通词汇，要提取具有实际意义的单词。
# 通常流程：提取每个句子的词频，将其转换为特征向量，然后用分类器来对这些句子进行分类

# 词频 - TF,Term Frequency表示一个单词在给定文档中出现的频次
# 逆文档频率 - IDF,Inverse Document Frequency表示给定单词的重要性

# 如果一个单词在每个文档都出现，或只出现一次，都不是十分有意义。我们寻找的是那些出现了一定次数，但不太频繁以至于变成噪声的单词
# TF-IDF可以挑选符合要求的单词，并可用于文档分类。搜索引擎常使用TF-IDF对搜索结果进行相关度排序

Out[24]: <3x40605 sparse matrix of type '<class 'numpy.float64'>'
	with 27 stored elements in Compressed Sparse Row format>

#### 实例：识别性别
- 通过姓名中的字母特征来识别性别

In [0]:
import nltk
nltk.download('names')

import random
from nltk.corpus import names
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


In [0]:
# Extract features from the input word
# 提取输入单词的特征
def gender_features(word, num_letters=2):
    return {'feature': word[-num_letters:].lower()}

In [0]:
# Extract labeled names
# 提取标记名称
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

# 设置随机生成数的种子值，并混合搅乱训练数据
random.seed(7)
random.shuffle(labeled_names)

# 定义一些输入的姓名
input_names = ['Leonardo', 'Amy', 'Sam']

# Sweeping the parameter space
# 不确定需要多少个末尾字符，这里将参数设置为1-5.每次循环执行，都会截取相应大小的末尾字符个数。 - 搜索参数空间
for i in range(1, 5):
    print('\nNumber of letters:', i)
    featuresets = [(gender_features(n, i), gender) for (n, gender) in labeled_names]
    
    # 训练集测试集划分
    train_set, test_set = featuresets[500:], featuresets[:500]
    
    # 朴素贝叶斯分类
    classifier = NaiveBayesClassifier.train(train_set)

    # Print classifier accuracy
    # 用参数空间的每一个值评价分类器的效果
    print('Accuracy ==>', str(100 * nltk_accuracy(classifier, test_set)) + str('%'))

    # Predict outputs for new inputs
    for name in input_names:
        print(name, '==>', classifier.classify(gender_features(name, i)))


Number of letters: 1
Accuracy ==> 76.2%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 2
Accuracy ==> 78.60000000000001%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 3
Accuracy ==> 76.6%
Leonardo ==> male
Amy ==> female
Sam ==> female

Number of letters: 4
Accuracy ==> 70.8%
Leonardo ==> male
Amy ==> female
Sam ==> female


#### 分析句子的情感 - Sentiment Analysis
- 情感分析是NLP最受欢迎的应用之一
- 输出结果：给定文本是'积极'，或'消极'。有时还会加入'中性'
- 情感分析常用于发现人们对于一个特定主题的看法
- 分析场景：营销活动、社交媒体、电子商务客户场景中的用户情绪

In [0]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Out[27]: True

In [0]:
# 定义一个用于提取特征的函数
def extract_features(word_list):
    return dict([(word, True) for word in word_list])
 

    
# Load positive and negative reviews  
# 使用NLTK提供的电影评论数据
positive_fileids = movie_reviews.fileids('pos')
negative_fileids = movie_reviews.fileids('neg')

# 将评论数据分成积极评论和消极评论
features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 
        'Positive') for f in positive_fileids]
features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 
        'Negative') for f in negative_fileids]

# Split the data into train and test (80/20)
# 训练和测试数据集划分
threshold_factor = 0.8
threshold_positive = int(threshold_factor * len(features_positive))
threshold_negative = int(threshold_factor * len(features_negative))

# 提取特征
features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]  
print("\nNumber of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))

# Train a Naive Bayes classifier
# 训练朴素贝叶斯分类器
classifier = NaiveBayesClassifier.train(features_train)
print("\nAccuracy of the classifier:", 100*nltk.classify.util.accuracy(classifier, features_test), '%')


Number of training datapoints: 1600
Number of test datapoints: 400

Accuracy of the classifier: 73.5 %


In [0]:
# 分类器对象包含分析过程中获得的最有信息量的单词
# 据此来判定哪些归类为积极评论，哪些为消极评论
print("\nTop 10 most informative words:")
for item in classifier.most_informative_features()[:10]:
    print(item[0])



Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
affecting
animators


In [0]:
# Sample input reviews
# 生成一些随机输入的句子
input_reviews = [
    "It is an amazing movie", 
    "This is a dull movie. I would never recommend it to anyone.",
    "The cinematography is pretty great in this movie", 
    "The direction was terrible and the story was all over the place" 
]

# 在输入句子上运行分类器，获得预测结果
print("\nPredictions:")
for review in input_reviews:
    print("\nReview:", review)
    probdist = classifier.prob_classify(extract_features(review.split()))
    pred_sentiment = probdist.max()
    
    # 打印输出
    print("Predicted sentiment:", pred_sentiment )
    print("Probability:", round(probdist.prob(pred_sentiment), 2))


Predictions:

Review: It is an amazing movie
Predicted sentiment: Positive
Probability: 0.61

Review: This is a dull movie. I would never recommend it to anyone.
Predicted sentiment: Negative
Probability: 0.77

Review: The cinematography is pretty great in this movie
Predicted sentiment: Positive
Probability: 0.67

Review: The direction was terrible and the story was all over the place
Predicted sentiment: Negative
Probability: 0.63


In [0]:
# 实例工作原理
# 用NLTK的朴素贝叶斯分类器进行分类
# 特征提取中，基本上提取了所有的唯一单词。NLTK分类器需要字典格式存放输入数据
# 训练分类器，以便将句子分为积极和消极
# 查看最有信息量的单词。如'outstanding'表示积极评论，'insulting'表示消极评论。因此通过某些单词来表示情绪

#### 用主题建模识别文本的模式
- 主题建模，识别文本数据隐藏模式的过程，目的是发现一组文档的隐藏主题结构。主题建模可以更好地组织文档，以便对文档进行分类
- LDA介绍
  - http://www.cs.columbia.edu/~blei/topicmodeling.html
- Gensim介绍 
  - https://radimrehurek.com/gensim/install.html
- LDA拓展资料：
  - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

In [0]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.2.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[?25l[K     |                                | 10 kB 21.8 MB/s eta 0:00:02[K     |                                | 20 kB 23.4 MB/s eta 0:00:02[K     |                                | 30 kB 12.2 MB/s eta 0:00:02[K     |                                | 40 kB 7.0 MB/s eta 0:00:04[K     |                                | 51 kB 6.6 MB/s eta 0:00:04[K     |                                | 61 kB 7.6 MB/s eta 0:00:04[K     |                                | 71 kB 7.7 MB/s eta 0:00:04[K     |                                | 81 kB 8.6 MB/s eta 0:00:03[K     |▏                               | 92 kB 7.3 MB/s eta 0:00:04[K     |▏                               | 102 kB 7.4 MB/s eta 0:00:04[K     |▏                               | 112 kB 7.4 MB/s eta 0:00:04[K     |▏                               | 122 kB 7.4 MB/s eta 0:00:04[K     |▏                       

In [0]:
from nltk.tokenize import RegexpTokenizer  
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[54]: True

In [0]:
# 导入训练数据

# spark读取DBFS中txt文件
df = spark.read.text('/FileStore/tables/data_topic_modeling.txt')

# txt文件中所有数据会自动到dataframe中的一列
df1 = df.toPandas()
df1

# # 将txt中文本格式的数据，通过逗号来拆分开，并保留为string格式
# expand_df = df1['value'].str.split(',', expand=True)

# # 暴力字典法 - 批量遍历修改数据集的列名
# new_dict = {key:'col'+'_'+str(i) for i, key in enumerate(expand_df.columns)}
# expand_df.rename(columns=new_dict, inplace=True)

# # str类型数据转化为numeric
# lst_to_num = ['col_0', 'col_1', 'col_2']
# for i in lst_to_num:
#     expand_df[i] = pd.to_numeric(expand_df[i])
    
# expand_df

# print(expand_df.head())

input_ = []
for i in range(len(df1)):
    input_.append(df1.iloc[i,0])

input_

Out[95]: ['He spent a lot of time studying cryptography. ',
 'You need to have a very good understanding of modern encryption systems in order to work there.',
 "If their team doesn't win this match, they will be out of the competition.",
 'Those codes are generated by a specialized machine. ',
 'The club needs to develop a policy of training and promoting younger talent. ',
 'His movement off the ball is really great. ',
 'In order to evade the defenders, he needs to move swiftly.',
 'We need to make sure only the authorized parties can read the message. ']

In [0]:
# Class to preprocess text
# 定义一个预处理文本的类。这个预处理器处理相应的对象，并从输入文本中提取相关的特征
class Preprocessor(object):
    # Initialize various operators 初始化
    def __init__(self):
        # Create a regular expression tokenizer 创建正则表达式解析器
        self.tokenizer = RegexpTokenizer(r'\w+')

        # get the list of stop words 获取停用词表，在分析过程中将停用词排除
        self.stop_words_english = stopwords.words('english')

        # Create a Snowball stemmer 创建一个Snowball词干提取器
        self.stemmer = SnowballStemmer('english')
        
    # Tokenizing, stop word removal, and stemming 标记解析、移除停用词、词干提取
    def process(self, input_text):
        
        # Tokenize the string 标记解析
        tokens = self.tokenizer.tokenize(input_text.lower())

        # Remove the stop words 移除停用词
        tokens_stopwords = [x for x in tokens if not x in self.stop_words_english]
        
        # Perform stemming on the tokens 对标记做词干提取
        tokens_stemmed = [self.stemmer.stem(x) for x in tokens_stopwords]
        
        # 返回处理后的标记
        return tokens_stemmed
    
    
    

data = input_.copy()


# Create a preprocessor object 创建预处理对象
preprocessor = Preprocessor()

# Create a list for processed documents 创建一组经过预处理的文档
processed_tokens = [preprocessor.process(x) for x in data]

# Create a dictionary based on the tokenized documents 创建基于标记文档的词典
dict_tokens = corpora.Dictionary(processed_tokens)

# Create a document-term matrix 创建文档-词矩阵
corpus = [dict_tokens.doc2bow(text) for text in processed_tokens]

# Generate the LDA model based on the corpus we just created 用隐含狄利克雷分布(LDA - Latent Dirichlet Allocation)做主题建模。定义相关参数并初始化LDA模型对象。
num_topics = 2
num_words = 4
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word=dict_tokens, passes=25)

# 识别两个主题后，将两个主题分开来看
print("\nMost contributing words to the topics:")
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print("\nTopic", item[0], "==>", item[1])




Most contributing words to the topics:

Topic 0 ==> 0.051*"need" + 0.031*"talent" + 0.031*"club" + 0.031*"younger"

Topic 1 ==> 0.063*"need" + 0.038*"order" + 0.037*"work" + 0.037*"good"


In [0]:
# LDA工作原理
# 主题建模通过识别文档中最有意义，最能表征主题的词来实现主题分类
# 使用正则表达式(regular regression)标记器，因为只需要那些没有标点或其他标点的单词
# 停用词去除是另一个更重要的步骤，可以减小一些常用词的噪声干扰
# 之后需要对单词做词干提取，以获得其原型
# LDA构建主题模型。结果分布将文档表示成不同主题的混合，这些主题可以“吐出”单词，有一定的概率。LDA分布的目标是找到这些主题，本质是一个生成主题的模型，该模型试图找到所有主题，所有主题又负责生成给定主题的文档。小样本输出的主题结果，有些词看起来不那么相关；用更大的数据集来运行程序，精确度会更高。