## The outline of this notebook is as follows:

1. **Exploratory Data Analysis (EDA) and Wordclouds** - Analyzing the data by generating simple statistics such word frequencies over the different authors as well as plotting some wordclouds (with image masks).
2. **Natural Language Processing (NLP) with NLTK (Natural Language Toolkit)** - Introducing basic text processing methods such as tokenizations, stop word removal, stemming and vectorizing text via term frequencies (TF) as well as the inverse document frequencies (TF-IDF)
3. **Topic Modelling with LDA and NNMF** - Implementing the two topic modelling techniques of Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

In [1]:
import base64
import numpy as np
import pandas as pd
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from collections import Counter
from scipy.misc import imread
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
# Loading in the training data with Pandas
train = pd.read_csv("./train.csv")

## 1. The Authors and their works EDA¶
首先，观察一下数据集的前几行，它会告诉我们数据集的结构，以及作者具体是谁

In [4]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


### Summary statistics of the training set
我们能对数据集的基础统计信息进行可视化，比如每个作家的语料分布。
为完成可视化，会调用Plot.ly可视化包，并用它完成一些简单的bar plot。

In [5]:
z = {'EAP': 'Edgar Allen Poe', 'MWS': 'Mary Shelley', 'HPL': 'HP Lovecraft'}
data = [go.Bar(
            x = train.author.map(z).unique(),
            y = train.author.value_counts().values,
            marker= dict(colorscale='Jet',
                         color = train.author.value_counts().values
                        ),
            text='Text entries attributed to Author'
    )]

layout = go.Layout(
    title='Target variable distribution'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

In [6]:
all_words = train['text'].str.split(expand=True).unstack().value_counts()

In [9]:
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (Uncleaned) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

Notice anything odd about the words that appear in this word frequency plot? Do these words actually tell us much about the themes and concepts that Mary Shelley wants to portray to the reader in her stories?

These words are all so commonly occuring words which you could find just anywhere else. Not just in spooky stories and novels by our three authors but also in newspapers, kid book, religious texts - really almost every other english text. Therefore we must find some way to preprocess our dataset first to strip out all these commonly occurring words which do not bring much to the table.

### WordClouds to visualise each author's work
One very handy visualization tool for a data scientist when it comes to any sort of natural language processing is plotting "Word Cloud". A word cloud (as the name suggests) is an image that is made up of a mixture of distinct words which may make up a text or book and where the size of each word is proportional to its word frequency in that text (number of times the word appears). Here instead of dealing with an actual book or text, our words can simply be taken from the column "text"

#### Store the text of each author in a Python list
We first create three different python lists that store the texts of Edgar Allen Poe, HP Lovecraft and Mary Shelley respectively as follows:

In [10]:
eap = train[train.author=="EAP"]["text"].values
hpl = train[train.author=="HPL"]["text"].values
mws = train[train.author=="MWS"]["text"].values

In [12]:
# Next to create our WordCloud
from wordcloud import WordCloud, STOPWORDS

ImportError: No module named 'wordcloud'

## 2. Natural Language Processing

在几乎所有的你能碰到的NLP(探索计算机和人类语言之间交互关系的领域)任务中（topic modeling, word clustering, document-text classification等），工作人员一般都会经历如下的几个数据预处理的阶段，为的是将输入的原始文本转化为模型或者机器能够理解的数据。期望把原始的文本数据交给一个随机森林模型，然后让它立即预测出结果是不现实的。

文本预处理可以分成以下几个步骤：

1. **Tokenization** - 分词
2. **Stopwords** - 丢掉哪些出现得过于频繁的词，以至于它们的出现频率对预测相关文本毫无帮助（此外，还经常将那些出现频率过低的词也丢弃）
3. **Stemming** - 组合有变体的词为同一个父类词组(parent word)，因为它们表达的是同一个意思
4. **Vectorization** - 将文本转化为向量格式。最简单的方法是著名的词袋（bag-of-words）方法，通过它可以为每个语料中对文档或文本创建一个矩阵。在最简单的形式中，这个矩阵存储了词频信息，通常称之为原始文本的向量化。

In [3]:
import nltk

### 2a. 分词(Tokenization)
The concept of tokenization is the act of taking a sequence of characters (think of Python strings) in a given document and dicing it up into its individual constituent pieces, which are the eponymous "tokens" of this method. One could loosely think of them as singular words in a sentence. One could naively implement the "split( )" method on a string which separates it into a python list based on the identifier in the argument. It is actually not that trivial to

可以简单的使用split来完成分词，用split()方法将字符串拆分成一个个单独的词，但实际上并没有这么简单。

Here we split the first sentence of the text in the training data just on a space as follows:

我们将训练文本中的第一句话进行分词：


In [4]:
# Storing the first text element as a string
first_text = train.text.values[0]
print(first_text)
print("="*90)
print(first_text.split(" "))

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
['This', 'process,', 'however,', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon;', 'as', 'I', 'might', 'make', 'its', 'circuit,', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out,', 'without', 'being', 'aware', 'of', 'the', 'fact;', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall.']


然而，正如你能看到的那样，单纯的用split方法有时候显得不够准确，例如第二个分词"process,"，逗号(",")被包含进词组中，但其实这并不是我们所需要的。
理想情况是，我们希望将逗号和词语分开，而单纯使用python语句来实现会很麻烦，而这时候NLTK包会派上用场。
可以使用word_tokenize()方法，将词语和标点符号都分开为单独的元素：


In [5]:
first_text_list = nltk.word_tokenize(first_text)
print(first_text_list)

['This', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'I', 'might', 'make', 'its', 'circuit', ',', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out', ',', 'without', 'being', 'aware', 'of', 'the', 'fact', ';', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall', '.']


### 2b. 移除停用词(Stopword Removal)
正如上文提到的那样，停用词指的是那些出现频率过高以至于对预测或者学习过程贡献甚微的词语。停用词包括像"to" 或者 "the"这样的词，所以我们需要在预处理过程中将其去掉。
NLTK中预先定义了一个包含153个英文停用词的list可供使用。

In [9]:
stopwords = nltk.corpus.stopwords.words('english')
len(stopwords)

153

可以使用一个列表生成式（list comprehension）来将停用词从我们的分词结果中过滤出去：

In [10]:
first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
print(first_text_list_cleaned)
print("="*90)
print("Length of original list: {0} words\n"
      "Length of list after stopwords removal: {1} words"
      .format(len(first_text_list), len(first_text_list_cleaned)))

['process', ',', 'however', ',', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', ';', 'might', 'make', 'circuit', ',', 'return', 'point', 'whence', 'set', ',', 'without', 'aware', 'fact', ';', 'perfectly', 'uniform', 'seemed', 'wall', '.']
Length of original list: 48 words
Length of list after stopwords removal: 28 words


### 2c. 词形规范化：词干提取和词形还原 （Stemming and Lemmatization）
NLP中去除停用词后的下一个步骤是词干提取（Stemming）。这一步的工作尝试着将那些具有相同意思的词合并为同一个词根。例如当我们有"running", "runs"和 "run"，将会把这3个不同的词合并为run。尽管这将会损失时态信息。

NLTK提供了多种stemmer方法，包括Porter stemming algorithm, the lancaster stemmer 以及 the Snowball stemmer。

在下面的例子中，会从创建一个stemmer实例开始：

In [11]:
stemmer = nltk.stem.PorterStemmer()

这样我们就能观察stemmer是否能从词组（running, runs, run）中提取出词干（run）来：

In [12]:
print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))

The stemmed form of running is: run
The stemmed form of runs is: run
The stemmed form of run is: run


As we can see, the stemmer has successfully reduced the given words above into a base form and this will be most in helping us reduce the size of our dataset of words when we come to learning and classification tasks.

However there is one flaw with stemming and that is the fact that the process involves quite a crude heuristic in chopping off the ends of words in the hope of reducing a particular word into a human recognizable base form. Therefore this process does not take into account vocabulary or word forms when collapsing words as this example will illustrate:

In [13]:
print("The stemmed form of leaves is: {}".format(stemmer.stem("leaves")))

The stemmed form of leaves is: leav


### Lemmatization to the rescue
Therefore we turn to another that we could use in lieu of stemming. This method is called lemmatization which aims to achieve the same effect as the former method. However unlike a stemmer, lemmatizing the dataset aims to reduce words based on an actual dictionary or vocabulary (the Lemma) and therefore will not chop off words into stemmed forms that do not carry any lexical meaning. Here we can utilize NLTK once again to initialize a lemmatizer (WordNet variant) and inspect how it collapses words as follows:



In [14]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
print("The lemmatized form of leaves is: {}".format(lemm.lemmatize("leaves")))

The lemmatized form of leaves is: leaf
