# Text preprocessing

**Goal:** Transform raw text input into *normalized* sequence of
*tokens*. Prepare for *feature extraction*.

"<span style="color:red">Hi</span>. This is an <span
style="color:green">example</span> <span
style="color:blue">sentence</span> in an <span
style="color:green">Example</span> <span
style="color:purple">Document</span>." → \[<span
style="color:red">hi</span>, <span style="color:green">example</span>,
<span style="color:blue">sentence</span>, <span
style="color:green">example</span>, <span
style="color:purple">document</span>\]

Text processing includes many steps and hence many decisions that have
**big effect** on your results. Several *possibilities* will be shown
here. If and how to apply them depends heavily on your data and your
later analysis.

## The document corpus

A *corpus* contains the *documents* that we want to process. Each
document can be accessed by a unique *document label* or *document ID*.
The document itself is usually a (very long) character string (Python
type: *str*) that may contain line breaks.

You normally load a corpus from files, a database or other sources.

In [1]:
# a small toy corpus with some news headlines 
corpus = {   # document label: document text
    'doc1': "This is Andrew's text, isn't it?",
    'doc2': "feet cats wolves talked",
    'doc3': "据报到，复旦大学启动校园准封闭管理",    
}

In [2]:
# access by document label
corpus['doc1']

"This is Andrew's text, isn't it?"

## STEP 1: Tokenization（分词）

**Goal:** Break down document text into smaller, meaningful components
(paragraphs, sentences, **words**) → from a document, form a list of
*tokens*

In our case: We apply *word tokenization*, so **token = word**

With plain Python: calling `split()` on a string splits it by
*whitespace*:

In [3]:
# TODO
corpus['doc1'].split()

['This', 'is', "Andrew's", 'text,', "isn't", 'it?']

**Tokenization is not trivial.**

-   how to handle punctuation, quotes, hyphens?
-   how to handle contractions? ("don't" or "is't")

→ depends on your text (language, source/medium)

-   `str.split()` might not be optimal
-   [NLTK](http://www.nltk.org/) implements several word- and sentence
    tokenizers, e.g.:
    -   [WordPunctTokenizer](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.WordPunctTokenizer):
        punctuation become separate tokens
    -   [TreebankWordTokenizer](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer):
        Default tokenizer → come up with a set of rules
    -   [RegExpTokenizer](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.regexp):
        Define your own tokenizer with [Regular
        Expression](https://en.wikipedia.org/wiki/Regular_expression)
        rules

In [4]:
import nltk, jieba
# nltk.download('punkt')

With nltk.tokenize.wordpunct_tokenize():

In [6]:
# TODO:
nltk.tokenize.wordpunct_tokenize(corpus['doc1'])

['This', 'is', 'Andrew', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

With nltk.word_tokenize():

In [10]:
# word_tokenize uses TreebankWordTokenizer by default
# TODO:
nltk.tokenize.word_tokenize(corpus['doc1'])

['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

**中文分词**

-   [jieba](https://github.com/fxsjy/jieba) 中文分词，jieba.cut()接受如下输入参数: 
    - 需要分词的字符串
    - cut_all 参数用来控制是否采用全模式

In [13]:
# TODO
jieba.lcut(corpus['doc3'])

['据', '报到', '，', '复旦大学', '启动', '校园', '准', '封闭', '管理']

**tokenize whole corpus**

In [14]:
# TODO
jieba.lcut(corpus['doc3'], cut_all = True)

['据', '报到', '，', '复旦', '复旦大学', '大学', '启动', '校园', '准', '封闭', '闭管', '管理']

In [20]:
tokens = {doc_label: nltk.word_tokenize(text) if doc_label != 'doc3'
          else jieba.lcut(text)
          for doc_label, text in corpus.items()}
tokens

{'doc1': ['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?'],
 'doc2': ['feet', 'cats', 'wolves', 'talked'],
 'doc3': ['据', '报到', '，', '复旦大学', '启动', '校园', '准', '封闭', '管理']}

## STEP 2: Text normalization（词形归一化）

## Stemming or Lemmatization

**Goal:** Reduce inflected words to a common form so that they're
counted as one.

### Stemming

Remove affixes from a word to get base form *(stem)* of a word → stem
might not be a lexicographically correct word

-   books → book
-   booked → book
-   **employees → employ**
-   **argued → argu**

NLTK implements several stemming algorithms, e.g.,
-   [PorterStemmer](https://www.nltk.org/howto/stem.html)

In [15]:
corpus['doc2']

'feet cats wolves talked'

With stem() method in nltk.stem.PorterStemmer() class:

In [21]:
#TODO
stemmer = nltk.stem.PorterStemmer()
[stemmer.stem(token) for token in tokens['doc2']]

['feet', 'cat', 'wolv', 'talk']

### Lemmatization

Find *lemma* (dictionary form) of a inflected word → a lemma is always a
lexicographically correct word

Implemented for English in NLTK with `WordNetLemmatizer`.

In [22]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/kirito/nltk_data...


True

In [24]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /Users/kirito/nltk_data...


True

With lemmatize() method in nltk.stem.WordNetLemmatizer() class:

In [25]:
#TODO
lemmatizer = nltk.stem.WordNetLemmatizer()
[lemmatizer.lemmatize(token) for token in tokens['doc2']]

['foot', 'cat', 'wolf', 'talked']

### Normalizing captical letters
Usually: convert all words to lowercase.

**Can be problematic because of "capitonyms":**

-   e.g. in English: "May" ≠ "may", "Pole" ≠ "pole"

Methods in Python: `str.lower()`, `str.upper()`

In [26]:
#TODO
[t.lower() for t in tokens["doc1"]]

['this', 'is', 'andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

### Normalizing whole tokens

In [28]:
# TODO
normtokens = {doc_label: [lemmatizer.lemmatize(t).lower() for t in tokenlist]
    for doc_label, tokenlist in tokens.items()
}
normtokens

{'doc1': ['this', 'is', 'andrew', "'s", 'text', ',', 'is', "n't", 'it', '?'],
 'doc2': ['foot', 'cat', 'wolf', 'talked'],
 'doc3': ['据', '报到', '，', '复旦大学', '启动', '校园', '准', '封闭', '管理']}

## STEP 3: Removing stopwords（删除停用词）

*Stopwords* are words that are removed before doing further text
analysis. 

Usually: Very common words for a certain language that
transport little information.

Stopword list depends on:

-   language
-   your data / research scenario (filter out too common words)
-   later text analysis method, e.g.:
    -   *tf-idf* automatically reduces importance of very common words
        (as opposed to *Bag-of-Words*)
    -   sentiment analysis: bad idea to have words like "not" in the
        stopword list!

NLTK has a list of stopwords for some languages:

In [29]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kirito/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [30]:
stopwords_english = nltk.corpus.stopwords.words('english')

In [31]:
stopwords_english[:5]

['i', 'me', 'my', 'myself', 'we']

In [32]:
len(stopwords_english)

179

**中文停用词**

Load the chinese stopwords from the file "stop_words".

In [33]:
# TODO
with open('stop_words') as f:
    stopwords_chinese = [line.strip() for line in f]

In [34]:
# TODO
stopwords_chinese[100:105]

['他', '以', '们', '任', '会']

**Remove english and chinese stopwords**

In [46]:
# TODO
stopwords = stopwords_chinese + stopwords_english

In [55]:
stoptokens = {doc_label: [t for t in tokenlist if t not in stopwords]
    for doc_label, tokenlist in normtokens.items()
}

In [56]:
stoptokens

{'doc1': ['andrew', "'s", 'text', "n't"],
 'doc2': ['foot', 'cat', 'wolf', 'talked'],
 'doc3': ['报到', '复旦大学', '启动', '校园', '准', '封闭', '管理']}