Text data usually consists of a series of unstructured symbols like words, punctuation, digits all of which combine into strings with varied lengths. It is complex but also necessary to handle text data ahead of NLP tasks such as feature extraction, classification and etc. In order to facilitate text processing, I wrote four tools to handle text data(English):
- text_preprocess.py : Remove punctuation and digits, lemmatize words and return clean text strings
- text_vectorizer.py : Simply clean the texts, vectorize them using tools from scikit-learn
- text_hier_split: Split texts into hierarchical structure, for example, a text can be divided into several sentences, and each sentence can be devided into tokens like words and punctuation
- token_idx_map: Build a vacabulary and a dictionary for texts of hierarchical-structure, map each word into an unique ID for latter usage like deep learning applications

In this demo, we take 20newsgroup data as example to do text cleaning and subsequent tasks.

In [1]:
import numpy as np
import nltk
import string
import numpy as np
from nltk.corpus import movie_reviews

In [2]:
#Fetch the news texts
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', 
                                  shuffle=True, random_state=11)
newsgroups_test = fetch_20newsgroups(subset='test', 
                                  shuffle=True, random_state=11)

In [17]:
print('Number of training texts:', len(newsgroups_train.data))
print('Number of testing texts:', len(newsgroups_test.data))

Number of training texts: 11314
Number of testing texts: 7532


Let's take a look at the news. It seems quite messy, that's why we need preprocessing.

In [8]:
print(newsgroups_train.data[2])

From: bear@kestrel.fsl.noaa.gov (Bear Giles)
Subject: Re: Fifth Amendment and Passwords
Organization: Forecast Systems Labs, NOAA, Boulder, CO USA
Lines: 37

In article <1993Apr15.160415.8559@magnus.acs.ohio-state.edu> ashall@magnus.acs.ohio-state.edu (Andrew S Hall) writes:
>I am postive someone will correct me if I am wrong, but doesn't the Fifth
>also cover not being forced to do actions that are self-incriminating?
>e.g. The police couldn't demand that you silently take them to where the
>body is buried or where the money is hidden.

But they can make you piss in a jar, and possibly provide DNA, semen,
and hair samples or to undergo tests for gunpowder residues on your hand.

(BTW, that was why the chemical engineer arrested in the WTC explosion
thrust his hands into a toilet filled with urine as the cops were breaking
down the door -- the nitrogen in the urine would mask any residue from
explosives.  I found it interesting the news reported his acts, but not
his reasons).

Somewhe

## Clean texts and Lemmatize words

In [3]:
from text_preprocess import text_clean
texts = newsgroups_train.data
tp = text_clean(texts)
processed_texts = tp.proceed()

Start to process....
Processing Finished! Timing:  25.959


In [5]:
print(processed_texts[2])

from bear kestrel fsl noaa gov bear giles subject re fifth amendment and password organization forecast system lab noaa boulder co usa line in article apr magnus ac ohio state edu ashall magnus ac ohio state edu andrew s hall writes i am postive someone will correct me if i am wrong but doesn t the fifth also cover not being forced to do action that are self incriminating e g the police couldn t demand that you silently take them to where the body is buried or where the money is hidden but they can make you piss in a jar and possibly provide dna semen and hair sample or to undergo test for gunpowder residue on your hand btw that wa why the chemical engineer arrested in the wtc explosion thrust his hand into a toilet filled with urine a the cop were breaking down the door the nitrogen in the urine would mask any residue from explosive i found it interesting the news reported his act but not his reason somewhere perhaps a privacy group they discussed the legal ramification of using a pas

Now everything other than words are gone, and words of plural forms like 'lines' have been transformed into original form 'line', but do be cautious, sometimes the symbols like punctuation can have some impact in terms of semantics, for example, in sentiment analysis, symbols like '?', '!' are also meaningful.

## Vectorize Text Data

In a classical machine learning task, we need to transform texts into vectors of fixed length as the input of subsequent algorithms. There are many methods to vectorize texts, for example, we can build a vocabulary, each word stands for a feature, connsequently it is convenient represent a text as a vector of frequences of words in the vocabulary. Alternatively, we can use TfIdf values to replace simple frequences in the vector.

In [6]:
from text_vectorizer import text_vectorizer
from sklearn.feature_extraction.text import CountVectorizer
tv = text_vectorizer(texts, vectorizer=CountVectorizer())
vecs = tv.proceed()

In [15]:
vecs.shape

(11314, 82810)

Each text is represented as a vector with 82910 features, below are frequences for each word that do appear in the text.

In [35]:
print(vecs[0])

  (0, 2556)	1
  (0, 32967)	1
  (0, 75210)	1
  (0, 42322)	1
  (0, 33339)	1
  (0, 75844)	2
  (0, 58932)	1
  (0, 39737)	2
  (0, 22607)	2
  (0, 80309)	2
  (0, 71616)	1
  (0, 19936)	3
  (0, 45221)	3
  (0, 21166)	1
  (0, 31029)	1
  (0, 54972)	1
  (0, 49041)	1
  (0, 40646)	1
  (0, 34169)	2
  (0, 70699)	1
  (0, 51173)	1
  (0, 29761)	2
  (0, 68298)	1
  (0, 63505)	1
  (0, 21555)	1
  (0, 44002)	1
  (0, 19979)	2
  (0, 19978)	2
  (0, 25354)	2


## Split Text into Hiararchical Structure

Texts vary much, however in terms of structure, we can view that an article consists of paragraphs, a paragraph consists of sentences, and a sentence consists of words. To simplify, we can also represent a text as a sequence of sentences, and each sentences as a sequence of words.

In [37]:
from text_hier_split import text2sents, sent2words
ts = text2sents(texts)
sents = ts.proceed()
print(sents[0])

[['from', ':', 'email', '(', ')', 'subject', ':', 'help', 'organization', ':', 'the', 'internet', 'line', ':', 'nntp', 'posting', 'host', ':', 'enterpoop', '.', 'mit', '.', 'edu', 'to', ':', 'email', 'received', 'from', 'eei', '.', 'eeiihy', '.'], ['vax', '.', 'xpert', '..', 'expo', '.', 'lcs', '.', 'mit', '.', 'edu', '..', 'inet', ':', 'mail', 'user', 'in', 'vax', 'and', 'internet', 'help']]


## Map Text into Sequences of IDs
In many machine learning tasks, in order to facilitate the operations, we need to convert word symbols to IDs.

In [39]:
from token_idx_map import token2idx
ti = token2idx(sents[:5])
sent_idx = ti.proceed()

In [40]:
sent_idx[0]

[[29,
  9,
  16,
  13,
  15,
  56,
  9,
  102,
  60,
  9,
  1,
  216,
  72,
  9,
  266,
  227,
  274,
  9,
  478,
  0,
  204,
  0,
  271,
  4,
  9,
  16,
  477,
  29,
  431,
  0,
  462,
  0],
 [197,
  0,
  489,
  254,
  483,
  0,
  804,
  0,
  204,
  0,
  271,
  254,
  525,
  9,
  637,
  706,
  10,
  197,
  12,
  216,
  102]]