Last updated on 01-26-2018.

Written by Zhiya Zuo

Email: [zhiyazuo@gmail.com](mailto:zhiyazuo@gmail.com)

---

Example usage of `Preprocessor` class.

In [1]:
import tm_preprocessor

In [2]:
tm_preprocessor.__version__

'0.0.3'

#### Load example data

Example from https://radimrehurek.com/gensim/tut1.html#corpus-formats, with some minor modifications.

In [3]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time 2017 [\t]",
             "He went to the gym and swam.",  
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees, and well quasi ordering",
             "Graph minors: A survey"]

#### Create a `Preprocessor` object

In [4]:
from tm_preprocessor import Preprocessor

In [5]:
help(Preprocessor)

Help on class Preprocessor in module tm_preprocessor.preprocessor:

class Preprocessor(builtins.object)
 |  Preprocessor of text corpus before feeding into topic modeling algorithms
 |  
 |  Attributes
 |  ----------
 |  corpus : np.array
 |      Processed corpus.
 |  documents : iteratble object (list/tuple/numpy array...)
 |      A list of documents
 |  punctuations : str
 |      String sequence of punctuations to be removed. By default: "!@#$%^*(),.:;&=+-_?'`\
 |  stopwords : np.array
 |      An array of stopwords.
 |  vocabulary : corpora.Dictionary
 |      Dictionary of self.corpus
 |  
 |  Methods
 |  -------
 |  add_stopwords(additional_stopwords)
 |      Add additional stop words (`additional_stopwords`) to `self.stopwords`.
 |  tokenize(normalizer, min_freq, max_freq, min_length)
 |      Tokenize the corpus into bag of words using the specified `stemmer` or `lemmatizer` and
 |      minimum/maximum frequency (`min_freq` and `max_freq`) and length (`min_length`) of words/tokens.

In [6]:
preprocessor = Preprocessor(documents)

##### States of the object `preprocessor`

###### `documents` that store the original documents

In [7]:
preprocessor.documents

['Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time 2017 [\\t]',
 'He went to the gym and swam.',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees, and well quasi ordering',
 'Graph minors: A survey']

###### `punctuations` that are to be removed.

Here shows the default values.

In [8]:
preprocessor.punctuations

'"!@#$%^*(),.:;&=+-_?\\\'`[]'

###### `stopwords`

Here shows the default values. Note that one can add more using `preprocessor.add_stopwords` method.

In [9]:
preprocessor.stopwords

array(['a', "a's", 'ableabout', 'about', 'above', 'according',
       'accordingly', 'across', 'actually', 'after', 'afterwards', 'again',
       'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone',
       'along', 'already', 'also', 'although', 'always', 'am', 'among',
       'amongst', 'an', 'and', 'and/or', 'another', 'any', 'anybody',
       'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere',
       'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't",
       'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at',
       'available', 'away', 'awfully', 'be', 'became', 'because', 'become',
       'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind',
       'being', 'believe', 'below', 'beside', 'besides', 'best', 'better',
       'between', 'beyond', 'both', 'brief', 'bring', 'but', 'by', "c'mon",
       "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes',
       'certain', 'certainly', 'changes', 'clearly', 'co', '

###### `corpus` and `vocabulary`

These are `None` for vocabulary at the moment because no processing has been done at this moment.

In [10]:
preprocessor.corpus

array([ list(['human', 'machine', 'interface', 'for', 'lab', 'abc', 'computer', 'applications']),
       list(['a', 'survey', 'of', 'user', 'opinion', 'of', 'computer', 'system', 'response', 'time', '2017', '[', '\\t', ']']),
       list(['he', 'went', 'to', 'the', 'gym', 'and', 'swam', '.']),
       list(['the', 'eps', 'user', 'interface', 'management', 'system']),
       list(['system', 'and', 'human', 'system', 'engineering', 'testing', 'of', 'eps']),
       list(['relation', 'of', 'user', 'perceived', 'response', 'time', 'to', 'error', 'measurement']),
       list(['the', 'generation', 'of', 'random', 'binary', 'unordered', 'trees']),
       list(['the', 'intersection', 'graph', 'of', 'paths', 'in', 'trees']),
       list(['graph', 'minors', 'iv', 'widths', 'of', 'trees', ',', 'and', 'well', 'quasi', 'ordering']),
       list(['graph', 'minors', ':', 'a', 'survey'])], dtype=object)

In [11]:
preprocessor.vocabulary

##### Methods of `preprocessor`

In [12]:
help(preprocessor.add_stopwords)

Help on method add_stopwords in module tm_preprocessor.preprocessor:

add_stopwords(additional_stopwords) method of tm_preprocessor.preprocessor.Preprocessor instance
    Add additional stopwords to the current `Preprocessor` object.
    
    Parameters
    ----------
    additional_stopwords : iteratble object (list/tuple/numpy array...; init to None)
        Additional stopwords.



In [13]:
help(preprocessor.remove_stopwords)

Help on method remove_stopwords in module tm_preprocessor.preprocessor:

remove_stopwords() method of tm_preprocessor.preprocessor.Preprocessor instance



In [14]:
help(preprocessor.normalize)

Help on method normalize in module tm_preprocessor.preprocessor:

normalize(normalizer, min_freq=0.05, max_freq=0.95, min_length=1) method of tm_preprocessor.preprocessor.Preprocessor instance
    Normalize corpus by either lemmatization or stemming. Also remove rare/common and short words
    
    Parameters
    ----------
    stemmer : nltk.stem stemmers or lemmatizers
        Stemmer/lemmatizer to use. See http://www.nltk.org/api/nltk.stem.html. If `None`, do not stem.
    min_freq : float
        The minimum frequency (in ratio) of a token to be kept
    max_freq : float
        The maximum frequency (in ratio) of a token to be kept
    min_length : int
        The minimum length of a token to be kept



In [15]:
help(preprocessor.get_word_ranking)

Help on method get_word_ranking in module tm_preprocessor.preprocessor:

get_word_ranking() method of tm_preprocessor.preprocessor.Preprocessor instance
    Get the ranking of words (tokens). Note that this should be done a
    
    Returns
    -------
    pd.DataFrame
        Sorted dataframe with columns `word` and corresponding `frequency`.



---

#### Preprocessing

##### Remove punctuations and digits

In [7]:
preprocessor.remove_digits_punctuactions()

In [8]:
preprocessor.corpus

[['human',
  'machine',
  'interface',
  'for',
  'lab',
  'abc',
  'computer',
  'applications'],
 ['a',
  'survey',
  'of',
  'user',
  'opinion',
  'of',
  'computer',
  'system',
  'response',
  'time',
  't'],
 ['he', 'went', 'to', 'the', 'gym', 'and', 'swam'],
 ['the', 'eps', 'user', 'interface', 'management', 'system'],
 ['system', 'and', 'human', 'system', 'engineering', 'testing', 'of', 'eps'],
 ['relation',
  'of',
  'user',
  'perceived',
  'response',
  'time',
  'to',
  'error',
  'measurement'],
 ['the', 'generation', 'of', 'random', 'binary', 'unordered', 'trees'],
 ['the', 'intersection', 'graph', 'of', 'paths', 'in', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'of',
  'trees',
  'and',
  'well',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'a', 'survey']]

##### Remove stopwords

In [9]:
preprocessor.remove_stopwords()

In [10]:
preprocessor.corpus

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['gym', 'swam'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

##### Tokenize

Set `min_freq` to 0.1, `max_freq` to 0.9, and `min_length` to 2

In [11]:
min_freq = 0
max_freq = 1
min_len = 2

Use lemmatization.

In [12]:
import nltk

In [13]:
lemmatizer = nltk.WordNetLemmatizer()

In [14]:
preprocessor.normalize(lemmatizer, min_freq, max_freq, min_len)

In [15]:
preprocessor.corpus

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'application'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['gym', 'swam'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'test', 'eps'],
 ['relation', 'user', 'perceive', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'tree'],
 ['intersection', 'graph', 'path', 'tree'],
 ['graph', 'minor', 'iv', 'width', 'tree', 'quasi', 'order'],
 ['graph', 'minor', 'survey']]

##### Serialize the corpus

###### Save as `Market Matrix format`

See http://math.nist.gov/MatrixMarket/formats.html

In [16]:
preprocessor.serialize(format_='MmCorpus', path='~/Desktop')

---

#### Frequencies of words/tokens

In [17]:
preprocessor.get_word_ranking()

Unnamed: 0,word,frequency
0,system,4
1,user,3
2,tree,3
3,graph,3
4,human,2
5,eps,2
6,time,2
7,minor,2
8,response,2
9,survey,2
