Last updated on 01-16-2018.

Written by Zhiya Zuo

Email: [zhiyazuo@gmail.com](mailto:zhiyazuo@gmail.com)

---

Example usage of `Preprocessor` class.

In [1]:
import tm_preprocessor

In [2]:
tm_preprocessor.__version__

'0.0.2'

#### Load example data

Example from https://radimrehurek.com/gensim/tut1.html#corpus-formats, with some minor modifications.

In [3]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time 2017 [\t]",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees, and well quasi ordering",
             "Graph minors: A survey"]

#### Create a `Preprocessor` object

In [4]:
from tm_preprocessor import Preprocessor

In [5]:
help(Preprocessor)

Help on class Preprocessor in module tm_preprocessor.preprocessor:

class Preprocessor(builtins.object)
 |  Preprocessor of text corpus before feeding into topic modeling algorithms
 |  
 |  Attributes
 |  ----------
 |  corpus : np.array
 |      Processed corpus.
 |  documents : iteratble object (list/tuple/numpy array...)
 |      A list of documents
 |  punctuations : str
 |      String sequence of punctuations to be removed. By default: "!@#$%^*(),.:;&=+-_?'`\
 |  stopwords : np.array
 |      An array of stopwords.
 |  vocabulary : corpora.Dictionary
 |      Dictionary of self.corpus
 |  
 |  Methods
 |  -------
 |  remove_digits_punctuactions()
 |      Remove both digits and punctuations in the corpus.
 |  add_stopwords(additional_stopwords)
 |      Add additional stop words (`additional_stopwords`) to `self.stopwords`.
 |  tokenize(stemmer, min_freq, min_length)
 |      Tokenize the corpus into bag of words using the specified `stemmer` and
 |      minimum frequency (`min_freq`) a

In [6]:
preprocessor = Preprocessor(documents)

##### States of the object `preprocessor`

###### `documents` that store the original documents

In [7]:
preprocessor.documents

['Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time 2017 [\\t]',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees, and well quasi ordering',
 'Graph minors: A survey']

###### `punctuations` that are to be removed.

Here shows the default values.

In [8]:
preprocessor.punctuations

'"!@#$%^*(),.:;&=+-_?\\\'`[]'

###### `stopwords`

Here shows the default values. Note that one can add more using `preprocessor.add_stopwords` method.

In [9]:
preprocessor.stopwords

array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
       'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his',
       'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
       'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
       'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
       'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
       'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
       'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
       'by', 'for', 'with', 'about', 'against', 'between', 'into',
       'through', 'during', 'before', 'after', 'above', 'below', 'to',
       'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
       'again', 'further', 'then', 'once', 'here', 'there', 'when',
       'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
       'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 

###### `corpus` and `vocabulary`

These are `None` at the moment because no processing has been done at this moment.

In [10]:
preprocessor.corpus

In [11]:
preprocessor.vocabulary

##### Methods of `preprocessor`

In [12]:
help(preprocessor.add_stopwords)

Help on method add_stopwords in module tm_preprocessor.preprocessor:

add_stopwords(additional_stopwords) method of tm_preprocessor.preprocessor.Preprocessor instance
    Add additional stopwords to the current `Preprocessor` object.
    
    Parameters
    ----------
    additional_stopwords : iteratble object (list/tuple/numpy array...; init to None)
        Additional stopwords.



In [13]:
help(preprocessor.remove_digits_punctuactions)

Help on method remove_digits_punctuactions in module tm_preprocessor.preprocessor:

remove_digits_punctuactions() method of tm_preprocessor.preprocessor.Preprocessor instance
    Remove digits and punctuations



In [14]:
help(preprocessor.tokenize)

Help on method tokenize in module tm_preprocessor.preprocessor:

tokenize(stemmer=<PorterStemmer>, min_freq=1, min_length=1) method of tm_preprocessor.preprocessor.Preprocessor instance
    Tokenize the corpus into bag of words
    
    Parameters
    ----------
    stemmer : nltk.stem stemmers
        Stemmer to use (Porter by default). See http://www.nltk.org/api/nltk.stem.html. If `None`, do not stem.
    min_freq : int
        The minimum frequency of a token to be kept
    min_length : int
        The minimum length of a token to be kept



In [15]:
help(preprocessor.serialize)

Help on method serialize in module tm_preprocessor.preprocessor:

serialize(path='.', format_='MmCorpus') method of tm_preprocessor.preprocessor.Preprocessor instance
    Serialize corpus and build vocabulary.
    
    Parameters
    ----------
    path : str
        The path to save corpus and vocabulary (current directory by default).
    format_ : str
        The format of the serialized corpus. See https://radimrehurek.com/gensim/tut1.html#corpus-formats



In [16]:
help(preprocessor.get_word_ranking)

Help on method get_word_ranking in module tm_preprocessor.preprocessor:

get_word_ranking() method of tm_preprocessor.preprocessor.Preprocessor instance
    Get the ranking of words (tokens). Note that this should be done a
    
    Returns
    -------
    pd.DataFrame
        Sorted dataframe with columns `word` and corresponding `frequency`.



---

#### Preprocessing

##### Remove punctuations and digits

In [7]:
preprocessor.remove_digits_punctuactions()

In [8]:
preprocessor.documents

['Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time 2017 [\\t]',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees, and well quasi ordering',
 'Graph minors: A survey']

In [9]:
preprocessor.corpus

array(['Human machine interface for lab abc computer applications',
       'A survey of user opinion of computer system response time        t ',
       'The EPS user interface management system',
       'System and human system engineering testing of EPS',
       'Relation of user perceived response time to error measurement',
       'The generation of random binary unordered trees',
       'The intersection graph of paths in trees',
       'Graph minors IV Widths of trees  and well quasi ordering',
       'Graph minors  A survey'],
      dtype='<U67')

##### Tokenize

###### Use `Porter` Stemmer

In [10]:
from nltk.stem.porter import PorterStemmer

In [11]:
stemmer = PorterStemmer()

Set `min_freq` to 0 and use the default value of `min_length` (1)

In [12]:
preprocessor.tokenize(stemmer, 0, 0)

In [13]:
preprocessor.corpus

array([list(['human', 'machin', 'interfac', 'lab', 'abc', 'comput', 'applic']),
       list(['survey', 'user', 'opinion', 'comput', 'system', 'respons', 'time']),
       list(['ep', 'user', 'interfac', 'manag', 'system']),
       list(['system', 'human', 'system', 'engin', 'test', 'ep']),
       list(['relat', 'user', 'perceiv', 'respons', 'time', 'error', 'measur']),
       list(['gener', 'random', 'binari', 'unord', 'tree']),
       list(['intersect', 'graph', 'path', 'tree']),
       list(['graph', 'minor', 'iv', 'width', 'tree', 'well', 'quasi', 'order']),
       list(['graph', 'minor', 'survey'])], dtype=object)

##### Serialize the corpus

###### Save as `Market Matrix format`

See http://math.nist.gov/MatrixMarket/formats.html

In [14]:
preprocessor.serialize(format_='MmCorpus', path='~/Desktop')

---

#### Frequencies of words/tokens

In [15]:
preprocessor.get_word_ranking()

Unnamed: 0,word,frequency
0,system,4
1,user,3
2,tree,3
3,graph,3
4,minor,2
5,ep,2
6,time,2
7,respons,2
8,human,2
9,survey,2
