Skip to content
This repository has been archived by the owner on Jan 17, 2019. It is now read-only.

Module: Text cleaners

liangi edited this page Nov 4, 2016 · 1 revision

OnTrack cleans text in the following way:

  1. The text is converted to lowercase.

Rehabilitation/Reconstruction/Removal of Gravel on Bulacan, Road. North km+1993-km+384
rehabilitation/reconstruction/removal of gravel on bulacan, road. north km+1993-km+384

  1. Punctuation marks are replaced by spaces

rehabilitation/reconstruction/removal of gravel on bulacan, road. north km+1993-km+384
rehabilitation reconstruction removal of gravel on bulacan road north km 1993 km 384

  1. Stopwords are removed.

rehabilitation reconstruction removal of gravel on bulacan road north km 1993 km 384
rehabilitation reconstruction removal gravel bulacan road north km 1993 km 384

  1. Words are stemmed.

rehabilitation reconstruction removal gravel bulacan road north km 1993 km 384
rehabilit reconstruct remov gravel bulacan road north km 1993 km 384

This is done automatically using the function clean_text(text).

clean_text(text, stringify = True)

query = "Rehabilitation/Reconstruction/Removal of Gravel on Bulacan, Road. North km+1993-km+384"
query = clean_text(query)
print query

>>> 'rehabilit reconstruct remov gravel bulacan road north km 1993 km 384'

Selective cleaning

One can choose to skip steps by selectively using the following functions:

clean_punctuation(text)

Transforms text to lowercase and leaves only alphanumeric characters. This is necessary for the other functions to work properly!
Note: also removes characters that may be useful, such as "ñ".

remove_stop(text)

Removes stopwords according to the nltk stopwords corpus. Stopwords are frequently occurring words which do not add context, such as a, the, on, etc. Read the wikipedia article here.

Full list of words removed:

i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now d ll m o re ve y ain aren couldn didn doesn hadn hasn haven isn ma mightn mustn needn shan shouldn wasn weren won wouldn

stem(text):

Stems each word according to the Porter Stemmer algorithm. Stemming removes common suffixes and word endings to be able to identify that the words "Rehabilitate", "Rehabilitation", and "Rehabilitating" refer to the same thing.

Example

If you want to skip the stemming, you can write:

query = remove_stop(clean_punc("Your query here"))