# LICENSE NOTES
---
This material was develop by [Prof. Wladmir Cardoso Brandão](http://www.wladmirbrandao.com) and is distributed under the [Creative Commons Attribution-NonCommercial License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), what means that anyone can share and adapt the material for non-commercial purposes, since acknowledging it but with no license to derivative it on the same terms.

# Text Processing
---

There are several problems that must be addressed to process text, such as symbol segmentation and removal, and term transformation. In this notebook, one present  approaches to address the most usual problems. [Documents](https://en.wikipedia.org/wiki/Document) are information items that represent thoughts. Text documents are usually composed by sentences, i.e., logically linked sequence of [words (or terms)](https://en.wikipedia.org/wiki/Word) and punctuation characters. 

In [0]:
try:
    import nltk    
except:
    !pip install -U nltk
    import nltk
import re

In [0]:
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
text = "Men, plans, an astonishing canal - Panama."

## Tokenization
[Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of demarcating sections (tokens) of a string. Particullarly, a tokenizer is a [text segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation) approach that divides a string into its component tokens.

In [0]:
tokens = word_tokenize(text)
tokens

['Men', ',', 'plans', ',', 'an', 'astonishing', 'canal', '-', 'Panama', '.']

## Normalization
[Token Normalization](https://nlp.stanford.edu/IR-book/html/htmledition/normalization-equivalence-classing-of-terms-1.html) is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens. For instance, the tokens *"USA"* and *"U.S.A"* shoud be canonized to *"usa"*. Token normalization includes lower case reduction, accent and diacritic removal, and canonicalizing of acronyms, currency, date and hyphenated words.

In [0]:
tokens = [token.lower() for token in tokens]
tokens

['men', ',', 'plans', ',', 'an', 'astonishing', 'canal', '-', 'panama', '.']

## Filtering
Token Filtering is the process to remove unusefull tokens, such as punctuation and special characters.

In [0]:
regex = r"(?<!\d)[.,;:-](?!\d)"
tokens = [re.sub(regex, "", token, 0) for token in tokens]
tokens = filter(None, tokens)
tokens

<filter at 0x7f46065ca0f0>

## Lemmatization

---

[Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) is the process of removing inflectional endings to return the base or dictionary form of a word, the *lemma*, by using of a vocabulary and morphological analysis of words.

In [0]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens

['men', 'plan', 'an', 'astonishing', 'canal', 'panama']

## Stemming
[Stemming](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) is the process of removing derivational affixes to return the root form of a word, the *stem*. The most popular stemming approaches perform [suffix stripping](https://dl.acm.org/citation.cfm?id=275537.275705) on words.

In [0]:
from nltk.stem.porter import PorterStemmer

In [0]:
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
tokens

['men', 'plan', 'an', 'astonish', 'canal', 'panama']

## Stop-Words
[Stop-words](https://en.wikipedia.org/wiki/Stop_words) are some extremely common and meaningless words which would appear to be of little value to express the topic of a document. The [general strategy for determining a stop list](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html) is to sort the terms by corpus frequency (the total number of times each term appears in the corpus), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents. 

In [0]:
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
tokens = [token for token in tokens if token not in stopwords.words('english')]
tokens

['men', 'plan', 'astonish', 'canal', 'panama']

## Tagging

Tagging, also known as [part-of-speech tagging (POST)](https://en.wikipedia.org/wiki/Part-of-speech_tagging), is the process of marking up a word in a text as corresponding to a particular category that have similar grammatical properties, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, or determiner. The most [popular taggers](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python) use supervised learning approaches trained in well known corpora to classify words in grammatical classes.

In [0]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [0]:
tagged = nltk.pos_tag(tokens)
tagged[0:100]

[('men', 'NNS'),
 ('plan', 'VBP'),
 ('astonish', 'JJ'),
 ('canal', 'JJ'),
 ('panama', 'NN')]

## Chunking

Chunking, also known as [shallow parsing](https://en.wikipedia.org/wiki/Shallow_parsing), performs sentence analysis based on tagging to add more structure to the sentence by grouping words in chunks, higher order units that have discrete grammatical meanings, such as noun groups or phrases and verb groups. Popular chunkers also use supervised learning classifiers to link words and provide chunks.

In [0]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [0]:
entities = nltk.chunk.ne_chunk(tagged)
print(entities)

(S men/NNS plan/VBP astonish/JJ canal/JJ panama/NN)
