# Data Preprocessing. Fundamentals
<h4 style="font-size:14px; font-family:Calibry" align="left"> Natalia Cheilytko </h4>
<img width="50%" height="50%" src="http://i.piccy.info/i9/666d78be04fbcf04fdb321d5953d1fa5/1492256847/123248/1137898/ua_parrots.jpg">

# <hr style="height: 1px; background-color: #808080">Table of Contents

1. Purpose of Data Preprocessing
2. Python Libraries for NLP Tasks: NLTK, TextBlob, Pattern, spaCy
3. Preprocessing Features Overview
4. Further Reading


# <hr style="height: 1px; background-color: #808080"> 1. Purpose of Data Preprocessing

# CRISP DM - Data Science Project Lifecycle
 <img width="35%" height="35%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/800px-CRISP-DM_Process_Diagram.png">
### Textual Data Preparation / Preprocessing
The data preparation phase covers all activities to construct the final dataset from the initial raw data, i. e. transformation and cleaning of data for modeling purposes.
Textual data requires more tentative preprocessing in order to get it normalized for further modeling, as well as to obtain valuable features from grammatical and semantic peculiarities of text.

# <hr style="height: 1px; background-color: #808080"> 2. Python Libraries for NLP Tasks


## Natural Language Toolkit
http://www.nltk.org/

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

## TextBlob

https://textblob.readthedocs.io/en/dev/

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

## Pattern

http://www.clips.ua.ac.be/pages/pattern-en

The pattern.en module contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface.

## spaCy

https://spacy.io/

spaCy is a Python library good for preparing text for deep learning. It interoperates seamlessly with TensorFlow, Keras, Scikit-Learn, Gensim and the rest of Python's awesome AI ecosystem. It's written from the ground up in carefully memory-managed Cython. For text preprocessing, it provides tokenization, syntax-driven sentence segmentation, pre-trained word vectors, part-of-speech tagging, named entity recognition, labelled dependency parsing.

### See also - Interactive Demo of the NLP Libraries: http://textanalysisonline.com/ 

<hr style="height: 1px; background-color: #808080">
## 3. Preprocessing Features Overview

### Language Identification

In [6]:
#Let's start with the TextBlob library
from textblob import TextBlob
from textblob import Word

In [7]:
#Let's detect language of the given sentence
text = TextBlob(u"Beauty and the Beast es una película maravillosa, pero no tan bueno como esperaba.")
text.detect_language()

'es'

### Tokenization

In [8]:
# Let's define tokens for the given extract from a Free Fire (2016) movie review
review = TextBlob(u"Overall, my evaluation is now done and I am rooting on the side of it being a brash and exhilarating minor masterpiece. Yes, it's one- dimensional. Yes, it is virtually impossible to feel any empathy with any of the characters, as they are all universally loathsome. But it's a movie whose flaws are forgivable based on the characterisation and the cracking good script by long-term collaborators Ben Wheatley and Amy Jump. Tight as it is within its 90 minute running time, I doubt you will be bored.")
review.words

WordList(['Overall', 'my', 'evaluation', 'is', 'now', 'done', 'and', 'I', 'am', 'rooting', 'on', 'the', 'side', 'of', 'it', 'being', 'a', 'brash', 'and', 'exhilarating', 'minor', 'masterpiece', 'Yes', 'it', "'s", 'one', 'dimensional', 'Yes', 'it', 'is', 'virtually', 'impossible', 'to', 'feel', 'any', 'empathy', 'with', 'any', 'of', 'the', 'characters', 'as', 'they', 'are', 'all', 'universally', 'loathsome', 'But', 'it', "'s", 'a', 'movie', 'whose', 'flaws', 'are', 'forgivable', 'based', 'on', 'the', 'characterisation', 'and', 'the', 'cracking', 'good', 'script', 'by', 'long-term', 'collaborators', 'Ben', 'Wheatley', 'and', 'Amy', 'Jump', 'Tight', 'as', 'it', 'is', 'within', 'its', '90', 'minute', 'running', 'time', 'I', 'doubt', 'you', 'will', 'be', 'bored'])

### Sentence Splitting

In [9]:
#Let's split the review to sentences
review.sentences

[Sentence("Overall, my evaluation is now done and I am rooting on the side of it being a brash and exhilarating minor masterpiece."),
 Sentence("Yes, it's one- dimensional."),
 Sentence("Yes, it is virtually impossible to feel any empathy with any of the characters, as they are all universally loathsome."),
 Sentence("But it's a movie whose flaws are forgivable based on the characterisation and the cracking good script by long-term collaborators Ben Wheatley and Amy Jump."),
 Sentence("Tight as it is within its 90 minute running time, I doubt you will be bored.")]

### Word Normalization: Lemmatization vs. Stemming

In [10]:
%%bash
python3 -m textblob.download_corpora

[nltk_data] Downloading package brown to /home/natalia/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/natalia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/natalia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/natalia/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home/natalia/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/natalia/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


#### Lemmatization
Normalization of any word token to its lemma, i.e. a word base form: children --> child.

In [11]:
#Let's lemmatize each token in the sentence below
sentence = TextBlob(u"But it's a movie whose flaws are forgivable based on the characterisation and the cracking good script by long-term collaborators Ben Wheatley and Amy Jump.")
sentence.words.lemmatize()

WordList(['But', 'it', "'s", 'a', 'movie', 'whose', 'flaw', 'are', 'forgivable', 'based', 'on', 'the', 'characterisation', 'and', 'the', 'cracking', 'good', 'script', 'by', 'long-term', 'collaborator', 'Ben', 'Wheatley', 'and', 'Amy', 'Jump'])

#### Stemming
Reducing words to their pseudo (i.e. not necessarily grammatically correct) stem: characterisation --> characteris. 
There are several stemming algorithms in the NLTK library: Porter Stemmer (default), Snowball Stemmer, Lancaster Stemmer.

The overstemming issue should be kept in mind.
For example, the widely used Porter Stemmer normalizes "universal", "university", and "universe" to "univers". 
These words are in different domains, so treating them as equals will likely reduce the relevance of many NLP tasks.


In [12]:
#Let's do the stemming for our sentence with the Porter Stemmer.
sentence.words.stem()

WordList(['but', 'it', "'s", 'a', 'movi', 'whose', 'flaw', 'are', 'forgiv', 'base', 'on', 'the', 'characteris', 'and', 'the', 'crack', 'good', 'script', 'by', 'long-term', 'collabor', 'ben', 'wheatley', 'and', 'ami', 'jump'])

#### Demo of various lemmatization and stemming algorithms:
http://textanalysisonline.com/

### Part-of-speech Tagging
Default NLTK and Textblob POS tags explained:
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [13]:
#Let's define POS tags for our sentence with the default NLTK Tagger
sentence.pos_tags

[('But', 'CC'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('movie', 'NN'),
 ('whose', 'WP$'),
 ('flaws', 'NNS'),
 ('are', 'VBP'),
 ('forgivable', 'JJ'),
 ('based', 'VBN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('characterisation', 'NN'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('cracking', 'NN'),
 ('good', 'JJ'),
 ('script', 'NN'),
 ('by', 'IN'),
 ('long-term', 'JJ'),
 ('collaborators', 'NNS'),
 ('Ben', 'NNP'),
 ('Wheatley', 'NNP'),
 ('and', 'CC'),
 ('Amy', 'NNP'),
 ('Jump', 'NNP')]

In [11]:
#Let's try spaCy for POS tagging and lemmatization, since it is trained to handle social media texts
from spacy.en import English
parser = English()

In [43]:
# Let's have a sentence from social media as our example
sentence = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
parsedSentence = parser(sentence)
for token in parsedSentence:
    print(token.orth_, token.pos_, token.lemma_)

lol NOUN lol
that ADJ that
is VERB be
rly ADV rly
funny ADJ funny
:) PUNCT :)
This DET this
is VERB be
gr8 VERB gr8
i PRON i
rate VERB rate
it PRON -PRON-
8/8 NUM 8/8
! PUNCT !
! PUNCT !
! PUNCT !


### An Exercise: 
Compare the spaCy and TextBlob POS tagging for the sentence "lol that is rly funny :) This is gr8 i rate it 8/8!!!"

In [None]:
# Insert your code here

<details>
  <summary>Click to see the answer</summary>
       <pre>
          <code>
            sentence = TextBlob(u"lol that is rly funny :) This is gr8 i rate it 8/8!!!")
            sentence.pos_tags
          </code>
      </pre>
</details>

### Stopword Detection

In [13]:
#Let's access the NLTK stopword list
import nltk
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [14]:
#Let's compute what amount of words in our example sentence is not in the stopwords list:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)
sentence="But it's a movie whose flaws are forgivable based on the characterisation and the cracking good script by long-term collaborators Ben Wheatley and Amy Jump."
tokens = nltk.word_tokenize(sentence)
content_fraction(tokens)

0.6296296296296297

### Syntactic Parsing
Identification of syntactic structure of a sentence, usually, in terms of either Dependency Grammar or Constituency Grammar.
 <img src="https://upload.wikimedia.org/wikipedia/commons/0/0d/Wearetryingtounderstandthedifference_%282%29.jpg">


In [17]:
# Let's try the TextBlob Syntactic Parser
sentence = TextBlob("Tight as it is within its 90 minute running time, I doubt you will be bored.")
print(sentence.parse())

Tight/JJ/B-ADJP/O as/IN/B-PP/B-PNP it/PRP/B-NP/I-PNP is/VBZ/B-VP/O within/IN/B-PP/B-PNP its/PRP$/B-NP/I-PNP 90/CD/I-NP/I-PNP minute/NN/I-NP/I-PNP running/VBG/B-VP/I-PNP time/NN/B-NP/I-PNP ,/,/O/O I/PRP/B-NP/O doubt/NN/I-NP/O you/PRP/I-NP/O will/MD/B-VP/O be/VB/I-VP/O bored/VBN/I-VP/O ././O/O


See the tags explained: http://www.clips.ua.ac.be/pages/pattern-en#parser

#### and compare it to the spaCy Parser output via DisplaCy - 
Dependency Tree Parsing Visualization: https://demos.explosion.ai/displacy/


In [18]:
# Let's get noun chunks for our sentence via spaCy:
import spacy
nlp = spacy.load('en')
doc = nlp(u'Tight as it is within its 90 minute running time, I doubt you will be bored.')
for np in doc.noun_chunks:
    print(np.text, np.root.text, np.root.dep_, np.root.head.text)

Tight Tight nsubj running
it it nsubj is
time time dobj running
I I nsubj doubt
you you nsubjpass bored


In [None]:
#### Putting all together with spaCy
The code below will print for each token in each sentence the following information:
- The original form
- The lemma
- The part-of-speech
- The Penn POS (default for NLTK)
- The syntactic function

In [19]:
#import and construct the package
from spacy.en import English
nlp = English()

#Now, let's assume that we work on a simple corpus, constructed from a list of texts.

corpus = [
    u"Yes, it's one-dimensional.",
    u"Tight as it is within its 90 minute running time, I doubt you will be bored."]

#To parse the corpus all we need to do is 
docs = [
    nlp(d) for d in corpus
]
for idx,doc in enumerate(docs):
    #print "working on doc {idx}".format(idx)
    for sent in doc.sents:
        print(doc)
        #for each sentence, print the tokens and their original form,lemma, pos, penn pos tag, and constituent 
        for token in sent:
            print (token.orth_,token.lemma_,token.pos_,token.tag_,token.dep_)

Yes, it's one-dimensional.
Yes yes INTJ UH intj
, , PUNCT , punct
it -PRON- PRON PRP nsubj
's be VERB VBZ ROOT
one one NUM CD advmod
- - PUNCT HYPH punct
dimensional dimensional ADJ JJ acomp
. . PUNCT . punct
Tight as it is within its 90 minute running time, I doubt you will be bored.
Tight tight NOUN NN nsubj
as as ADP IN mark
it -PRON- PRON PRP nsubj
is be VERB VBZ advcl
within within ADP IN prep
its -PRON- ADJ PRP$ pobj
90 90 NUM CD nummod
minute minute NOUN NN npadvmod
running run VERB VBG ROOT
time time NOUN NN dobj
, , PUNCT , punct
I -PRON- PRON PRP nsubj
doubt doubt VERB VBP conj
you -PRON- PRON PRP nsubjpass
will will VERB MD aux
be be VERB VB auxpass
bored bore VERB VBN ccomp
. . PUNCT . punct


### Named Entity Recognition
DisplaCy - NER Visualization: 
https://demos.explosion.ai/displacy-ent/

In [20]:
#Let's get named entities available in our example
review = nlp(u"A cracking 70' soundtrack, put together by the Portishead duo of Geoff Barrow and Ben Salisbury, involves 70's classics by Credence Clearwater Revival, John Denver and The Real Kids and it's hammered out at top volume over the action. The downside of this effect is that - for my old ears at least - it sometimes make some of the dialogue hard to follow.")
for ent in review.ents:
    print(ent.label_, ent.text)

CARDINAL 70
PERSON Geoff Barrow
PERSON Ben Salisbury
DATE 70
ORG Credence Clearwater Revival
PERSON John Denver
WORK_OF_ART The Real Kids


### Semantic Word Relations: WordNet

Try it online: http://wordnetweb.princeton.edu/perl/webwn

In [24]:
from nltk.corpus import wordnet

# Let's try WordNet via TextBlob
from textblob.wordnet import VERB
from textblob.wordnet import NOUN

In [25]:
# First, let's see what are the synsets for a given word
word = Word("funny")
word.synsets

[Synset('funny_story.n.01'),
 Synset('amusing.s.02'),
 Synset('curious.s.01'),
 Synset('fishy.s.02'),
 Synset('funny.s.04')]

In [26]:
#Then let's explore meanings of the word
word.definitions

['an account of an amusing incident (usually with a punch line)',
 'arousing or provoking laughter',
 'beyond or deviating from the usual or expected',
 'not as expected',
 'experiencing odd bodily sensations']

In [27]:
# Let's narrow the synset search to a certain part of speech
word = Word("order")
word.get_synsets(pos=VERB)
#Try other option: pos=NOUN

[Synset('order.v.01'),
 Synset('order.v.02'),
 Synset('order.v.03'),
 Synset('regulate.v.02'),
 Synset('order.v.05'),
 Synset('order.v.06'),
 Synset('ordain.v.02'),
 Synset('arrange.v.07'),
 Synset('rate.v.01')]

In [28]:
#Let's get a parent for a given word in a certain meaning 
word = wordnet.synset('order.v.01')
word.hypernyms()

[Synset('request.v.02')]

In [29]:
# #Let's find out what are the children for a given word in a certain meaning 
word.hyponyms()

[Synset('call.v.05'),
 Synset('command.v.02'),
 Synset('direct.v.01'),
 Synset('instruct.v.02'),
 Synset('warn.v.03')]

### Sentiment WordNet
SentiWordNet by Esuli, Sebastiani - sentiment scores for 145k WordNet synonym sets.
http://sentiwordnet.isti.cnr.it/

In [30]:
from nltk.corpus import sentiwordnet as swn

# For a given word in a certain meaning let's find out its polarity assigned
word = swn.senti_synset('breakdown.n.03')
print(word)


<breakdown.n.03: PosScore=0.0 NegScore=0.25>


In [62]:
# See the word's negative / positive score
word.neg_score()
#Try other option: breakdown.pos_score()

0.25

In [63]:
#Find the nearest sentiment words for a word given
list(swn.senti_synsets('slow'))

[SentiSynset('decelerate.v.01'),
 SentiSynset('slow.v.02'),
 SentiSynset('slow.v.03'),
 SentiSynset('slow.a.01'),
 SentiSynset('slow.a.02'),
 SentiSynset('dense.s.04'),
 SentiSynset('slow.a.04'),
 SentiSynset('boring.s.01'),
 SentiSynset('dull.s.08'),
 SentiSynset('slowly.r.01'),
 SentiSynset('behind.r.03')]

# 4. Further Reading

1. Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. Steven Bird, Ewan Klein, and Edward Loper (http://www.nltk.org/book/)
2. TextBlob: Simplified Text Processing (https://textblob.readthedocs.io/en/dev/index.html)
3. Spacy Documentation (https://spacy.io/docs/usage/)
4. Intro to NLP with spaCy (https://nicschrading.com/project/Intro-to-NLP-with-spaCy/)