# Text processing

### Róbert Móro, Jakub Ševcech

IAU, 7.11.2019

## Please, give us your feedback here: https://tinyurl.com/iau2019-w07

## You can ask us directly or at http://slido.com#iau2019-w07

## Revision of the last lecture

# Text processing

## Natural Language Processing, NLP

## Today, we will look at...

what we understand by natural language processing

what the typical NLP tasks are

how to transform text into a vector

how to extract features from text

what resources there are (also for Slovak language)

## The goal of NLP is to teach machines understand the human language (spoken/written)

<img src="img/bender.jpg" alt="Does not compute" style="margin-left: auto; margin-right: auto; width: 600px"/>

## The ability of having a dialog with human as an intelligence test - Turing's test (1950)

## What does it mean to understand text?

1. **Morphologic level** - the way the words are composed
2. **Syntactic level** - the way the words compose a sentence
3. **Semantic level** - semantics of words/sentences
4. **Pragmatic level** - the meaning in the context of a given situation or common sense / general knowledge

## Povedal mu: "Mier!" (He told him: "Peace!" or He told him: "Take aim!")

Mier (in Slovak) -> peace (noun) or to take aim (verb)?

told -> tell

Who told it to whom? 

## Humans do not make it easy for the machines (or other people for that matter)...

[Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.](https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo)

*Buffalo from the city of Buffalo, that is buffaloed (intimidated) by a buffalo from Buffalo, also buffaloes a buffalo from Buffalo.*

Si ton tonton tond ton tonton, ton tonton sera tondu par ton tonton.

*If your uncle shaves your (other) uncle, your uncle will be shaved by your uncle.*

## NLP tasks

* Text segmentation
  - into tokens
  - into sentences (sentence boundary disambiguation)
  - into topics
* Information extraction
  - e.g., named entity recognition
* Automatic text summarization
* Sentiment analysis

## NLP tasks

* Machine translation
* (Syntactic) text parsing
* Discourse analysis, question answering
* Speech recognition and segmentation
* Speech-to-text, text-to-speech
* Text generation

## We are primarily interested in feature extraction from text

In order to classify texts, identify clusters of similar documents, etc.

## Example: We want to distinguish, who the author is

Edgar Allan Poe vs. Mary Shelley vs. HP Lovecraft: https://www.kaggle.com/c/spooky-author-identification

**What features could we extract from the sentences?**

## Goto: Question in Slido

## Potencially useful features

* Sentence length
* The number of words in the sentence
* Average word length
* Text readability metrics, e.g, Flesh-Kincaid
* The number of conjunction/prepositions/other word categories (parts of speech)
* **Frequency of the used words - transformation of a sentence (text) into a vector reprsentation**

## In general

Text segmentation 

Transformation of text into a vector representation (using *bag of words*)

Identification of keywords or frequently co-occurring words (tokens)

Similarity of two textual documents

## Tools for text processing in Python

[NLTK](http://www.nltk.org/)

[Gensim](https://radimrehurek.com/gensim/tutorial.html)

[sklearn.feature_extraction.text](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

### Other tools

[Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) - interface available in NLTK

[Apache OpenNLP](https://opennlp.apache.org/)

[WordNet](https://wordnet.princeton.edu/) - interface available in NLTK

## Feature extraction can be done with other inputs as well

Images (sklearn.feature_extraction.image, [scikit-image](https://scikit-image.org/))

Videos ([scikit-video](http://www.scikit-video.org/stable/))

Signal, e.g., sound ([scikit-signal](https://docs.scipy.org/doc/scipy/reference/signal.html), [scikit-sound](http://work.thaslwanter.at/sksound/html/))

## Text processing methods

Regular expressions, finite-state automata, context-free grammars

Rule-based, dictionary/thesaurus-based approaches

Machine learning approaches (Markov models, **deep neural networks**)

A very good overview: [Dan Jurafsky, James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) (currently, the 3rd edition is in progress)

## Beware: Most of the methods is language-dependent

There are many models available for English, German, Spanish, ...

## A lot less for Slovak language

But there is no need to despair...

[Text@FIIT STU](http://text.fiit.stuba.sk/)

[NLP4SK](http://arl6.library.sk/nlp4sk/)

[Slovenský národný korpus](https://korpus.sk/)

[word2vec](https://github.com/essential-data/word2vec-sk)

and [others...](https://github.com/essential-data/nlp-sk-interesting-links)

Recommended contact at FIIT STU: Assoc. Prof. Marián Šimko, Assoc. Prof. Peter Lacko, Dr. Miroslav Blšták, or Assoc. Prof. Michal Kompan

## Text representation

Textual document is usually represented by a set of words (*bag-of-words*) = **vector**.

Vector elements present individual words or n-grams from the dictionary (for the whole corpus/language).

The value of the vector elements can be:

* occurrence (binary)
* count
* frequency
* weighted frequency

## Transformation of text into vector

1. Tokenization (segmenatation of text into sentences and words)

2. Text normalization
   - to lowercase
   - stemming or lemmatization
   - stopwords removal (conjunctions, preposition, etc.)

3. Construction of a dictionary

4. Construction of a vector - vector elements are words from the dictionary; often very sparse (i.e., a lot of zeros)

## Tokenization

In [1]:
import nltk

text = """At eight o'clock on Thursday morning 
... Arthur didn't feel very good. He closed his eyes and went to bed again."""

In [2]:
sentences = nltk.sent_tokenize(text)
print(sentences)

["At eight o'clock on Thursday morning \n... Arthur didn't feel very good.", 'He closed his eyes and went to bed again.']


In [3]:
sent = sentences[0]

tokens = nltk.word_tokenize(sent)
print(tokens)

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']


## Normalization

In [4]:
tokens = [token.lower() for token in tokens if token not in ".,?!..."]
print(tokens)

['at', 'eight', "o'clock", 'on', 'thursday', 'morning', 'arthur', 'did', "n't", 'feel', 'very', 'good']


## Stemming or lemmatization?

Stemming returns the stem of a word, e.g., *working -> work*

Lemmatization transforms the words into their dictionary form. For example *mice -> mouse*

**We always use one or the other.** Stemming is used for less inflected languages (such as English). Lemmatization is preferred for inflected languages (such as Slovak).

**Stemming** - for English, e.g., [Porter's algorithm (1980)](https://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf)

**Lemmatization** - mostly dictionary-based methods (morphological database); for Slovak: https://korpus.sk/morphology_database.html

In [5]:
porter = nltk.PorterStemmer()

stemmed = [porter.stem(token) for token in tokens]
print(stemmed)

['at', 'eight', "o'clock", 'on', 'thursday', 'morn', 'arthur', 'did', "n't", 'feel', 'veri', 'good']


In [6]:
wnl = nltk.WordNetLemmatizer()

lemmatized = [wnl.lemmatize(token) for token in tokens]
print(lemmatized)

['at', 'eight', "o'clock", 'on', 'thursday', 'morning', 'arthur', 'did', "n't", 'feel', 'very', 'good']


### Stopwords removal

In [7]:
stopwords = nltk.corpus.stopwords.words('english')

normalized_tokens = [token for token in stemmed if token not in stopwords]
print(normalized_tokens)

['eight', "o'clock", 'thursday', 'morn', 'arthur', "n't", 'feel', 'veri', 'good']


## Transformation to vector representation

We will work with [20 newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset:

*"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."*

In [8]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [9]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [10]:
len(twenty_train.data)

2257

In [11]:
print("\n".join(twenty_train.data[0].split("\n")[:10]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.



In [12]:
def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    return [token.lower() for token in tokens if token.isalpha() and token.lower() not in stopwords]

In [13]:
tokenized_docs = [preprocess_text(text) for text in twenty_train.data]

In [14]:
print(tokenized_docs[0])

['michael', 'collier', 'subject', 'converting', 'images', 'hp', 'laserjet', 'iii', 'hampton', 'organization', 'city', 'university', 'lines', 'anyone', 'know', 'good', 'way', 'standard', 'pc', 'utility', 'convert', 'files', 'laserjet', 'iii', 'format', 'would', 'also', 'like', 'converting', 'hpgl', 'hp', 'plotter', 'files', 'please', 'email', 'response', 'correct', 'group', 'thanks', 'advance', 'michael', 'michael', 'collier', 'programmer', 'computer', 'unit', 'email', 'city', 'university', 'tel', 'london', 'fax']


In [15]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
print(corpus[10])



[(1, 3), (2, 1), (8, 1), (20, 2), (22, 1), (23, 1), (26, 2), (33, 1), (39, 2), (59, 2), (60, 1), (61, 1), (62, 2), (64, 1), (65, 2), (78, 1), (86, 1), (99, 2), (103, 1), (110, 2), (128, 1), (135, 1), (155, 1), (158, 1), (160, 1), (162, 1), (187, 1), (200, 1), (205, 1), (208, 2), (213, 1), (220, 2), (224, 2), (236, 2), (239, 1), (253, 1), (256, 1), (258, 1), (261, 1), (270, 1), (273, 4), (277, 1), (278, 3), (290, 2), (306, 1), (308, 1), (310, 1), (314, 2), (317, 1), (318, 1), (319, 4), (328, 1), (329, 2), (335, 1), (337, 3), (338, 5), (340, 1), (361, 1), (371, 1), (379, 2), (410, 4), (431, 1), (433, 1), (443, 1), (458, 1), (479, 5), (509, 1), (511, 1), (513, 3), (534, 1), (569, 2), (632, 1), (635, 1), (666, 1), (670, 1), (703, 1), (718, 1), (720, 1), (750, 1), (766, 2), (771, 1), (776, 3), (777, 1), (787, 1), (823, 1), (824, 1), (825, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 1), (831, 1), (832, 1), (833, 1), (834, 2), (835, 1), (836, 1), (837, 1), (838, 1), (839, 2), (840, 3), 

## TF-IDF = term frequency * inverse document frequency

`TF` – frequency of a word in the document

$$ tf(t,d)=\frac{f_{t,d}}{\sum_{t' \in d}{f_{t',d}}} $$

`IDF` – negative logarithm of probability of the word occuring in the documents in the corpus (this is equal for every document)

$$ idf(t,D) = -\log{\frac{|{d \in D: t \in d}|}{N}} = \log{\frac{N}{|{d \in D: t \in d}|}} $$

Various variants (weighting schemes): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [16]:
tfidf_model = models.TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]
tfidf_corpus[10][:10]

[(1, 0.04549138234160459),
 (2, 0.018081179652764727),
 (8, 0.03190242457871024),
 (20, 0.024646895851439338),
 (22, 0.013140784831683568),
 (23, 4.5731503853650784e-05),
 (26, 0.0013149841529262584),
 (39, 0.03626102811288039),
 (59, 0.05617330821375869),
 (60, 0.027680946075067114)]

## Similarity of vectors

Similarity using Euclidean distance

$$ sim(u,v) = 1- d(u,v) = 1 - \sqrt{\sum_{i=1}^{n}{(v_i-u_i)^2}} $$

Cosine similarity

$$ sim(u,v) = cos(u,v) = \frac{u \cdot v}{||u||||v||} =\frac{\sum_{i=1}^{n}{u_iv_i}}{\sum_{i=1}^{n}{u_i}\sum_{i=1}^{n}{v_i}} $$

In [17]:
index = similarities.MatrixSimilarity(tfidf_corpus)
index[tfidf_corpus[0]]

array([1.        , 0.00336782, 0.01390726, ..., 0.00478842, 0.00904795,
       0.00284771], dtype=float32)

## Feature extraction using scikit-learn

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [19]:
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35482)

In [20]:
print(count_vect.vocabulary_.get(u'algorithm'))

4683


In [21]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Let us train a classifier

In [22]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [23]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Organization and automatization of preprocessing: Pipelines

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html

In [24]:
from sklearn.pipeline import Pipeline

text_ppl = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

In [25]:
text_ppl.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [26]:
[twenty_train.target_names[cat] for cat in text_ppl.predict(docs_new)]

['soc.religion.christian', 'comp.graphics']

## Our own transformer

In [27]:
from sklearn.base import TransformerMixin

class MyTransformer(TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        return X

## Other common tasks of text (pre)processing

## Part-of-Speech Tagging (POS)

Word category, singular/plural, tense, or other grammatical categories 

In [28]:
tagged = nltk.pos_tag(nltk.word_tokenize(sent))
print(tagged)

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), ('Arthur', 'NNP'), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]


In [29]:
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


## Named entities recognition

People, organizations, places, etc.

In [30]:
entities = nltk.chunk.ne_chunk(tagged)

In [31]:
print(entities.__repr__())

Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])


## N-grams

In general, it is a sequence of $N$ items. The items are usually words in case of text.

bigrams

trigrams

skipgrams - $k$-skip-$n$-grams

https://books.google.com/ngrams

In [32]:
tokens = nltk.word_tokenize(sent)
bigrams = list(nltk.bigrams(tokens))
print(bigrams[:5])

[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning')]


It can be used in `CountVectorizer` transformer.

In [33]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

## WordNet

* Lexical database
* Organized into so-called synsets (sets of synonyms)
  * Nouns, verbs, adjectives, adverbs
* Relationships between synsets
  * Antonyms, hypernyms, hyponyms, holonyms, meronyms

In [34]:
from nltk.corpus import wordnet as wn

In [35]:
print(wn.synsets('car'))

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]


In [36]:
car = wn.synset('car.n.01')

In [37]:
car.lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [38]:
car.definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [39]:
car.examples()

['he needs a car to get to work']

In [40]:
print(car.hyponyms()[:5])

[Synset('ambulance.n.01'), Synset('beach_wagon.n.01'), Synset('bus.n.04'), Synset('cab.n.03'), Synset('compact.n.03')]


In [41]:
car.hypernyms()

[Synset('motor_vehicle.n.01')]

In [42]:
print(car.part_meronyms()[:5])

[Synset('accelerator.n.01'), Synset('air_bag.n.01'), Synset('auto_accessory.n.01'), Synset('automobile_engine.n.01'), Synset('automobile_horn.n.01')]


In [43]:
wn.synsets('black')[0].lemmas()[0].antonyms()

[Lemma('white.n.02.white')]

## Other useful dictionaries

ConceptNet: http://conceptnet.io/

Sentiment and emotions (affect): [WordNet-Affect](http://wndomains.fbk.eu/wnaffect.html), [SenticNet](https://sentic.net/), [EmoSenticNet](https://www.gelbukh.com/emosenticnet/)

## Vector reprezentation of words - word2vec

Each word has associated a trained vector of real values that represent its attributes and reflect some linguistic regularities. We can compute similarity of words as a similarity of two vectors.

<img src="img/word2vec.png" alt="Trainig word2vec" style="margin-left: auto; margin-right: auto; width: 600px"/>

Zdroj obrázka: https://skymind.ai/wiki/word2vec

vector('Paris') - vector('France') + vector('Italy') ~= vector('Rome')

vector('king') - vector('man') + vector('woman') ~= vector('queen')

https://radimrehurek.com/gensim/models/word2vec.html

https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980

In [44]:
from nltk.corpus import brown

sentences = brown.sents()
#model = models.Word2Vec(sentences, min_count=1)
#model.save('brown_model')
model = models.Word2Vec.load('brown_model')

In [45]:
print(model.most_similar("mother"))

[('father', 0.9837818145751953), ('husband', 0.9667314291000366), ('wife', 0.948153018951416), ('friend', 0.9334405064582825), ('son', 0.9283027648925781), ('nickname', 0.9258568286895752), ('eagle', 0.9163479804992676), ('addiction', 0.9127334356307983), ('voice', 0.9063712358474731), ('patient', 0.9019753932952881)]


  """Entry point for launching an IPython kernel.


In [46]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  """Entry point for launching an IPython kernel.


# Other linguistic models

[fastText](https://fasttext.cc/), [ELMo](https://allennlp.org/elmo), [BERT](https://github.com/google-research/bert), [GloVe](https://nlp.stanford.edu/projects/glove/): a basic comparison can be found [here](https://www.quora.com/What-are-the-main-differences-between-the-word-embeddings-of-ELMo-BERT-Word2vec-and-GloVe)

[sentence embeddings](https://github.com/oborchers/Fast_Sentence_Embeddings)

[doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html)

...and others

## The main take-aways of the today's lecture

* heed the reusability, automatization and modularity of preprocessing - you can use `Pipelines` for this
* when working with text, we are primarily (at this course) interested in **feature extraction - transformation of text to vector reprezentation**
* many text processing methods are language-dependent, but fortunately, there are some ready-made solutions for Slovak language as well

### Next lecture

* dimensionality reduction - how we can project a many dimensional vector (representing, e.g., text) in a less dimensional space
* training, validation and testing of the machine learning models

## Sources

[Dan Jurafsky, James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) (work-in-progress 3rd edition)

http://www.nltk.org/book/

https://radimrehurek.com/gensim/tutorial.html

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Please, give us your feedback here: https://tinyurl.com/iau2019-w07