# Spracovanie textu

## *I made her duck.*

* Uvaril (upiekol) som jej kačku.
* Uvaril som kačku, ktorá jej patrí.
* Spravil som jej kačku (napr. ako origami).
* Primäl som ju zohnúť sa.
* Premenil som ju na kačku.

## Ako sa v tom vyznať?

*Lematizácia (stemming)*
* made -> make

*Word sense disambiguation*
* make ?= uvariť, spraviť, prinútiť, ...

*Part-of-speech tagging*
* duck ako sloveso alebo podstatné meno?

*Coreference (anaphora) resolution*
* her -> ?

## Ďalšie NLP úlohy

* Segmentácia textu
  * na tokeny
  * na vety (sentence boundary disambiguation)
  * na témy
* Automatická sumarizácia textu
* Analýza sentimentu
* Strojový preklad textu
* (Syntaktické) parsovanie viet
* Analýza diskurzu, odpovedanie na otázky 
* Rozpoznávanie a segmentácia reči
  * Speech-to-text, text-to-speech

## Nás zaujíma predovšetkým extrakcia čŕt z textu

Segmentácia textu + identifikácia kľúčových slov, resp. často sa spolu vyskytujúcich slov (tokenov)

Určovanie podobnosti dvoch textových dokumentov

## Dostupné nástroje v Pythone

NLTK (http://www.nltk.org/)

Gensim (https://radimrehurek.com/gensim/tutorial.html)

### Ďalšie nástroje (mimo Pythonu)

Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/; rozhranie aj cez NLTK)

Apache OpenNLP (https://opennlp.apache.org/)

WordNet (https://wordnet.princeton.edu/; rozhranie cez NLTK)

# NLTK

http://www.nltk.org/book/

In [1]:
import nltk

In [2]:
text = """At eight o'clock on Thursday morning 
... Arthur didn't feel very good. He closed his eyes and went to bed again."""

## Tokenizácia

In [3]:
sentences = nltk.sent_tokenize(text)

In [4]:
sentences

["At eight o'clock on Thursday morning \nArthur didn't feel very good.",
 'He closed his eyes and went to bed again.']

In [5]:
tokens = [nltk.word_tokenize(sent) for sent in sentences]

In [6]:
tokens

[['At',
  'eight',
  "o'clock",
  'on',
  'Thursday',
  'morning',
  'Arthur',
  'did',
  "n't",
  'feel',
  'very',
  'good',
  '.'],
 ['He', 'closed', 'his', 'eyes', 'and', 'went', 'to', 'bed', 'again', '.']]

## Stemming

In [7]:
porter = nltk.PorterStemmer()

In [8]:
[[porter.stem(token) for token in sent] for sent in tokens]

[['At',
  'eight',
  "o'clock",
  'on',
  'thursday',
  'morn',
  'arthur',
  'did',
  "n't",
  'feel',
  'veri',
  'good',
  '.'],
 ['He', 'close', 'hi', 'eye', 'and', 'went', 'to', 'bed', 'again', '.']]

## Lematizácia

In [9]:
wnl = nltk.WordNetLemmatizer()

In [10]:
[[wnl.lemmatize(token) for token in sent] for sent in tokens]

[['At',
  'eight',
  "o'clock",
  'on',
  'Thursday',
  'morning',
  'Arthur',
  'did',
  "n't",
  'feel',
  'very',
  'good',
  '.'],
 ['He', 'closed', 'his', 'eye', 'and', 'went', 'to', 'bed', 'again', '.']]

## Part-of-Speech Tagging

In [11]:
tagged = [nltk.pos_tag(sent) for sent in tokens]
tagged

[[('At', 'IN'),
  ('eight', 'CD'),
  ("o'clock", 'NN'),
  ('on', 'IN'),
  ('Thursday', 'NNP'),
  ('morning', 'NN'),
  ('Arthur', 'NNP'),
  ('did', 'VBD'),
  ("n't", 'RB'),
  ('feel', 'VB'),
  ('very', 'RB'),
  ('good', 'JJ'),
  ('.', '.')],
 [('He', 'PRP'),
  ('closed', 'VBD'),
  ('his', 'PRP$'),
  ('eyes', 'NNS'),
  ('and', 'CC'),
  ('went', 'VBD'),
  ('to', 'TO'),
  ('bed', 'VB'),
  ('again', 'RB'),
  ('.', '.')]]

In [12]:
nltk.help.upenn_tagset('IN')

IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...


In [13]:
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


## Menné entity

In [14]:
entities = nltk.chunk.ne_chunk(tagged[0])

In [15]:
print(entities.__repr__())

Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])


## N-gramy

In [16]:
text2 = nltk.word_tokenize("He is very old. He is very bold.")

In [17]:
bigrams = list(nltk.bigrams(text2))
bigrams

[('He', 'is'),
 ('is', 'very'),
 ('very', 'old'),
 ('old', '.'),
 ('.', 'He'),
 ('He', 'is'),
 ('is', 'very'),
 ('very', 'bold'),
 ('bold', '.')]

In [18]:
nltk.FreqDist(bigrams)

FreqDist({('.', 'He'): 1,
          ('He', 'is'): 2,
          ('bold', '.'): 1,
          ('is', 'very'): 2,
          ('old', '.'): 1,
          ('very', 'bold'): 1,
          ('very', 'old'): 1})

In [19]:
list(nltk.trigrams(text2))

[('He', 'is', 'very'),
 ('is', 'very', 'old'),
 ('very', 'old', '.'),
 ('old', '.', 'He'),
 ('.', 'He', 'is'),
 ('He', 'is', 'very'),
 ('is', 'very', 'bold'),
 ('very', 'bold', '.')]

## WordNet

* Lexikálna databáza
* Obsahuje synsety
  * Podstatné mená, slovesá, prídavné mená, príslovky
* Prepojenia medzi synsetmi
  * Antonymá, hyperonymá, hyponymá, holonymá, meronymá


In [20]:
from nltk.corpus import wordnet as wn

In [21]:
wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [22]:
car = wn.synset('car.n.01')

In [23]:
car.lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [24]:
car.definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [25]:
car.examples()

['he needs a car to get to work']

In [26]:
car.hyponyms()

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

In [27]:
car.hypernyms()

[Synset('motor_vehicle.n.01')]

In [28]:
car.part_meronyms()

[Synset('accelerator.n.01'),
 Synset('air_bag.n.01'),
 Synset('auto_accessory.n.01'),
 Synset('automobile_engine.n.01'),
 Synset('automobile_horn.n.01'),
 Synset('buffer.n.06'),
 Synset('bumper.n.02'),
 Synset('car_door.n.01'),
 Synset('car_mirror.n.01'),
 Synset('car_seat.n.01'),
 Synset('car_window.n.01'),
 Synset('fender.n.01'),
 Synset('first_gear.n.01'),
 Synset('floorboard.n.02'),
 Synset('gasoline_engine.n.01'),
 Synset('glove_compartment.n.01'),
 Synset('grille.n.02'),
 Synset('high_gear.n.01'),
 Synset('hood.n.09'),
 Synset('luggage_compartment.n.01'),
 Synset('rear_window.n.01'),
 Synset('reverse.n.02'),
 Synset('roof.n.02'),
 Synset('running_board.n.01'),
 Synset('stabilizer_bar.n.01'),
 Synset('sunroof.n.01'),
 Synset('tail_fin.n.02'),
 Synset('third_gear.n.01'),
 Synset('window.n.02')]

In [29]:
wn.synsets('black')[0].lemmas()[0].antonyms()

[Lemma('white.n.02.white')]

# Reprezentácia textu

Textový dokument väčšinou reprezentujeme pomocou množiny slov (angl. *bag-of-words*) = vektorom. Zložky vektoru predstavujú jednotlivé slová, resp. n-gramy zo slovníka (pre celý korpus/jazyk). Hodnotou zložiek vektora môže byť:

* početnosť
* frekvencia
* váhovaná frekvencia

Slová s vysokou frekvenciou výskytu v jazyku (spojky a pod.) sa označujú ako tzv. *stop slová* a zvyknú sa pri predspracovaní odstraňovať.

## TF-IDF

Term frequency * inverse document frequency

`TF` – frekvencia slova v aktuálnom dokumente

`IDF` – záporný logaritmus pravdepodobnosti výskytu slova v dokumente (rovnaká pre všetky dokumenty)

Rôzne varianty (váhovacie schémy): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

![TF](img/tf.png)

![IDF](img/idf.png)

![TF-IDF](img/tfidf.png)

## Gensim

Knižnica na modelovanie tém v dokumentoch.

Implementuje TF-IDF, LSA, pLSA, LDA, HDP, DTM, word2vec

https://radimrehurek.com/gensim/tutorial.html

In [30]:
from gensim import corpora, models, similarities



In [31]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

In [32]:
documents

['Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees and well quasi ordering',
 'Graph minors A survey']

Odstránenie stopslov

In [33]:
stoplist = set('for a of the and to in'.split())

In [34]:
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

In [35]:
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

Odstránenie slov, ktoré sa v korpuse vyskytujú len raz

In [36]:
from collections import defaultdict

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

In [37]:
dictionary = corpora.Dictionary(texts)

In [38]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [39]:
new_doc = "Human computer interaction"

In [40]:
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(0, 1), (1, 1)]

In [41]:
corpus = [dictionary.doc2bow(text) for text in texts]

Trénovanie TF-IDF modelu

In [42]:
tfidf = models.TfidfModel(corpus)

In [43]:
tfidf[new_vec]

[(0, 0.7071067811865476), (1, 0.7071067811865476)]

Ďalšie modely: LSI, LDA, ...

## word2vec

Každé slovo má naučený vektor reálnych čísel, ktoré reprezentujú rôzne jeho vlastnosti a zachytávajú viaceré lingvistické pravidelnosti. Môžeme počítať podobnosť medzi slovami ako podobnosť dvoch vektorov.

vector('Paris') - vector('France') + vector('Italy') ~= vector('Rome')

vector('king') - vector('man') + vector('woman') ~= vector('queen')

https://radimrehurek.com/gensim/models/word2vec.html