# PR NLP NLTK - Spacy

Eksplorasi *library* yang biasa digunakan untuk Natural Language Processing, yaitu **NLTK** dan **Spacy**

## Author

Stefanus Gusega Gunawan - 13518149

## Import packages dan inisiasi package

In [120]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import nltk
import spacy

# Load small pretreained model yang disediakan oleh Spacy untuk bahasa Inggris
nlp = spacy.load('en_core_web_sm')

## Sentence Splitter

Mengecek apakah sebuah penanda (determiner: “.”, “?”, “!”) adalah akhir sebuah kalimat (EOS) atau tidak. Hal ini diimplementasikan oleh `sent_tokenize` pada NLTK.

In [121]:
# Contoh kumpulan kalimat yang akan ditokenize menjadi kalimat-kalimat
sentences = 'The same words in a different order can mean something completely different. God is great! What can you do? I won a lottery!          '

### NLTK

In [122]:
# Dengan menggunakan function pada NLTK bernama sent_tokenize akan menghasilkan array of sentence
sent_tokenize(sentences)

['The same words in a different order can mean something completely different.',
 'God is great!',
 'What can you do?',
 'I won a lottery!']

### Spacy

In [123]:
# Aplikasikan pretrained model Spacy pada kumpulan kalimat
doc = nlp(sentences)
# Cetak tiap kalimat dan pastikan kalau EOS terdeteksi dengan benar
for sent in doc.sents:
    print(sent)

The same words in a different order can mean something completely different.
God is great!
What can you do?
I won a lottery!          


## Tokenization

Membagi kalimat menjadi token-token.

In [124]:
sentence = 'Tokenization is the U.K. process of segmenting a string of characters into words.'

### NLTK

In [125]:
# Menggunakan fungsi word_tokenize dari NLTK untuk mensegmentasi kalimat menjadi kata-kata beserta punctuationnya
word_tokenize(sentence)

['Tokenization',
 'is',
 'the',
 'U.K.',
 'process',
 'of',
 'segmenting',
 'a',
 'string',
 'of',
 'characters',
 'into',
 'words',
 '.']

### Spacy

In [126]:
# Aplikasikan pretreained model
doc = nlp(sentence)
# Print hasil tokenization
[token.text for token in doc]

['Tokenization',
 'is',
 'the',
 'U.K.',
 'process',
 'of',
 'segmenting',
 'a',
 'string',
 'of',
 'characters',
 'into',
 'words',
 '.']

## Stemming

Stemming di sini maksudnya adalah menghapus suffix, prefix, dan/atau infix dari suatu kata, sehingga meninggalkan *stemmed word*.

In [127]:
words_to_stem = [
    'program',
    'programs',
    'programming',
    'programmer',
    'programmers',
    'write',
    'writes',
    'written',
    'wrote'
]

### NLTK

In [128]:
# Inisiasi algoritma Porter untuk proses stemming yang menggunakan suffix stripping
nltk_stemmer = PorterStemmer()

for word in words_to_stem:
    print(nltk_stemmer.stem(word))

program
program
program
programm
programm
write
write
written
wrote


In [129]:
# Inisiasi algoritma stemming lainnya yang bernama Snowball Stemmer
nltk_stemmer = SnowballStemmer('english')

for word in words_to_stem:
    print(nltk_stemmer.stem(word))

program
program
program
programm
programm
write
write
written
wrote


### Spacy

Sayangnya Spacy tidak menyediakan fitur untuk melakukan *stemming*, maka dari itu lebih baik gunakan NLTK untuk melakukan *stemming*.

## Lemmatization

Jika melihat *stemming* menghasilkan kata dasar tapi tidak terdefinisi dari kamus, di sinilah *lemmatization* dibutuhkan. Kata dasar yang dikembalikan saat proses *lemmatization* adalah kata dasar yang terdefinisi di kamus

In [130]:
verbs_to_lemmatize = [
    'program',
    'programs',
    'programming',
    'write',
    'writes',
    'written',
    'wrote',
]

nouns_to_lemmatize = [
    'rocks',
    'corpus',
    'ashes',
    'knives'
]

adjectives_to_lemmatize = [
    'good',
    'better',
    'best',
]

### NLTK

In [131]:
nltk_lemmatizer = WordNetLemmatizer()

# Melakukan lemmatization sebagai verb
for word in verbs_to_lemmatize:
    print(nltk_lemmatizer.lemmatize(word, 'v'))

program
program
program
write
write
write
write


In [132]:
# Melakukan lemmatization sebagai noun by default
for word in nouns_to_lemmatize:
    print(nltk_lemmatizer.lemmatize(word))

rock
corpus
ash
knife


In [133]:
# Melakukan lemmatization sebagai adjective by default
for word in adjectives_to_lemmatize:
    print(nltk_lemmatizer.lemmatize(word, 'a'))

good
good
best


### Spacy

In [134]:
# Lakukan parsing kalimat dengan menggunakan pretrained model
doc = nlp('I was reading the newspaper')
# Print masing-masing lemmatisasi dari tiap hasil parse kalimat
[word.lemma_ for word in doc]

['-PRON-', 'be', 'read', 'the', 'newspaper']

In [135]:
# Iterasi tiap kata benda pada array
for noun in nouns_to_lemmatize:
    # Lakukan parsing dengan pretrained model
    doc = nlp(noun)
    # Print hasil lemmatisasi
    print([word.lemma_ for word in doc])

['rock']
['corpus']
['ashe']
['knife']


In [136]:
for verb in verbs_to_lemmatize:
    doc = nlp(verb)
    print([word.lemma_ for word in doc])

['program']
['program']
['programming']
['write']
['write']
['write']
['write']


In [137]:
for adj in adjectives_to_lemmatize:
    doc = nlp(adj)
    print([word.lemma_ for word in doc])

['good']
['well']
['good']


## Entity Masking

Suatu teknik untuk seakan-akan melakukan sensor terhadap entitas-entitas tertentu, seperti email, nama orang, nama brand, dan sebagainya. Di sini hanya digunakan Spacy.

In [138]:
news_text="""Indian man has allegedly duped nearly 50 businessmen in the UAE of USD 1.6 million and fled the country in the most unlikely way -- on a repatriation flight to Hyderabad, according to a media report on Saturday.Yogesh Ashok Yariava, the prime accused in the fraud, flew from Abu Dhabi to Hyderabad on a Vande Bharat repatriation flight on May 11 with around 170 evacuees, the Gulf News reported.Yariava, the 36-year-old owner of the fraudulent Royal Luck Foodstuff Trading, made bulk purchases worth 6 million dirhams (USD 1.6 million) against post-dated cheques from unsuspecting traders before fleeing to India, the daily said.
The bought goods included facemasks, hand sanitisers, medical gloves (worth nearly 5,00,000 dirhams), rice and nuts (3,93,000 dirhams), tuna, pistachios and saffron (3,00,725 dirhams), French fries and mozzarella cheese (2,29,000 dirhams), frozen Indian beef (2,07,000 dirhams) and halwa and tahina (52,812 dirhams).
The list of items and defrauded persons keeps getting longer as more and more victims come forward, the report said.
The aggrieved traders have filed a case with the Bur Dubai police station.
The traders said when the dud cheques started bouncing they rushed to the Royal Luck's office in Dubai but the shutters were down, even the fraudulent company's warehouses were empty."""

sentence = 'To email Guido, try guido@python.org or the older address guido@google.com'

news_doc = nlp(news_text)
sent_doc = nlp(sentence)

In [139]:
# Fungsi untuk melakukan masking terhadap named entity menjadi tipe entitasnya, dan menambahkan kasus khusus jika itu adalah email,
# maka masking dengan menggunakan __EMAIL__
def remove_details(word):
    # Jika tipe entitas nya PERSON, ORG, dan GPE, maka kembalikan entitas tipe untuk masking
    if word.ent_type_ =='PERSON' or word.ent_type_=='ORG' or word.ent_type_=='GPE':
        return f' _{word.ent_type_}_ '
    # Jika memiliki struktur email, maka kembalikan _EMAIL_
    elif word.like_email:
        return ' _EMAIL_ '
    # Jika bukan termasuk dua di atas, maka kembalikan bentuk awalnya
    return word.string

def update_text(doc):
    # Lakukan mapping terhadap parsed text
    tokens = map(remove_details, doc)
    # dan gabungkan
    return ''.join(tokens)

In [140]:
# Entity masking text berita
update_text(news_doc)

"Indian man has allegedly duped nearly 50 businessmen in the UAE of USD 1.6 million and fled the country in the most unlikely way -- on a repatriation flight to  _GPE_ , according to a media report on Saturday. _PERSON_  _PERSON_  _PERSON_ , the prime accused in the fraud, flew from  _GPE_  _GPE_ to  _GPE_ on a Vande Bharat repatriation flight on May 11 with around 170 evacuees,  _ORG_  _ORG_  _ORG_ reported. _ORG_ , the 36-year-old owner of the fraudulent Royal Luck Foodstuff Trading, made bulk purchases worth 6 million dirhams (USD 1.6 million) against post-dated cheques from unsuspecting traders before fleeing to  _GPE_ , the daily said.\nThe bought goods included facemasks, hand sanitisers, medical gloves (worth nearly 5,00,000 dirhams), rice and nuts (3,93,000 dirhams), tuna, pistachios and saffron (3,00,725 dirhams), French fries and mozzarella cheese (2,29,000 dirhams), frozen Indian beef (2,07,000 dirhams) and halwa and tahina (52,812 dirhams).\nThe list of items and defrauded 

In [141]:
# Entity masking teks yang mengandung email
update_text(sent_doc)

'To email  _PERSON_ , try  _EMAIL_ or the older address  _EMAIL_ '

## POS Tagger

Sebuah kalimat pasti memiliki struktur kalimat atau *syntax* tersendiri. POS Tagger atau Part-Of-Speech Tagger berguna untuk melabeli tiap-tiap bagian dari sebuah kalimat termasuk syntax apa. Contoh dari POS seperti *noun*, *plural noun*, *verb*, dan sebagainya.

In [142]:
sentence = "Can you please buy me a Luwak White Coffee over there? It's only Rp15,000."

### NLTK

In [143]:
# POS Tagger baru bisa digunakan setelah kalimat sudah ditokenisasi
nltk.pos_tag(word_tokenize(sentence))

[('Can', 'MD'),
 ('you', 'PRP'),
 ('please', 'VB'),
 ('buy', 'VB'),
 ('me', 'PRP'),
 ('a', 'DT'),
 ('Luwak', 'NNP'),
 ('White', 'NNP'),
 ('Coffee', 'NNP'),
 ('over', 'IN'),
 ('there', 'RB'),
 ('?', '.'),
 ('It', 'PRP'),
 ("'s", 'VBZ'),
 ('only', 'RB'),
 ('Rp15,000', 'NNP'),
 ('.', '.')]

### Spacy

In [144]:
doc = nlp(sentence)
# Print text hasil tokenisasi, part of speech secara umum, dan POS Tag nya yang lebih menjelaskan peran POS
for word in doc:
    print(f"{word.text}, POS: {word.pos_}, POS Tag: {word.tag_}")

Can, POS: VERB, POS Tag: MD
you, POS: PRON, POS Tag: PRP
please, POS: INTJ, POS Tag: UH
buy, POS: VERB, POS Tag: VB
me, POS: PRON, POS Tag: PRP
a, POS: DET, POS Tag: DT
Luwak, POS: PROPN, POS Tag: NNP
White, POS: PROPN, POS Tag: NNP
Coffee, POS: PROPN, POS Tag: NNP
over, POS: ADV, POS Tag: RB
there, POS: ADV, POS Tag: RB
?, POS: PUNCT, POS Tag: .
It, POS: PRON, POS Tag: PRP
's, POS: AUX, POS Tag: VBZ
only, POS: ADV, POS Tag: RB
Rp15,000, POS: PROPN, POS Tag: NNP
., POS: PUNCT, POS Tag: .


## Phrase Chunking

Untuk mengekstraksi frase yang mengikuti pola tertentu. Pola biasanya dilambangkan dengan menggunakan RegEx yang berisikan POS Tag. Untuk contoh kalimat di bawah, akan diambil informasi kata benda dan juga kata sifat yang menggambarkan kata benda itu jika ada.

In [145]:
sentence = "Can you please buy me a cup of cincau over there? It's only Rp15,000 and it's a delicious coffee that I've ever drink."

### NLTK

In [146]:
# Definisikan pattern yang akan dicari
# 0 atau 1 determiner, 0 atau tak terhingga adjective, dan 1 kata benda
pattern = "Chunk: {<DT>?<JJ>*<NN>}"

# Inisiasi regex parser dengan pattern tadi
parser = nltk.RegexpParser(pattern)

# Lalu, parse kalimat yang sudah ditokenisasi
result = parser.parse(nltk.pos_tag(word_tokenize(sentence)))

# untuk menampilkan representasi pohon pada terminal
print(result.__repr__())

# Untuk menampilkan gambar pohon
result.draw()

Tree('S', [('Can', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('buy', 'VB'), ('me', 'PRP'), Tree('Chunk', [('a', 'DT'), ('cup', 'NN')]), ('of', 'IN'), Tree('Chunk', [('cincau', 'NN')]), ('over', 'IN'), ('there', 'RB'), ('?', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('only', 'RB'), ('Rp15,000', 'NNP'), ('and', 'CC'), ('it', 'PRP'), ("'s", 'VBZ'), Tree('Chunk', [('a', 'DT'), ('delicious', 'JJ'), ('coffee', 'NN')]), ('that', 'IN'), ('I', 'PRP'), ("'ve", 'VBP'), ('ever', 'RB'), ('drink', 'VBP'), ('.', '.')])
