## What is Tokenization and How does it work ?

Tokenization adalah proses mendasar dalam Natural Language Processing (NLP) yang bertujuan untuk memecah teks menjadi unit-unit yang lebih kecil, disebut token. Token dapat berupa kata, frasa, atau bahkan karakter, tergantung pada kebutuhan analisis.

Secara umum, tokenization dilakukan dalam dua bentuk utama:
- **Tokenisasi Kalimat**: Memecah paragraf menjadi kalimat-kalimat terpisah. Setiap kalimat dianggap sebagai satu token.

- **Tokenisasi Kata**: Memecah kalimat atau paragraf menjadi kata-kata individual. Setiap kata menjadi token yang dapat dianalisis lebih lanjut.

Manfaat tokenization:
- Memudahkan proses analisis teks seperti pencarian kata kunci, stemming, lemmatization, dan ekstraksi fitur.

- Membantu dalam membangun model machine learning untuk tugas-tugas seperti klasifikasi teks, sentiment analysis, dan chatbot.

Tokenization juga penting untuk menghilangkan ambiguitas dalam teks, misalnya membedakan antara kata "NLP" dan "NLP." (dengan tanda titik). Proses ini biasanya dilakukan menggunakan library seperti NLTK atau Spacy yang menyediakan berbagai metode tokenisasi sesuai kebutuhan.

Dengan tokenization, data teks yang awalnya tidak terstruktur menjadi lebih terorganisir dan siap untuk tahap pemrosesan selanjutnya dalam NLP.

## Tokenization Practical

Using Library NLTK : https://www.nltk.org/

Using Library Spacy : https://spacy.io/

In [None]:
!pip install nltk

In [None]:
corpus="""Hello Welcome, to Vanya's NLP Tutorials.
Please do learn and watch the entire course on today's class ! to have a better understanding about NLP.
"""

In [None]:
corpus

In [None]:
print(corpus)

In [None]:
# tokenization
# mengubah paragraf menjadi kalimat
from nltk.tokenize import sent_tokenize

In [None]:
# ini harus di download kalau semisal mau memberi token ke dalam corpus
import nltk
nltk.download('punkt_tab')

In [None]:
sent_tokenize(corpus)

https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html

https://www.nltk.org/api/nltk.tokenize.html

In [None]:
documents=sent_tokenize(corpus)

In [None]:
documents

In [None]:
type(documents)

In [None]:
for sentence in documents:
    print(sentence)

In [None]:
# Tokenization 2
# convert paragraf ke words
# convert kalimat (sentence) ke words
from nltk.tokenize import word_tokenize

In [None]:
word_tokenize(corpus)

In [None]:
for sentence in documents:
    print(word_tokenize(sentence))

In [None]:
from nltk.tokenize import wordpunct_tokenize

In [None]:
# tapi kalau ini dia dipisah "text." jadinya 'text', '.'
# word tokenize belum tentu misahin "text." "kata-", 
wordpunct_tokenize(corpus)

In [None]:
from nltk.tokenize import TreebankWordTokenizer

In [None]:
tokenizer = TreebankWordTokenizer()

In [None]:
# dia itu kek misal "text." dia digabung
# tapi dikata terakhir dia dipisah "text" dan "."
tokenizer.tokenize(corpus)

### Perbandingan Tokenizer NLTK
Berikut contoh sederhana untuk membandingkan hasil dari beberapa tokenizer di NLTK:

In [None]:
# Contoh kalimat sederhana untuk perbandingan tokenizer
text = "Hello world! NLP is fun. Let's tokenize this text: NLP, AI & ML."
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

print('sent_tokenize:')
print(sent_tokenize(text))

print('\nword_tokenize:')
print(word_tokenize(text))

print('\nwordpunct_tokenize:')
print(wordpunct_tokenize(text))

print('\nTreebankWordTokenizer:')
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(text))

---

## Stemming dan tipe nya

Stemming adalah proses dalam Natural Language Processing (NLP) untuk mengubah kata ke bentuk dasarnya (stem/root) dengan cara menghapus akhiran atau awalan tertentu. 

Tujuan stemming adalah menyederhanakan kata sehingga analisis teks menjadi lebih efisien, misalnya kata "berlari", "lari", dan "pelari" akan diubah menjadi "lari". 

Stemming sering digunakan dalam pencarian informasi, text mining, dan analisis sentimen agar kata-kata yang memiliki makna serupa dapat dikenali sebagai satu entitas.

In [None]:
# Classifiation Problem
# Apakah komentar dari suatu produk itu positif atau negatif review
# Reviews ------> eatin, eat, eaten, [going, gone, goes]---> go
# apakah email spam atau tidak
# tidak rekomend untuk membuat chatbot

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

## Porter Stemmer

Porter Stemmer adalah salah satu algoritma stemming paling populer dalam Natural Language Processing (NLP). Algoritma ini dikembangkan oleh Martin Porter pada tahun 1980 dan bertujuan untuk mengubah kata-kata berimbuhan ke bentuk dasarnya (stem/root) dengan cara menghapus akhiran tertentu.

Porter Stemmer bekerja berdasarkan serangkaian aturan yang secara bertahap menghilangkan akhiran kata dalam bahasa Inggris. Meskipun hasil stemming tidak selalu menghasilkan kata yang valid secara tata bahasa, metode ini sangat efektif untuk mengurangi variasi kata sehingga memudahkan analisis teks, seperti pencarian informasi dan klasifikasi dokumen.

Contoh:
- "running", "runs", "runner" → "run"
- "eating", "eaten", "eats" → "eat"

Porter Stemmer banyak digunakan karena sederhana, cepat, dan cukup akurat untuk berbagai aplikasi NLP dasar.

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemming = PorterStemmer()

In [None]:
for word in words:
    print(word+"------>"+stemming.stem(word))

In [None]:
stemming.stem('congratulations')

In [None]:
stemming.stem('duration')

In [None]:
stemming.stem("running")

In [None]:
stemming.stem("sitting")

## Regexp Stemmer Class

Regexp Stemmer adalah salah satu metode stemming yang menggunakan pola regular expression (regex) untuk menghapus akhiran atau bagian tertentu dari kata. Dengan Regexp Stemmer, kita dapat menentukan aturan sendiri dalam bentuk pola regex, misalnya menghapus akhiran seperti "ing", "s", "e", atau "able" dari kata-kata yang memiliki panjang minimal tertentu.

Keunggulan Regexp Stemmer:

- Fleksibel karena dapat menyesuaikan pola penghapusan sesuai kebutuhan.

- Cocok untuk bahasa atau kasus khusus yang tidak didukung stemmer standar.

Contoh penggunaan:
Jika kita ingin menghapus akhiran "ing", "s", "e", atau "able" dari kata-kata yang panjangnya minimal 4 karakter, kita bisa menggunakan:

In [None]:
# reguler Expression Stemmer
from nltk.stem import RegexpStemmer

In [51]:
# 'able$' artinya menghilangkan akhiran 'able'
# 'ing$' artinya menghilangkan akhiran 'ing'
# 's$' artinya menghilangkan akhiran 's'
# 'e$' artinya menghilangkan akhiran 'e'
# min=4 artinya hanya kata yang memiliki panjang minimal 4 karakter yang akan di stem
# jadi kalau dollar nya diilangin dia akan hapus kata yang mengandung ing 
reg_stemmer = RegexpStemmer('ing|s$|e$|able$', min=4)

In [52]:
reg_stemmer.stem('ingridients')

'ridient'

In [54]:
reg_stemmer.stem("eating")

'eat'

In [53]:
reg_stemmer.stem("ingeingat")

'eat'

In [55]:
reg_stemmer.stem("ingkan")

'kan'

## Snowball Stemmer

Snowball Stemmer adalah algoritma stemming yang dikembangkan sebagai penerus Porter Stemmer oleh Martin Porter. Algoritma ini juga dikenal sebagai "Porter2 Stemmer". Snowball Stemmer dirancang untuk lebih fleksibel dan mendukung berbagai bahasa, tidak hanya bahasa Inggris.

#### Keunggulan Snowball Stemmer dibandingkan Porter Stemmer:

- **Lebih Akurat dan Konsisten**: Snowball Stemmer memperbaiki beberapa kelemahan dan inkonsistensi yang ada pada Porter Stemmer, sehingga hasil stemming lebih stabil dan relevan.

- **Mendukung Banyak Bahasa**: Snowball Stemmer dapat digunakan untuk berbagai bahasa seperti Inggris, Jerman, Prancis, Spanyol, dan lain-lain, sedangkan Porter Stemmer hanya untuk bahasa Inggris.

- **Aturan Lebih Modern dan Fleksibel**: Algoritma Snowball menggunakan aturan yang lebih modern dan dapat disesuaikan, sehingga lebih baik dalam menangani variasi kata.

- **Lebih Mudah Dikembangkan**: Snowball Stemmer dibuat dengan bahasa pemrograman khusus (Snowball) yang memudahkan pengembangan stemmer untuk bahasa lain.

Snowball Stemmer adalah metode stemming yang mengubah kata ke bentuk dasarnya dengan menghapus akhiran atau imbuhan tertentu berdasarkan aturan yang telah ditentukan. Snowball Stemmer sangat berguna dalam analisis teks, pencarian informasi, dan aplikasi NLP lainnya karena mampu mengurangi variasi kata tanpa mengubah makna inti.

Contoh:
- "running", "runs", "runner" → "run"

- "fairly", "sportingly" → "fair"

Secara umum, Snowball Stemmer dianggap lebih baik daripada Porter Stemmer karena hasilnya lebih konsisten, mendukung banyak bahasa, dan lebih fleksibel dalam penggunaannya.

In [56]:
from nltk.stem import SnowballStemmer

In [57]:
snowballstem = SnowballStemmer('english')

In [58]:
for word in words:
    print(word+"----->"+snowballstem.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
writes----->write
programming----->program
programs----->program
history----->histori
finally----->final
finalized----->final


In [59]:
# Porter Stemmer
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [60]:
# Snowball Stemmer
snowballstem.stem("fairly"),snowballstem.stem("sportingly")

('fair', 'sport')

In [61]:
snowballstem.stem("goes")

'goe'

In [62]:
stemming.stem("goes")

'goe'

---

lebih cepet--> stemming --> ketemu kata --> langsung di stemming --> 1 modul

agak lama ---> lemmatization ---> ketemu kata ---> nge cek di kamus wordnet ---> baru di lemmatization --- 2 modul : wornet terus baru lemma

Wordnet = kamusnya si lemma

lemmatization modul perintah tapi kalau wordnet itu kamus nya

## Wordnet Lemmatization

Wordnet Lemmatization adalah proses mengubah kata ke bentuk dasar (lemma) yang valid secara tata bahasa dan memiliki makna yang jelas, menggunakan database WordNet. Berbeda dengan stemming yang hanya memotong akhiran kata tanpa memperhatikan makna, lemmatization memastikan hasilnya adalah kata yang benar dan bermakna dalam bahasa.

WordNet Lemmatizer bekerja dengan mempertimbangkan part-of-speech (POS) seperti noun, verb, adjective, dan adverb, sehingga hasil lemmatization lebih akurat dan sesuai konteks.

**Contoh hasil lemmatization pada daftar kata:**

| Kata Asli     | Lemma (Noun) | Lemma (Verb) |
|---------------|--------------|--------------|
| eating        | eating       | eat          |
| eats          | eat          | eat          |
| eaten         | eaten        | eat          |
| writing       | writing      | write        |
| writes        | write        | write        |
| programming   | programming  | program      |
| programs      | program      | program      |
| history       | history      | history      |
| finally       | finally      | finally      |
| finalized     | finalized    | finalize     |

**Penjelasan:**
- Lemmatization menghasilkan kata dasar yang valid, misal "eating" menjadi "eat" (verb).

- Untuk kata benda, hasilnya tetap jika sudah dalam bentuk dasar.

- Lemmatization lebih cocok untuk aplikasi seperti Q&A, chatbot, dan text summarization karena menjaga makna kata.

Dengan WordNet Lemmatizer, analisis teks menjadi lebih akurat dan relevan dibandingkan dengan stemming.

In [63]:
from nltk.stem import WordNetLemmatizer

In [64]:
lemmatizer = WordNetLemmatizer()

In [65]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# '''
# POS- Noun-n kata benda
# verb-v kata kerja
# adjective-a kata sifat
# adverb-r kata keterangan
# '''
lemmatizer.lemmatize("cycling",pos='v')

'cycle'

In [71]:
# dia default nya noun atau n kalau enggak dikasih pos
for word in words:
    print(word+"------>"+lemmatizer.lemmatize(word ,pos='v'))

eating------>eat
eats------>eat
eaten------>eat
writing------>write
writes------>write
programming------>program
programs------>program
history------>history
finally------>finally
finalized------>finalize


In [72]:
for word in words:
    print(word+"------>"+lemmatizer.lemmatize(word ,pos='v'))

eating------>eat
eats------>eat
eaten------>eat
writing------>write
writes------>write
programming------>program
programs------>program
history------>history
finally------>finally
finalized------>finalize


In [73]:
for word in words:
    print(word+"------>"+lemmatizer.lemmatize(word ,pos='a'))

eating------>eating
eats------>eats
eaten------>eaten
writing------>writing
writes------>writes
programming------>programming
programs------>programs
history------>history
finally------>finally
finalized------>finalized


In [74]:
for word in words:
    print(word+"------>"+lemmatizer.lemmatize(word ,pos='r'))

eating------>eating
eats------>eats
eaten------>eaten
writing------>writing
writes------>writes
programming------>programming
programs------>programs
history------>history
finally------>finally
finalized------>finalized


In [79]:
lemmatizer.lemmatize("goes", pos='n')

'go'

In [80]:
lemmatizer.lemmatize("fairly"),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

In [85]:
lemmatizer.lemmatize("fairly", pos='v'),lemmatizer.lemmatize("sportingly", pos='v')

('fairly', 'sportingly')

Dari berbagai metode yang telah dibahas, WordNet Lemmatizer adalah yang paling memakan waktu dalam prosesnya.

**Alasan utama:**

WordNet Lemmatizer membutuhkan pencarian ke dalam database WordNet untuk menentukan bentuk dasar (lemma) dari sebuah kata berdasarkan part-of-speech (POS). Proses ini melibatkan lookup dan validasi linguistik agar hasil lemmatization benar secara tata bahasa. Berbeda dengan stemmer seperti Porter, Snowball, atau Regexp Stemmer yang hanya memotong akhiran kata berdasarkan aturan sederhana dan langsung, lemmatizer harus memastikan kata hasil lemmatization adalah kata yang valid dan bermakna.

Karena kompleksitas dan kebutuhan akses ke resource eksternal (WordNet), proses lemmatization dengan WordNet Lemmatizer cenderung lebih lambat dibandingkan metode stemming lainnya. Namun, hasil yang didapat biasanya lebih akurat dan sesuai konteks.

**Kesimpulan**

Setiap metode memiliki kelebihan dan kekurangan masing-masing. Stemming cocok digunakan untuk proses yang membutuhkan kecepatan dan tidak terlalu mementingkan akurasi kata dasar, seperti pencarian informasi atau analisis statistik sederhana. Sementara itu, lemmatization lebih tepat digunakan ketika hasil kata dasar harus valid secara tata bahasa dan bermakna, misalnya pada aplikasi chatbot, Q&A, atau text summarization. Pemilihan metode tergantung pada kebutuhan dan tujuan analisis teks yang dilakukan.

---

## Latihan Text Preprocessing dengan Dataset Baru
Gunakan dataset berikut untuk mengerjakan soal-soal di bawah ini:

```python
dataset = [
    "Natural Language Processing is amazing!",
    "Text preprocessing includes tokenization, stemming, and lemmatization.",
    "Python and NLTK make NLP tasks easier.",
    "Let's learn NLP together and build cool projects!"
 ]
```

### Soal Latihan
1. **Tokenisasi Kalimat**
   - Gabungkan seluruh dataset menjadi satu paragraf, lalu pecah menjadi kalimat-kalimat menggunakan `sent_tokenize` dari NLTK.

2. **Tokenisasi Kata**
   - Pilih satu kalimat dari hasil tokenisasi sebelumnya, lalu pecah menjadi kata-kata menggunakan `word_tokenize`.

3. **Perbandingan Tokenizer**
   - Gunakan kalimat yang sama untuk membandingkan hasil dari `word_tokenize`, `wordpunct_tokenize`, dan `TreebankWordTokenizer`. Tampilkan hasilnya dan jelaskan perbedaannya.

4. **Stemming**
   - Lakukan stemming pada seluruh kata dari dataset menggunakan Porter Stemmer, Snowball Stemmer, dan Regexp Stemmer. Bandingkan hasilnya.

5. **Lemmatization**
   - Lakukan lemmatization pada seluruh kata dari dataset menggunakan WordNet Lemmatizer untuk POS noun dan verb. Tampilkan hasilnya dan jelaskan perbedaannya dengan stemming.

6. **Analisis Waktu Eksekusi**
   - Ukur waktu eksekusi untuk stemming dan lemmatization pada seluruh kata dari dataset. Bandingkan dan simpulkan mengapa lemmatization lebih lambat.

7. **Penjelasan Konsep**
   - Jelaskan dengan singkat perbedaan antara stemming dan lemmatization, serta kapan sebaiknya menggunakan masing-masing metode dalam aplikasi NLP.

---
Kerjakan setiap soal di bawah ini pada cell kode baru dan tuliskan penjelasan atau hasil analisis Anda di cell markdown setelahnya.


In [86]:
# Dataset
dataset = [
    "Natural Language Processing is amazing!",
    "Text preprocessing includes tokenization, stemming, and lemmatization.",
    "Python and NLTK make NLP tasks easier.",
    "Let's learn NLP together and build cool projects!"
 ]

In [87]:
# 1. Tokenization menjadi kalimat
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

paragraf = " ".join(dataset)
kalimat = sent_tokenize(paragraf)
print('Tokenisasi Kalimat:', kalimat)

Tokenisasi Kalimat: ['Natural Language Processing is amazing!', 'Text preprocessing includes tokenization, stemming, and lemmatization.', 'Python and NLTK make NLP tasks easier.', "Let's learn NLP together and build cool projects!"]


In [88]:
# 2. Tokenization menjadi words
kata = word_tokenize(kalimat[1])
print('\nTokenisasi Kata:', kata)


Tokenisasi Kata: ['Text', 'preprocessing', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']


In [89]:
# 3. Perbandingan Tokenizer
print('\nword_tokenize:', word_tokenize(kalimat[1]))
print('wordpunct_tokenize:', wordpunct_tokenize(kalimat[1]))
tokenizer = TreebankWordTokenizer()
print('TreebankWordTokenizer:', tokenizer.tokenize(kalimat[1]))


word_tokenize: ['Text', 'preprocessing', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
wordpunct_tokenize: ['Text', 'preprocessing', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
TreebankWordTokenizer: ['Text', 'preprocessing', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']


In [90]:
# 4. Stemming
from nltk.stem import PorterStemmer, SnowballStemmer, RegexpStemmer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

print('\nPorter Stemmer:', [porter.stem(w) for w in kata])
print('Snowball Stemmer:', [snowball.stem(w) for w in kata])
print('Regexp Stemmer:', [regexp.stem(w) for w in kata])



Porter Stemmer: ['text', 'preprocess', 'includ', 'token', ',', 'stem', ',', 'and', 'lemmat', '.']
Snowball Stemmer: ['text', 'preprocess', 'includ', 'token', ',', 'stem', ',', 'and', 'lemmat', '.']
Regexp Stemmer: ['Text', 'preprocess', 'include', 'tokenization', ',', 'stemm', ',', 'and', 'lemmatization', '.']


In [91]:
# 5. Lemmatization
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
print('\nWordNet Lemmatizer (noun):', [lemmatizer.lemmatize(w, pos='n') for w in kata])
print('WordNet Lemmatizer (verb):', [lemmatizer.lemmatize(w, pos='v') for w in kata])



WordNet Lemmatizer (noun): ['Text', 'preprocessing', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
WordNet Lemmatizer (verb): ['Text', 'preprocessing', 'include', 'tokenization', ',', 'stem', ',', 'and', 'lemmatization', '.']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [92]:
# 6. Analisis waktu execution
import time

start_stem = time.time()
[porter.stem(w) for w in kata]
end_stem = time.time()

start_lem = time.time()
[lemmatizer.lemmatize(w, pos='v') for w in kata]
end_lem = time.time()

print(f'\nWaktu Stemming: {end_stem - start_stem:.6f} detik')
print(f'Waktu Lemmatization: {end_lem - start_lem:.6f} detik')


Waktu Stemming: 0.000372 detik
Waktu Lemmatization: 0.000331 detik


---

## Stopwords

**Stopwords** adalah kata-kata umum dalam suatu bahasa yang sering muncul dalam teks, namun biasanya tidak memiliki makna penting untuk analisis atau pemrosesan lebih lanjut. Contoh stopwords dalam bahasa Inggris antara lain: "the", "is", "in", "and", "of", "to", "he", "she", "ours", dll.

### Manfaat Stopwords

- **Mengurangi Noise:** Stopwords dihapus agar analisis fokus pada kata-kata yang lebih bermakna dan relevan.

- **Efisiensi Proses:** Mengurangi jumlah kata yang diproses sehingga mempercepat komputasi dan menghemat memori.

- **Meningkatkan Akurasi:** Dengan menghilangkan kata-kata yang tidak penting, model NLP seperti klasifikasi, sentiment analysis, atau pencarian informasi menjadi lebih akurat.

### Kapan Digunakan Stopwords?

- **Text Preprocessing:** Sebelum melakukan tokenisasi, stemming, atau lemmatization, stopwords biasanya dihapus agar hasil analisis lebih bersih.

- **Information Retrieval:** Dalam pencarian dokumen atau data, stopwords dihilangkan agar pencarian lebih relevan.

- **Machine Learning:** Untuk membangun model yang lebih baik, stopwords dihapus agar fitur yang digunakan benar-benar mewakili informasi penting.

### Konteks pada Notebook Ini

Pada notebook ini, stopwords digunakan untuk membersihkan teks sebelum proses stemming dengan Porter Stemmer dan Snowball Stemmer. Contohnya pada variabel `paragraph`, kalimat-kalimat dipecah, lalu kata-kata yang termasuk stopwords dihapus sebelum dilakukan stemming. Hal ini bertujuan agar hasil stemming lebih fokus pada kata-kata inti yang membawa makna, bukan kata-kata umum yang tidak berkontribusi pada analisis.

Stopwords juga tersedia untuk berbagai bahasa seperti Inggris, Jerman, Indonesia, dan Arab, sehingga dapat digunakan sesuai kebutuhan analisis teks multibahasa.

In [93]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [94]:
from nltk.stem import PorterStemmer

In [95]:
from nltk.corpus import stopwords

In [96]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [97]:
# untuk menentukan suatu statement itu positif atau negatif
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [98]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [99]:
stopwords.words('indonesian')

['ada',
 'adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut'

In [100]:
stopwords.words('arabic')

['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

In [101]:
# Porter Stemmer
from nltk.stem import PorterStemmer

In [102]:
stemmer=PorterStemmer()

In [103]:
# tokenization semua paragraf
# mengubah semua paragraf menjadi kalimat
# paragraph ----> sentence
nltk.sent_tokenize(paragraph)

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [104]:
sentences=nltk.sent_tokenize(paragraph)

In [105]:
type(sentences)

list

In [106]:
## Apply Stopwords And Filter And then Apply Stemming
# itunglah semua jumlah kalimat
for i in range(len(sentences)):
    # lakukan tokenisasi ke dalam setiap kalimat menjadi beberapa kata
    words=nltk.word_tokenize(sentences[i])
    # dan lakukan stemming ke dalam setiap kata dan hilangkan stopwords pada setiap kata dengan bahasa inggris
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    # kita gabungin beberapa kata yang sudah di stem dan dihilangkan stopwords nya
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [107]:
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

In [108]:
# snowball stemmer
from nltk.stem import SnowballStemmer

In [109]:
snowstemmer=SnowballStemmer('english')

In [110]:
sentences=nltk.sent_tokenize(paragraph)

In [111]:
## Apply Stopwords And Filter And then Apply Snowball Stemming
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[snowstemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [112]:
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

In [113]:
# Lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()

In [114]:
sentences=nltk.sent_tokenize(paragraph)

In [115]:
## Apply Stopwords And Filter And then Apply Lemmatization
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [116]:
sentences

['I three vision India .',
 'In 3000 year history , people world come invaded u , captured land , conquered mind .',
 'From Alexander onwards , Greeks , Turks , Moguls , Portuguese , British , French , Dutch , came looted u , took .',
 'Yet done nation .',
 'We conquered anyone .',
 'We grabbed land , culture , history tried enforce way life .',
 'Why ?',
 'Because respect freedom others.That first vision freedom .',
 'I believe India got first vision 1857 , started War Independence .',
 'It freedom must protect nurture build .',
 'If free , one respect u .',
 'My second vision India ’ development .',
 'For fifty year developing nation .',
 'It time see developed nation .',
 'We among top 5 nation world term GDP .',
 'We 10 percent growth rate area .',
 'Our poverty level falling .',
 'Our achievement globally recognised today .',
 'Yet lack self-confidence see developed nation , self-reliant self-assured .',
 'Isn ’ incorrect ?',
 'I third vision .',
 'India must stand world .',
 'Bec

In [117]:
sentences=nltk.sent_tokenize(paragraph)

In [118]:
# biar huruf nya kecil semua
## Apply Stopwords And Filter And then Apply Lemmatization
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [119]:
sentences

['i three visions india .',
 'in 3000 years history , people world come invade us , capture land , conquer mind .',
 'from alexander onwards , greeks , turks , moguls , portuguese , british , french , dutch , come loot us , take .',
 'yet do nation .',
 'we conquer anyone .',
 'we grab land , culture , history try enforce way life .',
 'why ?',
 'because respect freedom others.that first vision freedom .',
 'i believe india get first vision 1857 , start war independence .',
 'it freedom must protect nurture build .',
 'if free , one respect us .',
 'my second vision india ’ development .',
 'for fifty years develop nation .',
 'it time see develop nation .',
 'we among top 5 nations world term gdp .',
 'we 10 percent growth rate areas .',
 'our poverty level fall .',
 'our achievements globally recognise today .',
 'yet lack self-confidence see develop nation , self-reliant self-assured .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'because i believe unle

---

## Part Of Speech Tags

**Part of Speech (POS) tags** adalah label yang diberikan pada setiap kata dalam sebuah kalimat untuk menunjukkan kategori atau kelas kata tersebut dalam tata bahasa, seperti kata benda (noun), kata kerja (verb), kata sifat (adjective), kata keterangan (adverb), dan lain-lain.

#### Manfaat POS Tagging
- **Memahami Struktur Kalimat:** Membantu analisis sintaksis dan memahami hubungan antar kata dalam kalimat.

- **Pra-pemrosesan NLP:** Digunakan untuk filtering kata berdasarkan kelasnya, misalnya hanya mengambil kata benda atau kata kerja.

- **Meningkatkan Akurasi Model:** Berguna dalam berbagai aplikasi NLP seperti Named Entity Recognition, sentiment analysis, dan machine translation.

- **Ekstraksi Informasi:** Memudahkan pengambilan informasi spesifik, seperti subjek, objek, atau aksi dalam teks.

#### Kapan Digunakan POS Tagging?
- **Text Preprocessing:** Sebelum analisis lebih lanjut, seperti lemmatization yang membutuhkan POS untuk hasil yang akurat.

- **Information Extraction:** Saat ingin mengekstrak entitas atau fakta dari teks.

- **Parsing dan Syntax Analysis:** Untuk membangun pohon sintaksis atau analisis struktur kalimat.

- **Aplikasi NLP Lanjutan:** Chatbot, summarization, dan question answering sering menggunakan POS tagging untuk memahami konteks dan makna kalimat.

Other References : https://www.geeksforgeeks.org/python/part-speech-tagging-stop-words-using-nltk-python/

"Bali is a beautiful place"

```text
CC coordinating conjunction 
CD cardinal digit 
DT determiner 
EX existential there (like: "there is" ... think of it like "there exists") 
FW foreign word 
IN preposition/subordinating conjunction 
JJ adjective - 'big' 
JJR adjective, comparative - 'bigger' 
JJS adjective, superlative - 'biggest' 
LS list marker 1) 
MD modal - could, will 
NN noun, singular '- desk' 
NNS noun plural - 'desks' 
NNP proper noun, singular - 'Harrison' 
NNPS proper noun, plural - 'Americans' 
PDT predeterminer - 'all the kids' 
POS possessive ending parent's 
PRP personal pronoun -  I, he, she 
PRP$ possessive pronoun - my, his, hers 
RB adverb - very, silently, 
RBR adverb, comparative - better 
RBS adverb, superlative - best 
RP particle - give up 
TO - to go 'to' the store. 
UH interjection - errrrrrrrm 
VB verb, base form - take 
VBD verb, past tense - took 
VBG verb, gerund/present participle - taking 
VBN verb, past participle - taken 
VBP verb, sing. present, non-3d - take 
VBZ verb, 3rd person sing. present - takes 
WDT wh-determiner - which 
WP wh-pronoun - who, what 
WP$ possessive wh-pronoun, eg- whose 
WRB wh-adverb, eg- where, when
```

In [120]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [121]:
import nltk

sentences=nltk.sent_tokenize(paragraph)

In [122]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [123]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [124]:
# Mencari tau POS Tags pada paragraf
from nltk.corpus import stopwords

for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[word for word in words if word not in set(stopwords.words('english'))]
    #sentences[i]=' '.join(words)# converting all the list of words into sentences
    pos_tag=nltk.pos_tag(words)
    print(pos_tag)

[('I', 'PRP'), ('three', 'CD'), ('visions', 'NNS'), ('India', 'NNP'), ('.', '.')]
[('In', 'IN'), ('3000', 'CD'), ('years', 'NNS'), ('history', 'NN'), (',', ','), ('people', 'NNS'), ('world', 'NN'), ('come', 'VBP'), ('invaded', 'VBN'), ('us', 'PRP'), (',', ','), ('captured', 'VBD'), ('lands', 'NNS'), (',', ','), ('conquered', 'VBD'), ('minds', 'NNS'), ('.', '.')]
[('From', 'IN'), ('Alexander', 'NNP'), ('onwards', 'NNS'), (',', ','), ('Greeks', 'NNP'), (',', ','), ('Turks', 'NNP'), (',', ','), ('Moguls', 'NNP'), (',', ','), ('Portuguese', 'NNP'), (',', ','), ('British', 'NNP'), (',', ','), ('French', 'NNP'), (',', ','), ('Dutch', 'NNP'), (',', ','), ('came', 'VBD'), ('looted', 'JJ'), ('us', 'PRP'), (',', ','), ('took', 'VBD'), ('.', '.')]
[('Yet', 'RB'), ('done', 'VBN'), ('nation', 'NN'), ('.', '.')]
[('We', 'PRP'), ('conquered', 'VBD'), ('anyone', 'NN'), ('.', '.')]
[('We', 'PRP'), ('grabbed', 'VBD'), ('land', 'NN'), (',', ','), ('culture', 'NN'), (',', ','), ('history', 'NN'), ('tried'

In [125]:
"Bali is a beautiful place".split()

['Bali', 'is', 'a', 'beautiful', 'place']

In [126]:
print(nltk.pos_tag("Bali is a beautiful place".split()))

[('Bali', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('beautiful', 'JJ'), ('place', 'NN')]


---

## Named Entity Recognition

**Named Entity Recognition (NER)** adalah proses dalam Natural Language Processing (NLP) untuk mengidentifikasi dan mengklasifikasikan entitas penting dalam teks, seperti nama orang, organisasi, lokasi, tanggal, waktu, dan lain-lain. NER secara otomatis menandai kata atau frasa yang merupakan entitas tertentu sehingga memudahkan analisis dan ekstraksi informasi dari teks.

#### Manfaat Named Entity Recognition

- **Ekstraksi Informasi:** Memudahkan pengambilan data penting seperti nama, tempat, tanggal, dan organisasi dari dokumen atau artikel.

- **Pencarian dan Kategorisasi:** Membantu dalam pencarian dokumen berdasarkan entitas tertentu dan mengelompokkan data sesuai kategori entitas.

- **Analisis Data Besar:** Berguna untuk menganalisis data teks dalam jumlah besar secara otomatis, misalnya dalam berita, media sosial, atau laporan bisnis.

- **Meningkatkan Akurasi Model NLP:** NER membantu model NLP memahami konteks dan makna teks dengan lebih baik.

#### Kapan Digunakan Named Entity Recognition?

- **Sistem Pencarian Informasi:** Untuk mengekstrak dan menampilkan entitas penting dari hasil pencarian.

- **Chatbot dan Virtual Assistant:** Agar dapat mengenali nama, tempat, atau waktu dalam percakapan pengguna.

- **Analisis Media Sosial:** Untuk memantau dan menganalisis topik atau entitas yang sedang tren.

- **Dokumentasi dan Arsip:** Untuk mengorganisir dan mengindeks dokumen berdasarkan entitas yang terkandung di dalamnya.

- **Business Intelligence:** Untuk mengekstrak insight dari laporan, email, atau dokumen bisnis.

NER sangat penting dalam berbagai aplikasi NLP modern karena membantu mengubah data teks mentah menjadi informasi terstruktur yang dapat digunakan untuk analisis lebih lanjut.

```text
sentence= "The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel, whose company specialized in building metal frameworks and structures."

Person Eg : Vanya Mayazura
Place or Location Eg: Indonesia
Date Eg: September, 24-09-1989
Time Eg: 4:30pm
Money Eg: 1 Million dollar
Organization Eg: INeuron Private Limited
Percent Eg: 20%, twenty percent
```

In [127]:
sentence="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel, whose company specialized in building metal frameworks and structures."

In [128]:
import nltk
words=nltk.word_tokenize(sentence)

In [129]:
nltk.pos_tag(words)

[('The', 'DT'),
 ('Eiffel', 'NNP'),
 ('Tower', 'NNP'),
 ('was', 'VBD'),
 ('built', 'VBN'),
 ('from', 'IN'),
 ('1887', 'CD'),
 ('to', 'TO'),
 ('1889', 'CD'),
 ('by', 'IN'),
 ('Gustave', 'NNP'),
 ('Eiffel', 'NNP'),
 (',', ','),
 ('whose', 'WP$'),
 ('company', 'NN'),
 ('specialized', 'VBD'),
 ('in', 'IN'),
 ('building', 'NN'),
 ('metal', 'NN'),
 ('frameworks', 'NNS'),
 ('and', 'CC'),
 ('structures', 'NNS'),
 ('.', '.')]

In [130]:
tag_elements=nltk.pos_tag(words)

In [131]:
import nltk
nltk.download('maxent_ne_chunker_tab')
# buat nentuin named entity recognition

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


True

In [132]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [133]:
# Kalau mau provide named entity recognition
nltk.ne_chunk(tag_elements).draw()

---

## Latihan Stopwords, Part of Speech dan Named Entity Recognition

In [134]:
dataset = [
    "Barack Obama was the 44th President of the United States.",
    "I love learning Natural Language Processing with Python.",
    "Jakarta is the capital city of Indonesia.",
    "Apple is looking at buying U.K. startup for $1 billion."
]

### Bagian 1 – Stopwords

1. Hapus stopwords dari setiap kalimat di dataset.

2. Hitung berapa banyak stopwords yang ada di tiap kalimat.

### Bagian 2 – Part of Speech (POS Tagging)

3. Lakukan POS Tagging untuk setiap kata di dataset.

4. Ambil semua kata yang merupakan kata benda (NN/NNP).

### Bagian 3 – Named Entity Recognition (NER)

5. Lakukan Named Entity Recognition menggunakan ne_chunk.

6. Tampilkan entitas yang bertipe PERSON, GPE (Geopolitical Entity), dan ORGANIZATION.

---

### Jawab

---

### Pembahasan

In [135]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag, ne_chunk

# Download resource
# ini buat tokenizer
nltk.download('punkt')
# ini buat stopwords
nltk.download('stopwords')
# ini buat pos tag
nltk.download('averaged_perceptron_tagger')
# ini buat named entity recognition
nltk.download('maxent_ne_chunker')
# ini buat words atau kalimat dipecah menjadi suatu kata
nltk.download('words')

# Dataset
dataset = [
    "Barack Obama was the 44th President of the United States.",
    "I love learning Natural Language Processing with Python.",
    "Jakarta is the capital city of Indonesia.",
    "Apple is looking at buying U.K. startup for $1 billion."
]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Vanya\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [136]:
# Stopwords
stop_words = set(stopwords.words('english'))

print("=== STOPWORDS REMOVAL ===")
# setiap kalimat yang ada di dataset
for sentence in dataset:
    # lakukan tokenisasi ke dalam setiap kalimat menjadi beberapa kata
    words = word_tokenize(sentence)
    # lakukan filter untuk menghilangkan stopwords dan tulisannya menjadi kecil
    filtered = [w for w in words if w.lower() not in stop_words]
    # print hasil yang udah sebelum dan sesudah di filter
    print(f"Kalimat asli   : {sentence}")
    print(f"Setelah filter : {filtered}")
    # buat nampilin jumlah stopwords yang dihapus
    print(f"Jumlah stopwords dihapus: {len(words) - len(filtered)}\n")

=== STOPWORDS REMOVAL ===
Kalimat asli   : Barack Obama was the 44th President of the United States.
Setelah filter : ['Barack', 'Obama', '44th', 'President', 'United', 'States', '.']
Jumlah stopwords dihapus: 4

Kalimat asli   : I love learning Natural Language Processing with Python.
Setelah filter : ['love', 'learning', 'Natural', 'Language', 'Processing', 'Python', '.']
Jumlah stopwords dihapus: 2

Kalimat asli   : Jakarta is the capital city of Indonesia.
Setelah filter : ['Jakarta', 'capital', 'city', 'Indonesia', '.']
Jumlah stopwords dihapus: 3

Kalimat asli   : Apple is looking at buying U.K. startup for $1 billion.
Setelah filter : ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.']
Jumlah stopwords dihapus: 3



In [137]:
# POS Tagging
print("\n=== POS TAGGING ===")
# dia ngitung setiap kalimat yang ada di dataset
for sentence in dataset:
    # dia nge tokenize setiap kalimat menjadi beberapa kata
    words = word_tokenize(sentence)
    # dia nge tag setiap kata yang udah di tokenize tadi
    tagged = pos_tag(words)
    # disuruh nampilin kata benda aja
    nouns = [word for word, tag in tagged if tag in ["NN", "NNP", "NNS", "NNPS"]]
    # dia nampilin hasilnya
    print(f"Kalimat: {sentence}")
    print(f"POS Tags: {tagged}")
    print(f"Kata benda: {nouns}\n")


=== POS TAGGING ===
Kalimat: Barack Obama was the 44th President of the United States.
POS Tags: [('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('44th', 'JJ'), ('President', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.')]
Kata benda: ['Barack', 'Obama', 'President', 'United', 'States']

Kalimat: I love learning Natural Language Processing with Python.
POS Tags: [('I', 'PRP'), ('love', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'VBG'), ('with', 'IN'), ('Python', 'NNP'), ('.', '.')]
Kata benda: ['Natural', 'Language', 'Python']

Kalimat: Jakarta is the capital city of Indonesia.
POS Tags: [('Jakarta', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('capital', 'NN'), ('city', 'NN'), ('of', 'IN'), ('Indonesia', 'NNP'), ('.', '.')]
Kata benda: ['Jakarta', 'capital', 'city', 'Indonesia']

Kalimat: Apple is looking at buying U.K. startup for $1 billion.
POS Tags: [('Apple', 'NNP'), ('is', 'VBZ'), (

In [138]:
#  Named Entity Recognition
print("\n=== NAMED ENTITY RECOGNITION ===")
# untuk setiap kalimat yang ada di dataset
for sentence in dataset:
    # lakukan tokenisasi ke dalam setiap kalimat menjadi beberapa kata
    words = word_tokenize(sentence)
    # lakukan pos tag ke dalam setiap kata
    tagged = pos_tag(words)
    # lakukan named entity recognition
    chunks = ne_chunk(tagged)

#   print hasilnya
    print(f"Kalimat: {sentence}")
    print("Entitas yang ditemukan:")

#   print setiap entitas yang ditemukan
    for chunk in chunks:
        if hasattr(chunk, 'label'):
            if chunk.label() in ["PERSON", "GPE", "ORGANIZATION"]:
                entity = " ".join(c[0] for c in chunk.leaves())
                print(f" - {entity} ({chunk.label()})")
    print()


=== NAMED ENTITY RECOGNITION ===
Kalimat: Barack Obama was the 44th President of the United States.
Entitas yang ditemukan:
 - Barack (PERSON)
 - Obama (PERSON)
 - United States (GPE)

Kalimat: I love learning Natural Language Processing with Python.
Entitas yang ditemukan:
 - Natural (ORGANIZATION)
 - Python (PERSON)

Kalimat: Jakarta is the capital city of Indonesia.
Entitas yang ditemukan:
 - Jakarta (GPE)
 - Indonesia (GPE)

Kalimat: Apple is looking at buying U.K. startup for $1 billion.
Entitas yang ditemukan:
 - Apple (GPE)



In [139]:
# Visualisasi Named Entity Recognition (NER) dengan NLTK
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

text = "Barack Obama was born in Hawaii. He was elected president in 2008."

tokens = word_tokenize(text)
tagged = pos_tag(tokens)
ner_tree = ne_chunk(tagged)

# Menampilkan visual tree NER (jalankan di lingkungan yang mendukung GUI)
ner_tree.draw()