<a href="https://colab.research.google.com/github/velizhask/PemrosesanBahasaAlami-nlp/blob/main/Named%20Entity%20Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Named Entity Recognition**
adalah teknik NLP yang ampuh yang mengidentifikasi dan mengklasifikasikan entitas utama dalam teks, yang memungkinkan mesin untuk memahami dan memproses bahasa manusia secara lebih efektif.

#Spacy (English)

PERSON:      People, including fictional.

NORP:        Nationalities or religious or political groups.

FAC:         Buildings, airports, highways, bridges, etc.

ORG:         Companies, agencies, institutions, etc.

GPE:         Countries, cities, states.

LOC:         Non-GPE locations, mountain ranges, bodies of water.

PRODUCT:     Objects, vehicles, foods, etc. (Not services.)

EVENT:       Named hurricanes, battles, wars, sports events, etc.

WORK_OF_ART: Titles of books, songs, etc.

LAW:         Named documents made into laws.

LANGUAGE:    Any named language.

DATE:        Absolute or relative dates or periods.

TIME:        Times smaller than a day.

PERCENT:     Percentage, including ”%“.

MONEY:       Monetary values, including unit.

QUANTITY:    Measurements, as of weight or distance.

ORDINAL:     “first”, “second”, etc.

CARDINAL:    Numerals that do not fall under another type.

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = spacy.load("en_core_web_sm")

In [None]:
# Memuat model bahasa Inggris yang sudah dilatih
nlp = spacy.load("en_core_web_sm")

# Data dummy
text = """Apple Inc. is headquartered in Cupertino, California.
The company was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
It is known for its iPhone, iPad, Mac, and other consumer electronics products.
Google LLC is an American multinational technology company that specializes in Internet-related services and products."""

# Proses teks dengan spaCy
doc = nlp(text)

# Menampilkan entitas yang ditemukan dalam teks
print("Named Entities, Phrases, and Concepts:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Named Entities, Phrases, and Concepts:
Apple Inc. (ORG)
Cupertino (GPE)
California (GPE)
Steve Jobs (PERSON)
Steve Wozniak (PERSON)
Ronald Wayne (PERSON)
1976 (DATE)
iPhone (ORG)
iPad (ORG)
Mac (PERSON)
Google LLC (ORG)
American (NORP)


In [None]:
displacy.render(doc,style="ent",jupyter=True)

# Transformer (Indonesia)

In [None]:
!pip install transformers torch

In [None]:
teks = """
Joko Widodo, Presiden Republik Indonesia, mengunjungi Bali pada tanggal 10 November 2024 untuk meresmikan proyek
pembangunan infrastruktur baru. Dalam kunjungannya, beliau didampingi oleh Menteri Pekerjaan Umum, Basuki Hadimuljono,
serta Gubernur Bali, Wayan Koster.Proyek ini melibatkan perusahaan konstruksi besar seperti PT. Wijaya Karya dan PT. Adhi Karya.
Pemerintah berharap bahwa dengan adanya proyek ini, Bali akan semakin berkembang sebagai tujuan wisata utama di Asia Tenggara.
Selain itu, Joko Widodo juga menyampaikan pentingnya peningkatan kualitas pendidikan di daerah-daerah terpencil.
Untuk mendukung hal tersebut, pemerintah telah mengalokasikan dana sebesar 5 triliun rupiah yang akan digunakan untuk pembangunan
sekolah dan pelatihan guru di seluruh Indonesia.
"""

In [None]:
from transformers import pipeline

# Inisialisasi pipeline untuk NER menggunakan model 'cahya/bert-base-indonesian-NER'
ner_pipeline = pipeline("ner", model="cahya/bert-base-indonesian-NER", tokenizer="cahya/bert-base-indonesian-NER")

# Menjalankan NER pada artikel
ner_results = ner_pipeline(teks)

# Menampilkan hasil NER
print("Hasil Named Entity Recognition (NER):")
for entity in ner_results:
    print(f"Entity: {entity['word']} - Label: {entity['entity']} - Score: {entity['score']:.4f}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of the model checkpoint at cahya/bert-base-indonesian-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/230k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Hasil Named Entity Recognition (NER):
Entity: joko - Label: B-PER - Score: 0.9968
Entity: widodo - Label: I-PER - Score: 0.9929
Entity: presiden - Label: B-NOR - Score: 0.9902
Entity: republik - Label: I-NOR - Score: 0.9816
Entity: indonesia - Label: I-NOR - Score: 0.9758
Entity: bali - Label: B-GPE - Score: 0.9958
Entity: 10 - Label: B-DAT - Score: 0.9832
Entity: november - Label: I-DAT - Score: 0.9973
Entity: 2024 - Label: I-DAT - Score: 0.9947
Entity: menteri - Label: B-NOR - Score: 0.9944
Entity: pekerjaan - Label: I-NOR - Score: 0.9918
Entity: umum - Label: I-NOR - Score: 0.9902
Entity: basuki - Label: B-PER - Score: 0.9980
Entity: hadi - Label: I-PER - Score: 0.9970
Entity: ##mul - Label: I-PER - Score: 0.9966
Entity: ##jono - Label: I-PER - Score: 0.9937
Entity: gubernur - Label: B-NOR - Score: 0.9805
Entity: bali - Label: I-NOR - Score: 0.9566
Entity: wayan - Label: B-PER - Score: 0.9871
Entity: kos - Label: I-PER - Score: 0.9972
Entity: ##ter - Label: I-PER - Score: 0.9853
Ent

Hasil NER:

*   Nama Orang (PER)
*   Organisasi (ORG)
*  Lokasi (GPE - Geopolitical Entity)
*   Tanggal (DAT - Date)
*   Monetary (MON)
*   Nama Jabatan atau Gelar (NOR)
*   Geopolitik (GPE)

Penjelasan Label:
B-PER / I-PER: Entitas yang merujuk pada nama orang (contoh: Joko Widodo, Basuki Hadimuljono, Wayan Koster).

B-ORG / I-ORG: Entitas yang merujuk pada organisasi (contoh: PT. Wijaya Karya, PT. Adhi Karya).

B-GPE / I-GPE: Entitas yang merujuk pada tempat atau lokasi geografis atau politik (contoh: Bali, Indonesia, Asia Tenggara).

B-DAT / I-DAT: Entitas yang merujuk pada tanggal (contoh: 10 November 2024).

B-MON / I-MON: Entitas yang merujuk pada nilai uang atau jumlah finansial (contoh: 5 Triliun Rupiah).

B-NOR / I-NOR: Entitas yang merujuk pada jabatan atau gelar seperti Presiden, Menteri, Gubernur.

Catatan:
Beberapa entitas, seperti "Presiden" dan "Menteri Pekerjaan Umum", terkadang dibagi dalam dua token terpisah, sehingga model mengenali kata-kata ini dalam beberapa bagian (misalnya Presiden dan Republik), yang merupakan fitur umum dalam pemrosesan teks berbasis token.

Dengan hasil NER ini, Anda bisa mengidentifikasi entitas yang relevan dalam artikel dan menggunakannya untuk analisis lebih lanjut.

# Spacy (Multilingual)

In [None]:
!python -m spacy download xx_ent_wiki_sm

Collecting xx-ent-wiki-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/xx_ent_wiki_sm-3.7.0/xx_ent_wiki_sm-3.7.0-py3-none-any.whl (11.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xx-ent-wiki-sm
Successfully installed xx-ent-wiki-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('xx_ent_wiki_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# Memuat model bahasa Inggris yang sudah dilatih
coba = spacy.load("xx_ent_wiki_sm")

# Proses teks dengan spaCy
doc = coba(teks)

# Menampilkan entitas yang ditemukan dalam teks
print("Named Entities, Phrases, and Concepts:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Named Entities, Phrases, and Concepts:
Joko Widodo (PER)
Presiden Republik Indonesia (LOC)
Bali (LOC)
Dalam (PER)
Menteri Pekerjaan Umum (PER)
Basuki Hadimuljono (PER)
Gubernur Bali (PER)
Wayan Koster (PER)
PT (ORG)
Wijaya Karya (PER)
PT (ORG)
Adhi Karya (PER)
Bali (LOC)
Asia Tenggara (LOC)
Selain (PER)
Joko Widodo (PER)
daerah-daerah terpencil (LOC)
mengalokasikan dana (PER)
rupiah yang akan digunakan (PER)
Indonesia (LOC)


Referensi:
https://github.com/utomoreza/spaCy-NER

Soal Praktikum

*   Lakukan scrapping data dan implementasikan NER pada data tersebut


In [None]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.corpus import stopwords

# Unduh data yang diperlukan
nltk.download('punkt_tab')  # Untuk tokenisasi
nltk.download('averaged_perceptron_tagger_eng')  # Untuk POS tagging
nltk.download('maxent_ne_chunker_tab')  # Untuk NER
nltk.download('words')  # Untuk kata-kata dalam chunker NER

# Contoh teks untuk NER
text = "Barack Obama was born in Hawaii and was the 44th president of the United States. Microsoft is a major tech company based in Redmond."

# Tokenisasi teks menjadi kata
tokens = word_tokenize(text)

# POS tagging (menandai bagian dari ucapan)
tags = pos_tag(tokens)

# Melakukan Named Entity Recognition (NER)
entities = ne_chunk(tags)

# Menampilkan hasil NER
print("Hasil Named Entity Recognition:")
print(entities)

# Menampilkan entitas yang dikenali dengan format yang lebih mudah dibaca
print("\nEntitas yang dikenali:")
for subtree in entities:
    if isinstance(subtree, nltk.Tree):  # Mengecek apakah ini adalah entitas
        entity = " ".join([word for word, tag in subtree.leaves()])
        entity_type = subtree.label()
        print(f"{entity} -> {entity_type}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Hasil Named Entity Recognition:
(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  and/CC
  was/VBD
  the/DT
  44th/JJ
  president/NN
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.
  (PERSON Microsoft/NNP)
  is/VBZ
  a/DT
  major/JJ
  tech/NN
  company/NN
  based/VBN
  in/IN
  (GPE Redmond/NNP)
  ./.)

Entitas yang dikenali:
Barack -> PERSON
Obama -> PERSON
Hawaii -> GPE
United States -> GPE
Microsoft -> PERSON
Redmond -> GPE


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk import ne_chunk

# Unduh data yang diperlukan untuk NLTK
nltk.download('punkt')  # Untuk tokenisasi
nltk.download('averaged_perceptron_tagger')  # Untuk POS tagging

# Contoh teks bahasa Indonesia untuk NER
text = "Joko Widodo adalah Presiden Indonesia yang dilantik pada 20 Oktober 2014. Jakarta adalah ibu kota negara."

# Tokenisasi teks menjadi kata
tokens = word_tokenize(teks)

# POS tagging
tags = pos_tag(tokens)

# Menampilkan hasil POS tagging
print("Hasil POS tagging:")
print(tags)

# Melakukan Named Entity Recognition (NER)
entities = ne_chunk(tags)

# Menampilkan hasil NER
print("\nHasil Named Entity Recognition:")
print(entities)

# Menampilkan entitas yang dikenali
print("\nEntitas yang dikenali:")
for subtree in entities:
    if isinstance(subtree, nltk.Tree):  # Mengecek apakah ini adalah entitas
        entity = " ".join([word for word, tag in subtree.leaves()])
        entity_type = subtree.label()
        print(f"{entity} -> {entity_type}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Hasil POS tagging:
[('Joko', 'NNP'), ('Widodo', 'NNP'), (',', ','), ('Presiden', 'NNP'), ('Republik', 'NNP'), ('Indonesia', 'NNP'), (',', ','), ('mengunjungi', 'NN'), ('Bali', 'NNP'), ('pada', 'VBZ'), ('tanggal', 'JJ'), ('10', 'CD'), ('November', 'NNP'), ('2024', 'CD'), ('untuk', 'NN'), ('meresmikan', 'NN'), ('proyek', 'NN'), ('pembangunan', 'NN'), ('infrastruktur', 'NN'), ('baru', 'NN'), ('.', '.'), ('Dalam', 'NNP'), ('kunjungannya', 'NN'), (',', ','), ('beliau', 'NN'), ('didampingi', 'NN'), ('oleh', 'NN'), ('Menteri', 'NNP'), ('Pekerjaan', 'NNP'), ('Umum', 'NNP'), (',', ','), ('Basuki', 'NNP'), ('Hadimuljono', 'NNP'), (',', ','), ('serta', 'NN'), ('Gubernur', 'NNP'), ('Bali', 'NNP'), (',', ','), ('Wayan', 'NNP'), ('Koster.Proyek', 'NNP'), ('ini', 'NN'), ('melibatkan', 'NN'), ('perusahaan', 'NN'), ('konstruksi', 'NN'), ('besar', 'NN'), ('seperti', 'NN'), ('PT', 'NNP'), ('.', '.'), ('Wijaya', 'NNP'), ('Karya', 'NNP'), ('dan', 'NN'), ('PT', 'NNP'), ('.', '.'), ('Adhi', 'NNP'), ('Karya',

In [None]:
import spacy

# Menggunakan model bahasa Indonesia dari spaCy
nlp = spacy.load('xx_ent_wiki_sm')  # Model multibahasa yang mendukung bahasa Indonesia

# Proses teks
doc = nlp(teks)

# Menampilkan entitas yang dikenali
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")


Joko Widodo -> PER
Presiden Republik Indonesia -> LOC
Bali -> LOC
Dalam -> PER
Menteri Pekerjaan Umum -> PER
Basuki Hadimuljono -> PER
Gubernur Bali -> PER
Wayan Koster -> PER
PT -> ORG
Wijaya Karya -> PER
PT -> ORG
Adhi Karya -> PER
Bali -> LOC
Asia Tenggara -> LOC
Selain -> PER
Joko Widodo -> PER
daerah-daerah terpencil -> LOC
mengalokasikan dana -> PER
rupiah yang akan digunakan -> PER
Indonesia -> LOC


In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.9.2-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.9.2-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.14.0 stanza-1.9.2


In [None]:
import stanza

# Mengunduh dan memuat model bahasa Indonesia
stanza.download('id')
nlp = stanza.Pipeline('id')

# Teks dalam bahasa Indonesia
text = """
Joko Widodo adalah Presiden Republik Indonesia yang sering mengunjungi Bali.
Dalam kunjungannya, ia bertemu dengan Menteri Pekerjaan Umum Basuki Hadimuljono dan Gubernur Bali Wayan Koster.
Selain itu, ia juga mengunjungi proyek-proyek yang dikerjakan oleh PT Wijaya Karya dan PT Adhi Karya di Bali dan wilayah Asia Tenggara.
"""

# Proses teks dengan Stanza
doc = nlp(text)

# Menampilkan entitas yang dikenali
for ent in doc.ents:
    print(f"{ent.text} -> {ent.type}")


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: id (Indonesian) ...
INFO:stanza:File exists: /root/stanza_resources/id/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: id (Indonesian):
| Processor    | Package      |
-------------------------------
| tokenize     | gsd          |
| mwt          | gsd          |
| pos          | gsd_charlm   |
| lemma        | gsd_nocharlm |
| constituency | icon_charlm  |
| depparse     | gsd_charlm   |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
print(doc)

[
  [
    {
      "id": 1,
      "text": "Joko",
      "lemma": "joko",
      "upos": "PROPN",
      "xpos": "X--",
      "head": 4,
      "deprel": "nsubj",
      "start_char": 1,
      "end_char": 5,
      "misc": "SpacesBefore=\\n"
    },
    {
      "id": 2,
      "text": "Widodo",
      "lemma": "widodo",
      "upos": "PROPN",
      "xpos": "F--",
      "head": 1,
      "deprel": "flat:name",
      "start_char": 6,
      "end_char": 12
    },
    {
      "id": 3,
      "text": "adalah",
      "lemma": "adalah",
      "upos": "AUX",
      "xpos": "O--",
      "head": 4,
      "deprel": "cop",
      "start_char": 13,
      "end_char": 19
    },
    {
      "id": 4,
      "text": "Presiden",
      "lemma": "presiden",
      "upos": "PROPN",
      "xpos": "NSD",
      "head": 0,
      "deprel": "root",
      "start_char": 20,
      "end_char": 28
    },
    {
      "id": 5,
      "text": "Republik",
      "lemma": "republik",
      "upos": "PROPN",
      "xpos": "F--",
      "head": 