Informasi datang dalam berbagai bentuk dan ukuran. Salah satu bentuk penting adalah **data terstruktur**, di mana terdapat relasi teratur antar entitas. Misalnya, ketika kita ingin mencari tempat kerja, bisa jadi kita mungkin tertarik pada hubungan antara perusahaan dan lokasi. Mengingat perusahaan tertentu, kita mengidentifikasi lokasi di mana bisnis dijalankan. Sebaliknya, berdasarkan lokasi, kita bisa mengetahui perusahaan mana yang melakukan bisnis di lokasi tersebut. 

Perhatikan tabel berikut:

| OrgName             | LocationName |
|---------------------|--------------|
| Omnicom             | New York     |
| DDB Needham         | New York     |
| Kaplan Thaler Group | New York     |
| BBDO South          | Atlanta      |
| Georgia-Pacific     | Atlanta      |

Jika tabel di atas diubah ke dalam bentuk list atau tuple, maka muncul pertanyaan,"Perusahaan mana yang lokasinya berada di Atlanta?" Dengan bantuan python, kita bisa menuliskan:
```python
print([e1 for (e1,rel,e2) in data_kota if e2 == 'Atlanta'])
```

In [124]:
data_kota = [
    ('Omnicom','IN','New York'),
    ('DDB Needham','IN','New York'),
    ('Kaplan Thaler Group','IN','New York'),
    ('BBDO South','IN','Atlanta'),
    ('Georgia-Pasific','IN','Atlanta'),
]

hasil = [e1 for (e1,rel,e2) in data_kota if e2 == 'Atlanta']
print(hasil)

['BBDO South', 'Georgia-Pasific']


Dengan model data berupa list, tentu akan mudah dalam mencari kata. Akan tetapi, bila teksnya seperti berikut:

"*The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.*"

bagaimana cara mengambil informasi terkait perusahaan dan lokasinya? pertanyaan ini akan terjawab pada poin 3 yang membahas tentang Named-Entity Recognition (NER).

Secara umum, arsitektur dari ekstraksi informasi adalah seperti Gambar 1
![image.png](images/alur.png)
*Gambar 1. Arsitektur dan alur ekstraksi informasi*

Pada Gambar 1, tahapan *sentence segmentation* adalah proses mengubah teks menjadi token di mana tiap token adalah satu kalimat. Tahapan ini sudah dituntaskan pada Praktikum 1. Tahapan sisanya, kita bahas pada Praktikum ke-2 ini.

# Tagging Part-of-Speech (POS Tagging) #

**Part of speech** adalah istilah untuk grammar pada susunan kata yang dipakai seperti noun, verbs, adjective, adverbs, etc. Dalam NLP, POS Tagging adalah proses penandaan pada tiap kata berdasarkan *part of speech*. 

Sebagai contoh POS Tagging, kita gunakan kembali variabel `kalimat_token`.

In [125]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/oddy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [126]:
kalimat = "The striped bats are hanging on their feet in a dangerous place"
kalimat_token = word_tokenize(kalimat)


In [127]:

kalimat_token_pos_tagged = nltk.pos_tag(kalimat_token)
print(kalimat_token_pos_tagged)

[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('dangerous', 'JJ'), ('place', 'NN')]


Output dari adalah

```cmd
[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]
```
Setelah melihat isinya, mungkin kalian kebingungan. Tenang saja. Untuk mengetahui singkatan-singkatan tersebut, jalankan kode berikut

In [128]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to /home/oddy/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


#### Penjelasan kode ####
1. `nltk.download('averaged_perceptron_tagger')` digunakan untuk mengunduh corpus untuk POS Tagging
1. `nltk.pos_tag(kalimat_token)` digunakan untuk melabel/tagging token dari teks
1. `nltk.download('tagsets')` digunakan untuk mengunduh corpus tagsets
1. `nltk.help.upenn_tagset()` digunakan untuk menampilkan daftar istilah dalam POS Tagging

# Chunking #

## Mungkin kalian bertanya, apa sih kegunaan dari POS Tagging? ##

Dari tahapan awal hingga POS Tagging, yang kita lakukan hanyalah memproses token yang berisi satu kata. Bagaimana jika ada kumpulan kata dengan makna satu atau yang biasa disebut dengan frasa? Nah, di sinilah peran dari POS Tagging. Dengan adanya POS Tagging, kita bisa membuat 'rumus' untuk memilah suatu frasa. 
Contoh frasa dalam bahasa Inggris, yaitu: 'a dangerouse place', 'a beautiful mind', etc. 

Jika menggunakan tokenisasi langsung, maka teks 'a dangerous place' akan dipecah menjadi tiga token. Padahal, teks frasa memiliki satu arti tertentu yang tidak bisa dipisahkan. 

## Bagaimana cara mencari frasa dari teks? ##

Di NLTK ada fitur yang disebut sebagai 'Chunk Grammar'. Chunk Grammar (CG) adalah kumpulan susunan grammar berdasarkan POS Tagging. Pada dasarnya, CG menggunakan regular expression atau Regex.

Contoh dari rumus CG dalam mencari frasa adalah
`grammar = "NP: {<DT>?<JJ>*<NN>}"`

di mana `NP` adalah Noun Phrase atau frasa kata benda, sedangkan `DT` adalah determiner, `JJ` adalah adjective, dan `NN` adalah Noun atau kata benda. Tanda `?` memiliki arti bahwa `DT` itu bersifat opsional atau tidak wajib ada, sedangkan `*` berarti bahwa `JJ` bisa lebih dari satu. Untuk lebih jelasnya, mari kita praktikkan.

In [129]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)

In [130]:
tree = chunk_parser.parse(kalimat_token_pos_tagged)
#tree.draw() Khusus desktop user, silakan uncomment kode ini jika ingin melihat hasilnya. 

Hasil dari `tree.draw()` adalah seperti pada Gambar 2
![image.png](images/hirarki.png)
*Hirarki Frasa*

# Named Entity Recognition (NER) #

Named Entity (NE) adalah frasa kata benda yang menunjukkan lokasi, nama orang, nama organisasi, dan sebagainya. Dengan NER, kita bisa menemukan NE dari dalam teks dan juga menentukan jenis NE-nya. Tujuan dari NER adalah untuk mengidentifikasi semua penyebutan tekstual dari entitas. NER dapat dipecah menjadi dua tugas, pertama mengidentifikasi batas-batas NE dan mengidentifikasi jenisnya. Selain berfungsi untuk identifikasi **relasi** dalam **Ekstraksi Informasi** NER juga digunakan pada tugas-tugas lain. Misalnya, di Question Answering (QA), informasi yang diambil tidaklah keseluruhan halaman, tetapi hanya bagian-bagian yang berisi jawaban atas pertanyaan pengguna. 

**Sekarang misalkan pertanyaannya adalah Siapa Presiden pertama Indonesia?**, dan salah satu dokumen yang diambil berisi bagian berikut:

"*The National Monument is the most prominent structure in Jakarta and one of the city's early attractions. It was built and initiated by Soekarno, who led the country to independence and then became its first President.*"

Sebelum melakukan NER, kita lakukan preprocessing dulu terhadap teks yang akan diproses. Adapun fungsi dari preprocess adalah:

In [131]:
def preprocess_teks(teks):
    teks = nltk.sent_tokenize(teks)
    list_token = []
    for word in teks:
        kalimat = word_tokenize(word)
        list_token.append(kalimat)
    
    return list_token

In [132]:
# langkah satu
nltk.download('maxent_ne_chunker')
nltk.download('words')

# langkah dua
contoh_teks = "The National Monument is the most prominent structure " \
"in Jakarta and one of the city's early attractions. "\
"It was built and initiated by Soekarno, who led the"\
", who led the country to independence and then became its first President. "

# langkah tiga
for word in preprocess_teks(contoh_teks):
    pos = nltk.pos_tag(word)
    chunks = nltk.ne_chunk(pos)

    # langkah empat
    for chunk in chunks:
        if hasattr(chunk,'label'):
            print(chunk.label(),' '.join(c[0] for c in chunk))


ORGANIZATION National Monument
GPE Jakarta
PERSON Soekarno


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/oddy/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/oddy/nltk_data...
[nltk_data]   Package words is already up-to-date!


Jika Anda mengamati output di atas, NER menemukan total tiga entitas, yaitu ORGANIZATION, GPE (lokasi), dan PERSON (nama), yhanya dengan tiga baris kode python. Mari kita amati kode yang telah terjadi.

1. `nltk.download('maxent_ne_chunker')`
1. `nltk.download('words')`
1. `contoh_teks`
```python
contoh_teks = "The National Monument is the most prominent structure " \
"in Jakarta and one of the city's early attractions. "\
"It was built and initiated by Soekarno, who led the"\
", who led the country to independence and then became its first President. "
```

Langkah satu dan dua sangat mudah; kami telah mengunduh modul yang diperlukan dan mendefinisikan `contoh_teks` sebagai variabel python.

Pada langkah ketiga, kita memanggil fungsi `preprocess_teks` untuk mengubah `contoh_teks` menjadi bentuk token yang sudah di POS Tagging. Output dari fungsi `preprocess_teks` digunakan untuk langkah tiga dalam menghasilkan NER, yaitu dengan menggunakan `chunks = nltk.ne_chunk(pos)`. Variabel `chunks`, berisi semua token yang telah berisi NER.

Langkah keempat sebenarnya sangat sederhana, yaitu hanya memunculkan data yang ada labelnya. Dari langkah empat ini, yang dihasilkan NER dari `contoh_teks` adalah
```cmd
ORGANIZATION National Monument
GPE Jakarta
PERSON Soekarno
```

Demikian praktikum 2, semoga lancar dan penuh berkah belajarnya.

# Referensi #

1. N. Indurkhya and F. J. Damerau, Handbook of Natural Language Processing. CRC Press, 2010.
1. S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009.
1. https://www.nltk.org/index.html00

# Tugas Praktikum 2

## Judul Tugas
Penerapan POS Tagging dan Named-Entity Recognition pada Teks Bahasa Inggris

## Deskripsi
Gunakan beragam teknik preprocess yang telah dipelajari pada perkuliahan pertemuan ke-4 dan praktikum ke-2 untuk mengolah teks berikut:

South Africa's health minister says there is "absolutely no need to panic" over the new coronavirus variant Omicron, despite a surge in cases. "We have been here before," Joe Phaahla added, referring to the Beta variant detected in South Africa last December. South Africa also condemned the travel bans imposed on the country, saying they should be lifted immediately. Omicron has been classed as a "variant of concern". Early evidence suggests it has a heightened re-infection risk. The heavily mutated variant was detected in South Africa earlier this month and then reported to the World Health Organization (WHO) last Wednesday. The variant is responsible for most of the infections found in South Africa's most populated province, Gauteng, over the last two weeks. The number of cases of "appears to be increasing in almost all provinces" in the country, according to the WHO. Does southern Africa have enough vaccines? South Africans fear impact of new variant measures Covid variants: Do we need new vaccines yet? Africa Live: More on this and other stories from the continent South Africa reported 2,800 new infections on Sunday, a rise from the daily average of 500 in the previous week. Government adviser and epidemiologist Salim Abdool Karim said he expected the number of cases to reach more than 10,000 a day by the end of the week, and for hospitals to come under pressure in the next two to three weeks. Dr Phaahla said he wanted to "reiterate that there is absolutely no need to panic" because this "is no new territory for us". "We are now more than 20 months' experienced in terms of Covid-19, various variants and waves," he added at a media briefing. Media caption, Cyril Ramaphosa: "We are deeply disappointed by the decision of several countries to prohibit travel." On Monday, Japan became the latest country to reinstate tough border restrictions, banning all foreigners from entering from 30 November. The UK, EU and US are among those who earlier imposed travel bans on South Africa and other regional states. UN Secretary General António Guterres said he was "deeply concerned" about the isolation of southern Africa, adding that "the people of Africa cannot be blamed for the immorally low level of vaccinations available in Africa". The bans and restrictions have left the plans of a huge number of travellers up in the air. South African Annalee Veysey, who is getting treatment for cancer in South Africa, was expecting to be reunited with her family in the UK early in December. She has been separated from them for the last 15 months because of earlier travel restrictions and her treatment. "It's almost two years of my life I've missed out with my family. Especially if you've had a journey with cancer, you find what your family means to you," she told the BBC, adding that she felt "desperate". Travellers at a near empty airportImage source, EPA Image caption, South Africa's main airport in Johannesburg was getting quieter over the weekend as restrictions were taking effect Hannah Day is stuck in Pretoria. She flew to South Africa last week after she got news that her son, who lives there, was in hospital after being bitten by a snake. He is now recovering but Ms Day needs to return to the UK for work. "I can self-isolate, but I cannot afford to pay for the quarantine," she told the BBC. The WHO has warned against countries hastily imposing travel curbs, saying they should look to a "risk-based and scientific approach". The world body's Africa director Matshidiso Moeti said on Sunday: "With the Omicron variant now detected in several regions of the world, putting in place travel bans that target Africa attacks global solidarity." However, Rwanda and Angola are among African states that have announced a restriction on flights to and from South Africa. South Africa's foreign ministry spokesman Clayson Monyela described their decision as "quite regrettable, very unfortunate, and I will even say sad". In a speech on Sunday, South Africa's President Cyril Ramaphosa said the bans would not be effective in preventing the spread of the variant. "The only thing the prohibition on travel will do is to further damage the economies of the affected countries and undermine their ability to respond to, and recover from, the pandemic," he said. Current regulations in South Africa make it mandatory to wear face coverings in public, and restrict indoor gatherings to 750 people and outdoor gatherings to 2,000. Mr Ramaphosa said South Africa would not impose new restrictions, but would "undertake broad consultations on making vaccination mandatory for specific activities and locations". There are no vaccine shortages in South Africa itself, and Mr Ramaphosa urged more people to get jabbed, saying that remained the best way to fight the virus. Health experts said that Gauteng, which includes Johannesburg, had entered a fourth wave, and most hospital admissions were of unvaccinated people. Omicron has now been detected in a number of countries around the world, including the UK, Germany, Australia and Israel. In other developments: China said it would offer 1bn doses of vaccines to African countries on top of the 200m it had already supplied US Covid adviser Anthony Fauci says the government is on "high alert" and that spread is inevitable A Czech woman who came back from Namibia recently was confirmed to have the Omicron variant Portugal has detected 13 cases of the variant among players and staff of Lisbon-based Belenenses SAD football club Australia has paused its plans to reopen its borders in light of the Omicron variant

## Penilaian ##
1. Program harus dibuat dengan bahasa Jupyter Notebook (20 poin)
1. Harus ada fungsi preprocess() seperti pada TP1. Contoh: preprocess(teks) (30 poin)
1. Hitung jumlah kata yang mengandung CHUNK PERSON, ORGANIZATION, dan GPE (50 poin)

## Catatan ##
1. Pengumpulan tugas praktikum 2 dikumpulkan di Google Classroom dengan nama file TP2_NIM_Nama_Lengkap_Anda.ipynb
1. Kesalahan penamaan nama dan format file, berakibat pada penolakan Tugas Praktikum 2