# Praktikum 1 - Dasar Penggunaan NLTK #

**NLTK** (https://www.nltk.org) adalah library yang digunakan untuk pengolahan data bahasa manusia.

Pada bagian ini, kita belajar cara instalasi NLTK dan beberapa contoh dasar penggunaannya.

# Instalasi dan Setup

Instalasi nltk cukup sederhana, yaitu hanya menggunakan perintah `pip`

## Dari command line atau terminal: ##
> `pip install -U nltk`


# Bermain dengan nltk di Python

Berikut ini adalah beberapa contoh perintah dasar untuk nltk.

## Tokenization
Langkah pertama dalam memproses teks adalah membagi semua bagian komponen (kata & tanda baca) menjadi "token". Token ini sangat berguna untuk menemukan pola dan dianggap sebagai langkah dasar untuk stemming dan lemmatization. 
Tokenisasi nltk ada dua, yaitu: tokenisasi kata dan tokenisasi kalimat. 
Mari kita lihat contoh masing-masing:

### Tokenisasi Kata

In [12]:
# import librari nltk
import nltk

# download corpus punkt
nltk.download('punkt')

from nltk.tokenize import word_tokenize
text = "Alhamdulillah, hari ini cuacanya cerah. Tapi, sore hari hujan"
print(word_tokenize(text))


['Alhamdulillah', ',', 'hari', 'ini', 'cuacanya', 'cerah', '.', 'Tapi', ',', 'sore', 'hari', 'hujan']


[nltk_data] Downloading package punkt to /home/oddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Penjelasan kode ####
1. `import nltk` adalah import librari nltk
1. `nltk.download('punkt')` adalah download database corpus dari nltk. Proses ini hanya sekali ketika running awal.
1. `word_tokenize` adalah modul dari librari NLTK.
1. `text` adalah variabel yang menampung data teks tipe string

### Tokenisasi Kalimat

In [13]:
from nltk.tokenize import sent_tokenize

text = "Alhamdulillah, hari ini cuacanya cerah. Tapi, sore hari hujan"
print(sent_tokenize(text))

['Alhamdulillah, hari ini cuacanya cerah.', 'Tapi, sore hari hujan']


Output dari tokenisasi kata dan kalimat berbeda. Kita amati perbedaan output berikut:
1. Tokenisasi kata: `['Alhamdulillah', ',', 'hari', 'ini', 'cuacanya', 'cerah', '.', 'Tapi', ',', 'sore', 'hari', 'hujan']`, 
1. Tokenisasi kalimat: `['Alhamdulillah, hari ini cuacanya cerah.', 'Tapi, sore hari hujan']`. 

Tokenisasi pada nltk sudah bisa membedakan mana kalimat dan mana kata berdasarkan tanda baca.

## Stopword Removal

**Stop words** adalah kata-kata yang ingin diabaikan atau difilter dari teks. Kata-kata yang sangat umum seperti 'di', 'yang', dan 'saya' sering digunakan sebagai stopword karena tidak menambahkan banyak arti pada teks itu sendiri. Di sini, kita memanfaatkan fitur stopword removal untuk bahasa Inggris.

In [14]:
# download corpus stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /home/oddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Penjelasan kode ####
1. `nltk.download('stopwords')` adalah import corpus stopwords yang akan digunakan untuk filter corpus
1. `from nltk.corpus import stopwords` adalah import stopwords.

In [15]:
benjamin_quote = "Tell me and I forget. Teach me and I remember. Involve me and I learn"

words = word_tokenize(benjamin_quote)
print(words)

['Tell', 'me', 'and', 'I', 'forget', '.', 'Teach', 'me', 'and', 'I', 'remember', '.', 'Involve', 'me', 'and', 'I', 'learn']


In [16]:
stop_words = set(stopwords.words("english"))
print(stop_words)

{'with', 'against', 'are', 'be', "you'd", 'both', "wouldn't", 'herself', 'it', 'haven', 'of', 'itself', 'too', 'if', "isn't", 'between', 'because', 'again', "doesn't", 'doing', 't', 'theirs', 'other', "you're", "couldn't", 'by', 'who', 'do', 'nor', 'off', 'more', 'he', 'only', 'me', 'into', 'up', 'should', "mustn't", 'any', 'a', 'you', 'such', 'while', "didn't", 'didn', 'until', 'm', 'll', 'mightn', 'very', 'ma', 'their', 'weren', "weren't", 'hasn', 'them', 'but', 'been', 'below', 'own', 'wouldn', 'were', 'why', 'what', 'hers', 'an', 'in', 'from', 'her', 'further', 'our', 'or', 've', 'whom', 'same', 'to', 'out', 'am', 'himself', 'which', 'few', "mightn't", 'she', 'couldn', 'not', 'some', "wasn't", "haven't", 'yourself', 'on', 'isn', "shan't", 'shan', 'they', "shouldn't", 'where', 'him', 'aren', 'can', 're', 'when', 'your', 'through', "it's", 'is', 'than', 'ourselves', 'themselves', 'yours', 'before', 'had', 'its', 'then', 'his', 's', 'this', "aren't", 'now', "she's", 'down', 'did', 'th

#### Penjelasan kode ####
1. `stopwords.words("english")` adalah memilih stopwords yang dipakai dalam bahasa Inggris
1. `set(stopwords.words("english"))` adalah mengubah tipe data `list` dari poin 1 ke dalam bentuk tipe `set`.
1. `print(stop_words)` adalah mencetak daftar stopword.

In [17]:
quote_tanpa_stopwords = []

for word in words:
    if word.casefold() not in stop_words:
        quote_tanpa_stopwords.append(word)
        
print(quote_tanpa_stopwords)

['Tell', 'forget', '.', 'Teach', 'remember', '.', 'Involve', 'learn']


#### Penjelasan kode ####
1. `quote_tanpa_stopwords` adalah variabel yang dipakai untuk menampung teks tanpa stopwords
1. `word.casefold()` digunakan untuk mengabaikan besar-kecilnya huruf (upper atau lower case).
1. `quote_tanpa_stopwords.append(word)` digunakan untuk memasukkan kata ke dalam list.

Untuk penulisan kode python yang berkaitan dengan list, ada beberapa cara. Cara pertama adalah seperti kode berikut:
```python
quote_tanpa_stopwords = []

for word in words:
    if word.casefold() not in stop_words:
        quote_tanpa_stopwords.append(word)
        
print(quote_tanpa_stopwords)
```

sedangkan cara kedua, adalah sebagai berikut:
```python
quote_tanpa_stopwords = [
    word for word in words if word.casefold() not in stop_words
]
```
Kedua model penulisan list tersebut menghasilkan data dan tipe yang sama. Perbedaannya adalah lebih ke gaya penulisan di mana yang kedua itu terkesan lebih **Python**

In [18]:
quote_tanpa_stopwords = [
    word for word in words if word.casefold() not in stop_words
]

print(quote_tanpa_stopwords)

['Tell', 'forget', '.', 'Teach', 'remember', '.', 'Involve', 'learn']


Oke, berikutnya kita lanjutkan ke praktikum tentang **Stemming** dan **Lemmatization**

## Stemming

**Stemming** adalah salah satu bagian dari pemrosesan teks di mana kata direduksi untuk diambil kata dasarnya. Misalnya, kata "helping" dan "helper" memiliki akar kata "help". Stemming memungkinkan Anda untuk membidik arti dasar sebuah kata daripada semua detil tentang bagaimana kata itu digunakan. NLTK memiliki lebih dari satu stemmer, tetapi, di sini kita menggunakan **Stemmer Porter**.

In [19]:
from nltk.stem import PorterStemmer

teks_untuk_stemming = "studying loving lovingly loved lover lovely repeatedly"

token_teks_untuk_stemming = word_tokenize(teks_untuk_stemming)

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in token_teks_untuk_stemming]
print(stemmed_words)

['studi', 'love', 'lovingli', 'love', 'lover', 'love', 'repeatedli']


#### Penjelasan kode ####
1. `from nltk.stem import PorterStemmer` adalah import PorterStemmer untuk stemming 
1. `teks_untuk_stemming` adalah variabel yang berisi kata yang akan di-stemming
1. `stemmer = PorterStemmer()` adalah membuat object dari stemmer
1. `stemmed_words` adalah variabel yang berisi kata-kata yang sudah di-stemming
1. `stemmer.stem(word)` digunakan untuk stemming per token atau per kata.


Dari kode sebelumnya, outputnya adalah:
`['studi', 'love', 'lovingli', 'love', 'lover', 'love', 'repeatedli']`

Selain PorterStemmer, NLTK juga memiliki beberapa jenis stemmer yang lain seperti: LancasterStemmer, SnowballStemmer, dan ARLSTem.

Mari kita bahas satu per satu perbedaan dari masing-masing stemmer. Langkah-langkahnya adalah
1. pertama adalah import librari Stemmer-nya

In [20]:
# import librari
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

2. kedua, membuat variabel untuk teks yang akan distemming dan mengubahnya menjadi bentuk token

In [21]:
teks_untuk_stemming = "studying loving lovingly loved lover lovely repeatedly"
token_teks_untuk_stemming = word_tokenize(teks_untuk_stemming)

3. ketiga adalah membuat object dari masing-masing Stemmer satu per satu. 

In [22]:
porterStemmer = PorterStemmer()
lancasterStemmer = LancasterStemmer()
snowballStemmer = SnowballStemmer(language="english")

4. keempat adalah melakukan stemming pada token

In [23]:
porter_stemmed_words = []
lancaster_stemmed_words = []
snowball_stemmed_words = []

for word in token_teks_untuk_stemming:
    porter_stemmed_words.append(porterStemmer.stem(word))
    lancaster_stemmed_words.append(lancasterStemmer.stem(word))
    snowball_stemmed_words.append(snowballStemmer.stem(word))

In [24]:
print("Porter: ",porter_stemmed_words)
print("Lancaster: ", lancaster_stemmed_words)
print("Snowball: ",snowball_stemmed_words)

Porter:  ['studi', 'love', 'lovingli', 'love', 'lover', 'love', 'repeatedli']
Lancaster:  ['study', 'lov', 'lov', 'lov', 'lov', 'lov', 'rep']
Snowball:  ['studi', 'love', 'love', 'love', 'lover', 'love', 'repeat']


#### Penjelasan kode ####
1. `from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer` adalah import langsung 3 Stemmer 
1. Inisiasi variabel
```python
porterStemmer = PorterStemmer()
lancasterStemmer = LancasterStemmer()
snowballStemmer = SnowballStemmer(language="english")
``` 
adalah pembuatan object (instantiate) dari masing-masing Stemmer
1. Variabel penampung kata yang di-stemming
```python
porter_stemmed_words = []
lancaster_stemmed_words = []
snowball_stemmed_words = []
```
adalah inisialisasi variabel kosong dengan tipe list
1. Mencetak hasil tiap-tiap stemming
```python
print("Porter: ",porter_stemmed_words)
print("Lancaster: ", lancaster_stemmed_words)
print("Snowball: ",snowball_stemmed_words)
```
digunakan untuk mencetak hasil stemming

### Pertanyaan ###
Dari hasil stemming masing-masing teknik, apa yang bisa kalian simpulkan? 

## Lemmatization ##

Berbeda dengan stemming, lemmatization lebih dari sekadar pengurangan kata, dan mempertimbangkan kosakata lengkap bahasa serta susunan morfologi kata-kata. Lemma dari 'was' adalah 'be' dan lemma dari 'mice' adalah 'mouse'. Selanjutnya, lemma dari 'meeting' mungkin 'meet' atau 'meeting' tergantung pada penggunaannya dalam sebuah kalimat.

In [25]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/oddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [26]:
lemmatizer = WordNetLemmatizer()

Contoh lemma dari kata 'scarves'

In [27]:
print(lemmatizer.lemmatize("scarves"))

scarf


Dari hasil print pada kode sebelumnya, lemma dari 'scarves' menghasilkan 'scarf'. **Lemma mengubah kata ke bentuk asalnya**. Mari kita coba lagi lemma dengan contoh kalimat berikut ini,"*The striped bats are hanging on their feet for best*". Di sini, kita mencoba pengaruh lemma terhadap kalimat yang di-stemming dengan tanpa stemming menggunakan teknik Porter.


In [28]:
kalimat = "The striped bats are hanging on their feet in a dangerous place"
kalimat_token = word_tokenize(kalimat)

#### Dengan PorterStemmer ####

In [29]:
kalimat_token_stemmed = [porterStemmer.stem(word) for word in kalimat_token]
kalimat_token_lemmatized_stemmed = [lemmatizer.lemmatize(word) for word in kalimat_token_stemmed]

print("Stemming tanpa Lemma: ",kalimat_token_stemmed)
print("Stemming dan Lemma: ",kalimat_token_lemmatized_stemmed)

Stemming tanpa Lemma:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'in', 'a', 'danger', 'place']
Stemming dan Lemma:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'foot', 'in', 'a', 'danger', 'place']


#### Tanpa PorterStemmer ####

In [30]:
kalimat_token_lemmatized_non_stemmed = [lemmatizer.lemmatize(word) for word in kalimat_token]

print("Lemma tanpa Stemming: ",kalimat_token_lemmatized_non_stemmed)

Lemma tanpa Stemming:  ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'in', 'a', 'dangerous', 'place']


#### Penjelasan kode ####
1. `from nltk.stem import WordNetLemmatizer` adalah mengimpor WordNetLemmatizer sebagai fitur Lemmatization
1. `nltk.download('wordnet')` adalah corpus yang digunakan untuk lemma
1. `lemmatizer = WordNetLemmatizer()` adalah pembuatan (instantiate) object WordNetLemmatizer
1. `lemmatizer.lemmatize("scarves")` adalah contoh penggunaan lemma pada kata "scarves" yang menghasilkan bentuk dasar katanya

Demikian praktikum 1, semoga lancar dan penuh berkah belajarnya.

## Referensi ##

1. N. Indurkhya and F. J. Damerau, Handbook of Natural Language Processing. CRC Press, 2010.
1. https://realpython.com/nltk-nlp-python/
1. https://www.nltk.org/index.html

# Tugas Praktikum 1 #

## Judul Tugas ##
Text Preprocessing

## Deskripsi ##
Gunakan beragam teknik preprocess yang telah dipelajari pada perkuliahan pertemuan ke-3 dan praktikum ke-1 untuk mengolah teks berikut:

*South Africa's health minister says there is "absolutely no need to panic" over the new coronavirus variant Omicron, despite a surge in cases. "We have been here before," Joe Phaahla added, referring to the Beta variant detected in South Africa last December. South Africa also condemned the travel bans imposed on the country, saying they should be lifted immediately. Omicron has been classed as a "variant of concern". Early evidence suggests it has a heightened re-infection risk. The heavily mutated variant was detected in South Africa earlier this month and then reported to the World Health Organization (WHO) last Wednesday. The variant is responsible for most of the infections found in South Africa's most populated province, Gauteng, over the last two weeks. The number of cases of "appears to be increasing in almost all provinces" in the country, according to the WHO. Does southern Africa have enough vaccines? South Africans fear impact of new variant measures Covid variants: Do we need new vaccines yet? Africa Live: More on this and other stories from the continent South Africa reported 2,800 new infections on Sunday, a rise from the daily average of 500 in the previous week. Government adviser and epidemiologist Salim Abdool Karim said he expected the number of cases to reach more than 10,000 a day by the end of the week, and for hospitals to come under pressure in the next two to three weeks. Dr Phaahla said he wanted to "reiterate that there is absolutely no need to panic" because this "is no new territory for us". "We are now more than 20 months' experienced in terms of Covid-19, various variants and waves," he added at a media briefing. Media caption, Cyril Ramaphosa: "We are deeply disappointed by the decision of several countries to prohibit travel." On Monday, Japan became the latest country to reinstate tough border restrictions, banning all foreigners from entering from 30 November. The UK, EU and US are among those who earlier imposed travel bans on South Africa and other regional states. UN Secretary General António Guterres said he was "deeply concerned" about the isolation of southern Africa, adding that "the people of Africa cannot be blamed for the immorally low level of vaccinations available in Africa". The bans and restrictions have left the plans of a huge number of travellers up in the air. South African Annalee Veysey, who is getting treatment for cancer in South Africa, was expecting to be reunited with her family in the UK early in December. She has been separated from them for the last 15 months because of earlier travel restrictions and her treatment. "It's almost two years of my life I've missed out with my family. Especially if you've had a journey with cancer, you find what your family means to you," she told the BBC, adding that she felt "desperate". Travellers at a near empty airportImage source, EPA Image caption, South Africa's main airport in Johannesburg was getting quieter over the weekend as restrictions were taking effect Hannah Day is stuck in Pretoria. She flew to South Africa last week after she got news that her son, who lives there, was in hospital after being bitten by a snake. He is now recovering but Ms Day needs to return to the UK for work. "I can self-isolate, but I cannot afford to pay for the quarantine," she told the BBC. The WHO has warned against countries hastily imposing travel curbs, saying they should look to a "risk-based and scientific approach". The world body's Africa director Matshidiso Moeti said on Sunday: "With the Omicron variant now detected in several regions of the world, putting in place travel bans that target Africa attacks global solidarity." However, Rwanda and Angola are among African states that have announced a restriction on flights to and from South Africa. South Africa's foreign ministry spokesman Clayson Monyela described their decision as "quite regrettable, very unfortunate, and I will even say sad". In a speech on Sunday, South Africa's President Cyril Ramaphosa said the bans would not be effective in preventing the spread of the variant. "The only thing the prohibition on travel will do is to further damage the economies of the affected countries and undermine their ability to respond to, and recover from, the pandemic," he said. Current regulations in South Africa make it mandatory to wear face coverings in public, and restrict indoor gatherings to 750 people and outdoor gatherings to 2,000. Mr Ramaphosa said South Africa would not impose new restrictions, but would "undertake broad consultations on making vaccination mandatory for specific activities and locations". There are no vaccine shortages in South Africa itself, and Mr Ramaphosa urged more people to get jabbed, saying that remained the best way to fight the virus. Health experts said that Gauteng, which includes Johannesburg, had entered a fourth wave, and most hospital admissions were of unvaccinated people. Omicron has now been detected in a number of countries around the world, including the UK, Germany, Australia and Israel. In other developments: China said it would offer 1bn doses of vaccines to African countries on top of the 200m it had already supplied US Covid adviser Anthony Fauci says the government is on "high alert" and that spread is inevitable A Czech woman who came back from Namibia recently was confirmed to have the Omicron variant Portugal has detected 13 cases of the variant among players and staff of Lisbon-based Belenenses SAD football club Australia has paused its plans to reopen its borders in light of the Omicron variant*

## Penilaian ##
1. Program harus dibuat dengan bahasa Jupyter Notebook (10 poin)
1. Harus ada fungsi preprocess() dengan parameter teks. Contoh: preprocess(teks) (10 poin)
1. Harus ada fungsi get_word_token(teks) (20 poin)
1. Harus ada fungsi remove_stopword(teks) (20 poin)
1. Harus ada fungsi porter_stem(teks) (20 poin)
1. Harus ada fungsi lemma(teks) (10 poin)
1. Hitung jumlah kata sebelum dan sesudah preprocessing (10 poin)

## Catatan ##
1. Pengumpulan tugas praktikum 1 dikumpulkan di Google Classroom dengan nama file TP1_NIM_Nama_Lengkap_Anda.ipynb
1. Kesalahan penamaan nama dan format file, berakibat pada penolakan Tugas Praktikum 1