# APLIKASI: Pengenalan Teks Mining

## Bagian 1: Machine Learning dengan Scikit-Learn (review)

In [1]:
# load dataset iris
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
# simpan matriks fitur X dan target y
X = iris.data
y = iris.target

**"Fitur"** sering disebut atribut, prediktor atau input.**"target"** sering disebut dengan label

In [3]:
# lihat ukuran X dan y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


**"Observasi"** juga sering disebut jumlah sampel

In [4]:
# lihat 5 fitur pertama
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
# lihat vektor label
print(y)
iris.target_names

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Untuk  **membangun model**, fitur harus berbentuk **numerik**, dan setiap sampel harus memiliki **fitur yang sama dengan urutan yang sama**.

In [6]:
# import pustaka
from sklearn.neighbors import KNeighborsClassifier

# inisiasi model dengan parameter default
knn = KNeighborsClassifier()

# latih model
knn.fit(X, y)

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [7]:
# prediksi hasil
knn.predict([[1, 1, 1, 1]])

array([0])

## Bagian 2: Model Bag of Words

In [8]:
# contoh teks untuk training model
corpus = ["Saya sedang belajar Data Science",
          "Python merupakan salah satu tools Data Science",
          "Machine learning adalah salah satu cabang data science",
          "Scikit learn membuat machine learning menjadi lebih mudah",
          "Banyak data data tersebar di internet"]

Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

Kita akan menggunakan [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) untuk mengubah "teks menjadi matriks":

In [9]:
# inisiasi model bag of words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

In [10]:
# pelajari vocab pada corpus
vectorizer.fit(corpus)

In [12]:
# lihat vocab
vectorizer.get_feature_names_out()

array(['adalah', 'banyak', 'belajar', 'cabang', 'data', 'di', 'internet',
       'learn', 'learning', 'lebih', 'machine', 'membuat', 'menjadi',
       'merupakan', 'mudah', 'python', 'salah', 'satu', 'saya', 'science',
       'scikit', 'sedang', 'tersebar', 'tools'], dtype=object)

In [13]:
# transformasikan list corpus menjadi matriks fitur
corpus_vect = vectorizer.transform(corpus)
corpus_vect

<5x24 sparse matrix of type '<class 'numpy.int64'>'
	with 33 stored elements in Compressed Sparse Row format>

In [14]:
# ubah sparse matriks menjadi dense matriks
corpus_vect.toarray()

array([[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
        0, 1],
       [1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 1, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0]])

In [16]:
# lihat arti dari fitur menggunakan pandas dataframe
pd.DataFrame(corpus_vect.toarray(),columns=vectorizer.get_feature_names_out())

Unnamed: 0,adalah,banyak,belajar,cabang,data,di,internet,learn,learning,lebih,...,mudah,python,salah,satu,saya,science,scikit,sedang,tersebar,tools
0,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,1,0,1,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,1,1,1,0,1,0,0,0,1
2,1,0,0,1,1,0,0,0,1,0,...,0,0,1,1,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,1,1,...,1,0,0,0,0,0,1,0,0,0
4,0,1,0,0,2,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [17]:
# cek tipe corpus
type(corpus)

list

In [18]:
# lihat korpus
print(corpus)

['Saya sedang belajar Data Science', 'Python merupakan salah satu tools Data Science', 'Machine learning adalah salah satu cabang data science', 'Scikit learn membuat machine learning menjadi lebih mudah', 'Banyak data data tersebar di internet']


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [19]:
# Contoh tes model
new_text = ["Saya sedang belajar pemrosesan teks"]

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [20]:
# transformasi teks baru kedalam matriks
new_text_vect = vectorizer.transform(new_text)

In [22]:
# lihat menggunakan pandas dataframe
pd.DataFrame(new_text_vect.toarray(),columns=vectorizer.get_feature_names_out())

Unnamed: 0,adalah,banyak,belajar,cabang,data,di,internet,learn,learning,lebih,...,mudah,python,salah,satu,saya,science,scikit,sedang,tersebar,tools
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


**Ringkasan:**

- `vect.fit(train)` **memelajari vocabulary** dari data training
- `vect.transform(train)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data training
- `vect.transform(test)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data test

## Bagian 3: Membuka Data

In [23]:
# Baca data
import pandas as pd
data_teks = pd.read_csv('data/dataset_sms_spam _v1.csv')

In [24]:
data_teks.columns = ['sms','kategori']

In [25]:
# lihat ukuran
data_teks.shape

(1143, 2)

In [26]:
# lihat data
data_teks.head()

Unnamed: 0,sms,kategori
0,[PROMO] Beli paket Flash mulai 1GB di MY TELKO...,2
1,2.5 GB/30 hari hanya Rp 35 Ribu Spesial buat A...,2
2,"2016-07-08 11:47:11.Plg Yth, sisa kuota Flash ...",2
3,"2016-08-07 11:29:47.Plg Yth, sisa kuota Flash ...",2
4,4.5GB/30 hari hanya Rp 55 Ribu Spesial buat an...,2


In [27]:
# lihat distribusi kelas
data_teks.kategori.value_counts()

0    569
1    335
2    239
Name: kategori, dtype: int64

In [52]:
# Split data menjadi data train dan test
from sklearn.model_selection import train_test_split
sms_train, sms_test, label_train, label_test = train_test_split(data_teks.sms, \
                                                                data_teks.kategori, \
                                                                test_size=0.25,
                                                                random_state=46)

## Bagian 4: Vektorisasi

In [53]:
# Inisiasi vectorizer 
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [54]:
# Pelajari vocabulary dan ubah data train menjadi matriks
vect.fit(sms_train)
sms_train_vect = vect.transform(sms_train)

In [55]:
# alternatif satu langkah
sms_train_vect = vect.fit_transform(sms_train)

In [56]:
# lihat vektor fitur
sms_train_vect

<857x4098 sparse matrix of type '<class 'numpy.int64'>'
	with 14259 stored elements in Compressed Sparse Row format>

In [57]:
# lakukan hal yang sama dengan data testing
sms_test_vect = vect.transform(sms_test)
sms_test_vect

<286x4098 sparse matrix of type '<class 'numpy.int64'>'
	with 4025 stored elements in Compressed Sparse Row format>

## Bagian 5: Klasifikasi

Misal kita gunakan [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [58]:
# import
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

In [59]:
# train dengan melihat waktu eksekusi
%time model.fit(sms_train_vect, label_train)

CPU times: user 5.57 ms, sys: 2.36 ms, total: 7.94 ms
Wall time: 7.36 ms


In [60]:
# buat prediksi
sms_test_res = model.predict(sms_test_vect)

In [61]:
# hitung akurasi
from sklearn.metrics import accuracy_score
accuracy_score(sms_test_res, label_test)

0.916083916083916

In [62]:
# gunakan confussion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(sms_test_res, label_test)


array([[129,   0,   0],
       [  5,  75,   6],
       [  4,   9,  58]])

In [63]:
# hitung probabilitas
sms_test_prob = model.predict_proba(sms_test_vect)
sms_test_prob[:5]

array([[7.37247730e-04, 8.08217674e-02, 9.18440985e-01],
       [9.51261105e-01, 3.64215814e-04, 4.83746787e-02],
       [9.88955102e-01, 7.73692581e-03, 3.30797262e-03],
       [3.88652619e-08, 9.99720415e-01, 2.79546580e-04],
       [9.99982437e-01, 2.78359900e-06, 1.47789848e-05]])

## Bagian 6: Inference

In [64]:
new_sms = ['halo, apa kabar ?',
           'kuota murah, hanya 1000 per hari, klik disini',
           'pesugihan halal, dapatkan uang tunai banyak']

In [65]:
new_sms_vect = vect.transform(new_sms)
pred = model.predict(new_sms_vect)
print(pred)

[0 2 1]
