# MFM: Pengenalan Teks Mining

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Bagian 1: Machine Learning dengan Scikit-Learn (review)

In [2]:
# load dataset iris
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
# simpan matriks fitur X dan target y
X = iris.data
y = iris.target

**"Fitur"** sering disebut atribut, prediktor atau input.**"target"** sering disebut dengan label

In [4]:
# lihat ukuran X dan y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


**"Observasi"** juga sering disebut jumlah sampel

In [5]:
# lihat 5 fitur pertama
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
# lihat vektor label
print(y)
iris.target_names

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Untuk  **membangun model**, fitur harus berbentuk **numerik**, dan setiap sampel harus memiliki **fitur yang sama dengan urutan yang sama**.

In [7]:
# import pustaka
from sklearn.neighbors import KNeighborsClassifier

# inisiasi model dengan parameter default
knn = KNeighborsClassifier()

# latih model
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [9]:
# prediksi hasil
knn.predict([[1, 1, 1, 1]])

array([0])

## Bagian 2: Model Bag of Words

In [None]:
# contoh teks untuk training model
corpus = [
    'Kami sedang belajar machine learning',
    'Kami mempelajari machine learning untuk teks',
    'Machine learning adalah pembelajaran mesin',
    'Kami sangat antusias belajar machine learning',
    'banyak data yang bertebaran di internet'
]


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

Kita akan menggunakan [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) untuk mengubah "teks menjadi matriks":

In [None]:
# inisiasi model bag of words


In [None]:
# pelajari vocab pada corpus


In [None]:
# lihat vocab


In [None]:
# transformasikan list corpus menjadi matriks fitur


In [None]:
# ubah sparse matriks menjadi dense matriks


In [None]:
# lihat arti dari fitur menggunakan pandas dataframe


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# cek tipe corpus


In [None]:
# lihat korpus


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [None]:
# Contoh tes model


Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [None]:
# transformasi teks baru kedalam matriks


In [None]:
# lihat menggunakan pandas dataframe


**Ringkasan:**

- `vect.fit(train)` **memelajari vocabulary** dari data training
- `vect.transform(train)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data training
- `vect.transform(test)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data test

## Bagian 3: Membuka Data

In [None]:
# Baca data


In [None]:
# lihat ukuran


In [None]:
# lihat data


In [None]:
# lihat distribusi kelas


In [None]:
# Split data menjadi data train dan test


## Bagian 4: Vektorisasi

In [None]:
# Inisiasi vectorizer 


In [None]:
# Pelajari vocabulary dan ubah data train menjadi matriks


In [None]:
# alternatif satu langkah


In [None]:
# lihat vektor fitur


In [None]:
# lakukan hal yang sama dengan data testing


## Bagian 5: Klasifikasi

Misal kita gunakan [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
# import


In [None]:
# train dengan melihat waktu eksekusi


In [None]:
# buat prediksi


In [None]:
# hitung akurasi


In [None]:
# gunakan confussion matrix


In [None]:
# hitung probabilitas


## Bagian 6: Inference