# Praktikum 4 - Text Classification
Referensi dari [pycon-2016-tutorial](https://github.com/justmarkham/pycon-2016-tutorial/). Text Classification merupakan bagian dari Natural Language Processing (NLP). Text Classification artinya melakukan klasifikasi suatu text pada label/class yang sesuai.
## Agenda
1. Model building in scikit-learn
2. Representing text as Numerical data
3. Reading a text-based dataset into pandas
4. Vectorizing our dataset
5. Building and evaluating a model
6. Comparing models (Naive Bayes and Logistic Regression)

## Part 1: Model building in scikit-learn
<b>scikit-learn</b> merupakan  tools pada machine learning python untuk melakukan data mining atau data analysis. Berikut ini adalah contoh penggunaan <b>scikit-learn</b> pada dataset [iris](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/).

In [None]:
# Load dataset iris
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
# Mendefinisikan matriks fitur (X) dan response vector (y)
X = iris.data
y = iris.target

<b>"Fitur"</b> dikenal juga sebagai predictors, inputs, attributes. <b>"Response"</b> dikenal sebagai target, label, output.

In [None]:
# melakukan check shapes dari X dan y
print(X.shape)
print(y.shape)

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
           cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

plt.show()

<b>"Observations"</b> dikenal juga sebagai samples, instances, records. 
<br><br>
Dalam melakukan hal tersebut, pada python terdapat library Python Data Analysis yang dikenal [pandas](http://pandas.pydata.org/) berguna untuk melakukan analisis data. Berikut adalah contoh penggunaan <b>pandas</b> untuk mengamati dataset iris.

In [None]:
# mengamati 5 baris pertama matriks fitur (termasuk nama fitur)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

In [None]:
# mengamati response vector
print(y)

Perlu diperhatikan, dalam membuat sebuah <b>model</b>, fitur harus dalam bentuk <b>numeric</b>, dan setiap melakukan <b>observations</b> harus memiliki <b> fitur yang sama pada urutan yang sama</b>.

In [None]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# inisialisasi model knn
knn = KNeighborsClassifier()

# fit the model, dengan X sebagai training data dan y sebagai target
knn.fit(X, y)

Untuk membuat <b>prediction</b>, inputs/observations haruslah memiliki <b>fitur yang sama dengan data training</b>. 

In [None]:
# prediksi dengan input baru
knn.predict([[3, 5, 4, 2]])

In [None]:
# score akurasi
knn.score(X, y)

## Part 2: Representing text as numerical data
Pada bagian ini, akan merubah text menjadi data <b>numeric</b>, karena dalam membuat sebuah model harus dalam tipe numeric.

In [None]:
# contoh text pada training model (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me.. PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

Kita akan menggunakan [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) untuk "melakukan convert text ke dalam matriks jumlah token":

In [None]:
# import dan inisialisasi CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [None]:
# menentukan 'vocabulary' dari data training
vect.fit(simple_train)

In [None]:
# memeriksa vocabulary yang sesuai
vect.get_feature_names()

In [None]:
# mengubah data training ke dalam 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

In [None]:
# mengubah sparse matriks menjadi dense matriks
arr_dtm = simple_train_dtm.toarray()
arr_dtm

In [None]:
# memeriksa vocabulary dan document-term matrix bersamaan
pd.DataFrame(arr_dtm, columns=vect.get_feature_names())

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# mengecek tipe data document-term matrix
type(simple_train_dtm)

In [None]:
# memeriksa sparse matrix
print(simple_train_dtm)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [None]:
# contoh text untuk model testing
simple_test = ["please don't call me"]

Untuk membuat <b>prediction</b>, inputs/observations haruslah memiliki <b>fitur yang sama dengan data training</b>. 

In [None]:
# mengubah data test menjadi document-term matrix
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

In [None]:
# memeriksa vocabulary dan document-term matrix
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

**Kesimpulan:**
- vect.fit(train) **menentukan vocabulary** dari data training.
- vect.transform(train) menggunakan **fitted vocabulary** untuk membuat document-term matrix dari data training.
- vect.transform(test) menggunakan **fitted vocabulary** untuk membuat document-term matrix dari data test.

## Part 3: Reading a text-based dataset into pandas
Pada bagian ini, menggunakan dataset sms.tsv yang berisi data SMS berikut dengan labelnya berupa SPAM atau HAM. 

In [None]:
# read file ke dalam pandas menggunakan path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [None]:
# memerika bentuk data sms.tsv
sms.shape

In [None]:
# memeriksa 10 baris pertama data sms.tsv
sms.head(10)

In [None]:
# memerika distribusi kelas
sms.label.value_counts()

In [None]:
# mengubah label menjadi nilai numerik dengan 
# meletakkan pada column label_num, ham = 0 dan spam = 1
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [None]:
# mengecek apakah perubahan label berhasil
sms.head(10)

In [None]:
# mendefinisikan X dan y (dari data SMS) menggunakan COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

In [None]:
# split X dan y menjadi dataset training dan testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

## Part 4: Vectorizing our dataset

In [None]:
# inisialisasi vector
vect = CountVectorizer()

In [None]:
# menentukan vocabulary pada data training, 
# kemudian dibentuk menjadi document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [None]:
# atau bisa juga mengkombinasikan fit dan transform dalam satu langkah
X_train_dtm = vect.fit_transform(X_train)

In [None]:
# memeriksa document-term matrix
X_train_dtm

In [None]:
# mengubah data testing (menggunakan vocabulary) menjadi document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

## Part 5: Building and evaluating a model

Kita akan menggunakan [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
# import dan inisialisasi model Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
# lakukan train model menggunakan X_train_dtm
nb.fit(X_train_dtm, y_train)

# untuk mengetahui waktu training
%time nb.fit(X_train_dtm, y_train)

In [None]:
# membuat kelas prediksi untuk X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [None]:
from sklearn import metrics

# contoh penghitungan akurasi
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
print(metrics.accuracy_score(y_true, y_pred))

# menghitung akurasi dari kelas prediksi
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# mencetak confusion matrix dari contoh
print(metrics.confusion_matrix(y_true, y_pred))

print()
# mencetak confusion matrix dari y_test dan y_pred
print(metrics.confusion_matrix(y_test, y_pred_class))

In [None]:
# mencetak pesan untuk false negatives (ham terklasifikasi sebagai spam)
X_test[y_test < y_pred_class]

In [None]:
# mencetak pesan untuk false positives (spam terklasifikasi sebagai ham)
X_test[y_test > y_pred_class]

In [None]:
# contoh false negative
X_test[3132]

In [None]:
# menghitung predicted probabilities untuk X_test_dtm
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
# menghitung Area Under the Curve (AUC)
import numpy as np

# contoh
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print(metrics.roc_auc_score(y_true, y_scores))

# mencetak AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

## Part 6: Comparing models (Naive Bayes and Logistic Regression)

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [None]:
# import dan inisialisasi mode logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
# train model dengan X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

In [None]:
# membuat class prediction untuk X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [None]:
# menghitung predicted probabilities untuk X_test_dtm
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
# menghitung accuracy
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# menghitung AUC
metrics.roc_auc_score(y_test, y_pred_prob)