                                                       
                                                     
## Multi-Class Text Classification

Referensi: https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17


Dalam code program ini dilakukan proses klasifikasi terhadap text/dokumen. Kasus yang ditangani adalah mengenai Consumer Finance Complaint dimana diberikan narasi komplain dari customer, kemudian program akan mengklasifikasikan komplain tersebut sesuai topiknya. Metode yang digunakan disini adalah LSTM dengan menggunakan library keras.

In [1]:
# mengimport beberapa library dasar yang diperlukan
import pandas as pd
import numpy as np
import matplotlib as plt
import re #untuk regular expression

In [2]:
# membaca data csv
df = pd.read_csv('complaints.csv')

In [3]:
from io import StringIO
'''Hanya kolom Produk dan Customer Complaint Narrative yang diperlukan untuk proses pembelajaran. 
Oleh karena itu, hanya diambil dua kolom ini, untuk diproses pada proses selanjutnya.
'''

col = ['Product', 'Consumer complaint narrative']
df = df[col]
# mendeteksi non-missing value
df = df[pd.notnull(df['Consumer complaint narrative'])]
df.columns = ['Product', 'Consumer complaint narrative']

In [4]:
# menghitung jumlah data untuk setiap kelas (topik)
df.Product.value_counts()

Credit reporting, credit repair services, or other personal consumer reports    143103
Debt collection                                                                 106291
Mortgage                                                                         61304
Credit card or prepaid card                                                      31761
Credit reporting                                                                 31588
Student loan                                                                     24991
Checking or savings account                                                      18956
Credit card                                                                      18838
Bank account or service                                                          14885
Consumer Loan                                                                     9473
Vehicle loan or lease                                                             8132
Money transfer, virtual currency, or money 

In [5]:
#mencetak informasi mengenai tabel data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 488468 entries, 0 to 1491589
Data columns (total 2 columns):
Product                         488468 non-null object
Consumer complaint narrative    488468 non-null object
dtypes: object(2)
memory usage: 11.2+ MB


In [6]:
'''
Untuk menyederhanakan masalah, sebelumnya dilakukan hal-hal sebagai berikut:
- Komplain tentang Credit Reporting digabungkan dengan komplain Credit reporting, credit repair services, or other personal consumer reports
- Komplain tentang Credit Card digabungkan dengan komplain Credit card or prepaid card
- Komplain tentang Payday Loan digabungkan dengan komplain Payday loan, title loan, or personal loan
- Komplain tentang Virtual currency digabungkan dengan komplain Money transfer, virtual currency, or money service
- Sedangkan untuk Other financial service dihapuskan karena tidak begitu penting dan jumlahnya sangat sedikit.
'''

df.loc[df['Product'] == 'Credit reporting', 'Product'] = 'Credit reporting, credit repair services, or other personal consumer reports'
df.loc[df['Product'] == 'Credit card', 'Product'] = 'Credit card or prepaid card'
df.loc[df['Product'] == 'Payday loan', 'Product'] = 'Payday loan, title loan, or personal loan'
df.loc[df['Product'] == 'Virtual currency', 'Product'] = 'Money transfer, virtual currency, or money service'
df = df[df.Product != 'Other financial service']

In [7]:
# Mem-plotting jumlah data komplain berdasarkan topik/ produknya (setelah dilakukan penggabungan).
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
df['Product'].value_counts().sort_values(ascending=False).iplot(kind='bar', yTitle='Number of Complaints', 
                                                                title='Number complaints in each product')


IPython.utils.traitlets has moved to a top-level traitlets package.



### Tahap text Preprocessing

Tahap text preprocessing disini akan terdiri dari beberapa langkah, antara lain:
- Konversi semua teks ke lower case.
- Replace simbol yang didefinisikan di REPLACE_BY_SPACE_RE menjadi spasi
- Menghilangkan simbol yang termasuk BAD_SYMBOLS_RE pada text.
- Menghilangkan “x” pada text.
- Remove stopwords.
- Remove digit pada text.
- Tokenisasi

In [14]:
# FUNGSI PRINT_PLOT
# Untuk mencetak data pada index yang diinginkan
df = df.reset_index(drop=True) #reset index
def print_plot(index):
    '''
    input : index
    
    return : isi teks dalam Consumer complaint narrative, beserta Produk-nya
    '''
    example = df[df.index == index][['Consumer complaint narrative', 'Product']].values[0] #ambil informasi pada index=index masukan
    if len(example) > 0: #jika pada index ini tidak kosong
        print(example[0]) #cetak teks
        print('Product:', example[1]) #cetak jenis topik/product

# Contoh: Mencetak data yang ada pada index ke-1        
print_plot(1)

I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act.
Product: Credit reporting, credit repair services, or other personal consumer reports


In [15]:
print_plot(100) #contoh mencetak data pada indeks ke-100

I have been trying to resolve a payoff balance and speaking to employee XXXX ( last name refused ) at XXXX. Last communication XXXX/XXXX/19, XXXX with no response. On Wed XX/XX//19 he and I agreed on a payoff balance of {$5300.00} and he would send a written agreement to confirm. After not re ceiving the agreement I called back and spoke to XXXX ( last name refused ) on XXXX/XXXX/19 XXXX XXXX in Customer Service Dept. she informed me XXXX misinformed me because an additional fee to payoff my balance would be {$2500.00} in addition to the {$5300.00} I have pending payments of {$500.00} + {$2100.00} which will add to my current payment {$2600.00} = {$5300.00}, I was suggestive to personal insults by XXXX
Product: Debt collection


#### 1. Membersihkan Text dan Remove Stopwords

In [16]:
import nltk   #menggunakan library nltk
from nltk.corpus import stopwords #import library stopwords


'''
Operasi pada Regular expression:
- mencari pattern tertentu yang muncul pada text (pattern: [/(){}\[\]\|@,;])
- mencari bad symbol
'''
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]') #mencari simbol-simbol [/(){}\[\]\|@,;] yang muncul dalam teks
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') #mencari bad symbol dengan ekspresi [^0-9a-z #+_]

# Stopwords English
STOPWORDS = set(stopwords.words("english")) # Menggunakan stopword yang telah ada untuk bahasa inggris

# FUNGSI CLEAN_TEXT
# Untuk membersihkan text dari kata atau simbol-simbol yang kurang penting.
def clean_text(text):
    """
        text: string
        
        return: string yang telah dibersihkan
    """
    text = text.lower() #Semua text di lowercase
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # Substitusi string yang cocok dengan REPLACE_BY_SPACE_RE dengan spasi.
    text = BAD_SYMBOLS_RE.sub('', text) # substitusi string yang cocok dengan BAD_SYMBOLS_RE dengan 'nothing'. 
    text = text.replace('x', '') #substitusi x dengan 'nothing'
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # menghapus stopwords dari text. 
    return text



In [17]:
# Memanggil fungsi clean_text untuk diaplikasikan pada semua baris text pada kolom Consumer complaint narrative.
df['Consumer complaint narrative'] = df['Consumer complaint narrative'].apply(clean_text)
# Menghilangkan digit pada text
df['Consumer complaint narrative'] = df['Consumer complaint narrative'].str.replace('\d+', '')

In [18]:
# Hasil setelah dilakukan proses text cleaning
print_plot(1)

would like request suppression following items credit report result falling victim identity theft information relate transactions made accounts opened attached supporting documentation attest blocked appearing credit report pursuant section b fair credit reporting act
Product: Credit reporting, credit repair services, or other personal consumer reports


In [19]:
print_plot(100)

trying resolve payoff balance speaking employee last name refused last communication  response wed  agreed payoff balance  would send written agreement confirm ceiving agreement called back spoke last name refused  customer service dept informed misinformed additional fee payoff balance would  addition  pending payments  +  add current payment   suggestive personal insults
Product: Debt collection


#### 2. Tokenisasi

In [20]:
import keras
from keras.preprocessing.text import Tokenizer

# Maksimum jumlah kata yang akan digunakan (dari yang paling sering muncul)
MAX_NB_WORDS = 50000
# Maksimal jumlah kata pada setiap komplain
MAX_SEQUENCE_LENGTH = 250
# Dimensi embedding
EMBEDDING_DIM = 100

'''
TOKENISASI
- Menggunakan library Tokenizer dari Keras.
- Mengeset filter untuk simbol-simbol: !"#$%&()*+,-./:;<=>?@[\]^_`{|}~ agar tidak diikutkan.
'''

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['Consumer complaint narrative'].values) # di fit-kan dengan text
word_index = tokenizer.word_index #Tokenisasi kata
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.


Found 165703 unique tokens.


In [21]:
word_index #contoh isi token

{'credit': 1,
 'account': 2,
 'report': 3,
 'would': 4,
 'information': 5,
 'payment': 6,
 'loan': 7,
 'debt': 8,
 'bank': 9,
 'told': 10,
 'received': 11,
 'company': 12,
 'card': 13,
 'time': 14,
 'called': 15,
 'never': 16,
 'payments': 17,
 'reporting': 18,
 'sent': 19,
 'letter': 20,
 'back': 21,
 'pay': 22,
 'also': 23,
 'get': 24,
 'paid': 25,
 'mortgage': 26,
 'call': 27,
 'said': 28,
 'amount': 29,
 'made': 30,
 'one': 31,
 'due': 32,
 'number': 33,
 'could': 34,
 'phone': 35,
 'accounts': 36,
 'money': 37,
 'days': 38,
 'balance': 39,
 'nt': 40,
 'late': 41,
 'collection': 42,
 'asked': 43,
 'still': 44,
 'since': 45,
 'consumer': 46,
 'date': 47,
 'even': 48,
 'years': 49,
 'please': 50,
 'dispute': 51,
 'name': 52,
 'home': 53,
 'make': 54,
 'contacted': 55,
 'us': 56,
 'interest': 57,
 'file': 58,
 'check': 59,
 'month': 60,
 'request': 61,
 'service': 62,
 'months': 63,
 'help': 64,
 'removed': 65,
 'new': 66,
 'times': 67,
 'address': 68,
 'day': 69,
 'complaint': 70,
 '

In [22]:
from keras.preprocessing.sequence import pad_sequences
# potong dan pasang input sehingga mereka memiliki panjang yang sama.
# membuat sequence text imput
X = tokenizer.texts_to_sequences(df['Consumer complaint narrative'].values)
#membuat batch
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (488176, 250)


In [23]:
#convert categorical list ke angka
Y = pd.get_dummies(df['Product']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (488176, 13)


### Tahap Pemodelan LSTM

Tahap-tahap pemodelan LSTM dapat dituliskan sebagai berikut:
1. Split data menjadi data training dan data testing dengan perbandingan 90:10
2. Masukkan data yang telah di preprocessing kedalam model LSTM
3. Hitung akurasi model yang dihasilkan

In [24]:
from sklearn.model_selection import train_test_split
'''
Split data --> training : testing = 90:10.
Menggunakan train_test_split, dengan test_size = 0.1 (10%).
Disimpan dalam variabel X_train, X_test, Y_train, Y_test.
'''
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(439358, 250) (439358, 13)
(48818, 250) (48818, 13)


In [None]:
# LIBRARY KERAS untuk LSTM
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
from keras.callbacks import EarlyStopping

'''
LSTM STEP-BY STEP
------------------------

Arsitektur LSTM disini dibangun dengan tahapan, antara lain:
1. Membuat Keras model dengan Sequential() constructor
2. LAYER 1: Embedding layer
    --> Proses word to vektor (mengubah input sequence ke vector dengan ukuran EMBEDDING_DIM) dilakukan langsung di dalam layer ini. 
3. LAYER 2 : Dropout menggunakan variasi SpatialDropout1D
4. LAYER 3 : LSTM layer dengan 100 memory units
5. LAYER 4 : Dense (output layer, membentuk 13 kelas). Fungsi aktivasi yang digunakan adalah softmax
6. Mengkompile model dengan loss = categorical_crossentropy (karena kasus yang ditangani adalah multi-class) dan optimizer = adam
'''
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5 #jumlah epoch
batch_size = 64 #ukuran batch

#menjalankan proses learning
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 395422 samples, validate on 43936 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
 68992/395422 [====>.........................] - ETA: 1:45:21 - loss: 0.5212 - acc: 0.8204

In [None]:
# menghitung dan mencetak hasil akurasi yang diperoleh
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

In [None]:
# Mem-plot hasil Loss ke dalam grafik
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();

In [None]:
# Mem-plot hasil Akurasi ke dalam grafik
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();

Test dengan Komplain baru

In [None]:
# misalkan diberikan masukan data baru, maka data tersebut bisa langsung dipresiksi label/ topik/ produk-nya dengan menjalankannya pada model yang telah diperoleh
new_complaint = ['I am a victim of identity theft and someone stole my identity and personal information to open up a Visa credit card account with Bank of America. The following Bank of America Visa credit card account do not belong to me : XXXX.']
seq = tokenizer.texts_to_sequences(new_complaint)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
labels = ['Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Mortgage', 'Credit card or prepaid card', 'Student loan', 'Bank account or service', 'Checking or savings account', 'Consumer Loan', 'Payday loan, title loan, or personal loan', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Money transfers', 'Prepaid card']
print(pred, labels[np.argmax(pred)])

In [None]:
#KETERANGAN : Code program ini tidak sepenuhnya dijalankan karena pada saat proses training selalu terjadi dead kernel setelah 
#masuk di epoch ke-3