Mongolian text classification series #01

In this notebook I'm gonna try to classify cyrillic mongolian texts with LSTM.

Eduge dataset provided by Bolorsoft LLC

Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)

Github: https://github.com/sharavsambuu/mongolian-text-classification 



In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

!pip install -q tensorflow-gpu==2.0.0-alpha0
!pip install gensim

import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

2.0.0-alpha0


[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) preprocessing eats a lot of CPU cycle so it's good idea to cook it before using colab.

In [2]:
import os
from os.path import exists, join, basename, splitext
import sys

def download_from_google_drive(file_id, file_name):
  !rm -f ./cookie
  !curl -c ./cookie -s -L "https://drive.google.com/uc?export=download&id=$file_id" > /dev/null
  confirm_text = !awk '/download/ {print $NF}' ./cookie
  confirm_text = confirm_text[0]
  !curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id" -o $file_name
  
# download eduge pickles
file_path = 'eduge_pickles'
if not exists(file_path):
  download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)
  rar_file = file_path+".rar"
  !unrar x $rar_file

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0   4974      0 --:--:-- --:--:-- --:--:--  4974
100  106M    0  106M    0     0   104M      0 --:--:--  0:00:01 --:--:--  231M

UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from eduge_pickles.rar


Would you like to replace the existing file word_index.pickle
9178153 bytes, modified on 2019-04-13 01:44
with a new one
9178153 bytes, modified on 2019-04-13 01:44

[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q

Program aborted


In [0]:
import pickle

with open('word_index.pickle', 'rb') as handle:
  word_index = pickle.load(handle)
    
with open('reversed_word_index.pickle', 'rb') as handle:
  reversed_word_index = pickle.load(handle)
  
with open('eduge_stopwords_removed.pickle', 'rb') as handle:
  eduge_ds = pickle.load(handle)

In [4]:
# facebook trained word2vec on both commoncrawl and wikipedia. So this model should contain enough representation about our mongolian words.
mongolian_word2vec_download="https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz"
if not exists("cc.mn.300.bin.gz"):
  !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz
if exists('cc.mn.300.bin.gz'):
  !gunzip cc.mn.300.bin.gz

gzip: cc.mn.300.bin already exists; do you wish to overwrite (y or n)? n
	not overwritten


In [5]:
from gensim.models.wrappers import FastText

word2vec_model = FastText.load_fasttext_format('cc.mn.300.bin')

W0414 01:07:52.334683 139997640750976 ssh.py:33] paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress
W0414 01:07:52.809054 139997640750976 word2vec.py:573] Slow version of gensim.models.deprecated.word2vec is being used


In [6]:
print(word2vec_model.most_similar('монгол'))

[('Монгол', 0.6342526078224182), ('монголын', 0.6047513484954834), ('хятад', 0.5558866858482361), ('Монголын', 0.5087883472442627), ('судлалаараа', 0.48851606249809265), ('манай', 0.4853793680667877), ('уйгаржин', 0.4725492596626282), ('угсаатангууд', 0.47093287110328674), ('орос', 0.46463483572006226), ('худам', 0.4609120190143585)]


  if np.issubdtype(vec.dtype, np.int):


In [0]:
# preparing embedding matrix
import numpy as np

words_not_found = []
embed_dim       = 300
embedding_matrix = np.random.uniform(-1, 1, (len(word_index), embed_dim))
for word, i in word_index.items():
  if i<4:
    continue
  try:
    embedding_vector = word2vec_model[word]
    if (embedding_vector is not None) and len(embedding_vector) > 0:
      embedding_matrix[i] = embedding_vector
  except:
    words_not_found.append(word)
    pass

In [8]:
print(embedding_matrix.shape)
#print(embedding_matrix[5])

(370794, 300)


In [0]:
MAX_LEN = 512

import itertools

for item in eduge_ds:
  item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)

In [0]:
train_data_words  = [i[0] for i in train]
train_label_words = [i[1] for i in train]
test_data_words   = [i[0] for i in test ]
test_label_words  = [i[1] for i in test ]

In [0]:
def encode_news(text):
    return [word_index.get(i, 2) for i in text]
  
train_data = [encode_news(sent) for sent in train_data_words]
test_data  = [encode_news(sent) for sent in test_data_words ]

In [0]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=MAX_LEN)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=MAX_LEN)

In [14]:
labels = list(set(test_label_words))
labels

['спорт',
 'эрүүл мэнд',
 'урлаг соёл',
 'эдийн засаг',
 'байгал орчин',
 'хууль',
 'технологи',
 'улс төр',
 'боловсрол']

In [0]:
from sklearn.preprocessing import LabelBinarizer
encoder     = LabelBinarizer()
train_label = transfomed_label = encoder.fit_transform(train_label_words)
test_label  = transfomed_label = encoder.fit_transform(test_label_words )

In [17]:
vocab_size = len(word_index)

sequence_input     = keras.layers.Input(shape=(MAX_LEN,), dtype='int32')
embedded_sequences = keras.layers.Embedding(
    vocab_size, 
    embed_dim , 
    weights=[embedding_matrix], 
    input_length=MAX_LEN, 
    trainable=False)(sequence_input)
x     = keras.layers.LSTM(128)(embedded_sequences)
x     = keras.layers.Dense(245, activation='relu')(x)
x     = keras.layers.Dropout(0.5)(x) # prevents overfitting
preds = keras.layers.Dense(len(labels), activation='softmax')(x)

model = keras.models.Model(sequence_input, preds)
model.summary()

W0414 01:09:45.882035 139997640750976 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7f52c07278d0>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 512)]             0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 512, 300)          111238200 
_________________________________________________________________
unified_lstm_1 (UnifiedLSTM) (None, 128)               219648    
_________________________________________________________________
dense (Dense)                (None, 245)               31605     
_________________________________________________________________
dropout (Dropout)            (None, 245)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 2214      
Total params: 111,491,667
Trainable params: 253,467
Non-trainable params: 111,238,200
_________________________________________

In [0]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [19]:
print(len(train_data), len(train_label))
print(len(test_data ), len(test_label) )

partial_index = 3000

x_val = train_data[:partial_index]
partial_x_train = train_data[partial_index:]

y_val = train_label[:partial_index]
partial_y_train = train_label[partial_index:]

68094 68094
7567 7567


In [20]:
epochs = 50
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=epochs  ,
                    batch_size=512 ,
                    validation_data=(x_val, y_val),
                    verbose=1)

Train on 65094 samples, validate on 3000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50

KeyboardInterrupt: ignored

In [21]:
results = model.evaluate(test_data, test_label)
print(results)

[1.6579768257694387, 0.39857274]


In [22]:
data_index   = 12
data_words   = " ".join(test_data_words[data_index])
data_indexes = test_data[data_index]
print(data_words)

predicted = model.predict([[data_indexes]])
print(encoder.classes_[np.argmax(predicted)])

спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн 