# Mongolian News Classification

This notebook contains a simple demo classifying the [Eduge news dataset](https://github.com/tugstugi/mongolian-nlp) provided by [Bolorsoft LLC](https://bolorsoft.com/) using a SVM and [SentencePiece](https://github.com/google/sentencepiece).

## Download Eduge dataset

In [0]:
import os
from os.path import exists, join, basename, splitext

if not exists("eduge.csv"):
  !wget -q https://github.com/tugstugi/mongolian-nlp/raw/master/datasets/eduge.csv.gz
  !gunzip eduge.csv.gz

## Download SentencePiece vocabulary

A SentencePiece model trained on a Mongolian corpus containg 650M words will be used the text tokenizer. We will download it from the repo [tugstugi/mongolian-bert](https://github.com/tugstugi/mongolian-bert):


In [0]:
if not exists('mn_uncased.model'):
  # download both SentencePiece models: cased and uncased
  !wget -q https://github.com/tugstugi/mongolian-bert/raw/master/sentencepiece/mn_cased.model
  !wget -q https://github.com/tugstugi/mongolian-bert/raw/master/sentencepiece/mn_cased.vocab
  !wget -q https://github.com/tugstugi/mongolian-bert/raw/master/sentencepiece/mn_uncased.model
  !wget -q https://github.com/tugstugi/mongolian-bert/raw/master/sentencepiece/mn_uncased.vocab
    
  # install SentencePiece
  !pip install -q sentencepiece

## Load SentencePiece and test

Load the downloaded SentencePiece model and tokenize some text:

In [21]:
import sentencepiece as spm
import pandas as pd
import numpy as np
import time

from sklearn.feature_extraction.text import *
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

sp = spm.SentencePieceProcessor()
sp.Load('mn_uncased.model')
def sp_tokenize(w):
  return sp.EncodeAsPieces(w)

" ".join(sp_tokenize('Мөнгөө тушаачихсаныхаа дараа мэдэгдээрэй'.lower()))

'▁мөнгөө ▁тушаа чихсан ыхаа ▁дараа ▁мэдэгд ээрэй'

## Train/Test split

In [22]:
df = pd.read_csv("eduge.csv")
df = df.rename(columns=lambda x: x.strip())

# show labels
print('labels', df['label'].unique().tolist())

# stratified train and test split
train, test = train_test_split(df, test_size=0.1, random_state=999, stratify=df['label'])

labels ['урлаг соёл', 'эдийн засаг', 'эрүүл мэнд', 'хууль', 'улс төр', 'спорт', 'технологи', 'боловсрол', 'байгал орчин']


## Train SVM

Now train a SVM, no hyperparameter optimization, use only some default parameters:

In [23]:
text_clf = Pipeline([('vect', CountVectorizer(tokenizer=sp_tokenize, lowercase=True)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4, n_iter=5, random_state=0))])

t = time.time()
text_clf = text_clf.fit(train['news'], train['label'])
t = time.time()-t
print("Training time in seconds: ", t)

t = time.time()
predicted = text_clf.predict(test['news'])
t = time.time()-t
print("Prediction time in seconds: ", t)

print("Feature count:", len(text_clf.named_steps['vect'].vocabulary_))
print("Classifier accuracy: ", np.mean(predicted == test['label']))



Training time in seconds:  158.1350929737091
Prediction time in seconds:  17.703580856323242
Feature count: 32231
Classifier accuracy:  0.9093432007400555
