<a href="https://colab.research.google.com/github/seksonh/KMITL/blob/main/sample_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `wongnai-corpus` Classification Benchmark

We provide two benchmarks for 5-star multi-class classification of [wongnai-corpus](https://github.com/wongnai/wongnai-corpus): [fastText](https://github.com/facebookresearch/fastText) and [ULMFit](https://github.com/cstorm125/thai2fit). In both cases, we first finetune the embeddings using all data. The benchmark numbers are based on the test set. Performance metric is the micro-averaged F1 by the test set of [Wongnai Challenge](https://www.kaggle.com/c/wongnai-challenge-review-rating-prediction/data).

| Model     | Public Micro-F1 | Private Micro-F1 | 
|-----------|-----------------|------------------|
| [**ULMFit Knight**](https://www.facebook.com/photo.php?fbid=10215789035573261&set=pcb.795048317543327&type=3&theater&ifg=1) | **0.61109** | **0.62580** |
| [ULMFit](https://github.com/cstorm125/thai2fit/) | 0.59313          | 0.60322           |
| fastText | 0.5145          | 0.5109           |
| LinearSVC | 0.5022          | 0.4976           |
| Kaggle Score | 0.59139          | 0.58139          |
| [BERT](https://github.com/ThAIKeras/bert) | 0.56612 | 0.57057 |
| [USE](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) | 0.42688 | 0.41031 |

In [None]:
#uncomment if you are running from google colab
!pip install sklearn_crfsuite
!pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
!pip install fastai==1.0.45
!pip install tensorflow_text
!wget https://github.com/wongnai/wongnai-corpus/raw/master/review/review_dataset.zip; unzip review_dataset.zip
!mkdir wongnai_data; ls

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
  Using cached https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Looking in indexes: https://pypi.

In [None]:
!pip install fastai==2.0.15
!pip install fastai2==0.0.30
!pip install fastcore==1.0.16.

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval
from tqdm import tqdm_notebook
from collections import Counter
import re

#viz
import matplotlib.pyplot as plt
import seaborn as sns

from fastai.text import *
from fastai.callbacks import CSVLogger

from pythainlp import word_tokenize

ft_data = 'ft_data/'

## Oversampling

We oversampled class `1` and `2` for 11 and 3 times respectively in order to balance the classes.

In [None]:
train_df = pd.read_csv('w_review_train.csv',sep=';',header=None).drop_duplicates()
train_df.columns = ['review','rating']
test_df = pd.read_csv('test_file.csv',sep=';')
test_df['rating'] = 0
all_df = pd.concat([pd.DataFrame(test_df['review']),\
                   pd.DataFrame(train_df['review'])]).reset_index(drop=True)

In [None]:
train_df.rating.value_counts() / train_df.shape[0]

In [None]:
two_df = pd.concat([train_df[train_df.rating==2].copy() for i in range(2)]).reset_index(drop=True)
one_df = pd.concat([train_df[train_df.rating==1].copy() for i in range(10)]).reset_index(drop=True)
train_bal = pd.concat([train_df,one_df,two_df]).reset_index(drop=True)
train_bal.rating.value_counts() / train_bal.shape[0]

## [fastText](https://github.com/facebookresearch/fastText) Model

We used embeddings pretrained on [Thai Wikipedia Dump](https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md) and finetuned them using all of `wisesight-sentiment` using skipgram model. After that, we do a 5-class classification.

| model    | micro_f1_public | micro_f1_private |
|----------|-----------------|------------------|
| fastText | 0.5145          | 0.5109           |

In [None]:
df_txts = ['train','train_bal','test']
dfs = [train_df,train_bal,test_df]

for i in range(3):
    df = dfs[i]
    ft_lines = []
    for _,row in df.iterrows():
        ft_lab = f'__label__{row["rating"]}'
        ft_text = replace_newline(f'{row["review"]}')
        ft_line = f'{ft_lab} {ft_text}'
        ft_lines.append(ft_line)

    doc = '\n'.join(ft_lines)
    with open(f'{ft_data}{df_txts[i]}.txt','w') as f:
        f.write(doc)
    f.close()

In [None]:
#for fasttext embedding finetuning
ft_lines = []
for _,row in all_df.iterrows():
    ft_lab = '__label__0'
    ft_text = replace_newline(f'{row["review"]}')
    ft_line = f'{ft_lab} {ft_text}'
    ft_lines.append(ft_line)

doc = '\n'.join(ft_lines)
with open(f'{ft_data}df_all.txt','w') as f:
    f.write(doc)
f.close()

In [None]:
#finetune with all data
!/home/charin/fastText-0.1.0/fasttext skipgram \
-pretrainedVectors 'model/wiki.th.vec' -dim 300 \
-input ft_data/df_all.txt -output 'model/finetuned'

In [None]:
#train classifier
!/home/charin/fastText-0.1.0/fasttext supervised \
-input 'ft_data/train_bal.txt' -output 'model/wongnai_bal' \
-pretrainedVectors 'model/finetuned.vec' -epoch 5 -dim 300 -wordNgrams 2 

In [None]:
#get prediction
preds = !/home/charin/fastText-0.1.0/fasttext predict 'model/wongnai_bal.bin' 'ft_data/test.txt'
pred_lab = np.array([int(i.split('__')[-1]) for i in preds])

In [None]:
submit_df = pd.DataFrame({'reviewID':test_df.reviewID,
                          'rating':pred_lab})
submit_df.head()
submit_df.to_csv('submit_fastttext_bal.csv',index=False)

## LinearSVC Model

Code for LinearSVC is provided by [@lukkiddd](https://github.com/lukkiddd).

| model     | micro_f1_public | micro_f1_private | 
|-----------|-----------------|------------------|
| LinearSVC | 0.5022          | 0.4976           |

In [None]:
X_train, y_train = train_bal['review'], train_bal['rating']
X_test = test_df['review']

In [None]:
import string
def process_text(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in word_tokenize(nopunc,'ulmfit') if word and not re.search(pattern=r"\s+", string=word)]
def split_text(text):
    return text.split()

train_splits = []
test_splits = []
for i in tqdm_notebook(range(train_bal.shape[0])):
    train_splits.append(' '.join(process_text(train_bal['review'][i])))
for i in tqdm_notebook(range(test_df.shape[0])):
    test_splits.append(' '.join(process_text(test_df['review'][i])))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC

text_clf = Pipeline([
    ('vect', CountVectorizer(tokenizer=split_text, ngram_range=(1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC()),
])

text_clf.fit(X_train, y_train)

In [None]:
pred_lab = text_clf.predict(X_test)

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
pred_lab = text_clf.predict(X_test)

In [None]:
submit_df = pd.DataFrame({'reviewID':test_df.reviewID,
                          'rating':pred_lab})
submit_df.head()
submit_df.to_csv('submit_linearsvc.csv',index=False)

## [ULMFit](https://github.com/cstorm125/thai2fit) Model

| model     | micro_f1_public | micro_f1_private | 
|-----------|-----------------|------------------|
| **ULMFit** | **0.59313**          | **0.60322**           |

In [None]:
# #uncomment if you are running from google colab
# !pip install sklearn_crfsuite
# !pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
# !pip install fastai==1.0.45
# !wget https://github.com/wongnai/wongnai-corpus/raw/master/review/review_dataset.zip; unzip review_dataset.zip
# !mkdir wongnai_data; ls

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval
from tqdm import tqdm_notebook
from collections import Counter
import re

#viz
import matplotlib.pyplot as plt
import seaborn as sns

from fastai.text import *
from fastai.callbacks import CSVLogger

from pythainlp.ulmfit import *

model_path = 'wongnai_data/'

In [None]:
#process data
train_df = pd.read_csv('w_review_train.csv',sep=';',header=None).drop_duplicates()
train_df.columns = ['review','rating']
test_df = pd.read_csv('test_file.csv',sep=';')
test_df['rating'] = 0
all_df = pd.concat([pd.DataFrame(test_df['review']),\
                   pd.DataFrame(train_df['review'])]).reset_index(drop=True)
two_df = pd.concat([train_df[train_df.rating==2].copy() for i in range(2)]).reset_index(drop=True)
one_df = pd.concat([train_df[train_df.rating==1].copy() for i in range(10)]).reset_index(drop=True)
train_bal = pd.concat([train_df,one_df,two_df]).reset_index(drop=True)

### Finetune Language Model

In [None]:
# tt = Tokenizer(tok_func = ThaiTokenizer, lang = 'th', pre_rules = pre_rules_th, post_rules=post_rules_th)
# processor = [TokenizeProcessor(tokenizer=tt, chunksize=10000, mark_fields=False),
#             NumericalizeProcessor(vocab=None, max_vocab=60000, min_freq=3)]

# data_lm = (TextList.from_df(all_df, model_path, cols=['review'], processor=processor)
#     .random_split_by_pct(valid_pct = 0.01, seed = 1412)
#     .label_for_lm()
#     .databunch(bs=64))
# data_lm.sanity_check()
# data_lm.save('wongnai_lm.pkl')

In [None]:
data_lm = load_data(model_path,'wongnai_lm.pkl')
data_lm.sanity_check()
len(data_lm.train_ds), len(data_lm.valid_ds)

In [None]:
data_lm.show_batch(5)

In [None]:
next(iter(data_lm.train_dl))

In [None]:
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True,
             output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1)

learn = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained=False, **trn_args)

#load pretrained models
learn.load_pretrained(**_THWIKI_LSTM)

In [None]:
learn.predict('สวัสดีครับพี่น้องเสื้อ', 50, temperature=0.5)

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
len(learn.data.vocab.itos)

In [None]:
#train frozen
print('training frozen')
learn.freeze_to(-1)
learn.fit_one_cycle(1, 1e-3, moms=(0.8, 0.7))

In [None]:
#train unfrozen
print('training unfrozen')
learn.unfreeze()
learn.fit_one_cycle(10, 1e-4, moms=(0.8, 0.7))

In [None]:
learn.save('wongnai_lm')
learn.save_encoder('wongnai_enc')

### Classification

In [None]:
test_df.head()

In [None]:
tt = Tokenizer(tok_func = ThaiTokenizer, lang = 'th', pre_rules = pre_rules_th, post_rules=post_rules_th)
processor = [TokenizeProcessor(tokenizer=tt, chunksize=10000, mark_fields=False),
            NumericalizeProcessor(vocab=data_lm.vocab, max_vocab=60000, min_freq=3)]

data_cls = (TextList.from_df(train_bal, model_path, cols=['review'], processor=processor)
    .random_split_by_pct(valid_pct = 0.01, seed = 1412)
    .label_from_df('rating')
    .add_test(TextList.from_df(test_df, model_path, cols=['review'], processor=processor))
    .databunch(bs=32)
    )

data_cls.sanity_check()
data_cls.save('wongnai_cls.pkl')

In [None]:
#make sure we got the right number of vocab
len(data_cls.vocab.itos), len(data_lm.vocab.itos)

In [None]:
data_cls.show_batch(5)

In [None]:
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False,
             output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
trn_args = dict(bptt=70, drop_mult=0.5, alpha=2, beta=1)

learn = text_classifier_learner(data_cls, AWD_LSTM, config=config, pretrained=False, **trn_args)
learn.opt_func = partial(optim.Adam, betas=(0.7, 0.99))

#load pretrained finetuned model
learn.load_encoder('wongnai_enc')

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
# #train unfrozen
# learn.freeze_to(-1)
# learn.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7))

In [None]:
# #gradual unfreezing
# learn.freeze_to(-2)
# learn.fit_one_cycle(1, slice(1e-2 / (2.6 ** 4), 1e-2), moms=(0.8, 0.7))
# learn.freeze_to(-3)
# learn.fit_one_cycle(1, slice(5e-3 / (2.6 ** 4), 5e-3), moms=(0.8, 0.7))
# learn.unfreeze()
# learn.fit_one_cycle(1, slice(1e-3 / (2.6 ** 4), 1e-3), moms=(0.8, 0.7))

```
epoch     train_loss  valid_loss  accuracy
1         1.167613    1.109780    0.479079
Total time: 08:22
epoch     train_loss  valid_loss  accuracy
1         0.982858    0.979201    0.560669
2         0.870348    0.834990    0.598326
3         0.752523    0.802491    0.629707
4         0.653818    0.715869    0.671548
5         0.559333    0.702696    0.682008
Total time: 46:22
```

### Submission

In [None]:
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False,
             output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
trn_args = dict(bptt=70, drop_mult=0.5, alpha=2, beta=1)

learn = text_classifier_learner(data_cls, AWD_LSTM, config=config, pretrained=False, **trn_args)
learn.opt_func = partial(optim.Adam, betas=(0.7, 0.99))

learn.load('wongnai_cls');

In [None]:
probs,y= learn.get_preds(DatasetType.Test, ordered=True)

In [None]:
preds = np.argmax(probs.numpy(),1) + 1
Counter(preds)

In [None]:
submit_df = pd.DataFrame({'reviewID': test_df.reviewID,'rating':preds})
submit_df.head()
submit_df.to_csv('submit_ulmfit.csv',index=False)

## [Multilingual Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)

In [None]:
import tensorflow_hub as hub
import tensorflow_text
import tensorflow as tf #tensorflow 2.1.0
import tqdm

enc = hub.load('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')

In [None]:
y_test, y_train = test_df['rating'], train_bal['rating']

In [None]:
X_trains = []
X_tests = []
bs = 10

In [None]:
for i in tqdm.tqdm_notebook(range(y_test.shape[0]//bs+1)):
    X_tests.append(enc(test_df.review[(i*bs):((i+1)*bs)]).numpy())

In [None]:
for i in tqdm.tqdm_notebook(range(y_train.shape[0]//bs+1)):
    X_trains.append(enc(train_bal.review[(i*bs):((i+1)*bs)]).numpy())

In [None]:
X_test = np.concatenate(X_tests,0)
X_train = np.concatenate(X_trains,0)
X_train.shape, X_test.shape

In [None]:
from sklearn.svm import LinearSVC

text_clf = LinearSVC(class_weight='balanced')
text_clf.fit(X_train, y_train)

In [None]:
pred_lab = text_clf.predict(X_test)

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe_enc = OneHotEncoder(handle_unknown='ignore')
pred_lab = text_clf.predict(X_test)

In [None]:
submit_df = pd.DataFrame({'reviewID':test_df.reviewID,
                          'rating':pred_lab})
submit_df.head()
submit_df.to_csv('submit_use.csv',index=False)