In [1]:
%%time
import malaya

CPU times: user 4.82 s, sys: 1.21 s, total: 6.03 s
Wall time: 7.59 s


In [2]:
# some text examples copied from Twitter

string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
string2 = 'Husein ska mkn aym dkat kampng Jawa'
string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'
string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'
string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
string6 = 'blh bntg dlm kls nlp sy, nnti intch'

## Load probability speller

The probability speller extends the functionality of the Peter Norvig's, http://norvig.com/spell-correct.html.

And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews.

Also added custom vowels and consonant augmentation to adapt with our local shortform / typos.

In [3]:
prob_corrector = malaya.spell.probability()

#### To correct a word

In [4]:
prob_corrector.correct('sy')

'saya'

In [5]:
prob_corrector.correct('mhthir')

'mahathir'

In [6]:
prob_corrector.correct('mknn')

'makanan'

#### List possible generated pool of words

In [7]:
prob_corrector.edit_candidates('mhthir')

{'mahathir'}

In [8]:
prob_corrector.edit_candidates('smbng')

{'sambang',
 'sambong',
 'sambung',
 'sembang',
 'sembong',
 'sembung',
 'simbang',
 'smbg',
 'sombong',
 'sumbang',
 'sumbing'}

Now you can see, `edit_candidates` suggested quite a lot candidates and some of candidates not an actual word like `sambang`, to reduce that, we can use [sentencepiece](https://github.com/google/sentencepiece) to check a candidate a legit word for malaysia context or not.

In [9]:
prob_corrector_sp = malaya.spell.probability(sentence_piece = True)
prob_corrector_sp.edit_candidates('smbng')

{'sambong',
 'sambung',
 'sembang',
 'sembong',
 'sembung',
 'smbg',
 'sombong',
 'sumbang',
 'sumbing'}

**So how does the model knows which words need to pick? highest counts from wikipedia!**

#### To correct a sentence

In [10]:
prob_corrector.correct_text(string1)

'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

In [11]:
prob_corrector.correct_text(string2)

'Husein suka makan ayam dekat kampung Jawa'

In [12]:
prob_corrector.correct_text(string3)

'Melayu malas ni narration dia sama sahaja macam men are trash. True to some, false to some.'

In [13]:
prob_corrector.correct_text(string4)

'Tapi tak fikir ke bahaya perpetuate myths macam itu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah. Your kids will be victims of that too.'

In [14]:
prob_corrector.correct_text(string5)

'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'

In [15]:
prob_corrector.correct_text(string6)

'boleh bintang dalam kelas nlp saya, nanti intch'

## Load transformer speller

This spelling correction is a transformer based, improvement version of `malaya.spell.probability`. Problem with `malaya.spell.probability`, it naively picked highest probability of word based on public sentences (wiki, news and social media) without understand actual context, example,

```python
string = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
prob_corrector = malaya.spell.probability()
prob_corrector.correct_text(string)
-> 'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'
```

It supposely replaced `skt` with `sikit`, a common word people use in social media to give a little bit of attention to `pencen`. So, to fix that, we can use Transformer model! **Right now transformer speller supported `BERT` and `ALBERT` only, XLNET is not that good**.

In [16]:
model = malaya.transformer.load(model = 'bert')
transformer_corrector = malaya.spell.transformer(model, sentence_piece = True)






The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.






INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/bert-model/base/bert-base-v3/model.ckpt


In [17]:
transformer_corrector.correct_text(string1)

'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

perfect! But again, transformer model is very expensive! You can compare the time wall with probability based.

In [18]:
%%time
transformer_corrector.correct_text(string1)

CPU times: user 55.1 s, sys: 2.07 s, total: 57.1 s
Wall time: 10.6 s


'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

In [19]:
%%time
prob_corrector.correct_text(string1)

CPU times: user 111 ms, sys: 8.72 ms, total: 119 ms
Wall time: 119 ms


'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

In [20]:
transformer_corrector.correct_text(string2)

'Husein suka mkn ayam dekat kampung Jawa'

In [21]:
prob_corrector.correct_text(string2)

'Husein suka makan ayam dekat kampung Jawa'

## Load symspeller speller

This spelling correction is an improvement version for [symspeller](https://github.com/mammothb/symspellpy) to adapt with our local shortform / typos. Before you able to use this spelling correction, you need to install [symspeller](https://github.com/mammothb/symspellpy),

```bash
pip install symspellpy
```

In [23]:
symspell_corrector = malaya.spell.symspell()

#### To correct a word

In [24]:
symspell_corrector.correct('bntng')

'bintang'

In [25]:
symspell_corrector.correct('kerajaan')

'kerajaan'

In [26]:
symspell_corrector.correct('mknn')

'makanan'

#### List possible generated words

In [27]:
symspell_corrector.edit_step('mrh')

{'marah': 12684.0,
 'merah': 21448.5,
 'arah': 15066.5,
 'darah': 10003.0,
 'mara': 7504.5,
 'malah': 7450.0,
 'zarah': 3753.5,
 'murah': 3575.5,
 'barah': 2707.5,
 'march': 2540.5,
 'martha': 390.0,
 'marsha': 389.0,
 'maratha': 88.5,
 'marcha': 22.5,
 'karaha': 13.5,
 'maraba': 13.5,
 'varaha': 11.5,
 'marana': 4.5,
 'marama': 4.5}

#### To correct a sentence

In [28]:
symspell_corrector.correct_text(string1)

'kerajaan patut bagi pencen awal saat kepada warga emas supaya emosi'

In [29]:
symspell_corrector.correct_text(string2)

'Husein suka makan ayam dapat kampung Jawa'

In [30]:
symspell_corrector.correct_text(string3)

'Melayu malas ni narration dia sama sahaja macam men are trash. True to some, false to some.'

In [31]:
symspell_corrector.correct_text(string4)

'Tapi tak fikir ke bahaya perpetuate maathai macam itu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah. Your kids will be victims of that too.'

In [32]:
symspell_corrector.correct_text(string5)

'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya am edging towards retirement in 4-5 aras time after a career of being an Engineer, Project Manager, General Manager'

In [33]:
symspell_corrector.correct_text(string6)

'boleh bintang dalam kelas malaya saya, nanti mintalah'