In [1]:
%%time
import malaya
import numpy as np

CPU times: user 4.47 s, sys: 1.01 s, total: 5.48 s
Wall time: 5.37 s


## Why lexicon

Lexicon is populated words related to certain domains, like, words for negative and positive sentiments.

Example, word `suka` can represent as positive sentiment. If `suka` exists in a sentence, we can say that sentence is positive sentiment.

Lexicon based is common way people use to classify a text and very fast. Again, it is pretty naive because a word can be semantically ambiguous.

## sentiment lexicon

Malaya provided a small sample for sentiment lexicon, simply,

In [6]:
sentiment_lexicon = malaya.lexicon.sentiment
sentiment_lexicon.keys()

dict_keys(['negative', 'positive'])

## emotion lexicon

Malaya provided a small sample for emotion lexicon, simply,

In [3]:
emotion_lexicon = malaya.lexicon.emotion
emotion_lexicon.keys()

dict_keys(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'])

## Lexicon generator

To build a lexicon is time consuming, because required expert domains to populate related words to the domains. With the help of word vector, we can induce sample words to specific domains given some annotated lexicon. Why we induced lexicon from word vector? Even for a word `suka` commonly represent positive sentiment, but if the word vector learnt the context of `suka` different polarity and based nearest words also represent different polarity, so `suka` got tendency to become negative sentiment.

Malaya provided inducing lexicon interface, build on top of [Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora](https://arxiv.org/pdf/1606.02820.pdf).

Let say you have a lexicon based on standard language or `bahasa baku`, then you want to find similar lexicon on social media context. So you can use this `malaya.lexicon` interface. To use this interface, we must initiate `malaya.wordvector.load` first.

And, at least small lexicon sample like this,

```python
{'label1': ['word1', 'word2'], 'label2': ['word3', 'word4']}
```

`label` can be more than 2, example like `malaya.lexicon.emotion`, up to 6 different labels.

In [5]:
vocab, embedded = malaya.wordvector.load_social_media()
wordvector = malaya.wordvector.load(embedded, vocab)

## random walk

Random walk technique is main technique use by the paper, can read more at [3.2 Propagating polarities from a seed set](https://arxiv.org/abs/1606.02820)

```python

def random_walk(
    lexicon,
    wordvector,
    pool_size = 10,
    top_n = 20,
    similarity_power = 100.0,
    beta = 0.9,
    arccos = True,
    normalization = True,
    soft = False,
    silent = False,
):

    """
    Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf

    Parameters
    ----------

    lexicon: dict
        curated lexicon from expert domain, {'label1': [str], 'label2': [str]}.
    wordvector: object
        wordvector interface object.
    pool_size: int, optional (default=10)
        pick top-pool size from each lexicons.
    top_n: int, optional (default=20)
        top_n for each vectors will multiple with `similarity_power`.
    similarity_power: float, optional (default=100.0)
        extra score for `top_n`, less will generate less bias induced but high chance unbalanced outcome.
    beta: float, optional (default=0.9)
        penalty score, towards to 1.0 means less penalty. 0 < beta < 1.
    arccos: bool, optional (default=True)
        covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
    normalization: bool, optional (default=True)
        normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
    soft: bool, optional (default=False)
        if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.
    silent: bool, optional (default=False)
        if True, will not print any logs.
        
    Returns
    -------
    tuple: (labels[argmax(scores), axis = 1], scores, labels)
    
    """

```

In [5]:
%%time

results, scores, labels = malaya.lexicon.random_walk(sentiment_lexicon, wordvector, pool_size = 5)

populating nearest words from wordvector
populating vectors from populated nearest words
random walking from populated vectors 

CPU times: user 1min 36s, sys: 16.1 s, total: 1min 52s
Wall time: 28.1 s


In [6]:
np.unique(list(results.values()), return_counts = True)

(array(['negative', 'positive'], dtype='<U8'), array([2260, 2922]))

In [7]:
results

{'serang': 'negative',
 'cilegon': 'positive',
 'culik': 'negative',
 'tanjungpinang': 'positive',
 'jenguk': 'negative',
 'luka': 'negative',
 'jerawat': 'negative',
 'infeksi': 'negative',
 'migrain': 'negative',
 'penyakit': 'negative',
 'penaklukan': 'negative',
 '4ir': 'positive',
 'renjer': 'positive',
 'kezhaliman': 'positive',
 'proklamator': 'positive',
 'kelucahan': 'negative',
 'pablisiti': 'positive',
 'terjwp': 'positive',
 '33100': 'positive',
 'impos': 'positive',
 'kritikan': 'negative',
 'mandat': 'negative',
 'teguran': 'negative',
 'persepsi': 'negative',
 'pembelaan': 'negative',
 'muflis': 'negative',
 'mempelajarinya': 'negative',
 'melarat': 'positive',
 'dihabisi': 'positive',
 'kooperatif': 'positive',
 'kelemahan': 'negative',
 'keyakinan': 'positive',
 'kehendak': 'negative',
 'keburukan': 'negative',
 'gerombolan': 'negative',
 'kelakuan': 'negative',
 'antek': 'negative',
 'politikus': 'negative',
 'ulah': 'negative',
 'debu': 'negative',
 'kotoran': 'negat

In [8]:
%%time

results_emotion, scores_emotion, labels_emotion = malaya.lexicon.random_walk(emotion_lexicon, 
                                                                             wordvector,
                                                                             pool_size = 10)

populating nearest words from wordvector
populating vectors from populated nearest words
random walking from populated vectors 

CPU times: user 5.9 s, sys: 3.13 s, total: 9.03 s
Wall time: 1.5 s


In [9]:
np.unique(list(results_emotion.values()), return_counts = True)

(array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='<U8'),
 array([ 76, 156,  14, 132,  40,  34]))

In [10]:
results_emotion

{'sebal': 'anger',
 'gesture': 'anger',
 'se7': 'anger',
 'ziraa': 'love',
 'mantepp': 'love',
 'mesem': 'love',
 'nggapapa': 'love',
 'maen2': 'love',
 'gacocok': 'anger',
 'jeongwoo': 'love',
 'bergelora': 'anger',
 'mereda': 'anger',
 'skeptis': 'anger',
 'gebus': 'love',
 'tyrion': 'love',
 'memuncak': 'anger',
 'mewabah': 'love',
 'mengenaskan': 'anger',
 'kesasar': 'love',
 'kepedean': 'love',
 'annoying': 'anger',
 'awkward': 'fear',
 'scary': 'fear',
 'handsome': 'fear',
 'nervous': 'fear',
 'cringe': 'fear',
 'menyampah': 'fear',
 'kelakar': 'fear',
 'cute': 'fear',
 'cuak': 'fear',
 'bodoh': 'anger',
 'bangang': 'anger',
 'bebal': 'anger',
 'bodo': 'fear',
 'noob': 'fear',
 'bengap': 'fear',
 'celaka': 'fear',
 'biadap': 'fear',
 'pukimak': 'fear',
 'berang': 'anger',
 'buru': 'anger',
 'nerus': 'anger',
 'kangsar': 'anger',
 'lipis': 'anger',
 'pilah': 'anger',
 'besut': 'anger',
 'krai': 'anger',
 'klawang': 'anger',
 'ketil': 'anger',
 'amuk': 'anger',
 'mbatin': 'love',
 

## propagate probabilistic


```python

def propagate_probabilistic(
    lexicon,
    wordvector,
    pool_size = 10,
    top_n = 20,
    similarity_power = 10.0,
    arccos = True,
    normalization = True,
    soft = False,
    silent = False,
):

    """
    Learns polarity scores via standard label propagation from lexicon sets.

    Parameters
    ----------

    lexicon: dict
        curated lexicon from expert domain, {'label1': [str], 'label2': [str]}.
    wordvector: object
        wordvector interface object.
    pool_size: int, optional (default=10)
        pick top-pool size from each lexicons.
    top_n: int, optional (default=20)
        top_n for each vectors will multiple with `similarity_power`.
    similarity_power: float, optional (default=10.0)
        extra score for `top_n`, less will generate less bias induced but high chance unbalanced outcome.
    arccos: bool, optional (default=True)
        covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
    normalization: bool, optional (default=True)
        normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
    soft: bool, optional (default=False)
        if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.
    silent: bool, optional (default=False)
        if True, will not print any logs.

    Returns
    -------
    tuple: (labels[argmax(scores), axis = 1], scores, labels)
    """

```

In [11]:
%%time

results_emotion, scores_emotion, labels_emotion = malaya.lexicon.propagate_probabilistic(emotion_lexicon, 
                                                                             wordvector,
                                                                             pool_size = 10)

populating nearest words from wordvector
populating vectors from populated nearest words
propagating probabilistic from populated vectors 

CPU times: user 5.64 s, sys: 2.05 s, total: 7.68 s
Wall time: 1.29 s


In [12]:
np.unique(list(results_emotion.values()), return_counts = True)

(array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='<U8'),
 array([315,  66,  10,  21,  28,  12]))

In [13]:
results_emotion

{'sebal': 'anger',
 'gesture': 'anger',
 'se7': 'anger',
 'ziraa': 'anger',
 'mantepp': 'anger',
 'mesem': 'anger',
 'nggapapa': 'anger',
 'maen2': 'anger',
 'gacocok': 'anger',
 'jeongwoo': 'anger',
 'bergelora': 'anger',
 'mereda': 'anger',
 'skeptis': 'anger',
 'gebus': 'anger',
 'tyrion': 'anger',
 'memuncak': 'anger',
 'mewabah': 'anger',
 'mengenaskan': 'anger',
 'kesasar': 'anger',
 'kepedean': 'anger',
 'annoying': 'anger',
 'awkward': 'fear',
 'scary': 'fear',
 'handsome': 'anger',
 'nervous': 'fear',
 'cringe': 'fear',
 'menyampah': 'fear',
 'kelakar': 'anger',
 'cute': 'anger',
 'cuak': 'fear',
 'bodoh': 'anger',
 'bangang': 'anger',
 'bebal': 'anger',
 'bodo': 'anger',
 'noob': 'anger',
 'bengap': 'anger',
 'celaka': 'anger',
 'biadap': 'anger',
 'pukimak': 'anger',
 'berang': 'anger',
 'buru': 'anger',
 'nerus': 'anger',
 'kangsar': 'anger',
 'lipis': 'anger',
 'pilah': 'anger',
 'besut': 'anger',
 'krai': 'anger',
 'klawang': 'anger',
 'ketil': 'anger',
 'amuk': 'anger',


## propagate graph

```python

def propagate_graph(
    lexicon,
    wordvector,
    pool_size = 10,
    top_n = 20,
    similarity_power = 10.0,
    normalization = True,
    soft = False,
    silent = False,
):

    """
    Graph propagation method dapted from Velikovich, Leonid, et al. "The viability of web-derived polarity lexicons." http://www.aclweb.org/anthology/N10-1119

    Parameters
    ----------

    lexicon: dict
        curated lexicon from expert domain, {'label1': [str], 'label2': [str]}.
    wordvector: object
        wordvector interface object.
    pool_size: int, optional (default=10)
        pick top-pool size from each lexicons.
    top_n: int, optional (default=20)
        top_n for each vectors will multiple with `similarity_power`.
    similarity_power: float, optional (default=10.0)
        extra score for `top_n`, less will generate less bias induced but high chance unbalanced outcome.
    normalization: bool, optional (default=True)
        normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
    soft: bool, optional (default=False)
        if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.
    silent: bool, optional (default=False)
        if True, will not print any logs.

    Returns
    -------
    tuple: (labels[argmax(scores), axis = 1], scores, labels)
    """
```

In [14]:
%%time

results_emotion, scores_emotion, labels_emotion = malaya.lexicon.propagate_graph(emotion_lexicon, 
                                                                             wordvector,
                                                                             pool_size = 10)

populating nearest words from wordvector
populating vectors from populated nearest words
propagate graph from populated nearest words


100%|██████████| 452/452 [00:00<00:00, 1830.24it/s]

CPU times: user 16.5 s, sys: 2.2 s, total: 18.7 s
Wall time: 11.8 s





In [15]:
np.unique(list(results_emotion.values()), return_counts = True)

(array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='<U8'),
 array([149,  61,  49,  69,  46,  78]))

In [16]:
results_emotion

{'sebal': 'anger',
 'gesture': 'fear',
 'se7': 'anger',
 'ziraa': 'anger',
 'mantepp': 'anger',
 'mesem': 'fear',
 'nggapapa': 'anger',
 'maen2': 'anger',
 'gacocok': 'fear',
 'jeongwoo': 'anger',
 'bergelora': 'anger',
 'mereda': 'anger',
 'skeptis': 'anger',
 'gebus': 'love',
 'tyrion': 'fear',
 'memuncak': 'anger',
 'mewabah': 'anger',
 'mengenaskan': 'anger',
 'kesasar': 'love',
 'kepedean': 'anger',
 'annoying': 'anger',
 'awkward': 'fear',
 'scary': 'fear',
 'handsome': 'love',
 'nervous': 'fear',
 'cringe': 'anger',
 'menyampah': 'anger',
 'kelakar': 'anger',
 'cute': 'love',
 'cuak': 'fear',
 'bodoh': 'anger',
 'bangang': 'anger',
 'bebal': 'anger',
 'bodo': 'anger',
 'noob': 'anger',
 'bengap': 'anger',
 'celaka': 'anger',
 'biadap': 'anger',
 'pukimak': 'anger',
 'berang': 'anger',
 'buru': 'joy',
 'nerus': 'anger',
 'kangsar': 'fear',
 'lipis': 'anger',
 'pilah': 'fear',
 'besut': 'anger',
 'krai': 'anger',
 'klawang': 'anger',
 'ketil': 'anger',
 'amuk': 'anger',
 'mbatin':