# Lexical-Sample Supervised Word Sense Disambiguation

## Progress
Classifier: Linear SVM
Cross validation: k-fold cross validation

### Include PERSON, LOCATION, ORGANIZATION, OTHER ENTITY, and MWE
5-fold CV

Features, cross validation macro average accuracy:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .74
- It Makes Sense's Local Collocation + Surrounding Words: .726
- It Makes Sense's Local Collocation SVD: .721
- Latent Semantic Analysis: .713
- It Makes Sense's Local Collocation: .71
- Surrounding Words SVD: .708
- Collocation Vector: .70
- TF-IDF: .70
- Unigram-Bigram TFIDF: .69
- Choose most frequent sense: .57


Features, cross validation macro average F1-score:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .533
- It Makes Sense's Local Collocation SVD: .528
- It Makes Sense's Local Collocation + Surrounding Words: .526
- It Makes Sense's Local Collocation: .524
- Collocation Vector SVD: .506
- Collocation Vector: .495
- Latent Semantic Analysis: .491
- Surrounding Words SVD: .477
- TF-IDF: .427
- Unigram-Bigram TF-IDF: .399
- Choose most frequent sense: .220

### Pure WSD,  with MWE
5-fold CV
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .550
- It Makes Sense's Local Collocation SVD: .539
- It Makes Sense's Local Collocation: .537
- Latent Semantic Analysis: .506
- Surrounding Words SVD: .505
- Collocation Vector SVD: .505
- Collocation Vector: .501
- TF-IDF: .434
- Choose most frequent sense: .241

### Pure WSD,  not even MWE
5-fold CV, F1-score
- Iacobacci, et. al (2016) replication: .634
- Wikipedia word embedding + POS tags SVD + IMS collocation vectors: .630
- Wikipedia word embedding + POS tags SVD: .630
- Wikipedia word embedding + POS tags SVD + Surrounding words SVD: .630
- Wikipedia word embedding + IMS collocation vectors + Surrounding words SVD: .629
- Wikipedia word embedding + IMS collocation vectors: .625
- Wikipedia word embedding + Surrounding words SVD: .623
- Wikipedia word embedding: .618
- It Makes Sense's (Zhong & Ng, 2010) Replication, but SVD: .584
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .582
- It Makes Sense's Local Collocation SVD + POS Tags SVD: .574
- It Makes Sense's Local Collocation SVD: .573
- POS Tags SVD + Surrounding Words SVD: .561
- It Makes Sense's Local Collocation: .569
- Latent Semantic Analysis: .524
- Surrounding Words SVD: .521
- POS Tags SVD: .488
- POS Tags: .488
- TF-IDF: .459
- Choose most frequent sense: .253

In general, training accuracy / f1-score is perfect, but the cross validation score is way too low, which means:
**Overfit**

- Now with word embedding, the training score is never 100% perfect and the cross validation score *improved drastically*

### TODO
- Tackle overfit problem -> word embedding somewhat solved this
- Wikipedia Indonesia Word Embedding -> done
- SVD with larger dimension (with extra memory)
- Build balanced dataset: manual labor

In [1]:
import sys
import pandas as pd
import numpy as np

# Load Data

In [2]:
dataset = pd.read_csv('train_data.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,kata,sense,kalimat,pos_tags,clean,targetpos_clean,targetpos_ori,targetpos_pos_tag
0,0,cerah,4801,cuaca cerah adalah lazim panjang tahun,NN NN VB NN NN NN Z,cuaca cerah lazim,1,1,1
1,1,cerah,4801,gambar yang hasil oleh layarnya cukup cerah da...,NNP SC VB IN NN RB JJ CC VB NN SC JJ VB NN SC ...,gambar hasil layarnya cerah milik speaker hasi...,3,6,6
2,2,cerah,4803,masa depan yang cerah bagi pemuda umur somenum...,NN NN SC VB IN NN NN CD IN NNP NNP CD Z,cerah bagi pemuda umur prancis abad,0,3,3
3,3,cerah,4801,cor caroli alpha canum venaticorum nama lengka...,NNP NNP Z NNP NNP NNP Z Z Z NN RB VB NNP NNP N...,cor caroli alpha canum venaticorum nama lengka...,12,16,21
4,4,cerah,4801,sanders lebih suka cat air untuk lilo dengan m...,NN RB VB NN NN SC NNP IN NN VB NN NN NN NN NN Z,sanders suka cat air lilo maksud tampil warna ...,8,11,11


# Drop rare sense from training set

In [3]:
RARE_LIMIT = 5
sense_set = set(dataset.sense)

In [4]:
rare_sense = set(filter(lambda s: len(dataset.query('sense == "{}"'.format(s))) <= RARE_LIMIT, sense_set))
len(rare_sense)

37

In [5]:
dataset_kata = []
dataset_sense = []
dataset_kalimat = []
dataset_clean = []
dataset_pos_clean = []
dataset_pos_ori = []
dataset_pos_tags = []
dataset_pos_pos_tag = []
for i in range(len(dataset)):
    row = dataset.iloc[i]
    if row.sense not in rare_sense:
        dataset_kata.append(row.kata)
        dataset_sense.append(row.sense)
        dataset_kalimat.append(row.kalimat)
        dataset_clean.append(row.clean)
        dataset_pos_clean.append(row.targetpos_clean)
        dataset_pos_ori.append(row.targetpos_ori)
        dataset_pos_tags.append(row.pos_tags)
        dataset_pos_pos_tag.append(row.targetpos_pos_tag)

dataset = pd.DataFrame({
    'kata': dataset_kata,
    'sense': dataset_sense,
    'kalimat': dataset_kalimat,
    'clean': dataset_clean,
    'targetpos_clean': dataset_pos_clean,
    'targetpos_ori': dataset_pos_ori,
    'pos_tags': dataset_pos_tags,
    'targetpos_pos_tag': dataset_pos_pos_tag,
})

In [6]:
set(dataset.query('kata == "{}"'.format('panas')).sense)

{'4901', '4903', '4904'}

In [7]:
len(dataset)

8311

# Feature Extraction

## POS Tags

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from functools import reduce

In [12]:
POS_TAGS_WINDOW = 2

In [40]:
pos_tags = [['-' for j in range(2*POS_TAGS_WINDOW+1)] for i in range(len(dataset))]
possible_tags = set()

for i in range(len(dataset)):
    row = dataset.iloc[i]
    tags = row.pos_tags.split()
    position = row.targetpos_pos_tag
    pos_tags[i][POS_TAGS_WINDOW] = tags[position]
    j = position-1
    k = POS_TAGS_WINDOW - 1
    while j >= 0 and j >= position - POS_TAGS_WINDOW:
        if tags[j] == 'Z':
            break # do not even include
        pos_tags[i][k] = tags[j]
        k -= 1
        j -= 1
    j = position+1
    k = POS_TAGS_WINDOW + 1
    while j < len(tags) and j <= position + POS_TAGS_WINDOW:
        pos_tags[i][k] = tags[j]
        if tags[j] == 'Z':
            break # include, then break

        k += 1
        j += 1
    
    

In [32]:
TAGSET = [
    '-', 'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'MD', 'NEG', 'NN',
    'NND','NNP','OD','PR','PRP','RB','RP','SC','SYM','VB','WH','X','Z'
]

TAG_LABEL = {t: [1 if t == x else 0 for x in TAGSET] for t,i in zip(TAGSET, range(len(TAGSET)))}

class POSTagTransformer(BaseEstimator, TransformerMixin):
    def transform(self, X, y=None):
        res = []
        for sentence in X:
            r = []
            for tag in sentence:
                r = [*r, *TAG_LABEL[tag]]
            res.append(r)
        return csr_matrix(res)

In [41]:
pos_tags = POSTagTransformer().transform(pos_tags)

## TF-IDF

In [130]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [131]:
tfidf_u = TfidfVectorizer()
u_tfidf = tfidf_u.fit_transform(dataset.clean)

In [132]:
u_tfidf

<8311x20337 sparse matrix of type '<class 'numpy.float64'>'
	with 99962 stored elements in Compressed Sparse Row format>

## Unigram-Bigram TF-IDF
as in Faisal, et. al (2018) "Word Sense Disambiguation in Bahasa Indonesia using SVM"

In [109]:
combined_unigram_bigram = []

for i in range(len(dataset)):
    row = dataset.iloc[i]
    combined_unigram_bigram.append(row.clean + ' ' + row.clean_bigram)

In [111]:
tfidf_ub = TfidfVectorizer()
ub_tfidf = tfidf_ub.fit_transform(combined_unigram_bigram)

In [112]:
ub_tfidf

<8721x109586 sparse matrix of type '<class 'numpy.float64'>'
	with 220345 stored elements in Compressed Sparse Row format>

## Latent Semantic Analysis

In [43]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [136]:
svdtfidf = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))
lsa = svdtfidf.fit_transform(u_tfidf)

In [137]:
lsa.shape

(8311, 1000)

## Correct (ed again) implementation of collocation vectors
as in Zhong & Ng (2010) "It Makes Sense", but unigram and bigrams only 

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
from preprocessor import normalize_money, normalize_number, stemmer, pipe
from scipy.sparse.csr import csr_matrix
import time
from functools import reduce

In [15]:
UNIGRAM = 0
BIGRAM = 1

collocation_pos = [
    (-2, -2), (-1, -1), (1, 1), (2, 2), (-2, -1), (-1, 1), (1, 2),
]

collocation_type = [
    UNIGRAM, UNIGRAM, UNIGRAM, UNIGRAM, BIGRAM, BIGRAM, BIGRAM
]

In [16]:
def get_collocation(sentence, targetpos, L, R):
    col = ['-' for i in range(R-L+1 - (1 if L < 0 and R > 0 else 0))]
    tokens = sentence.split()
    L = targetpos+L
    R = targetpos+R
    j = L
    i = 0
    while j <= R:
        if j < 0:
            j += 1
            i += 1
            continue
        if j == targetpos:
            j += 1
            continue
        if j >= len(tokens):
            break
        col[i] = tokens[j]
        j += 1
        i += 1
    
    return ' '.join(col)

In [17]:
print(dataset.iloc[2].kalimat, dataset.iloc[2].targetpos_ori)

masa depan yang cerah bagi pemuda umur somenumber di prancis abad somenumber 3


In [18]:
get_collocation(dataset.iloc[2].kalimat, 5, -6, -1)

'- masa depan yang cerah bagi'

In [19]:
collocation_words = [[] for i in range(len(dataset))]
context_window = []

for i in range(len(dataset)):
    instance = dataset.iloc[i]
    for l, r in collocation_pos:
        collocation_words[i].append(get_collocation(instance.kalimat, instance.targetpos_ori, l, r))
    context_window.append(get_collocation(instance.kalimat, instance.targetpos_ori, -2, 2))

In [20]:
unigram_vectorizer = CountVectorizer(ngram_range=(1,1), min_df=.0002).fit(context_window)
bigram_vectorizer = CountVectorizer(ngram_range=(2,2), min_df=.0002).fit(context_window)

In [21]:
print(len(unigram_vectorizer.vocabulary_))
print(len(bigram_vectorizer.vocabulary_))

2367
2060


In [22]:
len(collocation_words[2])

7

In [23]:
collocation_vectors = []

vectorizer = [None, None]
vectorizer[UNIGRAM] = unigram_vectorizer
vectorizer[BIGRAM] = bigram_vectorizer

for i in range(len(dataset)):
    vec = []
    for j in range(len(collocation_pos)):
        vec = [
            *vec, 
            *vectorizer[collocation_type[j]].transform([collocation_words[i][j]]).toarray()[0]
        ]
    collocation_vectors.append(vec)
        
collocation_vectors = csr_matrix(collocation_vectors)

In [24]:
collocation_vectors

<8311x15648 sparse matrix of type '<class 'numpy.int64'>'
	with 34988 stored elements in Compressed Sparse Row format>

## Surrounding Words

In [25]:
cv_bin = CountVectorizer(min_df=.0002)
surrounding_words = cv_bin.fit_transform(
    list(map(lambda s: ' '.join(set(s.split())), dataset.clean))
)

In [26]:
surrounding_words = csr_matrix(np.array([surrounding_words[i].toarray()[0] for i in range(surrounding_words.shape[0])], dtype=np.bool))

In [27]:
surrounding_words

<8311x7202 sparse matrix of type '<class 'numpy.bool_'>'
	with 86827 stored elements in Compressed Sparse Row format>

## Wikipedia Word2Vec Word Embedding

In [28]:
import gensim

In [29]:
EMBEDDING_SIZE = 50

In [30]:
word_vectors = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('../wikipedia_indonesia_embedding{}.model'.format(EMBEDDING_SIZE))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Exponential Decay Word Embedding Features
Iacobacci, et. al (2016)

In [31]:
embedding = []

W = 5
alpha = 1 - (np.power(0.1, np.power(W-1.0, -1)))

for p in range(len(dataset)):
    if (p % 800) == 0:
        sys.stdout.write("\r{0:.2f} %".format(p/len(dataset)))
        sys.stdout.flush()
    instance = dataset.iloc[p]
    e = np.zeros(EMBEDDING_SIZE)
    I = instance.targetpos_ori
    words = instance.kalimat.split()
    for i in range(EMBEDDING_SIZE):
        for j in range(max(0, I-W), min(len(words), I+W+1)):
            if j == I:
                continue
            try:
                e[i] += (word_vectors.get_vector(words[j])[i] * (np.power(1 - alpha, abs(I-j) - 1)))
            except:
                continue
    embedding.append(e)
            

0.96 %

In [32]:
embedding = np.array(embedding)

In [33]:
embedding.shape

(8311, 50)

# Form Training Set

## It Makes Sense's Collocation Vectors only

In [163]:
X_train = collocation_vectors

In [164]:
collocation_vectors

<8311x15648 sparse matrix of type '<class 'numpy.int64'>'
	with 49470 stored elements in Compressed Sparse Row format>

## IMS Collocation Vectors SVD

In [36]:
svdimscv = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [37]:
begin = time.perf_counter()
X_train = svdimscv.fit_transform(collocation_vectors)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 12.89343769300001


In [38]:
imscvsvd = X_train

In [39]:
X_train = imscvsvd

In [40]:
X_train.shape

(8311, 1000)

## Surrounding Words Only

In [84]:
X_train = surrounding_words

## Surrounding Words SVD

In [41]:
svdsw = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [42]:
begin = time.perf_counter()
swsvd = svdsw.fit_transform(surrounding_words)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 12.8981895


In [43]:
X_train = swsvd

## IMS Collocation Vectors SVD + Surrounding Words SVD

In [214]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 2.334622258997115


In [215]:
X_train.shape

(8311, 2000)

## Collocation Vector only

In [13]:
X_train = collocation_vector

## Collocation Vector SVD

In [28]:
svdcv = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [29]:
begin = time.perf_counter()
cvsvd = svdcv.fit_transform(collocation_vector)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 21.836217599999998


In [30]:
X_train = cvsvd

## Word Embedding Only

In [78]:
X_train = embedding

## IMS Collocation Vectors + Surrounding Words

In [57]:
transform_to_imscv_sw = lambda imscv, sw: csr_matrix(
    np.array(
        list(map(lambda i: [*imscv[i].toarray()[0], *sw[i].toarray()[0]], [i for i in range(imscv.shape[0])])),
        dtype=np.bool
    )
)

In [58]:
begin = time.perf_counter()
X_train = transform_to_imscv_sw(
    collocation_vectors,
    surrounding_words
)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 250.5095411000002


In [78]:
X_train

<8721x238788 sparse matrix of type '<class 'numpy.bool_'>'
	with 301381 stored elements in Compressed Sparse Row format>

## TF-IDF Only

In [133]:
X_train = u_tfidf

## Unigram-Bigram TF-IDF Only

In [147]:
X_train = ub_tfidf

## LSA Only

In [139]:
X_train = lsa

## POS Tags Only

In [109]:
X_train = pos_tags

## POS Tags SVD

In [44]:
svdpos = make_pipeline(TruncatedSVD(80), Normalizer(copy=False))

In [45]:
possvd = svdpos.fit_transform(pos_tags)

In [46]:
X_train = possvd

## Surrounding Words SVD + POS Tags SVD

In [116]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 3.57551219999732


## IMS Collocation Vectors + POS Tags SVD

In [192]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *imscvsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1.5660946600037278


## It Makes Sense's Features Set SVD
The It Makes Sense's disambiguator system uses collocation vectors, surrounding words, and POS tags

In [46]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *imscvsvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 2.635862696000004


## Wikipedia Word Embedding + POS Tags SVD

In [98]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 0.5411793000002945


## Wikipedia Word Embedding + Surrounding Words SVD

In [223]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*swsvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1.381767138998839


## Wikipedia Word Embedding + IMS Collocation Vectors SVD

In [225]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1.319250854998245


## Wikipedia Word Embedding + IMS Collocation Vectors SVD + Surrounding Words SVD

In [227]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *swsvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 2.5495485010033008


## Wikipedia Word Embedding + IMS Collocation Vectors SVD + POS Tags SVD

In [231]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *possvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1.5257597589952638


## Wikipedia Word Embedding + Surrounding Words SVD + POS Tags SVD

In [229]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*swsvd[i], *possvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1.7236650740014738


## Iacobacci, et. al Features Set SVD
But the word embedding is not transformed by SVD algorithm

In [59]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *swsvd[i], *possvd[i], *embedding[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 2.417027671000028


## Labels

In [47]:
annotated_words = set(dataset.kata)

In [48]:
mappers = dict()
for w in annotated_words:
    possible_sense = set(dataset.query('kata == "{}"'.format(w)).sense)
    mappers[w] = []
    for sense, i in zip(list(possible_sense),  [n for n in range(len(possible_sense))]):
        mappers[w].append((sense, i))

In [49]:
y_train = np.array([list(filter(lambda m: m[0] == sense, mappers[kata]))[0][1] for sense, kata in zip(dataset.sense, dataset.kata)])

# Training

Dummy classifier: always choose the most frequent sense

In [50]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [51]:
classifier = {w: None for w in annotated_words}

In [52]:
classification_report([0,1,1], [0,1,0], output_dict=True)['macro avg']['f1-score']

0.6666666666666666

In [53]:
'''
Select best parameter using k-fold cross validation
'''
def train(X, y, clf, possible_param, fold=3):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False)
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Cross validation accuracy:', clf.best_score_)
    dummy_score = label_counts[most_freq_label] / len(y)
    print('Dummy classifier accuracy: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

def train_f1(X, y, clf, possible_param, fold=3):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False, scoring='f1_macro')
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Training f1-score:', classification_report(y, clf.predict(X), output_dict=True)['macro avg']['f1-score'])
    print('Cross validation f1-score:', clf.best_score_)
    dummy_score = classification_report(y, [most_freq_label for i in y], output_dict=True)['macro avg']['f1-score']
    print('Dummy classifier f1-score: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

In [54]:
def print_param(param):
    print('Best parameters:')
    for p in param:
        print(p, ':', param[p])

In [55]:
def train_all(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average accuracy:', sum(scores)/len(scores))
    print('Dummy classifier macro average accuracy:', sum(dummy_scores)/len(dummy_scores))

def train_all_f1(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train_f1(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average f1-score:', sum(scores)/len(scores))
    print('Dummy classifier macro average f1-score:', sum(dummy_scores)/len(dummy_scores))

In [56]:
y_train[list(dataset.query('kata == "{}"'.format('besar')).index)]

array([3, 1, 1, 1, 1, 3, 3, 3, 3, 0, 0, 3, 2, 3, 1, 3, 1, 2, 1, 3, 2, 3,
       3, 2, 3, 1, 3, 3, 1, 1, 2, 1, 1, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3,
       3, 3, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 0, 1, 1, 3, 1, 3, 1, 3, 3, 3,
       3, 3, 3, 1, 1, 1, 3, 2, 0, 3, 3, 3, 3, 1, 3, 1, 1, 1, 1, 1, 1, 3,
       0, 3, 2, 1, 1, 1, 3, 1, 3, 0, 3, 1, 1, 3, 3, 1, 0, 1, 3, 3, 3, 3,
       2, 1, 1, 1, 3, 3, 3, 3, 1, 0, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 3, 3,
       3, 3, 1, 0, 1, 1, 1, 3, 3, 3, 1, 1, 3, 1, 1, 3, 3, 3, 1, 3, 3, 3,
       1, 3, 1, 1, 1])

## Linear SVM

In [57]:
from sklearn.svm import LinearSVC
import time

In [58]:
begin = time.perf_counter()
train_all_f1(
    LinearSVC(),
    {'max_iter': [10, 20, 40], 'C':[0.25, 0.5, 1.0, 2.0, 4.0, 8.0]},
    algorithm_name='Linear SVM'
)
print('elapsed time:', time.perf_counter() - begin)

Linear SVM
menurunkan


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.5984190319484437
Cross validation f1-score: 0.3432512604984799
Dummy classifier f1-score:  0.12656641604010024
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
bisa

Training f1-score: 0.755364255428498
Cross validation f1-score: 0.6488748488748489
Dummy classifier f1-score:  0.43718592964824116
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
rapat


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8897243107769424
Cross validation f1-score: 0.6799967673651884
Dummy classifier f1-score:  0.45964912280701753
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
bunga


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7537878787878788
Cross validation f1-score: 0.5461715554146187
Dummy classifier f1-score:  0.4703703703703704
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
kabur


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8332076525624913
Cross validation f1-score: 0.41144800793012315
Dummy classifier f1-score:  0.3182674199623352
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
ketat


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.595948985654868
Cross validation f1-score: 0.39016446613131667
Dummy classifier f1-score:  0.1534090909090909
Best parameters:
C : 2.0
max_iter : 40
----------------------------------
dalam


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6843445295987773
Cross validation f1-score: 0.21476567600577753
Dummy classifier f1-score:  0.07039337474120083
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
berat

Training f1-score: 0.891305753070459
Cross validation f1-score: 0.3518450046685341
Dummy classifier f1-score:  0.10769230769230768
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
cabang

Training f1-score: 0.6032480429956771
Cross validation f1-score: 0.34421304761075555
Dummy classifier f1-score:  0.3177570093457944
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
pembagian


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.5936671867050225
Cross validation f1-score: 0.28113669590643275
Dummy classifier f1-score:  0.1721698113207547
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
kulit

Training f1-score: 0.6928605257318483
Cross validation f1-score: 0.4922056062890573
Dummy classifier f1-score:  0.26506024096385544
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
tengah


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7751932503770739
Cross validation f1-score: 0.5045365843792617
Dummy classifier f1-score:  0.1624203821656051
Best parameters:
C : 1.0
max_iter : 20
----------------------------------
tinggi


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.47490004085748766
Cross validation f1-score: 0.35305268003410417
Dummy classifier f1-score:  0.10232558139534884
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
menerima


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7096703980099501
Cross validation f1-score: 0.24511341991341995
Dummy classifier f1-score:  0.11327433628318584
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
mengejar

Training f1-score: 0.8706349206349207
Cross validation f1-score: 0.7438235833486153


  'precision', 'predicted', average, warn_for)


Dummy classifier f1-score:  0.38257575757575757
Best parameters:
C : 0.5
max_iter : 20
----------------------------------
kali


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9071842136358265
Cross validation f1-score: 0.7027362920304097
Dummy classifier f1-score:  0.2309711286089239
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
jaringan


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6727834357892161
Cross validation f1-score: 0.4341324895733498
Dummy classifier f1-score:  0.20202020202020202
Best parameters:
C : 1.0
max_iter : 20
----------------------------------
halaman

Training f1-score: 0.7462820356109684
Cross validation f1-score: 0.4407159793611407
Dummy classifier f1-score:  0.24637681159420288
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
harapan


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7825140809011776
Cross validation f1-score: 0.5161933641031771
Dummy classifier f1-score:  0.2818428184281843
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
kepala

Training f1-score: 0.8089851584436422
Cross validation f1-score: 0.4643509351486842
Dummy classifier f1-score:  0.3046964490263459
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
mata


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7869281943375925
Cross validation f1-score: 0.4348387654657004
Dummy classifier f1-score:  0.14371257485029942
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
buah

Training f1-score: 0.9144764957264958
Cross validation f1-score: 0.6348256912607473
Dummy classifier f1-score:  0.2397003745318352
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
memecahkan


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7113420925263543
Cross validation f1-score: 0.45805896443493344
Dummy classifier f1-score:  0.2164821648216482
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
badan

Training f1-score: 0.8047591939892365
Cross validation f1-score: 0.4589361872695206
Dummy classifier f1-score:  0.27255985267034993
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
jalan


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6578725103579094
Cross validation f1-score: 0.2012900372900373
Dummy classifier f1-score:  0.1651376146788991
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
bintang


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8284632034632035
Cross validation f1-score: 0.46957622624289297
Dummy classifier f1-score:  0.24113475177304963
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
kunci


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.770840219510128
Cross validation f1-score: 0.4812972866505475
Dummy classifier f1-score:  0.14832535885167464
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
dasar

Training f1-score: 0.6727040816326532
Cross validation f1-score: 0.32359029859029864
Dummy classifier f1-score:  0.176056338028169
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
membawa


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.43927010417693646
Cross validation f1-score: 0.2020529720905661
Dummy classifier f1-score:  0.05442176870748299
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
jam

Training f1-score: 0.7968608910469377
Cross validation f1-score: 0.5868716804596137
Dummy classifier f1-score:  0.17924528301886794
Best parameters:
C : 0.5
max_iter : 20
----------------------------------
besar


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.761980415667466
Cross validation f1-score: 0.5201329182639511
Dummy classifier f1-score:  0.15732758620689655
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
lebat

Training f1-score: 0.7262178434592228
Cross validation f1-score: 0.6735198135198136
Dummy classifier f1-score:  0.3920265780730897
Best parameters:
C : 0.25
max_iter : 20
----------------------------------
kaki


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8186918445539134
Cross validation f1-score: 0.6897949393601567
Dummy classifier f1-score:  0.24444444444444446
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
menjaga

Training f1-score: 0.5413239588490018
Cross validation f1-score: 0.22284643970127843
Dummy classifier f1-score:  0.1634980988593156
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
asing


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8678571428571429
Cross validation f1-score: 0.8261893369788107
Dummy classifier f1-score:  0.4825174825174825
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
layar


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8495703843029966
Cross validation f1-score: 0.6261605993899227
Dummy classifier f1-score:  0.3347826086956522
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
bidang

Training f1-score: 0.4854014598540146
Cross validation f1-score: 0.4855121293800539
Dummy classifier f1-score:  0.4854014598540146
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
lingkungan


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6325156325156325
Cross validation f1-score: 0.49191944794526093
Dummy classifier f1-score:  0.2222222222222222
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
baru

Training f1-score: 0.9937859562611502
Cross validation f1-score: 0.7831971424021112
Dummy classifier f1-score:  0.284037558685446
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
mengandung


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7661655929372465
Cross validation f1-score: 0.5575730935730935
Dummy classifier f1-score:  0.4895833333333333
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
atas


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6482581611128786
Cross validation f1-score: 0.45908923374761884
Dummy classifier f1-score:  0.05970149253731344
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
sarung

Training f1-score: 0.8549244213509685
Cross validation f1-score: 0.8463680146207126
Dummy classifier f1-score:  0.37
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
mengikat


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7056934524333289
Cross validation f1-score: 0.3428054262409492
Dummy classifier f1-score:  0.13793103448275862
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
bulan

Training f1-score: 0.9612903225806451
Cross validation f1-score: 0.9291436990428925
Dummy classifier f1-score:  0.45614035087719296
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
coklat


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.48084284418689566
Cross validation f1-score: 0.38965709614381183
Dummy classifier f1-score:  0.3037974683544304
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
cerah

Training f1-score: 0.6912280701754386
Cross validation f1-score: 0.6339094481861101
Dummy classifier f1-score:  0.47586206896551725
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
nilai


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6896004890250413
Cross validation f1-score: 0.29982570491084426
Dummy classifier f1-score:  0.12085308056872036
Best parameters:
C : 2.0
max_iter : 40
----------------------------------
dunia

Training f1-score: 0.776680190150139
Cross validation f1-score: 0.4732409424506199
Dummy classifier f1-score:  0.1794871794871795
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
menangkap


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Training f1-score: 0.749404761904762
Cross validation f1-score: 0.4406626596262592
Dummy classifier f1-score:  0.29508196721311475
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
mendorong

Training f1-score: 0.7174975562072337
Cross validation f1-score: 0.5010236533251292
Dummy classifier f1-score:  0.4672364672364672
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
menyusun


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.6694055944055943
Cross validation f1-score: 0.3839923884340103
Dummy classifier f1-score:  0.24242424242424243
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
mengeluarkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.5717967186874751
Cross validation f1-score: 0.35231663399637947
Dummy classifier f1-score:  0.13970588235294118
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
mengisi

Training f1-score: 0.6328518920624184
Cross validation f1-score: 0.2794133334845409


  'precision', 'predicted', average, warn_for)


Dummy classifier f1-score:  0.14791666666666667
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
panas

Training f1-score: 0.7927829224816311
Cross validation f1-score: 0.6065490688722488
Dummy classifier f1-score:  0.23008849557522124
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
Cross validation macro average f1-score: 0.4847205469988656
Dummy classifier macro average f1-score: 0.25266422986045856
elapsed time: 12.434107830999892


  'precision', 'predicted', average, warn_for)


In [59]:
classification_report(
    y_train[list(dataset.query('kata == "{}"'.format('kunci')).index)], 
    classifier['kunci'].predict(X_train[list(dataset.query('kata == "{}"'.format('kunci')).index)]),
    output_dict=True
)

{'0': {'precision': 0.84,
  'recall': 0.9130434782608695,
  'f1-score': 0.8749999999999999,
  'support': 46},
 '1': {'precision': 0.9433962264150944,
  'recall': 0.8064516129032258,
  'f1-score': 0.8695652173913043,
  'support': 62},
 '2': {'precision': 0.8181818181818182,
  'recall': 1.0,
  'f1-score': 0.9,
  'support': 27},
 '3': {'precision': 1.0,
  'recall': 0.9166666666666666,
  'f1-score': 0.9565217391304348,
  'support': 12},
 'accuracy': 0.8843537414965986,
 'macro avg': {'precision': 0.9003945111492282,
  'recall': 0.9090404394576904,
  'f1-score': 0.9002717391304347,
  'support': 147},
 'weighted avg': {'precision': 0.892663096113231,
  'recall': 0.8843537414965986,
  'f1-score': 0.8839544513457556,
  'support': 147}}