# Lexical-Sample Supervised Word Sense Disambiguation

### Current Best Approach
Classifier: Linear SVM


Features, macro average accuracy:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .74
- It Makes Sense's Local Collocation + Surrounding Words: .726
- It Makes Sense's Local Collocation SVD: .721
- Latent Semantic Analysis: .713
- It Makes Sense's Local Collocation: .71
- Surrounding Words SVD: .708
- Collocation Vector: .70
- TF-IDF: .70
- Unigram-Bigram TFIDF: .69
- Choose most frequent sense: .57


Features, macro average F1-score:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .533
- It Makes Sense's Local Collocation SVD: .528
- It Makes Sense's Local Collocation + Surrounding Words: .526
- It Makes Sense's Local Collocation: .524
- Collocation Vector SVD: .506
- Collocation Vector: .495
- Latent Semantic Analysis: .491
- Surrounding Words SVD: .477
- TF-IDF: .427
- Unigram-Bigram TF-IDF: .399
- Choose most frequent sense: .220

In general, training accuracy / f1-score is perfect, but the cross validation score is low, which means:
**Overfit**

### TODO
- Tackle overfit problem
- Wikipedia Indonesia Word Embedding
- POS Tagger
- SVD with larger dimension (with extra memory)
- Build balanced dataset: manual labor
- Latent Dirchlet Analysis

In [1]:
import pandas as pd
import numpy as np

# Load Data

In [2]:
dataset = pd.read_csv('train_data.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,kalimat_id,kata,sense,kalimat,clean,targetpos_clean,targetpos_ori,clean_bigram
0,0,336691,cerah,4801,Cuaca cerah adalah lazim sepanjang tahun.,cuaca cerah lazim,1,1,cuaca_cerah cerah_lazim
1,1,336270,cerah,4801,Gambar yang dihasilkan oleh layarnya cukup cer...,gambar hasil layarnya cerah milik speaker hasi...,3,6,gambar_hasil hasil_layarnya layarnya_cerah cer...
2,2,336555,cerah,4803,Masa depan yang cerah bagi pemuda berumur 20 d...,cerah pemuda umur somenumber prancis abad some...,0,3,cerah_pemuda pemuda_umur umur_somenumber somen...
3,3,336618,cerah,4801,"Cor Caroli (Alpha Canum Venaticorum), (nama le...",cor caroli alpha canum venaticorum nama lengka...,12,16,cor_caroli caroli_alpha alpha_canum canum_vena...
4,4,336613,cerah,4801,Sanders lebih menyukai cat air untuk Lilo deng...,sanders suka cat air lilo maksud tampil warna ...,8,11,sanders_suka suka_cat cat_air air_lilo lilo_ma...


# Drop rare sense from training set

In [3]:
RARE_LIMIT = 5
sense_set = set(dataset.sense)

In [4]:
rare_sense = set(filter(lambda s: len(dataset.query('sense == "{}"'.format(s))) <= RARE_LIMIT, sense_set))
len(rare_sense)

119

In [5]:
dataset_kata = []
dataset_sense = []
dataset_kalimat = []
dataset_clean = []
dataset_pos_clean = []
dataset_pos_ori = []
dataset_clean_bigram = []
for i in range(len(dataset)):
    row = dataset.iloc[i]
    if row.sense not in rare_sense:
        dataset_kata.append(row.kata)
        dataset_sense.append(row.sense)
        dataset_kalimat.append(row.kalimat)
        dataset_clean.append(row.clean)
        dataset_clean_bigram.append(row.clean_bigram)
        dataset_pos_clean.append(row.targetpos_clean)
        dataset_pos_ori.append(row.targetpos_ori)

dataset = pd.DataFrame({
    'kata': dataset_kata,
    'sense': dataset_sense,
    'kalimat': dataset_kalimat,
    'clean': dataset_clean,
    'clean_bigram': dataset_clean_bigram,
    'targetpos_clean': dataset_pos_clean,
    'targetpos_ori': dataset_pos_ori,
})

In [6]:
set(dataset.query('kata == "{}"'.format('panas')).sense)

{'4901', '4903', '4904'}

In [7]:
len(dataset)

8721

# Feature Extraction

## TF-IDF

In [108]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [115]:
tfidf_u = TfidfVectorizer()
u_tfidf = tfidf_u.fit_transform(dataset.clean)

In [116]:
u_tfidf

<8721x20120 sparse matrix of type '<class 'numpy.float64'>'
	with 109131 stored elements in Compressed Sparse Row format>

## Unigram-Bigram TF-IDF
as in Faisal, et. al (2018) "Word Sense Disambiguation in Bahasa Indonesia using SVM"

In [109]:
combined_unigram_bigram = []

for i in range(len(dataset)):
    row = dataset.iloc[i]
    combined_unigram_bigram.append(row.clean + ' ' + row.clean_bigram)

In [111]:
tfidf_ub = TfidfVectorizer()
ub_tfidf = tfidf_ub.fit_transform(combined_unigram_bigram)

In [112]:
ub_tfidf

<8721x109586 sparse matrix of type '<class 'numpy.float64'>'
	with 220345 stored elements in Compressed Sparse Row format>

## Latent Semantic Analysis

In [23]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [128]:
svdtfidf = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))
lsa = svdtfidf.fit_transform(u_tfidf)

In [130]:
lsa.shape

(8721, 1000)

## Collocation Vector

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from preprocessor import normalize_money, normalize_number, stemmer, pipe

In [70]:
CONTEXT_WINDOW = 5

In [71]:
context_words = [[] for i in range(len(dataset))]

for i in range(len(dataset)):
    tokens = dataset.iloc[i].kalimat.split()
    pos = dataset.iloc[i].targetpos_ori
    for j in range(max(0, pos-CONTEXT_WINDOW), pos):
        token = pipe(normalize_money, normalize_number, stemmer.stem)(tokens[j])
        context_words[i].append(token)
    for j in range(pos+1, min(len(tokens), pos+CONTEXT_WINDOW+1)):
        token = pipe(normalize_money, normalize_number, stemmer.stem)(tokens[j])
        context_words[i].append(token)
    context_words[i] = ' '.join(context_words[i])

In [72]:
cv = CountVectorizer()
collocation_vector = cv.fit_transform(list(map(lambda s: ' '.join(set(s.split())), context_words)))

In [73]:
collocation_vector

<8721x10979 sparse matrix of type '<class 'numpy.int64'>'
	with 73005 stored elements in Compressed Sparse Row format>

## Collocation Vectors
as in Zhong & Ng (2010) "It Makes Sense"

In [13]:
from scipy.sparse.csr import csr_matrix
import time

In [14]:
def get_collocation(sentence, targetpos, L, R):
    tokens = sentence.split()
    L = max(0, targetpos+L)
    R = min(len(tokens), targetpos+R)
    collocation = tokens[L:R+1]
    return ' '.join(set(map(pipe(normalize_money, normalize_number, stemmer.stem), collocation)))

In [15]:
collocation_pos = {
    (-2, -2), (-1, -1), (1, 1), (2, 2), (-2, -1), (-1, 1), (1, 2), (-3, -1), (-2, 1), (-1, 2), (1, 3)
}

In [16]:
collocation_words = [[] for i in range(len(dataset))]

for i in range(len(dataset)):
    instance = dataset.iloc[i]
    for l, r in collocation_pos:
        collocation_words[i].append(get_collocation(instance.kalimat, instance.targetpos_ori, l, r))

In [17]:
# cv = CountVectorizer().fit(dataset.clean) -> use the above

In [18]:
collocation_vectors = np.array(list(map(
    lambda cws: cv.transform(cws),
    collocation_words
)))

In [19]:
collocation_vectors = np.array(list(map(lambda v: v.reshape(1, -1), collocation_vectors)))

In [20]:
begin = time.perf_counter()
collocation_vectors = csr_matrix([np.array(vec.toarray()[0], dtype=np.bool) for vec in collocation_vectors])
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 9.114274600000002


In [21]:
collocation_vectors

<8721x88429 sparse matrix of type '<class 'numpy.bool_'>'
	with 210910 stored elements in Compressed Sparse Row format>

## Surrounding Words

In [54]:
cv_bin = CountVectorizer()
surrounding_words = cv.fit_transform(
    list(map(lambda s: ' '.join(set(s.split())), dataset.clean))
)

In [55]:
surrounding_words = csr_matrix(np.array([surrounding_words[i].toarray()[0] for i in range(surrounding_words.shape[0])], dtype=np.bool))

In [56]:
surrounding_words

<8721x20120 sparse matrix of type '<class 'numpy.bool_'>'
	with 109131 stored elements in Compressed Sparse Row format>

## Word Embedding: Word2Vec

In [10]:
import gensim
from functools import reduce

In [59]:
EMBEDDING_SIZE = 50
clean_sentence = list(map(str.split, (pd.read_csv('clean_sentence.csv').clean)))

In [69]:
embedding_model = gensim.models.Word2Vec(clean_sentence, min_count=1, window=10, size=EMBEDDING_SIZE)

### Exponential Decay Word Embedding Features
Iacobacci, et. al (2016)

In [61]:
embedding = []

W = CONTEXT_WINDOW
alpha = 1 - (np.power(0.1, np.power(W-1.0, -1)))

for p in range(len(dataset)):
    if (p % 800) == 0:
        print(p)
    instance = dataset.iloc[p]
    e = np.zeros(EMBEDDING_SIZE)
    I = instance.targetpos_clean
    words = instance.clean.split()
    for i in range(EMBEDDING_SIZE):
        for j in range(max(0, I-W), min(len(words), I+W+1)):
            if j == I:
                continue
            e[i] += (embedding_model.wv.get_vector(words[j])[i] * (np.power(1 - alpha, abs(I-j) - 1)))
    embedding.append(e)
            

0
800
1600
2400
3200
4000
4800
5600
6400
7200
8000


### Sum of context word embeddings

In [112]:
embedding = np.array(
    list(map(
        lambda s: reduce(
            lambda x, y: x + embedding_model.wv[y],
            s.split(),
            embedding_model[s.split()[0]]
        ),
        context_words
    ))
)

  


# Form Training Set

## It Makes Sense's Collocation Vectors only

In [176]:
X_train = collocation_vectors

In [177]:
collocation_vectors

<8721x221320 sparse matrix of type '<class 'numpy.bool_'>'
	with 165767 stored elements in Compressed Sparse Row format>

## IMS Collocation Vectors SVD

In [24]:
svdimscv = make_pipeline(TruncatedSVD(5000), Normalizer(copy=False))

In [25]:
begin = time.perf_counter()
X_train = svdimscv.fit_transform(collocation_vectors)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 1440.6749212999998


In [26]:
imscvsvd = X_train

In [79]:
X_train = imscvsvd

In [27]:
X_train.shape

(8721, 5000)

## Surrounding Words Only

In [84]:
X_train = surrounding_words

## Surrounding Words SVD

In [61]:
svdsw = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [62]:
begin = time.perf_counter()
swsvd = svdsw.fit_transform(surrounding_words)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 24.514832599999863


In [63]:
X_train = swsvd

## IMS Collocation Vectors SVD + Surrounding Words SVD

In [64]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 24.574275399999806


In [65]:
X_train.shape

(8721, 6000)

## Collocation Vector only

In [154]:
X_train = collocation_vector

## Collocation Vector SVD

In [74]:
svdcv = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [75]:
begin = time.perf_counter()
cvsvd = svdcv.fit_transform(collocation_vector)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 18.34346170000026


In [77]:
X_train = cvsvd

## Word Embedding Only

In [46]:
X_train = np.array(embedding)

## IMS Collocation Vectors + Surrounding Words

In [57]:
transform_to_imscv_sw = lambda imscv, sw: csr_matrix(
    np.array(
        list(map(lambda i: [*imscv[i].toarray()[0], *sw[i].toarray()[0]], [i for i in range(imscv.shape[0])])),
        dtype=np.bool
    )
)

In [58]:
begin = time.perf_counter()
X_train = transform_to_imscv_sw(
    collocation_vectors,
    surrounding_words
)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 250.5095411000002


In [78]:
X_train

<8721x238788 sparse matrix of type '<class 'numpy.bool_'>'
	with 301381 stored elements in Compressed Sparse Row format>

## TF-IDF Only

In [145]:
X_train = u_tfidf

## Unigram-Bigram TF-IDF Only

In [147]:
X_train = ub_tfidf

## LSA Only

In [140]:
X_train = lsa

## Labels

In [29]:
annotated_words = set(dataset.kata)

In [30]:
mappers = dict()
for w in annotated_words:
    possible_sense = set(dataset.query('kata == "{}"'.format(w)).sense)
    mappers[w] = []
    for sense, i in zip(list(possible_sense),  [n for n in range(len(possible_sense))]):
        mappers[w].append((sense, i))

In [31]:
y_train = np.array([list(filter(lambda m: m[0] == sense, mappers[kata]))[0][1] for sense, kata in zip(dataset.sense, dataset.kata)])

# Training

Dummy classifier: always choose the most frequent sense

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [33]:
classifier = {w: None for w in annotated_words}

In [34]:
classification_report([0,1,1], [0,1,0], output_dict=True)['macro avg']['f1-score']

0.6666666666666666

In [67]:
'''
Select best parameter using k-fold cross validation
'''
def train(X, y, clf, possible_param, fold=5):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False)
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Cross validation accuracy:', clf.best_score_)
    dummy_score = label_counts[most_freq_label] / len(y)
    print('Dummy classifier accuracy: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

def train_f1(X, y, clf, possible_param, fold=5):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False, scoring='f1_macro')
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Training f1-score:', classification_report(y, clf.predict(X), output_dict=True)['macro avg']['f1-score'])
    print('Cross validation f1-score:', clf.best_score_)
    dummy_score = classification_report(y, [most_freq_label for i in y], output_dict=True)['macro avg']['f1-score']
    print('Dummy classifier f1-score: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

In [36]:
def print_param(param):
    print('Best parameters:')
    for p in param:
        print(p, ':', param[p])

In [37]:
def train_all(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average accuracy:', sum(scores)/len(scores))
    print('Dummy classifier macro average accuracy:', sum(dummy_scores)/len(dummy_scores))

def train_all_f1(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train_f1(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average f1-score:', sum(scores)/len(scores))
    print('Dummy classifier macro average f1-score:', sum(dummy_scores)/len(dummy_scores))

In [38]:
y_train[list(dataset.query('kata == "{}"'.format('besar')).index)]

array([3, 0, 0, 0, 0, 3, 3, 3, 3, 2, 2, 3, 1, 3, 0, 3, 0, 1, 0, 3, 1, 3,
       3, 1, 3, 0, 3, 3, 0, 0, 1, 0, 0, 3, 3, 3, 3, 3, 3, 0, 0, 0, 3, 3,
       3, 3, 3, 0, 0, 0, 0, 3, 0, 0, 0, 3, 2, 0, 0, 3, 0, 3, 0, 3, 3, 3,
       3, 3, 3, 0, 0, 0, 3, 1, 2, 3, 3, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 3,
       2, 3, 1, 0, 0, 0, 3, 0, 3, 2, 3, 0, 0, 3, 3, 0, 2, 0, 3, 3, 3, 3,
       1, 0, 0, 0, 3, 3, 3, 3, 0, 2, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 3, 3,
       3, 3, 0, 2, 0, 0, 0, 3, 3, 3, 0, 0, 3, 0, 0, 3, 3, 3, 0, 3, 3, 3,
       0, 3, 0, 0, 0])

## Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression

In [None]:

train_all(
    LogisticRegression(),
    {'solver':['newton-cg'], 'max_iter':[10, 20, 50], 'multi_class': ['ovr', 'multinomial']},
    algorithm_name='Logistic Regression'
)

## Linear SVM

In [39]:
from sklearn.svm import LinearSVC

In [80]:
begin = time.perf_counter()
train_all_f1(
    LinearSVC(),
    {'max_iter': [10, 20, 40], 'C':[0.25, 0.5, 1.0, 2.0, 4.0, 8.0]},
    algorithm_name='Linear SVM'
)
print('elapsed time:', time.perf_counter() - begin)

Linear SVM
baru


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5531211125959817
Dummy classifier f1-score:  0.15426621160409557
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
memecahkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6603611955207723
Dummy classifier f1-score:  0.2164821648216482
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
layar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6988916974312511
Dummy classifier f1-score:  0.21751412429378528
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
mata


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.8754052025336653
Cross validation f1-score: 0.5014914098247432
Dummy classifier f1-score:  0.07920792079207921
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
panas


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7848697909648674
Dummy classifier f1-score:  0.23008849557522124
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
pembagian


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9935823385958158
Cross validation f1-score: 0.40797258297258293
Dummy classifier f1-score:  0.1721698113207547
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
jalan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.23848837209302326
Dummy classifier f1-score:  0.11904761904761905
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
kulit


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5098107448107448
Dummy classifier f1-score:  0.19411764705882353
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
bidang


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.4854014598540146
Cross validation f1-score: 0.4855121293800539
Dummy classifier f1-score:  0.4854014598540146
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
atas


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.986986098787341
Cross validation f1-score: 0.5067699340808585
Dummy classifier f1-score:  0.05970149253731344
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
coklat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.44809438751448505
Dummy classifier f1-score:  0.2236024844720497
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
kepala


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4894472726100891
Dummy classifier f1-score:  0.3046964490263459
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
cabang


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4987663232824523
Dummy classifier f1-score:  0.3177570093457944
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
bisa


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.494181396267404
Dummy classifier f1-score:  0.43718592964824116
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
tinggi


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.37725016620080365
Dummy classifier f1-score:  0.0641399416909621
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
halaman


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6941466239995652
Dummy classifier f1-score:  0.17857142857142858
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
bintang


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9844077961019491
Cross validation f1-score: 0.44260478755833066
Dummy classifier f1-score:  0.12710280373831775
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
buah


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6683879127172846
Dummy classifier f1-score:  0.17454545454545456
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
kunci


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9858461538461537
Cross validation f1-score: 0.3342238820308996
Dummy classifier f1-score:  0.09226190476190477
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
menurunkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9938026378515811
Cross validation f1-score: 0.27347652819993246
Dummy classifier f1-score:  0.12656641604010024
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
jaringan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6054499609894493
Dummy classifier f1-score:  0.20202020202020202
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
mengandung


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.4895833333333333
Cross validation f1-score: 0.48964102564102563
Dummy classifier f1-score:  0.4895833333333333
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
sarung


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.940914435431447
Dummy classifier f1-score:  0.37
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
menangkap


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4136730555760554
Dummy classifier f1-score:  0.29508196721311475
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
mengejar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7417638570579748
Dummy classifier f1-score:  0.38257575757575757
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
menjaga


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9929824561403509
Cross validation f1-score: 0.32123046114532183
Dummy classifier f1-score:  0.1634980988593156
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
berat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.32535259087890667
Dummy classifier f1-score:  0.0851581508515815
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
jam


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9680131119805435
Cross validation f1-score: 0.782462784962785
Dummy classifier f1-score:  0.17924528301886794
Best parameters:
C : 0.5
max_iter : 20
----------------------------------
menerima


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.19255603137741772
Dummy classifier f1-score:  0.11327433628318584
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
rapat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7266570466570468
Dummy classifier f1-score:  0.45964912280701753
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
asing


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.7559735638027453
Cross validation f1-score: 0.49723385514243434
Dummy classifier f1-score:  0.3150684931506849
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
kali


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9955327815004598
Cross validation f1-score: 0.7948879934114147
Dummy classifier f1-score:  0.2309711286089239
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
bulan


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.990195561646724
Cross validation f1-score: 0.8991391158125029
Dummy classifier f1-score:  0.45614035087719296
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
dunia


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9867178378915468
Cross validation f1-score: 0.23112351675199286
Dummy classifier f1-score:  0.106544901065449
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
mengikat


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9827760891590679
Cross validation f1-score: 0.40754512584147634
Dummy classifier f1-score:  0.13793103448275862
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
besar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5602709574650621
Dummy classifier f1-score:  0.15732758620689655
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
kabur


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.47654195324452153
Dummy classifier f1-score:  0.3182674199623352
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
lingkungan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4829576282079063
Dummy classifier f1-score:  0.16113744075829384
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
ketat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5242469265449964
Dummy classifier f1-score:  0.1534090909090909
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
tengah


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5955026299701135
Dummy classifier f1-score:  0.1152
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
harapan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.41549490093087293
Dummy classifier f1-score:  0.126984126984127
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
nilai


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5244712102408293
Dummy classifier f1-score:  0.12085308056872036
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
kaki


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9830418653948065
Cross validation f1-score: 0.7767166891079935
Dummy classifier f1-score:  0.1729559748427673
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
mengisi


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.41449628776241676
Dummy classifier f1-score:  0.14791666666666667
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
mendorong


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5019379019213319
Dummy classifier f1-score:  0.4672364672364672
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
lebat


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9508196721311475
Cross validation f1-score: 0.9370948468193937
Dummy classifier f1-score:  0.3920265780730897
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
badan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6245487347036582
Dummy classifier f1-score:  0.16591928251121077
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
dalam


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9971092780494927
Cross validation f1-score: 0.34283933675564926
Dummy classifier f1-score:  0.07039337474120083
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
membawa


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.2760827731902477
Dummy classifier f1-score:  0.05442176870748299
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
bunga


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6303251818246853
Dummy classifier f1-score:  0.29917550058892817
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
cerah


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9095238095238095
Cross validation f1-score: 0.7144494345155701
Dummy classifier f1-score:  0.47586206896551725
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
dasar


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9894076655052265
Cross validation f1-score: 0.4232864620759358
Dummy classifier f1-score:  0.12987012987012986
Best parameters:
C : 8.0
max_iter : 40
----------------------------------
mengeluarkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.26159999999999994
Dummy classifier f1-score:  0.13970588235294118
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
menyusun

Training f1-score: 1.0
Cross validation f1-score: 0.41527802915638556
Dummy classifier f1-score:  0.24242424242424243
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
Cross validation macro average f1-score: 0.5247340924290915
Dummy classifier macro average f1-score: 0.2198565853937676
elapsed time: 291.33560620000026


  'precision', 'predicted', average, warn_for)


In [53]:
classification_report(
    y_train[list(dataset.query('kata == "{}"'.format('kunci')).index)], 
    classifier['kunci'].predict(X_train[list(dataset.query('kata == "{}"'.format('kunci')).index)]),
    output_dict=True
)

{'0': {'precision': 1.0,
  'recall': 0.8571428571428571,
  'f1-score': 0.923076923076923,
  'support': 7},
 '1': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8},
 '2': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 27},
 '3': {'precision': 0.96875,
  'recall': 1.0,
  'f1-score': 0.9841269841269841,
  'support': 62},
 '4': {'precision': 1.0,
  'recall': 0.9782608695652174,
  'f1-score': 0.989010989010989,
  'support': 46},
 '5': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12},
 'accuracy': 0.9876543209876543,
 'macro avg': {'precision': 0.9947916666666666,
  'recall': 0.972567287784679,
  'f1-score': 0.9827024827024827,
  'support': 162},
 'weighted avg': {'precision': 0.9880401234567902,
  'recall': 0.9876543209876543,
  'f1-score': 0.9874809689624503,
  'support': 162}}