# Lexical-Sample Supervised Word Sense Disambiguation

## Progress
Classifier: Linear SVM
Cross validation: k-fold cross validation

### Include PERSON, LOCATION, ORGANIZATION, OTHER ENTITY, and MWE
5-fold CV

Features, cross validation macro average accuracy:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .74
- It Makes Sense's Local Collocation + Surrounding Words: .726
- It Makes Sense's Local Collocation SVD: .721
- Latent Semantic Analysis: .713
- It Makes Sense's Local Collocation: .71
- Surrounding Words SVD: .708
- Collocation Vector: .70
- TF-IDF: .70
- Unigram-Bigram TFIDF: .69
- Choose most frequent sense: .57


Features, cross validation macro average F1-score:
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .533
- It Makes Sense's Local Collocation SVD: .528
- It Makes Sense's Local Collocation + Surrounding Words: .526
- It Makes Sense's Local Collocation: .524
- Collocation Vector SVD: .506
- Collocation Vector: .495
- Latent Semantic Analysis: .491
- Surrounding Words SVD: .477
- TF-IDF: .427
- Unigram-Bigram TF-IDF: .399
- Choose most frequent sense: .220

### Pure WSD,  with MWE
5-fold CV
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .550
- It Makes Sense's Local Collocation SVD: .539
- It Makes Sense's Local Collocation: .537
- Latent Semantic Analysis: .506
- Surrounding Words SVD: .505
- Collocation Vector SVD: .505
- Collocation Vector: .501
- TF-IDF: .434
- Choose most frequent sense: .241

### Pure WSD,  not even MWE
5-fold CV
- It Makes Sense's Replication, but SVD: .576
- It Makes Sense's Local Collocation SVD + Surrounding Words SVD: .567
- POS Tags SVD + Surrounding Words SVD: .561
- It Makes Sense's Local Collocation SVD + POS Tags SVD: .558
- It Makes Sense's Local Collocation SVD: .546
- It Makes Sense's Local Collocation: .545
- Latent Semantic Analysis: .524
- Surrounding Words SVD: .520
- POS Tags SVD: .488
- POS Tags: .488
- TF-IDF: .459
- Choose most frequent sense: .253

In general, training accuracy / f1-score is perfect, but the cross validation score is way too low, which means:
**Overfit**

### TODO
- Tackle overfit problem
- Wikipedia Indonesia Word Embedding
- SVD with larger dimension (with extra memory)
- Build balanced dataset: manual labor
- Latent Dirchlet Analysis

In [14]:
import sys
import pandas as pd
import numpy as np

# Load Data

In [2]:
dataset = pd.read_csv('train_data.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,kata,sense,kalimat,pos_tags,clean,targetpos_clean,targetpos_ori,targetpos_pos_tag
0,0,cerah,4801,cuaca cerah adalah lazim panjang tahun,NN NN VB NN NN NN Z,cuaca cerah lazim,1,1,1
1,1,cerah,4801,gambar yang hasil oleh layarnya cukup cerah da...,NNP SC VB IN NN RB JJ CC VB NN SC JJ VB NN SC ...,gambar hasil layarnya cerah milik speaker hasi...,3,6,6
2,2,cerah,4803,masa depan yang cerah bagi pemuda umur somenum...,NN NN SC VB IN NN NN CD IN NNP NNP CD Z,cerah bagi pemuda umur prancis abad,0,3,3
3,3,cerah,4801,cor caroli alpha canum venaticorum nama lengka...,NNP NNP Z NNP NNP NNP Z Z Z NN RB VB NNP NNP N...,cor caroli alpha canum venaticorum nama lengka...,12,16,21
4,4,cerah,4801,sanders lebih suka cat air untuk lilo dengan m...,NN RB VB NN NN SC NNP IN NN VB NN NN NN NN NN Z,sanders suka cat air lilo maksud tampil warna ...,8,11,11


# Drop rare sense from training set

In [7]:
RARE_LIMIT = 5
sense_set = set(dataset.sense)

In [8]:
rare_sense = set(filter(lambda s: len(dataset.query('sense == "{}"'.format(s))) <= RARE_LIMIT, sense_set))
len(rare_sense)

37

In [9]:
dataset_kata = []
dataset_sense = []
dataset_kalimat = []
dataset_clean = []
dataset_pos_clean = []
dataset_pos_ori = []
dataset_pos_tags = []
dataset_pos_pos_tag = []
for i in range(len(dataset)):
    row = dataset.iloc[i]
    if row.sense not in rare_sense:
        dataset_kata.append(row.kata)
        dataset_sense.append(row.sense)
        dataset_kalimat.append(row.kalimat)
        dataset_clean.append(row.clean)
        dataset_pos_clean.append(row.targetpos_clean)
        dataset_pos_ori.append(row.targetpos_ori)
        dataset_pos_tags.append(row.pos_tags)
        dataset_pos_pos_tag.append(row.targetpos_pos_tag)

dataset = pd.DataFrame({
    'kata': dataset_kata,
    'sense': dataset_sense,
    'kalimat': dataset_kalimat,
    'clean': dataset_clean,
    'targetpos_clean': dataset_pos_clean,
    'targetpos_ori': dataset_pos_ori,
    'pos_tags': dataset_pos_tags,
    'targetpos_pos_tag': dataset_pos_pos_tag,
})

In [10]:
set(dataset.query('kata == "{}"'.format('panas')).sense)

{'4901', '4903', '4904'}

In [11]:
len(dataset)

8311

# Feature Extraction

## POS Tags

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from functools import reduce

In [104]:
POS_TAGS_WINDOW = 2

In [105]:
pos_tags = [['-' for j in range(2*POS_TAGS_WINDOW+1)] for i in range(len(dataset))]
possible_tags = set()

for i in range(len(dataset)):
    row = dataset.iloc[i]
    tags = row.pos_tags.split()
    position = row.targetpos_pos_tag
    pos_tags[i][POS_TAGS_WINDOW] = tags[position]
    j = position-1
    k = POS_TAGS_WINDOW - 1
    while j >= 0 and j >= position - POS_TAGS_WINDOW:
        if tags[j] == 'Z':
            break # do not even include
        pos_tags[i][k] = tags[j]
        k -= 1
        j -= 1
    j = position+1
    k = POS_TAGS_WINDOW + 1
    while j < len(tags) and j <= position + POS_TAGS_WINDOW:
        pos_tags[i][k] = tags[j]
        if tags[j] == 'Z':
            break # include, then break

        k += 1
        j += 1
    
    

In [106]:
pos_tag_transformer = OneHotEncoder().fit(pos_tags)

In [107]:
pos_tags = pos_tag_transformer.transform(pos_tags)

In [108]:
pos_tags

<8311x98 sparse matrix of type '<class 'numpy.float64'>'
	with 41555 stored elements in Compressed Sparse Row format>

## TF-IDF

In [130]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [131]:
tfidf_u = TfidfVectorizer()
u_tfidf = tfidf_u.fit_transform(dataset.clean)

In [132]:
u_tfidf

<8311x20337 sparse matrix of type '<class 'numpy.float64'>'
	with 99962 stored elements in Compressed Sparse Row format>

## Unigram-Bigram TF-IDF
as in Faisal, et. al (2018) "Word Sense Disambiguation in Bahasa Indonesia using SVM"

In [109]:
combined_unigram_bigram = []

for i in range(len(dataset)):
    row = dataset.iloc[i]
    combined_unigram_bigram.append(row.clean + ' ' + row.clean_bigram)

In [111]:
tfidf_ub = TfidfVectorizer()
ub_tfidf = tfidf_ub.fit_transform(combined_unigram_bigram)

In [112]:
ub_tfidf

<8721x109586 sparse matrix of type '<class 'numpy.float64'>'
	with 220345 stored elements in Compressed Sparse Row format>

## Latent Semantic Analysis

In [57]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [136]:
svdtfidf = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))
lsa = svdtfidf.fit_transform(u_tfidf)

In [137]:
lsa.shape

(8311, 1000)

## Collocation Vector

In [63]:
from sklearn.feature_extraction.text import CountVectorizer
from preprocessor import normalize_money, normalize_number, stemmer, pipe

In [41]:
CONTEXT_WINDOW = 3

In [42]:
context_words = [[] for i in range(len(dataset))]

for i in range(len(dataset)):
    tokens = dataset.iloc[i].kalimat.split()
    pos = dataset.iloc[i].targetpos_ori
    for j in range(max(0, pos-CONTEXT_WINDOW), pos):
        token = pipe(normalize_money, normalize_number, stemmer.stem)(tokens[j])
        context_words[i].append(token)
    for j in range(pos+1, min(len(tokens), pos+CONTEXT_WINDOW+1)):
        token = pipe(normalize_money, normalize_number, stemmer.stem)(tokens[j])
        context_words[i].append(token)
    context_words[i] = ' '.join(context_words[i])

In [43]:
cv = CountVectorizer()
collocation_vector = cv.fit_transform(list(map(lambda s: ' '.join(set(s.split())), context_words)))

In [44]:
collocation_vector

<8428x7855 sparse matrix of type '<class 'numpy.int64'>'
	with 45075 stored elements in Compressed Sparse Row format>

## Correct implementation of collocation vectors
as in Zhong & Ng (2010) "It Makes Sense", but unigram and bigrams only 

In [73]:
from scipy.sparse.csr import csr_matrix
import time
from functools import reduce

In [74]:
collocation_pos = {
    (-2, -2), (-1, -1), (1, 1), (2, 2), (-2, -1), (-1, 1), (1, 2),
}

In [75]:
def get_collocation(sentence, targetpos, L, R):
    col = ['-' for i in range(R-L+1 - (1 if L < 0 and R > 0 else 0))]
    tokens = sentence.split()
    L = targetpos+L
    R = targetpos+R
    j = L
    i = 0
    while j <= R:
        if j < 0:
            j += 1
            i += 1
            continue
        if j == targetpos:
            j += 1
            continue
        if j >= len(tokens):
            break
        col[i] = tokens[j]
        j += 1
        i += 1
    
    return col

In [76]:
print(dataset.iloc[2].kalimat, dataset.iloc[2].targetpos_ori)

masa depan yang cerah bagi pemuda umur somenumber di prancis abad somenumber 3


In [77]:
collocation_words = [[] for i in range(len(dataset))]
collocations = ['' for i in range(len(dataset))]

for i in range(len(dataset)):
    instance = dataset.iloc[i]
    for l, r in collocation_pos:
        collocation_words[i].append(get_collocation(instance.kalimat, instance.targetpos_ori, l, r))
        collocations[i] = ' '.join(list(map(lambda s: ' '.join(s) , collocation_words[i])))

In [78]:
cv_unigram_bigram = CountVectorizer().fit(collocations)

In [79]:
collocation_vectors = [0 for i in range(len(dataset))]

for i in range(len(dataset)):
    collocation_vectors[i] = cv_unigram_bigram.transform(
        reduce(lambda acc, nex: [*acc, *nex], collocation_words[i], [])
    ).reshape(1, -1)

In [80]:
collocation_vectors = csr_matrix([np.array(vec.toarray()[0], dtype=np.bool) for vec in collocation_vectors])

In [81]:
collocation_vectors

<8311x62350 sparse matrix of type '<class 'numpy.bool_'>'
	with 80193 stored elements in Compressed Sparse Row format>

## Surrounding Words

In [64]:
cv_bin = CountVectorizer()
surrounding_words = cv_bin.fit_transform(
    list(map(lambda s: ' '.join(set(s.split())), dataset.clean))
)

In [67]:
surrounding_words = csr_matrix(np.array([surrounding_words[i].toarray()[0] for i in range(surrounding_words.shape[0])], dtype=np.bool))

In [68]:
surrounding_words

<8311x20337 sparse matrix of type '<class 'numpy.bool_'>'
	with 99962 stored elements in Compressed Sparse Row format>

## Wikipedia Word2Vec Word Embedding

In [12]:
import gensim

In [23]:
word_vectors = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('../wikipedia_indonesia_embedding50.model')

In [24]:
EMBEDDING_SIZE = word_vectors['presiden'].shape[0]

### Exponential Decay Word Embedding Features
Iacobacci, et. al (2016)

In [25]:
embedding = []

W = 5
alpha = 1 - (np.power(0.1, np.power(W-1.0, -1)))

for p in range(len(dataset)):
    if (p % 800) == 0:
        sys.stdout.write("\r{0:.2f}".format(p/len(dataset)))
        sys.stdout.flush()
    instance = dataset.iloc[p]
    e = np.zeros(EMBEDDING_SIZE)
    I = instance.targetpos_clean
    words = instance.clean.split()
    for i in range(EMBEDDING_SIZE):
        for j in range(max(0, I-W), min(len(words), I+W+1)):
            if j == I:
                continue
            try:
                e[i] += (word_vectors.get_vector(words[j])[i] * (np.power(1 - alpha, abs(I-j) - 1)))
            except:
                continue
    embedding.append(e)
            

0.96

In [27]:
embedding = np.array(embedding)

In [28]:
embedding.shape

(8311, 50)

### Sum of context word embeddings

In [112]:
embedding = np.array(
    list(map(
        lambda s: reduce(
            lambda x, y: x + embedding_model.wv[y],
            s.split(),
            embedding_model[s.split()[0]]
        ),
        context_words
    ))
)

  


# Form Training Set

## It Makes Sense's Collocation Vectors only

In [127]:
X_train = collocation_vectors

In [128]:
collocation_vectors

<8311x62350 sparse matrix of type '<class 'numpy.bool_'>'
	with 80193 stored elements in Compressed Sparse Row format>

## IMS Collocation Vectors SVD

In [82]:
svdimscv = make_pipeline(TruncatedSVD(5000), Normalizer(copy=False))

In [83]:
begin = time.perf_counter()
X_train = svdimscv.fit_transform(collocation_vectors)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 741.4769104000002


In [84]:
imscvsvd = X_train

In [85]:
X_train = imscvsvd

In [24]:
X_train.shape

(8311, 5000)

## Surrounding Words Only

In [84]:
X_train = surrounding_words

## Surrounding Words SVD

In [69]:
svdsw = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [70]:
begin = time.perf_counter()
swsvd = svdsw.fit_transform(surrounding_words)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 22.242304500000046


In [42]:
X_train = swsvd

## IMS Collocation Vectors SVD + Surrounding Words SVD

In [43]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*imscvsvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 9.209585700000162


In [44]:
X_train.shape

(8311, 6000)

## Collocation Vector only

In [13]:
X_train = collocation_vector

## Collocation Vector SVD

In [28]:
svdcv = make_pipeline(TruncatedSVD(1000), Normalizer(copy=False))

In [29]:
begin = time.perf_counter()
cvsvd = svdcv.fit_transform(collocation_vector)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 21.836217599999998


In [30]:
X_train = cvsvd

## Word Embedding Only

In [29]:
X_train = embedding

## IMS Collocation Vectors + Surrounding Words

In [57]:
transform_to_imscv_sw = lambda imscv, sw: csr_matrix(
    np.array(
        list(map(lambda i: [*imscv[i].toarray()[0], *sw[i].toarray()[0]], [i for i in range(imscv.shape[0])])),
        dtype=np.bool
    )
)

In [58]:
begin = time.perf_counter()
X_train = transform_to_imscv_sw(
    collocation_vectors,
    surrounding_words
)
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 250.5095411000002


In [78]:
X_train

<8721x238788 sparse matrix of type '<class 'numpy.bool_'>'
	with 301381 stored elements in Compressed Sparse Row format>

## TF-IDF Only

In [133]:
X_train = u_tfidf

## Unigram-Bigram TF-IDF Only

In [147]:
X_train = ub_tfidf

## LSA Only

In [139]:
X_train = lsa

## POS Tags Only

In [109]:
X_train = pos_tags

## POS Tags SVD

In [112]:
svdpos = make_pipeline(TruncatedSVD(80), Normalizer(copy=False))

In [113]:
possvd = svdpos.fit_transform(pos_tags)

In [114]:
X_train = possvd

## Surrounding Words SVD + POS Tags SVD

In [116]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 3.57551219999732


## IMS Collocation Vectors + POS Tags SVD

In [118]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *imscvsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 17.504873400001088


## It Makes Sense's Features Set SVD
The It Makes Sense's disambiguator system uses collocation vectors, surrounding words, and POS tags

In [120]:
begin = time.perf_counter()
X_train = np.array(list(map(lambda i: [*possvd[i], *imscvsvd[i], *swsvd[i]], [i for i in range(len(dataset))])))
print('elapsed time:', time.perf_counter() - begin)

elapsed time: 18.15229390001332


## Labels

In [32]:
annotated_words = set(dataset.kata)

In [33]:
mappers = dict()
for w in annotated_words:
    possible_sense = set(dataset.query('kata == "{}"'.format(w)).sense)
    mappers[w] = []
    for sense, i in zip(list(possible_sense),  [n for n in range(len(possible_sense))]):
        mappers[w].append((sense, i))

In [34]:
y_train = np.array([list(filter(lambda m: m[0] == sense, mappers[kata]))[0][1] for sense, kata in zip(dataset.sense, dataset.kata)])

# Training

Dummy classifier: always choose the most frequent sense

In [35]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [36]:
classifier = {w: None for w in annotated_words}

In [37]:
classification_report([0,1,1], [0,1,0], output_dict=True)['macro avg']['f1-score']

0.6666666666666666

In [38]:
'''
Select best parameter using k-fold cross validation
'''
def train(X, y, clf, possible_param, fold=3):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False)
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Cross validation accuracy:', clf.best_score_)
    dummy_score = label_counts[most_freq_label] / len(y)
    print('Dummy classifier accuracy: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

def train_f1(X, y, clf, possible_param, fold=3):
    clf = GridSearchCV(clf, possible_param, cv=fold, n_jobs=7, iid=False, scoring='f1_macro')
    clf.fit(X, y)
    label_counts = np.bincount(y)
    most_freq_label = np.argmax(label_counts)
    print()
    print('Training f1-score:', classification_report(y, clf.predict(X), output_dict=True)['macro avg']['f1-score'])
    print('Cross validation f1-score:', clf.best_score_)
    dummy_score = classification_report(y, [most_freq_label for i in y], output_dict=True)['macro avg']['f1-score']
    print('Dummy classifier f1-score: ', dummy_score)
    print_param(clf.best_params_)
    return (clf.best_estimator_, clf.best_score_, dummy_score)

In [39]:
def print_param(param):
    print('Best parameters:')
    for p in param:
        print(p, ':', param[p])

In [40]:
def train_all(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average accuracy:', sum(scores)/len(scores))
    print('Dummy classifier macro average accuracy:', sum(dummy_scores)/len(dummy_scores))

def train_all_f1(clf, possible_param, fold=5, algorithm_name=''):
    print(algorithm_name)
    scores = []
    dummy_scores = []
    for w in classifier.keys():
        print('==================================')
        print(w)
        indexes = list(dataset.query('kata == "{}"'.format(w)).index)
        best_clf, best_score, dummy_score = train_f1(X_train[indexes], y_train[indexes], clf, possible_param, fold)
        scores.append(best_score)
        dummy_scores.append(dummy_score)
        classifier[w] = best_clf
        print('----------------------------------')
    print('Cross validation macro average f1-score:', sum(scores)/len(scores))
    print('Dummy classifier macro average f1-score:', sum(dummy_scores)/len(dummy_scores))

In [41]:
y_train[list(dataset.query('kata == "{}"'.format('besar')).index)]

array([1, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 1, 3, 1, 2, 1, 2, 3, 2, 1, 3, 1,
       1, 3, 1, 2, 1, 1, 2, 2, 3, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1,
       1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 0, 2, 2, 1, 2, 1, 2, 1, 1, 1,
       1, 1, 1, 2, 2, 2, 1, 3, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1,
       0, 1, 3, 2, 2, 2, 1, 2, 1, 0, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 1,
       3, 2, 2, 2, 1, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1,
       1, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1,
       2, 1, 2, 2, 2])

## Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression

In [None]:

train_all(
    LogisticRegression(),
    {'solver':['newton-cg'], 'max_iter':[10, 20, 50], 'multi_class': ['ovr', 'multinomial']},
    algorithm_name='Logistic Regression'
)

## Linear SVM

In [54]:
from sklearn.svm import LinearSVC
import time

In [121]:
begin = time.perf_counter()
train_all_f1(
    LinearSVC(),
    {'max_iter': [10, 20, 40], 'C':[0.25, 0.5, 1.0, 2.0, 4.0, 8.0]},
    algorithm_name='Linear SVM'
)
print('elapsed time:', time.perf_counter() - begin)

Linear SVM
layar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.9222772777833959
Dummy classifier f1-score:  0.3347826086956522
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
menangkap


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.41832725360675677
Dummy classifier f1-score:  0.29508196721311475
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
mengikat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.505743560704861
Dummy classifier f1-score:  0.13793103448275862
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
bunga


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.695969955969956
Dummy classifier f1-score:  0.4703703703703704
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
ketat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5593469841948103
Dummy classifier f1-score:  0.1534090909090909
Best parameters:
C : 2.0
max_iter : 20
----------------------------------
kaki


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.8772834894574025
Dummy classifier f1-score:  0.24444444444444446
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
kabur


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5442572647212249
Dummy classifier f1-score:  0.3182674199623352
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
cabang


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.36328245231471035
Dummy classifier f1-score:  0.3177570093457944
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
bidang


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4855121293800539
Dummy classifier f1-score:  0.4854014598540146
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
jam


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7647918250357275
Dummy classifier f1-score:  0.17924528301886794
Best parameters:
C : 1.0
max_iter : 20
----------------------------------
badan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6333344833534587
Dummy classifier f1-score:  0.27255985267034993
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
jalan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.22147524020694753
Dummy classifier f1-score:  0.1651376146788991
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
kepala


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4511788743416908
Dummy classifier f1-score:  0.3046964490263459
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
bintang


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5969861671301351
Dummy classifier f1-score:  0.24113475177304963
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
atas


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5573626707461294
Dummy classifier f1-score:  0.05970149253731344
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
cerah


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6925626420627399
Dummy classifier f1-score:  0.47586206896551725
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
buah


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7550853748784784
Dummy classifier f1-score:  0.2397003745318352
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
mengeluarkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.3125520061334014
Dummy classifier f1-score:  0.13970588235294118
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
kali


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9955327815004598
Cross validation f1-score: 0.805601544985603
Dummy classifier f1-score:  0.2309711286089239
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
menerima


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9936507936507937
Cross validation f1-score: 0.20151100189547586
Dummy classifier f1-score:  0.11327433628318584
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
baru


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6870108227356029
Dummy classifier f1-score:  0.284037558685446
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
berat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.3488748300203408
Dummy classifier f1-score:  0.10769230769230768
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
besar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5616626099114812
Dummy classifier f1-score:  0.15732758620689655
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
membawa


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.27858967314075667
Dummy classifier f1-score:  0.05442176870748299
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
nilai


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5129542986402926
Dummy classifier f1-score:  0.12085308056872036
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
mendorong


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.46727660078115757
Dummy classifier f1-score:  0.4672364672364672
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
rapat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7880785248809083
Dummy classifier f1-score:  0.45964912280701753
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
lebat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.9456784282277466
Dummy classifier f1-score:  0.3920265780730897
Best parameters:
C : 4.0
max_iter : 40
----------------------------------
asing


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.687593984962406
Dummy classifier f1-score:  0.4825174825174825
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
kunci


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4771252901687685
Dummy classifier f1-score:  0.14832535885167464
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
pembagian


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9935823385958157
Cross validation f1-score: 0.32871069356363475
Dummy classifier f1-score:  0.1721698113207547
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
menurunkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.35066846519020434
Dummy classifier f1-score:  0.12656641604010024
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
mata


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.8402380361819721
Dummy classifier f1-score:  0.14371257485029942
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
mengejar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7778639055585467
Dummy classifier f1-score:  0.38257575757575757
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
memecahkan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6339877242652684
Dummy classifier f1-score:  0.2164821648216482
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
dalam


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9971092780494927
Cross validation f1-score: 0.36120164399999116
Dummy classifier f1-score:  0.07039337474120083
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
bisa


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5378947368421053
Dummy classifier f1-score:  0.43718592964824116
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
sarung


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.9617065081317351
Dummy classifier f1-score:  0.37
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
menjaga


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.3152461934814876
Dummy classifier f1-score:  0.1634980988593156
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
coklat


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6438415015031327
Dummy classifier f1-score:  0.3037974683544304
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
tinggi


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.4478693122734043
Dummy classifier f1-score:  0.10232558139534884
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
dunia


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.45079573204573203
Dummy classifier f1-score:  0.1794871794871795
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
jaringan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.7077296534054626
Dummy classifier f1-score:  0.20202020202020202
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
bulan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.9219594854070661
Dummy classifier f1-score:  0.45614035087719296
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
mengandung


  'precision', 'predicted', average, warn_for)



Training f1-score: 0.9653404067197171
Cross validation f1-score: 0.48964102564102563
Dummy classifier f1-score:  0.4895833333333333
Best parameters:
C : 0.25
max_iter : 10
----------------------------------
tengah


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6630538302277433
Dummy classifier f1-score:  0.1624203821656051
Best parameters:
C : 0.5
max_iter : 10
----------------------------------
menyusun


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.3999602751168444
Dummy classifier f1-score:  0.24242424242424243
Best parameters:
C : 1.0
max_iter : 10
----------------------------------
dasar


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5008386795578447
Dummy classifier f1-score:  0.176056338028169
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
harapan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5933918899136291
Dummy classifier f1-score:  0.2818428184281843
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
halaman


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.698387214857803
Dummy classifier f1-score:  0.24637681159420288
Best parameters:
C : 4.0
max_iter : 20
----------------------------------
mengisi


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.37111730014316435
Dummy classifier f1-score:  0.14791666666666667
Best parameters:
C : 8.0
max_iter : 20
----------------------------------
lingkungan


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.5463242281648354
Dummy classifier f1-score:  0.2222222222222222
Best parameters:
C : 4.0
max_iter : 10
----------------------------------
kulit


  'precision', 'predicted', average, warn_for)



Training f1-score: 1.0
Cross validation f1-score: 0.6870683545826254
Dummy classifier f1-score:  0.26506024096385544
Best parameters:
C : 8.0
max_iter : 10
----------------------------------
panas

Training f1-score: 1.0
Cross validation f1-score: 0.7731355238919199
Dummy classifier f1-score:  0.23008849557522124
Best parameters:
C : 2.0
max_iter : 10
----------------------------------
Cross validation macro average f1-score: 0.576331836598511
Dummy classifier macro average f1-score: 0.2526642298604586
elapsed time: 324.4387279000075


  'precision', 'predicted', average, warn_for)


In [53]:
classification_report(
    y_train[list(dataset.query('kata == "{}"'.format('kunci')).index)], 
    classifier['kunci'].predict(X_train[list(dataset.query('kata == "{}"'.format('kunci')).index)]),
    output_dict=True
)

{'0': {'precision': 1.0,
  'recall': 0.8571428571428571,
  'f1-score': 0.923076923076923,
  'support': 7},
 '1': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8},
 '2': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 27},
 '3': {'precision': 0.96875,
  'recall': 1.0,
  'f1-score': 0.9841269841269841,
  'support': 62},
 '4': {'precision': 1.0,
  'recall': 0.9782608695652174,
  'f1-score': 0.989010989010989,
  'support': 46},
 '5': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12},
 'accuracy': 0.9876543209876543,
 'macro avg': {'precision': 0.9947916666666666,
  'recall': 0.972567287784679,
  'f1-score': 0.9827024827024827,
  'support': 162},
 'weighted avg': {'precision': 0.9880401234567902,
  'recall': 0.9876543209876543,
  'f1-score': 0.9874809689624503,
  'support': 162}}