# Text classification

The task concentrates on content-based text classification.



1. Get acquainted with the data of the [Polish Cyberbullying detection dataset](https://huggingface.co/datasets/poleval2019_cyberbullying). 
   Pay special attention to the distribution of the positive and negative examples in the first task as well as
   distribution of the classes in the second task.


In [185]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import *
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import csv
from imblearn.over_sampling import RandomOverSampler
import numpy as np
import random
from datasets import list_datasets, load_dataset, list_metrics, load_metric
from pycm import ConfusionMatrix
from lime import lime_text
from sklearn.pipeline import make_pipeline
from lime.lime_text import LimeTextExplainer
import fasttext


### Utils

In [225]:
def print_metrics(test_y,y_pred,target_names):
    if len(target_names) ==2:
        print(f'f1_score={metrics.f1_score(test_y,y_pred)}')
    val = metrics.f1_score(test_y,y_pred,average='micro')
    print(f'f1_score micro={val}')
    val =metrics.f1_score(test_y,y_pred,average='macro')
    print(f'f1_score macro={val}')
    print(f'MCC={metrics.matthews_corrcoef(test_y, y_pred)}')
    print()
    
    if len(target_names) ==2:
        print(metrics.classification_report(test_y,y_pred,target_names=target_names))

    print('Confusion matrix')
    print(metrics.confusion_matrix(test_y,y_pred))

In [2]:
class Statistics():
    def __init__(self):
        pass
    
    def calc_metrics(algorithm,test_y,y_pred):
        self.f1_score =  metrics.f1_score(test_y,y_pred)
        self.f1_score_micro =  metrics.f1_score(test_y,y_pred,average='micro')
        self.f1_score_macro = metrics.f1_score(test_y,y_pred,average='macro')
        self.mcc = metrics.matthews_corrcoef(test_y, y_pred)
        
        self.confusion_matrix = metrics.confusion_matrix(test_y,y_pred)
        

        
statistics = {"bayesian" : Statistics(),"fasttext" :Statistics(),"transformer" : Statistics()}



In [202]:


dataset = load_dataset('poleval2019_cyberbullying','task01')


train_X = dataset['train']['text']
train_y = dataset['train']['label']

test_X = dataset['test']['text']
test_y = dataset['test']['label']

oversample = RandomOverSampler(sampling_strategy='minority')

train_X, train_y = oversample.fit_resample(np.array(train_X).reshape(-1,1), np.array(train_y))
train_X = train_X.reshape(-1).tolist()
train_y=train_y.reshape(-1).tolist()


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task01/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

In [209]:

def load_data(task='task01'):
    dataset = load_dataset('poleval2019_cyberbullying',task)


    train_X = dataset['train']['text']
    train_y = dataset['train']['label']

    test_X = dataset['test']['text']
    test_y = dataset['test']['label']

    oversample = RandomOverSampler(sampling_strategy='minority')

    train_X, train_y = oversample.fit_resample(np.array(train_X).reshape(-1,1), np.array(train_y))
    train_X = train_X.reshape(-1).tolist()
    train_y=train_y.reshape(-1).tolist()
    return [train_X,train_y,test_X,test_y]
    


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task01/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

In [210]:
pd.DataFrame(random.sample(list(zip(train_X,train_y)),20),columns=['text','pred'])

Unnamed: 0,text,pred
0,@anonymized_account @anonymized_account Ale ty...,1
1,@anonymized_account @anonymized_account @anony...,1
2,@anonymized_account @anonymized_account @anony...,1
3,"PODAJ DALEJ ten tweet, a oddasz głos na udział...",1
4,@anonymized_account A ten bubel to chwalił się...,0
5,Moi rodzice wzięli ślub jak mieli 23 lata a ja...,0
6,@anonymized_account Jak taka majeka mogła być ...,0
7,@anonymized_account @anonymized_account Już by...,1
8,"Jedna rzecz chyba wam umknęła. To, że Gdańsk p...",1
9,Prowokator B.Budka żali się na słowa PJK po sw...,1


2. Train the following classifiers on the training sets (for the task 1 and the task 2)


    i. Bayesian classifier with TF * IDF weighting.
    ii. Fasttext text classifier
    iii. Transformer classifier (take into account that a number of experiments should be performed for this model).

   
   
3. Compare the results of classification on the test set. Select the appropriate measures (from accuracy, F1, macro/micro F1, MCC) to compare the results.

### Bayesian classifier with TF * IDF weighting

In [211]:

def tfidf(train_X,test_X):
    tf_idf = TfidfVectorizer()


    X_train_tf = tf_idf.fit_transform(train_X)
    X_train_tf = tf_idf.transform(train_X)
    X_test_tf = tf_idf.transform(test_X)

    # print("n_samples: %d, n_features: %d" % X_train_tf.shape)
    # print(X_test_tf.shape)

    naive_bayes_classifier = MultinomialNB()
    naive_bayes_classifier.fit(X_train_tf,train_y)

    y_pred = naive_bayes_classifier.predict(X_test_tf)
    return [y_pred,naive_bayes_classifier]



In [212]:
def sample_tf(naive_bayes_classifier,test_X,test_y):
    testing_X = [x for (x,y)in zip(test_X,test_y) if y==1][1:10]
    testing_y = [y for (x,y)in zip(test_X,test_y) if y==1][1:10]

    test_input = tf_idf.transform(testing_X)
    result = naive_bayes_classifier.predict(test_input)
    # print(list(zip(testing_X,testing_y,result)))
    for X,y, result in zip(testing_X,testing_y,result):
        print(f'true={class_names[y]}, predicted={class_names[result]}, {X}')



In [213]:
# local interpretable model-agnositc explenations LIME


def lime_explain(idx):
    


    c = make_pipeline(tf_idf, naive_bayes_classifier)

    explainer = LimeTextExplainer(class_names=class_names)

    exp = explainer.explain_instance(test_X[idx], c.predict_proba, num_features=6)
    print('Probability(harmful) =', c.predict_proba([test_X[idx]])[0,1])
#     print(f'True class: {class_names[test_y[idx]]} Predicted class: {class_names[y_pred[idx]]}')

    display(pd.DataFrame(exp.as_list(),columns=['word','value']))
lime_explain(0)

Probability(harmful) = 0.23346132796823493


Unnamed: 0,word,value
0,ok,-0.23129
1,Duda,0.115853
2,pięć,-0.08275
3,Spoko,-0.059106
4,im,0.056553
5,Morawieckim,-0.032929


In [214]:
def lime_explain_confusion_matrix(test_y,y_pred):
    cm = ConfusionMatrix(actual_vector=test_y, predict_vector=y_pred)
    conf_indices= cm.position()[0]

    states = ['TP','FP','FN','TN']
    for state in states:
        if len(conf_indices[state]) ==0:
            print(f'{state} no samples')
            continue
        index = conf_indices[state][0]
        print(f'{state}, True={class_names[test_y[index]]}, Pred={class_names[y_pred[index]]}, {test_X[index]}')
        lime_explain(index)
        


In [218]:
[train_X,train_y,test_X,test_y] = load_data()
class_names = ["non-harmful","harmful"]

[y_pred,naive_bayes_classifier] = tfidf(train_X,test_X)

sample_tf(naive_bayes_classifier,test_X,test_y)



Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task01/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

true=harmful, predicted=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=harmful, predicted=harmful, @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=harmful, predicted=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=harmful, predicted=non-harmful, @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=harmful, predicted=non-harmful, RT @anonymized_account @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=harmful, predicted=non-harmful, @anonymized_account @anonymized_account @anonymized_account Najbardziej to on jest wolny od mózgu.
true=harmful, predicted=harmful, @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=harmful, predicted=harmful, RT @anonymized_account @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytar

In [219]:
print_metrics(test_y,y_pred)



f1_score=0.4491017964071856
f1_score micro=0.816
f1_score macro=0.6693288093680586
MCC=0.35373310130581387

              precision    recall  f1-score   support

    Positive       0.93      0.86      0.89       866
    Negative       0.38      0.56      0.45       134

    accuracy                           0.82      1000
   macro avg       0.65      0.71      0.67      1000
weighted avg       0.85      0.82      0.83      1000

Confusion matrix
[[741 125]
 [ 59  75]]


In [220]:
lime_explain_confusion_matrix(test_y,y_pred)

TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.22627213213909472


Unnamed: 0,word,value
0,ok,-0.229405
1,Duda,0.11248
2,pięć,-0.082503
3,im,0.071782
4,Spoko,-0.071228
5,Morawieckim,-0.033739


FP, True=harmful, Pred=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
Probability(harmful) = 0.40384868769954463


Unnamed: 0,word,value
0,pisdzielstwo,0.189273
1,większości,-0.143673
2,systemu,0.127227
3,mając,-0.125078
4,sądownictwa,-0.098716
5,Dokładnie,-0.019108


FN, True=non-harmful, Pred=harmful, Jajka na miękko czy na twardo? Jeśli jecie jajka ofc
Probability(harmful) = 0.7142550218022541


Unnamed: 0,word,value
0,twardo,0.147433
1,jajka,0.123507
2,Jajka,0.122235
3,Jeśli,-0.121246
4,ofc,-0.074811
5,czy,-0.027525


TN, True=harmful, Pred=harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.6893850496242374


Unnamed: 0,word,value
0,szmaty,0.249719
1,komentuje,-0.103668
2,Tej,0.015356
3,się,-0.009473
4,anonymized_account,0.006356
5,nie,-0.002643


#### bayes task 2

In [221]:
[train_X,train_y,test_X,test_y] = load_data('task02')
class_names = ["non-harmful","cyberbullying","hate-speech"]

[y_pred,naive_bayes_classifier] = tfidf(train_X,test_X)

sample_tf(naive_bayes_classifier,test_X,test_y)


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task02/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

true=cyberbullying, predicted=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=cyberbullying, predicted=non-harmful, @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=cyberbullying, predicted=non-harmful, RT @anonymized_account @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=cyberbullying, predicted=non-harmful, @anonymized_account W powietrzu sa środki chemiczne , uważaj wygladasz na podatnego
true=cyberbullying, predicted=cyberbullying, @anonymized_account Konserwator kuwety, gegacz i przechył mózgowy, za pieniądze robił loda w tokfm
true=cyberbullying, predicted=cyberbullying, @anonymized_account @anonymized_account A potem pójdziesz pod prokuratora za ukradzione kilometrówki
true=cyberbullying, predicted=non-harmful, @anonymized_account Oni akurat w dupie wnoszą bo to pedały i to dla nich nic nowego.
true=cyberbullying, predicted=cyberbullying, @anonymized_account @anonymize

In [231]:


print_metrics(test_y,y_pred,class_names)


f1_score micro=0.798
f1_score macro=0.3543097836492437
MCC=0.1628642950183259

Confusion matrix
[[786  80   0]
 [ 13  12   0]
 [ 76  33   0]]
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.099377
1,Spoko,-0.081637
2,Duda,-0.039994
3,będzie,-0.025582
4,po,0.023142
5,pięć,-0.021467


In [272]:
def lime_explain_3(idx):
    


    c = make_pipeline(tf_idf, naive_bayes_classifier)

    explainer = LimeTextExplainer(class_names=class_names)

    exp = explainer.explain_instance(test_X[idx], c.predict_proba, num_features=6)
    print('Probability(harmful) =', c.predict_proba([test_X[idx]])[0,1])
#     print(f'True class: {class_names[test_y[idx]]} Predicted class: {class_names[y_pred[idx]]}')

    display(pd.DataFrame(exp.as_list(),columns=['word','value']))


def lime_explain_confusion_matrix_3(test_y,y_pred):
    cm = ConfusionMatrix(actual_vector=test_y, predict_vector=y_pred)
    conf_indices= cm.position()

    for conf_index in conf_indices:
        print(conf_index)
        for state in ['TP','FP','FN','TN']:
            if len(conf_indices[conf_index][state]) ==0:
                print(f'{state} no samples')
                continue
            index = conf_indices[conf_index][state][0]

            print(f'{class_names[conf_index]} {state}, True={class_names[test_y[index]]}, Pred={class_names[y_pred[index]]}, {test_X[index]}')
            lime_explain_3(index)


lime_explain_confusion_matrix_3(test_y,y_pred)

0
non-harmful TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.102156
1,Spoko,-0.081619
2,Duda,-0.0418
3,będzie,-0.024526
4,po,0.02347
5,pięć,-0.021657


non-harmful FP, True=hate-speech, Pred=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
Probability(harmful) = 0.08273766026619488


Unnamed: 0,word,value
0,Dokładnie,-0.097256
1,prawa,-0.088858
2,pisdzielstwo,0.064997
3,większości,-0.055613
4,mając,-0.045208
5,sądownictwa,-0.026575


non-harmful FN, True=non-harmful, Pred=cyberbullying, @anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂
Probability(harmful) = 0.5802033418205965


Unnamed: 0,word,value
0,nawet,0.098759
1,Przecież,-0.052822
2,jest,-0.03092
3,anonymized_account,0.029567
4,nie,0.011798
5,przewrotka,0.000513


non-harmful TN, True=cyberbullying, Pred=cyberbullying, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.5190246889650267


Unnamed: 0,word,value
0,komentuje,-0.094351
1,Tej,0.08662
2,szmaty,-0.029537
3,anonymized_account,0.012286
4,nie,0.010709
5,się,0.004861


1
cyberbullying TP, True=cyberbullying, Pred=cyberbullying, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.5190246889650267


Unnamed: 0,word,value
0,komentuje,-0.094562
1,Tej,0.087321
2,szmaty,-0.029337
3,anonymized_account,0.011848
4,nie,0.0117
5,się,0.00531


cyberbullying FP, True=non-harmful, Pred=cyberbullying, @anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂
Probability(harmful) = 0.5802033418205965


Unnamed: 0,word,value
0,nawet,0.09879
1,Przecież,-0.052502
2,jest,-0.030715
3,anonymized_account,0.029884
4,nie,0.011498
5,przewrotka,-0.000105


cyberbullying FN, True=cyberbullying, Pred=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
Probability(harmful) = 0.4286084689689802


Unnamed: 0,word,value
0,napluł,0.196522
1,pasuje,-0.191391
2,ludziom,-0.117872
3,Sączu,-0.09558
4,bym,0.092676
5,mu,0.062514


cyberbullying TN, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.100533
1,Spoko,-0.081315
2,Duda,-0.040818
3,będzie,-0.02285
4,pięć,-0.021905
5,po,0.017761


2
TP no samples
FP no samples
hate-speech FN, True=hate-speech, Pred=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
Probability(harmful) = 0.08273766026619488


Unnamed: 0,word,value
0,Dokładnie,-0.094713
1,prawa,-0.086414
2,pisdzielstwo,0.064213
3,większości,-0.056768
4,mając,-0.043074
5,sądownictwa,-0.02878


hate-speech TN, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.105038
1,Spoko,-0.083662
2,Duda,-0.042853
3,będzie,-0.025232
4,po,0.02362
5,pięć,-0.022209


### Fasttext classification

In [204]:

def convert_to_fasttext_format(name):
    df = pd.DataFrame(dataset[name])
    df = df[['label','text']]
    df.label = df.label.apply(lambda x: f'__label__{x}')
    return df

def tofile(df,filename):
    df.to_csv(filename, 
      index = False, 
      sep = ' ',
      header = None, 
      quoting = csv.QUOTE_NONE, 
      quotechar = "", 
      escapechar = " ")
    

def fast_text():
    df = convert_to_fasttext_format('train')
    tofile(df,'train.txt')

    df = convert_to_fasttext_format('test')
    tofile(df,'test.txt')
    
    model = fasttext.train_supervised('train.txt',epoch=40)
    y_pred = [ int(model.predict(x)[0][0][-1]) for x,y in zip(test_X,test_y)]
    return [y_pred,model]

In [205]:
[y_pred,model] = fast_text()

Read 0M words
Number of words:  31486
Number of labels: 2
Progress: 100.0% words/sec/thread: 1174713 lr:  0.000000 avg.loss:  0.032007 ETA:   0h 0m 0s100.0% words/sec/thread: 1175004 lr: -0.000001 avg.loss:  0.032007 ETA:   0h 0m 0s


In [206]:
def sample_ft(model,test_X,test_y,number_samples_to_predict = 5):
    for label in {0,1}:
        testing_X = [x for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
        testing_y = [y for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
        for X,y in zip(testing_X,testing_y):
            predicted = int(model.predict(X)[0][0][-1])
            print(f'true={class_names[y]}, predicted={class_names[predicted]}, {X}')
        print()

sample_ft(model,test_X,test_y,5)

true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Ale on tu nie miał szans jej zagrania, a ta 'proba' to czysta prowizorka.
true=non-harmful, predicted=non-harmful, @anonymized_account No czy Prezes nie miał racji, mówiąc,ze to są zdradzieckie mordy? No czy nie miał racji?😁😁
true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂
true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Owszem podatki tak. Ale nie w takich okolicznościach. Czemu Małysza odpalili z teamu Orlen?

true=harmful, predicted=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=harmful, predicted=non-harmful, @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=harmful, predicted=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=harmful, predicted=non-h

In [207]:
print_metrics(test_y,y_pred)

f1_score=0.1875
f1_score micro=0.87
f1_score macro=0.5584239130434783
MCC=0.21243406452447067

              precision    recall  f1-score   support

    Positive       0.88      0.99      0.93       866
    Negative       0.58      0.11      0.19       134

    accuracy                           0.87      1000
   macro avg       0.73      0.55      0.56      1000
weighted avg       0.84      0.87      0.83      1000

Confusion matrix
[[855  11]
 [119  15]]


In [208]:
lime_explain_confusion_matrix(test_y,y_pred)

TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.23346132796823493


Unnamed: 0,word,value
0,ok,-0.232028
1,Duda,0.1172
2,pięć,-0.08213
3,Spoko,-0.059579
4,im,0.058608
5,Morawieckim,-0.031165


FP, True=harmful, Pred=non-harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.6308802393175929


Unnamed: 0,word,value
0,szmaty,0.217141
1,komentuje,-0.095687
2,się,-0.011215
3,anonymized_account,0.006621
4,Tej,-0.005036
5,nie,-0.001894


FN, True=non-harmful, Pred=harmful, @anonymized_account Droga p.Kamilko! Leczyć się . Leczyć póki czas😁😁
Probability(harmful) = 0.8708997771641055


Unnamed: 0,word,value
0,Leczyć,0.389602
1,Droga,-0.114598
2,póki,0.041016
3,czas,0.026613
4,się,-0.008075
5,anonymized_account,0.003217


TN, True=harmful, Pred=harmful, @anonymized_account Dokładnie wie co mówi. A Ty pajacu poczytaj ustawę domsie dowiesz kto decyduje o wysokości zarobków w samorządach.
Probability(harmful) = 0.4766271084155878


Unnamed: 0,word,value
0,pajacu,0.183502
1,poczytaj,-0.141596
2,decyduje,-0.097367
3,ustawę,0.088714
4,dowiesz,-0.084682
5,Ty,0.068466


#### fast text task 2

In [277]:
[train_X,train_y,test_X,test_y] = load_data('task02')
class_names = ["non-harmful","cyberbullying","hate-speech"]


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task02/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

In [278]:
[y_pred,model] = fast_text()
print_metrics(test_y,y_pred,class_names)



Read 0M words
Number of words:  31486
Number of labels: 2
Progress:  89.5% words/sec/thread: 1227603 lr:  0.010516 avg.loss:  0.034748 ETA:   0h 0m 0s

f1_score micro=0.858
f1_score macro=0.34785554437760485
MCC=0.13673535994927147

Confusion matrix
[[855  11   0]
 [ 22   3   0]
 [ 95  14   0]]


100.0% words/sec/thread: 1175921 lr: -0.000006 avg.loss:  0.031460 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread: 1175691 lr:  0.000000 avg.loss:  0.031460 ETA:   0h 0m 0s


In [279]:
lime_explain_confusion_matrix_3(test_y,y_pred)

0
non-harmful TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.10151
1,Spoko,-0.083302
2,Duda,-0.040991
3,będzie,-0.024991
4,pięć,-0.024936
5,po,0.022766


non-harmful FP, True=cyberbullying, Pred=non-harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.5190246889650267


Unnamed: 0,word,value
0,komentuje,-0.093996
1,Tej,0.086612
2,szmaty,-0.029657
3,anonymized_account,0.012394
4,nie,0.010973
5,się,0.004862


non-harmful FN, True=non-harmful, Pred=cyberbullying, @anonymized_account Droga p.Kamilko! Leczyć się . Leczyć póki czas😁😁
Probability(harmful) = 0.8818696512068902


Unnamed: 0,word,value
0,Leczyć,0.28542
1,póki,0.147739
2,Droga,-0.123732
3,anonymized_account,0.008284
4,się,0.00376
5,Kamilko,0.002331


non-harmful TN, True=hate-speech, Pred=cyberbullying, @anonymized_account Dokładnie wie co mówi. A Ty pajacu poczytaj ustawę domsie dowiesz kto decyduje o wysokości zarobków w samorządach.
Probability(harmful) = 0.4791263968887314


Unnamed: 0,word,value
0,ustawę,0.206883
1,Dokładnie,-0.172388
2,pajacu,0.166448
3,poczytaj,-0.13663
4,decyduje,-0.095203
5,Ty,0.081647


1
cyberbullying TP, True=cyberbullying, Pred=cyberbullying, @anonymized_account @anonymized_account @anonymized_account Zreszta ty chuja zobaczysz, kutasa ziobry najwyzej
Probability(harmful) = 0.9740090397622778


Unnamed: 0,word,value
0,ziobry,0.074529
1,chuja,0.074056
2,zobaczysz,0.067789
3,ty,0.051037
4,kutasa,-0.037977
5,anonymized_account,0.009364


cyberbullying FP, True=non-harmful, Pred=cyberbullying, @anonymized_account Droga p.Kamilko! Leczyć się . Leczyć póki czas😁😁
Probability(harmful) = 0.8818696512068902


Unnamed: 0,word,value
0,Leczyć,0.285941
1,póki,0.147427
2,Droga,-0.119833
3,anonymized_account,0.009199
4,czas,-0.003519
5,p,-0.002371


cyberbullying FN, True=cyberbullying, Pred=non-harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.5190246889650267


Unnamed: 0,word,value
0,komentuje,-0.094691
1,Tej,0.085763
2,szmaty,-0.030553
3,anonymized_account,0.012349
4,nie,0.011609
5,się,0.004053


cyberbullying TN, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.104268
1,Spoko,-0.0847
2,Duda,-0.042062
3,będzie,-0.024279
4,pięć,-0.023745
5,po,0.02327


2
TP no samples
FP no samples
hate-speech FN, True=hate-speech, Pred=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
Probability(harmful) = 0.08273766026619488


Unnamed: 0,word,value
0,Dokładnie,-0.098205
1,prawa,-0.086603
2,pisdzielstwo,0.06513
3,większości,-0.057108
4,mając,-0.042641
5,sądownictwa,-0.025651


hate-speech TN, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.05335649287357199


Unnamed: 0,word,value
0,ok,-0.10349
1,Spoko,-0.083399
2,Duda,-0.041803
3,będzie,-0.025399
4,po,0.024775
5,pięć,-0.022294


### Transformers

In [None]:
# w pliku cyberbulling_transformers

4. Select 1 TP, 1 TN, 1 FP and 1 FN from your predictions (for the best classifier) and compare the decisions of each
   classifier on these examples using [LIME](https://github.com/marcotcr/lime).

5. Answer the following questions:


1. Which of the classifiers works the best for the task 1 and the task 2.

Najlepiej radziłby sobie transformer (w pełni nauczony). Następnie egzekwo 

2. Did you achieve results comparable with the results of [PolEval Task](http://2019.poleval.pl/index.php/results/)?

3. Did you achieve results comparable with the [Klej leaderboard](https://klejbenchmark.com/leaderboard/)?

4. Describe strengths and weaknesses of each of the compared algorithms.

5. Do you think comparison of raw performance values on a single task is enough to assess the value of a given
  algorithm/model?
  
  
6. Did SHAP show that the models use valuable features/words when performing their decision?