# Text classification

The task concentrates on content-based text classification.



1. Get acquainted with the data of the [Polish Cyberbullying detection dataset](https://huggingface.co/datasets/poleval2019_cyberbullying). 
   Pay special attention to the distribution of the positive and negative examples in the first task as well as
   distribution of the classes in the second task.


In [185]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import *
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import csv
from imblearn.over_sampling import RandomOverSampler
import numpy as np
import random
from datasets import list_datasets, load_dataset, list_metrics, load_metric
from pycm import ConfusionMatrix
from lime import lime_text
from sklearn.pipeline import make_pipeline
from lime.lime_text import LimeTextExplainer
import fasttext


### Utils

In [164]:
def print_metrics(test_y,y_pred):
    print(f'f1_score={metrics.f1_score(test_y,y_pred)}')
    val = metrics.f1_score(test_y,y_pred,average='micro')
    print(f'f1_score micro={val}')
    val =metrics.f1_score(test_y,y_pred,average='macro')
    print(f'f1_score macro={val}')
    print(f'MCC={metrics.matthews_corrcoef(test_y, y_pred)}')
    print()
    print(metrics.classification_report(test_y,y_pred,target_names=['Positive','Negative']))

    print('Confusion matrix')
    print(metrics.confusion_matrix(test_y,y_pred))

In [2]:
class Statistics():
    def __init__(self):
        pass
    
    def calc_metrics(algorithm,test_y,y_pred):
        self.f1_score =  metrics.f1_score(test_y,y_pred)
        self.f1_score_micro =  metrics.f1_score(test_y,y_pred,average='micro')
        self.f1_score_macro = metrics.f1_score(test_y,y_pred,average='macro')
        self.mcc = metrics.matthews_corrcoef(test_y, y_pred)
        
        self.confusion_matrix = metrics.confusion_matrix(test_y,y_pred)
        

        
statistics = {"bayesian" : Statistics(),"fasttext" :Statistics(),"transformer" : Statistics()}



In [202]:


dataset = load_dataset('poleval2019_cyberbullying','task01')


train_X = dataset['train']['text']
train_y = dataset['train']['label']

test_X = dataset['test']['text']
test_y = dataset['test']['label']

oversample = RandomOverSampler(sampling_strategy='minority')

train_X, train_y = oversample.fit_resample(np.array(train_X).reshape(-1,1), np.array(train_y))
train_X = train_X.reshape(-1).tolist()
train_y=train_y.reshape(-1).tolist()


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task01/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset = load_dataset('poleval2019_cyberbullying','task02')


train_X = dataset['train']['text']
train_y = dataset['train']['label']

test_X = dataset['test']['text']
test_y = dataset['test']['label']

oversample = RandomOverSampler(sampling_strategy='minority')

train_X, train_y = oversample.fit_resample(np.array(train_X).reshape(-1,1), np.array(train_y))
train_X = train_X.reshape(-1).tolist()
train_y=train_y.reshape(-1).tolist()

In [203]:
pd.DataFrame(random.sample(list(zip(train_X,train_y)),20),columns=['text','pred'])

Unnamed: 0,text,pred
0,@anonymized_account Ty już robiłeś swoje refor...,1
1,@anonymized_account @anonymized_account Szacun...,1
2,"Słyszę, że Pierwsza Milcząca będzie czytała dz...",1
3,@anonymized_account mam pytanie.Za te kłopoty ...,0
4,@anonymized_account @anonymized_account Łżesz ...,1
5,może wypuścimy szczepiaki zamiast słodziaków a...,0
6,@anonymized_account Tak jak pisdzielstwo ośmie...,1
7,@anonymized_account @anonymized_account Widać ...,1
8,@anonymized_account oj nie wiedział co odpisać...,1
9,@anonymized_account @anonymized_account A niby...,0


2. Train the following classifiers on the training sets (for the task 1 and the task 2)


    i. Bayesian classifier with TF * IDF weighting.
    ii. Fasttext text classifier
    iii. Transformer classifier (take into account that a number of experiments should be performed for this model).

### Bayesian classifier with TF * IDF weighting

In [172]:

def tfidf(train_X,test_X):
    tf_idf = TfidfVectorizer()


    X_train_tf = tf_idf.fit_transform(train_X)
    X_train_tf = tf_idf.transform(train_X)
    X_test_tf = tf_idf.transform(test_X)

    # print("n_samples: %d, n_features: %d" % X_train_tf.shape)
    # print(X_test_tf.shape)

    naive_bayes_classifier = MultinomialNB()
    naive_bayes_classifier.fit(X_train_tf,train_y)

    y_pred = naive_bayes_classifier.predict(X_test_tf)
    return [y_pred,naive_bayes_classifier]



In [173]:
def sample_tf(naive_bayes_classifier,test_X,test_y):
    testing_X = [x for (x,y)in zip(test_X,test_y) if y==1][1:10]
    testing_y = [y for (x,y)in zip(test_X,test_y) if y==1][1:10]

    test_input = tf_idf.transform(testing_X)
    result = naive_bayes_classifier.predict(test_input)
    # print(list(zip(testing_X,testing_y,result)))
    for X,y, result in zip(testing_X,testing_y,result):
        print(f'true={class_names[y]}, predicted={class_names[result]}, {X}')



In [174]:
# local interpretable model-agnositc explenations LIME


def lime_explain(idx):
    class_names = ["non-harmful","harmful"]


    c = make_pipeline(tf_idf, naive_bayes_classifier)

#     print(c.predict_proba([test_X[0]]))

    explainer = LimeTextExplainer(class_names=class_names)

    exp = explainer.explain_instance(test_X[idx], c.predict_proba, num_features=6)
#     print('Document id: %d' % idx)
    print('Probability(harmful) =', c.predict_proba([test_X[idx]])[0,1])
#     print(f'True class: {class_names[test_y[idx]]} Predicted class: {class_names[y_pred[idx]]}')

    display(pd.DataFrame(exp.as_list(),columns=['word','value']))
lime_explain(0)

Probability(harmful) = 0.23346132796823493


Unnamed: 0,word,value
0,ok,-0.232064
1,Duda,0.121119
2,pięć,-0.083097
3,Spoko,-0.059688
4,im,0.058557
5,Morawieckim,-0.034639


In [175]:
def lime_explain_confusion_matrix(test_y,y_pred):
    cm = ConfusionMatrix(actual_vector=test_y, predict_vector=y_pred)
    conf_indices= cm.position()[0]

    states = ['TP','FP','FN','TN']
    for state in states:
        if len(conf_indices[state]) ==0:
            print(f'{state} no samples')
            continue
        index = conf_indices[state][0]
        print(f'{state}, True={class_names[test_y[index]]}, Pred={class_names[y_pred[index]]}, {test_X[index]}')
        lime_explain(index)
        


In [177]:
[y_pred,naive_bayes_classifier] = tfidf(train_X,test_X)

sample_tf(naive_bayes_classifier,test_X,test_y)



true=harmful, predicted=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=harmful, predicted=harmful, @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=harmful, predicted=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=harmful, predicted=non-harmful, @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=harmful, predicted=non-harmful, RT @anonymized_account @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=harmful, predicted=non-harmful, @anonymized_account @anonymized_account @anonymized_account Najbardziej to on jest wolny od mózgu.
true=harmful, predicted=harmful, @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=harmful, predicted=harmful, RT @anonymized_account @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytar

In [179]:
print_metrics(test_y,y_pred)



f1_score=0.4464285714285714
f1_score micro=0.8140000000000001
f1_score macro=0.6673248626373627
MCC=0.3504588851261634

              precision    recall  f1-score   support

    Positive       0.93      0.85      0.89       866
    Negative       0.37      0.56      0.45       134

    accuracy                           0.81      1000
   macro avg       0.65      0.71      0.67      1000
weighted avg       0.85      0.81      0.83      1000

Confusion matrix
[[739 127]
 [ 59  75]]


In [180]:
lime_explain_confusion_matrix(test_y,y_pred)

TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.23346132796823493


Unnamed: 0,word,value
0,ok,-0.230961
1,Duda,0.118478
2,pięć,-0.082712
3,Spoko,-0.060575
4,im,0.057571
5,Morawieckim,-0.033458


FP, True=harmful, Pred=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
Probability(harmful) = 0.4202574617947113


Unnamed: 0,word,value
0,pisdzielstwo,0.192534
1,większości,-0.144137
2,systemu,0.138809
3,mając,-0.12551
4,sądownictwa,-0.09918
5,prawa,-0.02502


FN, True=non-harmful, Pred=harmful, Jajka na miękko czy na twardo? Jeśli jecie jajka ofc
Probability(harmful) = 0.7688665235325993


Unnamed: 0,word,value
0,twardo,0.17531
1,Jajka,0.12451
2,jajka,0.119602
3,Jeśli,-0.102823
4,ofc,-0.075717
5,czy,-0.030908


TN, True=harmful, Pred=harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.6308802393175929


Unnamed: 0,word,value
0,szmaty,0.217807
1,komentuje,-0.09563
2,się,-0.01191
3,anonymized_account,0.006923
4,Tej,-0.00474
5,nie,-0.002315


### Fasttext classification

In [204]:

def convert_to_fasttext_format(name):
    df = pd.DataFrame(dataset[name])
    df = df[['label','text']]
    df.label = df.label.apply(lambda x: f'__label__{x}')
    return df

def tofile(df,filename):
    df.to_csv(filename, 
      index = False, 
      sep = ' ',
      header = None, 
      quoting = csv.QUOTE_NONE, 
      quotechar = "", 
      escapechar = " ")
    

def fast_text():
    df = convert_to_fasttext_format('train')
    tofile(df,'train.txt')

    df = convert_to_fasttext_format('test')
    tofile(df,'test.txt')
    
    model = fasttext.train_supervised('train.txt',epoch=40)
    y_pred = [ int(model.predict(x)[0][0][-1]) for x,y in zip(test_X,test_y)]
    return [y_pred,model]

In [205]:
[y_pred,model] = fast_text()

Read 0M words
Number of words:  31486
Number of labels: 2
Progress: 100.0% words/sec/thread: 1174713 lr:  0.000000 avg.loss:  0.032007 ETA:   0h 0m 0s100.0% words/sec/thread: 1175004 lr: -0.000001 avg.loss:  0.032007 ETA:   0h 0m 0s


In [206]:
def sample_ft(model,test_X,test_y,number_samples_to_predict = 5):
    for label in {0,1}:
        testing_X = [x for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
        testing_y = [y for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
        for X,y in zip(testing_X,testing_y):
            predicted = int(model.predict(X)[0][0][-1])
            print(f'true={class_names[y]}, predicted={class_names[predicted]}, {X}')
        print()

sample_ft(model,test_X,test_y,5)

true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Ale on tu nie miał szans jej zagrania, a ta 'proba' to czysta prowizorka.
true=non-harmful, predicted=non-harmful, @anonymized_account No czy Prezes nie miał racji, mówiąc,ze to są zdradzieckie mordy? No czy nie miał racji?😁😁
true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂
true=non-harmful, predicted=non-harmful, @anonymized_account @anonymized_account Owszem podatki tak. Ale nie w takich okolicznościach. Czemu Małysza odpalili z teamu Orlen?

true=harmful, predicted=non-harmful, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=harmful, predicted=non-harmful, @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=harmful, predicted=non-harmful, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=harmful, predicted=non-h

In [207]:
print_metrics(test_y,y_pred)

f1_score=0.1875
f1_score micro=0.87
f1_score macro=0.5584239130434783
MCC=0.21243406452447067

              precision    recall  f1-score   support

    Positive       0.88      0.99      0.93       866
    Negative       0.58      0.11      0.19       134

    accuracy                           0.87      1000
   macro avg       0.73      0.55      0.56      1000
weighted avg       0.84      0.87      0.83      1000

Confusion matrix
[[855  11]
 [119  15]]


In [208]:
lime_explain_confusion_matrix(test_y,y_pred)

TP, True=non-harmful, Pred=non-harmful, @anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.
Probability(harmful) = 0.23346132796823493


Unnamed: 0,word,value
0,ok,-0.232028
1,Duda,0.1172
2,pięć,-0.08213
3,Spoko,-0.059579
4,im,0.058608
5,Morawieckim,-0.031165


FP, True=harmful, Pred=non-harmful, @anonymized_account Tej szmaty się nie komentuje
Probability(harmful) = 0.6308802393175929


Unnamed: 0,word,value
0,szmaty,0.217141
1,komentuje,-0.095687
2,się,-0.011215
3,anonymized_account,0.006621
4,Tej,-0.005036
5,nie,-0.001894


FN, True=non-harmful, Pred=harmful, @anonymized_account Droga p.Kamilko! Leczyć się . Leczyć póki czas😁😁
Probability(harmful) = 0.8708997771641055


Unnamed: 0,word,value
0,Leczyć,0.389602
1,Droga,-0.114598
2,póki,0.041016
3,czas,0.026613
4,się,-0.008075
5,anonymized_account,0.003217


TN, True=harmful, Pred=harmful, @anonymized_account Dokładnie wie co mówi. A Ty pajacu poczytaj ustawę domsie dowiesz kto decyduje o wysokości zarobków w samorządach.
Probability(harmful) = 0.4766271084155878


Unnamed: 0,word,value
0,pajacu,0.183502
1,poczytaj,-0.141596
2,decyduje,-0.097367
3,ustawę,0.088714
4,dowiesz,-0.084682
5,Ty,0.068466


### Transformers

   
   
3. Compare the results of classification on the test set. Select the appropriate measures (from accuracy, F1, macro/micro F1, MCC) to compare the results.

4. Select 1 TP, 1 TN, 1 FP and 1 FN from your predictions (for the best classifier) and compare the decisions of each
   classifier on these examples using [LIME](https://github.com/marcotcr/lime).

In [30]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)


NameError: name 'vectorizer' is not defined

5. Answer the following questions:


1. Which of the classifiers works the best for the task 1 and the task 2.
1. Did you achieve results comparable with the results of [PolEval Task](http://2019.poleval.pl/index.php/results/)?
1. Did you achieve results comparable with the [Klej leaderboard](https://klejbenchmark.com/leaderboard/)?
1. Describe strengths and weaknesses of each of the compared algorithms.
1. Do you think comparison of raw performance values on a single task is enough to assess the value of a given
  algorithm/model?
1. Did SHAP show that the models use valuable features/words when performing their decision?