# Text classification

The task concentrates on content-based text classification.



1. Get acquainted with the data of the [Polish Cyberbullying detection dataset](https://huggingface.co/datasets/poleval2019_cyberbullying). 
   Pay special attention to the distribution of the positive and negative examples in the first task as well as
   distribution of the classes in the second task.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import *
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import csv
# from imblearn import fit_resample
from imblearn.over_sampling import RandomOverSampler
import numpy as np
import random
from datasets import list_datasets, load_dataset, list_metrics, load_metric


### Utils

In [60]:
def print_metrics(test_y,y_pred):
    print(f'f1_score={metrics.f1_score(test_y,y_pred)}')
    val = metrics.f1_score(test_y,y_pred,average='micro')
    print(f'f1_score micro={val}')
    val =metrics.f1_score(test_y,y_pred,average='macro')
    print(f'f1_score macro={val}')
    print(f'MCC={metrics.matthews_corrcoef(test_y, y_pred)}')

In [2]:
class Statistics():
    def __init__(self):
        pass
    
    def calc_metrics(algorithm,test_y,y_pred):
        self.f1_score =  metrics.f1_score(test_y,y_pred)
        self.f1_score_micro =  metrics.f1_score(test_y,y_pred,average='micro')
        self.f1_score_macro = metrics.f1_score(test_y,y_pred,average='macro')
        self.mcc = metrics.matthews_corrcoef(test_y, y_pred)
        
        self.confusion_matrix = metrics.confusion_matrix(test_y,y_pred)
        

        
statistics = {"bayesian" : Statistics(),"fasttext" :Statistics(),"transformer" : Statistics()}



In [3]:

# print(list_datasets())

dataset = load_dataset('poleval2019_cyberbullying','task01')
print(dataset['test'][3])
print(dataset['train'][3])

train_X = dataset['train']['text']
train_y = dataset['train']['label']

test_X = dataset['test']['text']
test_y = dataset['test']['label']

oversample = RandomOverSampler(sampling_strategy='minority')

train_X, train_y = oversample.fit_resample(np.array(train_X).reshape(-1,1), np.array(train_y))
train_X = train_X.reshape(-1).tolist()
train_y=train_y.reshape(-1).tolist()


Reusing dataset poleval2019_cyber_bullying (/home/x/.cache/huggingface/datasets/poleval2019_cyber_bullying/task01/1.0.0/ce6060c56dae43c469bab309a7573b86299b0bcc2484e85cfe0ae70b5f770450)


  0%|          | 0/2 [00:00<?, ?it/s]

{'text': '@anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂', 'label': 0}
{'text': '@anonymized_account @anonymized_account Musi. Innej drogi nie mamy.', 'label': 0}


In [4]:
pd.DataFrame(random.sample(list(zip(train_X,train_y)),20))

Unnamed: 0,0,1
0,@anonymized_account Powinni im naziole porządn...,1
1,@anonymized_account @anonymized_account Normal...,1
2,@anonymized_account @anonymized_account @anony...,0
3,"@anonymized_account \""A nie mówiłem?\"" 😂 No mó...",0
4,@anonymized_account @anonymized_account @anony...,0
5,@anonymized_account @anonymized_account no i ż...,0
6,@anonymized_account Takie malutkie pytanko: il...,0
7,@anonymized_account Tak to jest z pomnikami ob...,1
8,RT @anonymized_account Zamieszanie wokół Tomas...,0
9,Co ty na zjebie bez godności @anonymized_account,0


2. Train the following classifiers on the training sets (for the task 1 and the task 2)


    i. Bayesian classifier with TF * IDF weighting.
    ii. Fasttext text classifier
    iii. Transformer classifier (take into account that a number of experiments should be performed for this model).

### Bayesian classifier with TF * IDF weighting

In [5]:
test_X

['@anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.',
 "@anonymized_account @anonymized_account Ale on tu nie miał szans jej zagrania, a ta 'proba' to czysta prowizorka.",
 '@anonymized_account No czy Prezes nie miał racji, mówiąc,ze to są zdradzieckie mordy? No czy nie miał racji?😁😁',
 '@anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂',
 '@anonymized_account @anonymized_account Owszem podatki tak. Ale nie w takich okolicznościach. Czemu Małysza odpalili z teamu Orlen?',
 '@anonymized_account @anonymized_account skąd wiesz jaki Skendija ma budżet skoro mówisz że jest bogatsza ? Tylko dwóch zawodników ponoć dobrze zarabia.',
 'Z tego, co widzę, to kibice Widzewa mają szczęście, że trwa mundial. Dzięki temu ogólnopolska szydera jest tylko z Argentyny i Messiego.',
 '@anonymized_account @anonymized_account @anonymized_account Na utrzymanie własnej armii 2% PKB, tyle że teraz to jedna wielka ściema',
 'Przypomnijc

In [6]:

tf_idf = TfidfVectorizer()


X_train_tf = tf_idf.fit_transform(train_X)
X_train_tf = tf_idf.transform(train_X)
X_test_tf = tf_idf.transform(test_X)

print("n_samples: %d, n_features: %d" % X_train_tf.shape)
print(X_test_tf.shape)

naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf,train_y)

y_pred = naive_bayes_classifier.predict(X_test_tf)
print(len(y_pred),len(test_y))

print(metrics.classification_report(test_y,y_pred,target_names=['Positive','Negative']))

# vectorizer = TfidfVectorizer()
# X = vectorizer.fit_transform(dataset['train'])
# vectorizer.get_feature_names_out()

print("Confusion matrix")
print(metrics.confusion_matrix(test_y,y_pred))

n_samples: 18380, n_features: 22872
(1000, 22872)
1000 1000
              precision    recall  f1-score   support

    Positive       0.93      0.85      0.89       866
    Negative       0.37      0.56      0.45       134

    accuracy                           0.81      1000
   macro avg       0.65      0.71      0.67      1000
weighted avg       0.85      0.81      0.83      1000

Confusion matrix
[[739 127]
 [ 59  75]]


In [7]:
metrics.confusion_matrix(test_y,y_pred)

array([[739, 127],
       [ 59,  75]])

In [8]:
print(len(train_y))

18380


In [31]:
# Sformułowań użyto jedynie do celów naukowych i nie odzwierciedlają one poglądów studenta
testing_X = [x for (x,y)in zip(test_X,test_y) if y==1][1:10]
testing_y = [y for (x,y)in zip(test_X,test_y) if y==1][1:10]
testing_X.append("@anonymized_account A na drzewach zamiast liści będą wisieć syjoniści")
testing_y.append(1)
test_input = tf_idf.transform(testing_X)
result = naive_bayes_classifier.predict(test_input)
# print(list(zip(testing_X,testing_y,result)))
for X,y, result in zip(testing_X,testing_y,result):
    print(f'true={y}, predicted={result}, {X}')


true=1, predicted=0, @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=1, predicted=1, @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=1, predicted=0, @anonymized_account Widać ludziom w Sączu tak pasuje. Ja bym mu w gębę napluł
true=1, predicted=0, @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=1, predicted=0, RT @anonymized_account @anonymized_account powinnaś odpowiedzieć za działanie na szkodę Polski i obywateli
true=1, predicted=0, @anonymized_account @anonymized_account @anonymized_account Najbardziej to on jest wolny od mózgu.
true=1, predicted=1, @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=1, predicted=1, RT @anonymized_account @anonymized_account   Półgłówek Wieliński, wymyślił sobie półautorytaryzm!
true=1, predicted=0, @anonymized_account @anonymized_account @anonymized_account Podstawowe zadanie każdego ksi

In [32]:
pd.DataFrame(train_X,train_y)

Unnamed: 0,0
0,Dla mnie faworytem do tytułu będzie Cracovia. ...
0,@anonymized_account @anonymized_account Brawo ...
0,"@anonymized_account @anonymized_account Super,..."
0,@anonymized_account @anonymized_account Musi. ...
0,"Odrzut natychmiastowy, kwaśna mina, mam problem"
...,...
1,@anonymized_account @anonymized_account Ty się...
1,@anonymized_account @anonymized_account ssa ws...
1,"Jedna rzecz chyba wam umknęła. To, że Gdańsk p..."
1,@anonymized_account @anonymized_account Przeci...


In [51]:
# local interpretable model-agnositc explenations LIME
from lime import lime_text
from sklearn.pipeline import make_pipeline
class_names = ["non-harmful","harmful"]


c = make_pipeline(tf_idf, naive_bayes_classifier)

print(c.predict_proba([test_X[0]]))

from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

idx = 83
exp = explainer.explain_instance(test_X[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(harmful) =', c.predict_proba([test_X[idx]])[0,1])
print(f'True class: {class_names[test_y[idx]]} Predicted class: {class_names[y_pred[idx]]}')


exp.as_list()

[[0.76653867 0.23346133]]
Document id: 83
Probability(harmful) = 0.35218713530643686
True class: harmful Predicted class: non-harmful


[('księdza', 0.250010954097491),
 ('każdego', -0.11633003752658998),
 ('zadanie', -0.10580957457613267),
 ('wiary', -0.09645056634966379),
 ('Podstawowe', -0.06596245609500014),
 ('anonymized_account', 0.014665608318354678)]

array([0])

0

### Fasttext classification

In [52]:

def convert_to_fasttext_format(name):
    df = pd.DataFrame(dataset[name])
    df = df[['label','text']]
    df.label = df.label.apply(lambda x: f'__label__{x}')
    return df

def tofile(df,filename):
    df.to_csv(filename, 
      index = False, 
      sep = ' ',
      header = None, 
      quoting = csv.QUOTE_NONE, 
      quotechar = "", 
      escapechar = " ")
    
df = convert_to_fasttext_format('train')
tofile(df,'train.txt')

df = convert_to_fasttext_format('test')
tofile(df,'train.txt')


In [53]:
df

Unnamed: 0,label,text
0,__label__0,"@anonymized_account Spoko, jak im Duda z Moraw..."
1,__label__0,@anonymized_account @anonymized_account Ale on...
2,__label__0,@anonymized_account No czy Prezes nie miał rac...
3,__label__0,@anonymized_account @anonymized_account Przeci...
4,__label__0,@anonymized_account @anonymized_account Owszem...
...,...,...
995,__label__0,"@anonymized_account Olej jak kto sie ubiera, p..."
996,__label__0,@anonymized_account to oczywiste byłyście dziś...
997,__label__0,@anonymized_account Duda może się przyjąć w bi...
998,__label__1,"@anonymized_account Ty jesteś jebnięty, tła ta..."


In [54]:
# Fasttext text classifier
import fasttext
# help(fasttext.FastText)

model = fasttext.train_supervised('train.txt',epoch=40)
                 


Read 0M words
Number of words:  4977
Number of labels: 2
Progress: 100.0% words/sec/thread:  810270 lr:  0.000000 avg.loss:  0.087467 ETA:   0h 0m 0s


In [55]:
number_samples_to_predict = 5
for label in {0,1}:
    testing_X = [x for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
    testing_y = [y for (x,y)in zip(test_X,test_y) if y==label][1:number_samples_to_predict]
    for X,y in zip(testing_X,testing_y):
        predicted = model.predict(X)
        print(f'true={y}, predicted={predicted}, {X}')
    print()


true=0, predicted=(('__label__0',), array([0.99514216])), @anonymized_account @anonymized_account Ale on tu nie miał szans jej zagrania, a ta 'proba' to czysta prowizorka.
true=0, predicted=(('__label__0',), array([0.99910498])), @anonymized_account No czy Prezes nie miał racji, mówiąc,ze to są zdradzieckie mordy? No czy nie miał racji?😁😁
true=0, predicted=(('__label__0',), array([0.98963141])), @anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂
true=0, predicted=(('__label__0',), array([0.9939937])), @anonymized_account @anonymized_account Owszem podatki tak. Ale nie w takich okolicznościach. Czemu Małysza odpalili z teamu Orlen?

true=1, predicted=(('__label__1',), array([0.96780187])), @anonymized_account Dokładnie, pisdzielstwo nie ma prawa rozpierdalać systemu,  sądownictwa nie mając większości
true=1, predicted=(('__label__1',), array([0.98551434])), @anonymized_account Adrian Juda, figurant WSI i  lobby żydowskiego
true=1, predicted=(('__label__1',),

In [56]:
y_pred = [ int(model.predict(x)[0][0][-1]) for x,y in zip(test_X,test_y)]


In [57]:
print(metrics.confusion_matrix(test_y,y_pred))

[[866   0]
 [  0 134]]


In [61]:
print_metrics(test_y,y_pred)

f1_score=1.0
f1_score micro=1.0
f1_score macro=1.0
MCC=1.0


In [62]:
# local interpretable model-agnositc explenations LIME
from lime import lime_text
from sklearn.pipeline import make_pipeline
class_names = ["non-harmful","harmful"]


c = make_pipeline(tf_idf, naive_bayes_classifier)

print(c.predict_proba([test_X[0]]))

from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

idx = 83
exp = explainer.explain_instance(test_X[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(harmful) =', c.predict_proba([test_X[idx]])[0,1])
print(f'True class: {class_names[test_y[idx]]} Predicted class: {class_names[y_pred[idx]]}')


exp.as_list()

[[0.76653867 0.23346133]]
Document id: 83
Probability(harmful) = 0.35218713530643686
True class: harmful Predicted class: harmful


[('księdza', 0.25161172187087766),
 ('każdego', -0.11623569829815956),
 ('zadanie', -0.10525727406327469),
 ('wiary', -0.09861862445737687),
 ('Podstawowe', -0.064826415457611),
 ('anonymized_account', 0.012831066931339263)]

### Transformers

   
   
3. Compare the results of classification on the test set. Select the appropriate measures (from accuracy, F1, macro/micro F1, MCC) to compare the results.

4. Select 1 TP, 1 TN, 1 FP and 1 FN from your predictions (for the best classifier) and compare the decisions of each
   classifier on these examples using [LIME](https://github.com/marcotcr/lime).

In [30]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)


NameError: name 'vectorizer' is not defined

5. Answer the following questions:


1. Which of the classifiers works the best for the task 1 and the task 2.
1. Did you achieve results comparable with the results of [PolEval Task](http://2019.poleval.pl/index.php/results/)?
1. Did you achieve results comparable with the [Klej leaderboard](https://klejbenchmark.com/leaderboard/)?
1. Describe strengths and weaknesses of each of the compared algorithms.
1. Do you think comparison of raw performance values on a single task is enough to assess the value of a given
  algorithm/model?
1. Did SHAP show that the models use valuable features/words when performing their decision?