<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#TFIDF" data-toc-modified-id="TFIDF-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>TFIDF</a></span></li><li><span><a href="#BERT" data-toc-modified-id="BERT-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>BERT</a></span></li><li><span><a href="#TRYING-to-fine-tune-BERT-model" data-toc-modified-id="TRYING-to-fine-tune-BERT-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>TRYING to fine tune BERT model</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект для «Викишоп» c BERT & TFIDF

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

## Подготовка

In [1]:
import os
import pandas as pd
import numpy as np
import torch
import re
import tqdm
import nltk
import transformers
from tqdm.autonotebook import tqdm
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from transformers import BertTokenizer
from torch.nn import functional as F
from tqdm import notebook
from pymystem3 import Mystem
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from transformers import TrainingArguments, Trainer
from transformers import BertForSequenceClassification, BertTokenizerFast
from torch.utils.data import Dataset
from torch import cuda


In [3]:
pth1 = 'C:/datasc/pracfiles/toxic_comments.csv'
pth2 = '/datasets/toxic_comments.csv'
pth3 = '/content/drive/My Drive/toxic_comments.csv'
if os.path.exists(pth1):
    df = pd.read_csv(pth1)
elif os.path.exists(pth2):
    df = pd.read_csv(pth2)
elif os.path.exists(pth3):
    df = pd.read_csv(pth3)
else:
    print('Something is wrong')

df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
df = df.drop(['Unnamed: 0'], axis=1)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [6]:
df['toxic'].mean()

0.10161213369158527

## TFIDF

In [7]:
%%time
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n'

def lemmatize_sent(text):
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    text = text.lower()

    tokens = word_tokenize(text)
    tagged_tokens = pos_tag(tokens)

    lemmatized_tokens = [wnl.lemmatize(word, pos=penn2morphy(tag)) for word, tag in tagged_tokens]

    return ' '.join(lemmatized_tokens)




print(lemmatize_sent("The striped bats are hanging on their feet for best"))
lemmatize_sent('He is walking to school')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


the striped bat be hang on their foot for best
CPU times: user 1.46 s, sys: 83.8 ms, total: 1.55 s
Wall time: 2.14 s


'he be walk to school'

In [8]:
tqdm.pandas()
df['cleaned_comment'] = df['text'].progress_apply(lemmatize_sent)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [9]:
features = df['cleaned_comment']
target = df['toxic']

features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, target, test_size=0.1, random_state=12345, shuffle=False)

features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val, target_train_val, test_size=0.1, random_state=12345, shuffle=False)


In [10]:
nltk.download('stopwords')
stop_words = list(nltk.corpus.stopwords.words('english'))

vectorizer = TfidfVectorizer(stop_words=stop_words)
features_train = vectorizer.fit_transform(features_train)
features_valid = vectorizer.transform(features_valid.copy())


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
f1score = 0
best_C = 0
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
for i in range(5, 15):
    model = LogisticRegression(class_weight='balanced', C=i)
    model.fit(features_train, target_train)
    y_pred = model.predict(features_valid)
    f1 = f1_score(target_valid, y_pred)
    if f1 > f1score:
        best_C = i
        f1score = f1
print('BEST SCORE for BALANCED model', f1score)
print('BEST C', best_C)

BEST SCORE for BALANCED model 0.759167492566898
BEST C 14


In [12]:
f1score = 0
best_C = 0
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
for i in range(5, 15):
    model = LogisticRegression(C=i)
    model.fit(features_train, target_train)
    y_pred = model.predict(features_valid)
    f1 = f1_score(target_valid, y_pred)
    if f1 > f1score:
        best_C = i
        f1score = f1
print('BEST F1 for NON BALANCED model', f1score)
print('BEST C', best_C)

BEST F1 for NON BALANCED model 0.7715415019762846
BEST C 12


In [13]:
best_model = None
best_result = 0
best_est = 0
best_depth = 0
for est in [50, 100, 200]:
    for depth in [5, 10, 15, 20]:

        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        pred = model.predict(features_valid)
        f1 = f1_score(target_valid, pred)
        if f1 > best_result:
            best_est = est
            best_depth = depth
            best_result = f1
print("f1 наилучшей модели на валидационной выборке:", best_result, "\nКоличество деревьев:", best_est, "Максимальная глубина:", best_depth)


f1 наилучшей модели на валидационной выборке: 0 
Количество деревьев: 0 Максимальная глубина: 0


## BERT pre-trained on toxic comments

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [15]:
tokenizer = BertTokenizer.from_pretrained('unitary/toxic-bert')

tokenized = df['text'].progress_apply(
    lambda x: tokenizer.encode(x, padding=True, truncation=True, max_length=60, add_special_tokens = True))
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


  0%|          | 0/159292 [00:00<?, ?it/s]

In [16]:
config = transformers.BertConfig.from_pretrained(
    'unitary/toxic-bert')
model = transformers.BertModel.from_pretrained(
    'unitary/toxic-bert', config=config)
model.to(device)



BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [17]:
batch_size = 28
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)

        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/5689 [00:00<?, ?it/s]

In [18]:
features = np.concatenate(embeddings)
target = df['toxic']

In [19]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, random_state=123, test_size=0.5
)

In [20]:
model = LogisticRegression(random_state=12345)
model.fit(features_train, target_train)

In [21]:
model.score(features_test, target_test)

0.9821585515907892

In [22]:
pred = model.predict(features_test)
f1 = f1_score(target_test, pred)
f1

0.909530782453683

## TRYING to fine tune BERT model

In [23]:
df = df.sample(frac=0.5, random_state=123)

In [24]:
NUM_LABELS = 2

id2label={1:'toxic', 0:'not'}

label2id={'toxic':1, 'not':0}

In [25]:
df["category"]=df.toxic.map(lambda x: id2label[x])

In [26]:
tokenizer = BertTokenizerFast.from_pretrained("google-bert/bert-base-uncased", max_length=512)

In [27]:
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=NUM_LABELS, id2label=id2label, label2id=label2id)
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [28]:
features = list(df['text'])
target = list(df['toxic'])

features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, target, test_size=0.1, random_state=12345, shuffle=False)

features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val, target_train_val, test_size=0.5, random_state=12345, shuffle=False)

In [29]:
train_encodings = tokenizer(features_train, truncation=True, padding=True)
val_encodings  = tokenizer(features_valid, truncation=True, padding=True)
test_encodings = tokenizer(features_test, truncation=True, padding=True)

In [30]:
class DataLoader(Dataset):
    def __init__(self, encodings, labels):
        """
        Initializes the DataLoader class with encodings and labels.

        Args:
            encodings (dict): A dictionary containing tokenized input text data
                              (e.g., 'input_ids', 'token_type_ids', 'attention_mask').
            labels (list): A list of integer labels for the input text data.
        """
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """
        Returns a dictionary containing tokenized data and the corresponding label for a given index.

        Args:
            idx (int): The index of the data item to retrieve.

        Returns:
            item (dict): A dictionary containing the tokenized data and the corresponding label.
        """
        # Retrieve tokenized data for the given index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add the label for the given index to the item dictionary
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        """
        Returns the number of data items in the dataset.

        Returns:
            (int): The number of data items in the dataset.
        """
        return len(self.labels)

In [31]:
train_dataloader = DataLoader(train_encodings, target_train)

val_dataloader = DataLoader(val_encodings, target_valid)

test_dataset = DataLoader(test_encodings, target_test)

In [32]:
def compute_metrics(pred):
    """
    Computes F1 for a given set of predictions.

    Args:
        pred (obj): An object containing label_ids and predictions attributes.
            - label_ids (array-like): A 1D array of true class labels.
            - predictions (array-like): A 2D array where each row represents
              an observation, and each column represents the probability of
              that observation belonging to a certain class.

    Returns:
        dict: A dictionary containing the following metrics:
            - F1 (float): The macro F1 score, which is the harmonic mean of precision
              and recall. Macro averaging calculates the metric independently for
              each class and then takes the average.
    """
    # Extract true labels from the input object
    labels = pred.label_ids

    # Obtain predicted class labels by finding the column index with the maximum probability
    preds = pred.predictions.argmax(-1)

    # Compute macro precision, recall, and F1 score using sklearn's precision_recall_fscore_support function
    # precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    f1 = f1_score(labels, preds)
    # Calculate the accuracy score using sklearn's accuracy_score function
    # acc = accuracy_score(labels, preds)

    # Return the computed metrics as a dictionary
    return {
        'F1': f1
    }

In [None]:
import accelerate
import transformers
training_args = TrainingArguments(
    output_dir='./myberttoxic',
    do_train=True,
    do_eval=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    logging_strategy='steps',
    logging_dir='./toxic-class-logs',
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    fp16=True,
    load_best_model_at_end=True
)

In [44]:
print("Transformers version:", transformers.__version__)
print("Accelerate version:", accelerate.__version__)


Transformers version: 4.41.2
Accelerate version: 0.32.1


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataloader,
    eval_dataset=val_dataloader,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
q=[trainer.evaluate(eval_dataset=df) for df in [train_dataloader, val_dataloader, test_dataset]]

pd.DataFrame(q, index=["train","val","test"]).iloc[:,:]

## Выводы

Модель BERT от unitary разработанная для детекции токсичности работает сильно лучше, однако занимает больше времени для работы