# Project for Wikishop with BERT

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a dataset with markup on the toxicity of edits.

Build a model with a quality metric *F1* of at least 0.75.

**Instructions for the implementation of the project**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

It is not necessary to use *BERT* to run the project, but you can try.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Loading data

In [2]:
import numpy as np
import pandas as pd
import torch
import re

from transformers import DistilBertModel, DistilBertTokenizer

from tqdm import notebook
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.metrics import f1_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from catboost import CatBoostClassifier

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
try:
    df = pd.read_csv("toxic_comments.csv")
except:
    df = pd.read_csv('/datasets/toxic_comments.csv')

#function to check the data
def data_check(data):
    data.info()
    print()
    display(data.head())
    print()
    print('Duplicates:', data.duplicated().sum()) 
    print()
    print('Missing values')
    print(data.isna().mean())
    
data_check(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB



Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0



Дубликаты: 0

Пропуски
text     0.0
toxic    0.0
dtype: float64


### Conclusions:
* no duplicates or omissions
* texts in English
* need to remove regular expressions

In [4]:
#reduce dataset size
df = df[:5000]

In [None]:
#function to clean up regular expressions and stop words
stop_words = set(stopwords.words('english'))

def clear_text(text):
    text = word_tokenize(text)
    text = [word for word in text if not word.lower() in stopwords.words()]
    return " ".join(re.sub(r'[^a-zA-Z ]', ' ', ' '.join(text)).split())

tqdm.pandas()
df['text'] = df['text'].progress_apply(clear_text)

In [6]:
#tokenizer initialization
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

#converting text to token numbers from a dictionary
tokenized = df['text'].apply(lambda x: tokenizer.encode(x, 
                                                        add_special_tokens=True, truncation=True, max_length=512))

#padding
max_len = 512 #max number of tokens for Bert
padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

#creating a mask of tokens
attention_mask = np.where(padded != 0, 1, 0)

#run on GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

#model BERT
model_bert = DistilBertModel.from_pretrained("distilbert-base-uncased").to(device)

#creating embeddings
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            batch_embeddings = model_bert(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

#features and target       
features = np.concatenate(embeddings)
target = df['toxic'][:159500] #due to batch=100, some texts were not included in the features

#sampling
features_train, features_test, target_train, target_test = train_test_split(features, target, 
                                                                            test_size=0.2, random_state=12345)

#sample size check
print(len(features), len(target))
print(len(features_train), len(target_test))

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=442.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=267967963.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50.0), HTML(value='')))


5000 5000
4000 1000


## Education

I will train two models: Logistic regression and Catboost.

### Logistic regression

In [7]:
model = make_pipeline(StandardScaler(), LogisticRegression(solver='saga', C=0.1))
rmse = cross_val_score(model, features_train, target_train, cv=3, scoring='f1')
rmse = np.mean(rmse)

model.fit(features_train, target_train)
rmse_test = f1_score(target_test, model.predict(features_test))
print('F1 на обучающей выборке:', rmse)
print('F1 на тестовой выборке:', rmse_test)

F1 на обучающей выборке: 0.6829376213030843
F1 на тестовой выборке: 0.7358490566037736


### CatBoost

In [9]:
model = CatBoostClassifier(task_type="GPU", verbose=False)
rmse = cross_val_score(model, features_train, target_train, cv=2, scoring='f1')
rmse = np.mean(rmse)

model.fit(features_train, target_train)
rmse_test = f1_score(target_test, model.predict(features_test))
print('F1 на обучающей выборке:', rmse)
print('F1 на тестовой выборке:', rmse_test)

F1 на обучающей выборке: 0.5905409782960804
F1 на тестовой выборке: 0.6739130434782609


## Conclusions

* models were trained to determine the toxicity of comments
* classification was implemented on embeddings using BERT
* using BERT is very resource intensive