<footer id="footer"></footer>

# Cleaned Toxic Comments with stacking

![](https://i.ibb.co/pjcBRMR/bbc87fcc-3bb9-422a-a925-60ae8f17b019.jpg)

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

We have several target features, but let us work only with toxic in this data because of kernel limits.


**It is just a baseline for beginners, thank you for reading and also you can see the stacking technique for classification and downsampling**

Note: Dataset contains toxic vocabulary

## Preprocessing

### Imports

In [None]:
%%capture
!pip install transformers;

In [None]:
%%capture
!pip install wordcloud;

In [None]:
%%capture
!pip install tqdm;

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.utils import shuffle


import torch
import transformers
from wordcloud import WordCloud


import warnings
import seaborn as sns
from tqdm import notebook
from tqdm import tqdm

sns.set_style('darkgrid')
nltk.download('punkt')
nltk.download('wordnet')
warnings.filterwarnings('ignore')
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

np.random.seed(42)

### Downloading data and review

In [None]:
train = pd.read_csv('../input/cleaned-toxic-comments/train_preprocessed.csv')
train.drop(['set', 'id', 'toxicity'], axis=1, inplace=True)
display(train.head())
display(train.columns)

the column `comment_text` contains the text of the comment, and `identity_hate`, `insult`, `obscene`, `severe_toxic`, `threat`, `toxic` — target features

Check the gaps

In [None]:
train.isna().mean()

In [None]:
train.info()

The dataset contains 159571 lines, the data types correspond to the desired ones

In [None]:
train.duplicated().sum()

For convenience, we will convert the text to lower case

In [None]:
train['comment_text'] = train['comment_text'].str.lower()

In [None]:
train.head()

In [None]:
cols = ['identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic']
for col in cols:
  display(train[col].value_counts(normalize=True))

**Conclusion:** Primary transformations were made, checked for gaps and duplicates. we observe an imbalance in the target class

We need to transform the text, get tokens, and also clear lines of characters. We will make the transformations through the function and library `nltk` and` re`

In [None]:
def text_preprocessing(text):
    tokenized = nltk.word_tokenize(text)
    joined = ' '.join(tokenized)
    text_only = re.sub(r"[^a-z0-9!@#\$%\^\&\*_\-,\.' ]", ' ', joined)
    final = ' '.join(text_only.split())
    return final

In [None]:
tqdm.pandas() 
train['token_text'] = train['comment_text'].progress_apply(text_preprocessing)

We got tokens of words, we can continue working with the set

In [None]:
corpus_lemm = train['token_text']

In [None]:
corpus_lemm[0]

Received a body for further processing

In [None]:
x, y = np.ogrid[:300, :300]

mask = (x - 150) ** 2 + (y - 150) ** 2 > 150 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white", 
               random_state=42, mask=mask, repeat=True,
               stopwords=stopwords).generate(corpus_lemm[0])

plt.figure(figsize=(15, 10), dpi=42)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

**Conclusion:** Transformed the dataset and got word lemmas. In the word cloud, the most common words are explanation, dolls, edits - let's try to train the models to predict the toxicity of the text.

We going to use an Random Forest and SGD classifier, and we will also use the Distilbert to obtain and predict embeddings - perhaps we will be able to improve the results of the basic models and also we will try stacking

To speed up the DistillBERT learning process without GPU, only a part of the dataset will be transmitted, which should have a definite effect on the result. Also we will use only toxic target

## Model training

there is a strong class imbalance. Let's try to go in two ways:

- train the model on the network using **downsampling**
- train the model with the parameter **class_weight = 'balanced'**

### Preparing characteristics


Let's select from the set date the target feature and the training feature - the text

In [None]:
train.head()

In [None]:
features = train['token_text']
target = train['toxic']

Divide our set to test and train

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=42)

Let's write a function that allows you to achieve a balance of the class, through downsampling

In [None]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_sample = features_zeros.sample(frac=0.1, random_state=42)
    target_sample = target_zeros.sample(frac=0.1, random_state=42)
    
    features_downsampled = pd.concat([features_sample] + [features_ones])
    target_downsampled = pd.concat([target_sample] + [target_ones])
    
    features_downsampled = shuffle(features_downsampled, random_state=42)
    target_downsampled = shuffle(target_downsampled, random_state=42)
    
    return features_downsampled, target_downsampled

We will receive new sets

In [None]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

print(features_downsampled.shape)
print(target_downsampled.shape)

In [None]:
target_downsampled.value_counts(normalize=True)

The imbalance is insignificant, with such a set, you can try to train the model. First, let's get the TD-IDF measure for the new set.

In [None]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf = count_tf_idf.fit_transform(features_downsampled)

print("Learning Matrix Size:", tf_idf.shape)

In [None]:
model_name = []
fscore = []

### Training a random forest with downsampling


Train an ensemble of models using the downsampling technique

In [None]:
X_train_ans = tf_idf
y_train_ans = target_downsampled

In [None]:
%%time

rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)



X_test_ans = count_tf_idf.transform(features_test)

rnd_clf.fit(X_train_ans, y_train_ans)
predict = rnd_clf.predict(X_test_ans)
f_score = f1_score(predict, target_test)

print('{}'.format(f_score))

F1 measure is not good, the model converges poorly on the test - let's try learning without downsampling with class balance and SGD


In [None]:
model_name.append(str(rnd_clf.__class__.__name__)+str(' ')+str('downsampling)'))
fscore.append(round(f_score, 2))

### Training a random forest without downsampling

Let's train a model based on the same ensemble, but instead of a balanced set, we use the basic lemmatized one and set the class weight as balanced and set mode estimators

In [None]:
count_tf = TfidfVectorizer(stop_words=stopwords)
tf_idf_new = count_tf.fit_transform(features_train)

In [None]:
%%time
X_train = tf_idf_new
y_train = target_train

X_test = count_tf.transform(features_test)


rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42, 
                            class_weight='balanced')



rnd_clf.fit(X_train, y_train)
predict_new = rnd_clf.predict(X_test)

In [None]:
f_score = f1_score(predict_new, target_test)
print(f_score)


With this approach, we have decreased F1-measure

In [None]:
model_name.append(str(rnd_clf.__class__.__name__)+str(' ')+str('class_weight balanced'))
fscore.append(round(f_score, 2))

### Train SGD model with downsampling

Let's try the stochastic gradient descent model with downsampling

In [None]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf = count_tf_idf.fit_transform(features_downsampled)

print("Matrix size:", tf_idf.shape)

In [None]:
X_train = tf_idf
y_train = target_downsampled

In [None]:
sgb_clf = SGDClassifier(l1_ratio=0.1, random_state=42,
                            class_weight='balanced')

In [None]:
%%time
sgb_clf.fit(X_train, y_train)

X_test = count_tf_idf.transform(features_test)

predict = sgb_clf.predict(X_test)
f_score = f1_score(predict, target_test)
print(f_score)

The result is better then forest (but in forest we use only 10 estimators)

In [None]:
model_name.append(str(sgb_clf.__class__.__name__)+str(' ')+str('class_weight balanced'))
fscore.append(round(f_score, 2))

### Stacking with Random forest

Let us try stacking via Sklearn models - RandomForestClassifier, SGDClassifier and MLP  
We need validation set

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    features_train, target_train, test_size=0.2, random_state=42)

Now use TD-IDF for all three sets

In [None]:
count_tf = TfidfVectorizer(stop_words=stopwords)
X_train_idf = count_tf.fit_transform(X_train)
X_val_idf = count_tf.transform(X_val)
X_test = count_tf.transform(features_test)

We will use three base models - RandomForestClassifier, SGDClassifier and MLP, then we will blend our predictions with RandomForestClassifier

In [None]:
random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=42, 
                                           class_weight='balanced')
sgd_clf = SGDClassifier(l1_ratio=0.1, random_state=42,
                            class_weight='balanced')
mlp_clf = MLPClassifier(random_state=42, early_stopping=True)

In [None]:
estimators = [random_forest_clf, sgd_clf, mlp_clf]
for estimator in estimators:
    print('Training', estimator)
    estimator.fit(X_train_idf, y_train)

From the predictions let us make new trainig set for our meta model - blender

In [None]:
X_val_predictions = np.empty((X_val_idf.shape[0], len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val_idf)
X_val_predictions

In [None]:
rnd_forest_blender = RandomForestClassifier(n_estimators=50, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

In [None]:
rnd_forest_blender.oob_score_

Now we can predict our test and see the F1-measure

In [None]:
X_test_predictions = np.empty((X_test.shape[0], len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [None]:
%%time

y_pred = rnd_forest_blender.predict(X_test_predictions)
f_score = f1_score(y_pred, target_test)
print(f_score)

we have an improvement here with stacking

In [None]:
model_name.append(str(rnd_forest_blender.__class__.__name__)+str(' ')+str('Stacking Ensemble'))
fscore.append(round(f_score, 2))

### Train DistillBert

To train the model with pretraining using DistillBERT, we will build a new set, balanced, since we will have to transfer an order of magnitude fewer rows for training, which is not an entirely adequate performance estimate

Let's create samples for DistillBERT and remove the class imbalance in the training sample. 

In [None]:
features = train['comment_text']
target = train['toxic']

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=42)

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

target_downsampled.value_counts(normalize=True).to_frame()

Let's create a set from the training balanced sample, remove duplicates from it and take a sample of 1000 values

In [None]:
df_bert = features_downsampled.to_frame().join(
    target_downsampled.to_frame())
df_bert.head()

In [None]:
df_bert.duplicated().sum()

In [None]:
df_bert.drop_duplicates(inplace=True)
df_bert.duplicated().sum()

In [None]:
df_bert[df_bert.index == 115222]

Thus, we got a new set, from which we will take slices, while deleting all duplicates, checking one index in order to make sure that the set was assembled adequately

In [None]:
df_comm = df_bert.sample(1000).reset_index(
    drop=True)
df_comm.head()

In [None]:
df_comm['toxic'].value_counts(normalize=True).to_frame()

We got a fairly balanced sample.

We transform our signs in order to obtain embeddings

In [None]:
configuration = transformers.DistilBertConfig()
model = transformers.DistilBertModel(configuration)
configuration = model.config

pretrained_weights = 'distilbert-base-uncased'

tokenizer_class = transformers.DistilBertTokenizer

It is worth noting that the model is trained to work with sentences up to 512 characters. It is necessary to cut our offers if they exceed this limit. It can also affect the results

In [None]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

tokenized = df_comm['comment_text'].apply(
    lambda x: tokenizer.encode(x[:512], add_special_tokens=True))

padded = np.array([i + [0]*(512 - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [None]:
len(padded[0])

In [None]:
padded.shape, attention_mask.shape

In [None]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

Create training and target datasets for our model

In [None]:
X_train = np.concatenate(embeddings)
y_train = df_comm['toxic'][:padded.shape[0]]

We will check visually whether the target classes were selected correctly

In [None]:
y_train.values[:50]

In [None]:
df_comm['toxic'].values[:50]

Check the sets for the form

In [None]:
X_train.shape, y_train.shape

In [None]:
y_train.value_counts(normalize=True).to_frame()

The target feature is balanced on the training sample

Let's prepare a test sample. Let's take 200 random values ​​and get embeddings for the test

In [None]:
test = features_test.to_frame().join(
    target_test.to_frame()).sample(200).reset_index(
    drop=True)
test.head()

In [None]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

tokenized = test['comment_text'].apply(
    lambda x: tokenizer.encode(x[:512], add_special_tokens=True))

padded = np.array([i + [0]*(512 - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [None]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

In [None]:
X_test = np.concatenate(embeddings)
y_test = test['toxic'][:X_test.shape[0]]

In [None]:
y_test.value_counts(normalize=True).to_frame()

There is an imbalance of classes in the test sample. Let's train a logistic regression model, with a balance

In [None]:
%%time
log_clf = LogisticRegression(solver="liblinear", random_state=42,
                             class_weight='balanced')

log_clf.fit(X_train, y_train)

In [None]:
predict = log_clf.predict(X_test)

In [None]:
f_score = f1_score(predict, y_test)
print(f_score)

Unfortunately, we got a rather low result. But this fact is due to the fact that in order to reduce the training time, we had to transfer not the entire set for training, and we had to cut off the sentences that the model could work with them, which could affect the context

In [None]:
model_name.append(str(log_clf.__class__.__name__)+str(' ')+str('BERT'))
fscore.append(round(f_score, 2))

### Sanity check

Let's build a constant model. It will predict 1 - toxic comment everywhere, since our goal is to identify them.

In [None]:
dummy = DummyClassifier(random_state=42, strategy='constant', constant=1)

In [None]:
dummy.fit(features_train, target_train)
dummy_pred = dummy.predict(features_test)

In [None]:
f1_const = f1_score(target_test, dummy_pred)

print("Const:", f1_const)

In [None]:
model_name.append(str(dummy.__class__.__name__)+str(' ')+str('const 1'))
fscore.append(round(f1_const, 2))

## Summary

In [None]:
summary = pd.DataFrame(
    { 'model' : model_name , 'F1' : fscore }
    ).sort_values(by='F1', ascending=False).reset_index( drop = True )

summary.style.highlight_max( 'F1' , color = 'green' , axis = 0 )


Acceptable results were obtained on a model based on the SGB algorithm and we have **better score on stacking**

Logistic regression based on DistillBERT to classify long texts such as comments for these purposes is not worth it - you have to truncate the text, which can affect the context, you have to limit the amount of data for training and prediction.

Thank you for reading


---
<font size="1">
ArtyKraftyy
</font>     
