This is beginner's room of sentence-transformers

### There is a possibility of random selection when comparing the toxicity of toxic texts that are close in meaning.

### In this notebook, It try to get closer to a reliable validation data by removing more & less text that are close in meaning from the validation data.

### This notebook is one of the approaches to creating reliable validation data, which is described in the discussion [here.](http://www.kaggle.com/c/jigsaw-toxic-severity-rating/discussion/303429)

# import sentence-transformers

In [None]:
!pip install sentence_transformers

In [None]:
from sentence_transformers import util, SentenceTransformer

import pandas as pd
import os

from tqdm.auto import tqdm
from bs4 import BeautifulSoup
import numpy as np
import re
import random
import string

In [None]:
VALID_DATA_PATH = "../input/jigsaw-toxic-severity-rating/"

validation_df=pd.read_csv(os.path.join(VALID_DATA_PATH,'validation_data.csv'))
validation_df

In [None]:
SENTENCE_BERT_PATH="/kaggle/input/sentence transformers"
model = SentenceTransformer('stsb-bert-base')

# clean text

https://www.kaggle.com/vitaleey/tfidf-ridge

In [None]:
def text_cleaning(text):
    template = re.compile(r'https?://\S+|www\.\S+') #Removes website links
    text = template.sub(r'', text)

    soup = BeautifulSoup(text, 'lxml') #Removes HTML tags
    only_text = soup.get_text()
    text = only_text

    emoji_pattern = re.compile("["
                              u"\U0001F600-\U0001F64F"  # emoticons
                              u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                              u"\U0001F680-\U0001F6FF"  # transport & map symbols
                              u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                              u"\U00002702-\U000027B0"
                              u"\U000024C2-\U0001F251"
                              "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    text = re.sub(r"[^a-zA-Z\d]", " ", text) #Remove special Charecters
    text = re.sub(' +', ' ', text) #Remove Extra Spaces
    text = text.strip() # remove spaces at the beginning and at the end of string

    return text

validation_df['less_toxic']=validation_df['less_toxic'].apply(text_cleaning)
validation_df['more_toxic']=validation_df['more_toxic'].apply(text_cleaning)

# sentence-transformers

In [None]:
import time

start_time = time.time()

#less_toxic, more_toxic　 embedding
less_embedding = model.encode(validation_df.less_toxic, convert_to_tensor=True)
more_embedding = model.encode(validation_df.more_toxic, convert_to_tensor=True)

pass_time=round(time.time() - start_time)
print(f'time:{pass_time}s')
print(less_embedding.shape)
print(more_embedding.shape)

In [None]:
#Calculating sentence-transformers
scores_list=[]

for idx in range(len(less_embedding)):  
    scores = util.pytorch_cos_sim(less_embedding[idx], more_embedding[idx])
    scores=scores.squeeze().numpy()
    scores_list.append(scores)

In [None]:
#for visible, convert DataFrame
sentence_df = pd.DataFrame(scores_list, columns = ['sentence_transformers_score'])
sentence_df.head()

In [None]:
#　sentence_transformers_score Distribution
import seaborn as sns
sns.histplot(sentence_df['sentence_transformers_score'])

## Let's look at sample of scores above 0.7(high score sample)

In [None]:
high_score_index=[]

for index,score in enumerate(scores_list):
    if score >= 0.7:
        high_score_index.append(index)

print(len(high_score_index))

In [None]:
#show high_score_index
print(high_score_index[:10])

In [None]:
print('more_toxic:','\n',validation_df.loc[780,'more_toxic'],'\n')
print('less_toxic:','\n',validation_df.loc[780,'less_toxic']) 

781,782 is same text as 780

In [None]:
print('more_toxic:','\n',validation_df.loc[781,'more_toxic'],'\n') 
print('less_toxic:','\n',validation_df.loc[781,'less_toxic'],'\n') 
print('more_toxic:','\n',validation_df.loc[782,'more_toxic'],'\n') 
print('less_toxic:','\n',validation_df.loc[782,'less_toxic']) 

It's able to extract text with similar meanings.

### Let's look at some other samples.
However, these texts do not have a similar meaning.

Since the max length of sentence-transformers is 128, and the validation data is more than 128,
This may be because the validation data does not fit into the maximum length of sentence-transformer.

https://huggingface.co/sentence-transformers/stsb-bert-base

In [None]:
print('more_toxic:','\n',validation_df.loc[1160,'more_toxic'],'\n') 
print('less_toxic:','\n',validation_df.loc[1160,'less_toxic']) 

In [None]:
print('more_toxic:','\n',validation_df.loc[1290,'more_toxic'],'\n') 
print('less_toxic:','\n',validation_df.loc[1290,'less_toxic']) 

In [None]:
print('more_toxic:','\n',validation_df.loc[1402,'more_toxic'],'\n')
print('less_toxic:','\n',validation_df.loc[1402,'less_toxic']) 

### Thank you for watching so far!

### I will be happy if I can contribute to kaggler.