# Toxicity classification using BERT

**Description:** This notebook builds a model to classify a text comment as toxic/non-toxic using BERT.
The data used for training the model was originally sourced from [Kaggle Toxic Commnet Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). 

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup) 
  * 2. [Data](#data)  
  * 3. [Data Cleaning](#datacleaning)



<a id = 'setup'></a>

## 1. Setup

Install tensorflow, pydot and transformers

In [1]:
!pip install tensorflow-datasets --quiet

In [2]:
!pip install pydot --quiet

In [3]:
!pip install transformers --quiet

Import required libraries

In [4]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds

import sklearn as sk
import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re
import string

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from transformers import BertTokenizer, TFBertModel

[Return to Top](#returnToTop)  
<a id = 'data'></a>

## 2. Data

The jigsaw database has been downloaded from kaggle, cleaned and preprocessed and split into train, validation and test datasets. The datsets are stored on amazon S3 where we will be accessing them from.

In [6]:
df_train = pd.read_csv("https://adamhyman-public.s3.amazonaws.com/train_data.csv")

df_valid = pd.read_csv("https://adamhyman-public.s3.amazonaws.com/validation_data.csv")

df_test = pd.read_csv("https://adamhyman-public.s3.amazonaws.com/test_data.csv")

In [7]:
df_train.head()

Unnamed: 0,comment_text,toxic
0,Sobriety would also be a good first step to en...,0
1,gender is NOT a feeling or belief you ignorant...,1
2,"""If Whitmer had been a white woman at a weddin...",1
3,Nobody stated it was a male thing. Take a brea...,0
4,Ontario deserves this for the government they ...,1


In [8]:
#There are 3 null comments in train dataset, 1 in valid and 1 in test. They need to be removed or we get error while convertin gto tensor
df_train = df_train.dropna(how='any',axis=0) 
df_valid = df_valid.dropna(how='any',axis=0) 
df_test = df_test.dropna(how='any',axis=0) 

In [9]:
df_train.head()


Unnamed: 0,comment_text,toxic
0,Sobriety would also be a good first step to en...,0
1,gender is NOT a feeling or belief you ignorant...,1
2,"""If Whitmer had been a white woman at a weddin...",1
3,Nobody stated it was a male thing. Take a brea...,0
4,Ontario deserves this for the government they ...,1


[Return to Top](#returnToTop)  
<a id = 'datacleaning'></a>

## 3. Data Cleaning:
This section is geared towards cleaning the existing data to remove special characters, ensure we replace shortened words etc

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
stemmer_ps = PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence) #we need to tokenize the sentence or else stemming will return the entire sentence as is.
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(stemmer_ps.stem(word))
        stem_sentence.append(" ") #adding a space so that we can join all the words at the end to form the sentence again.
    return "".join(stem_sentence)

def remStopWords(sentence):
        return ' '.join([word for word in sentence.split() if word not in cachedStopWords])


In [None]:
'''
for sen in df_train['comment_text']:
    temSen = sen
    sen = remStopWords(sen)
    tokenized_sentence = stemSentence(sen)

print("final sent:" + tokenized_sentence)
print("sen:" + temSen)
'''

In [10]:
def  clean_text(text):
    text =  text.lower()
    text = re.sub(r"i'm", "i am", text)
#    text = re.sub(r"r", "", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"'bout", "about", text)
    text = re.sub(r"'til", "until", text)
    text = re.sub(r"[-()\"\#/@;:{}`+=~|!?,]", "", text)
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    text = re.sub("(\W)"," ",text) 
   # text = re.sub('S*dS*s*','', text)
    return text
df_train["text"] = df_train['comment_text'].apply(lambda text: clean_text(text))
df_valid["text"] = df_valid['comment_text'].apply(lambda text: clean_text(text))
df_test["text"] = df_test['comment_text'].apply(lambda text: clean_text(text))

In [11]:
df_train[['comment_text','text']].head(20)

Unnamed: 0,comment_text,text
0,Sobriety would also be a good first step to en...,sobriety would also be a good first step to en...
1,gender is NOT a feeling or belief you ignorant...,gender is not a feeling or belief you ignorant...
2,"""If Whitmer had been a white woman at a weddin...",if whitmer had been a white woman at a wedding...
3,Nobody stated it was a male thing. Take a brea...,nobody stated it was a male thing take a breat...
4,Ontario deserves this for the government they ...,ontario deserves this for the government they ...
5,"I think you are right Mike, this no ""accident""...",i think you are right mike this no accident ch...
6,"You beat me to it. Indeed, the Conservatives c...",you beat me to it indeed the conservatives can...
7,The gigglig git creates the impression that he...,the gigglig git creates the impression that he...
8,While there certainly are small-minded idiots ...,while there certainly are smallminded idiots w...
9,ron? lol,ron lol


In [None]:
train_comments, train_labels = df_train["text"], df_train["toxic"]
valid_comments, valid_labels = df_valid["text"], df_valid["toxic"]
test_comments, test_labels = df_test["text"], df_test["toxic"]

In [None]:
train_comments, train_labels = tf.convert_to_tensor(train_comments), tf.convert_to_tensor(train_labels)
valid_comments, valid_labels = tf.convert_to_tensor(valid_comments), tf.convert_to_tensor(valid_labels)
test_comments, test_labels = tf.convert_to_tensor(test_comments), tf.convert_to_tensor(test_labels)

In [None]:
train_comments[:4]

In [None]:
train_labels[:4]