## CS 6120: Natural Language Processing

### Final Project - Group 20
### Group Members: Xi Jia, Xiaochen Ma, Young Zhang

### Dataset: <br>
The Civil Comments dataset is a large collection of comments that have been labeled with six
different types of toxic behavior: toxic, severe toxic, obscene, threat, insult, and identity hate.
The dataset consists of over 1.6 million comments, making it one of the largest and most
comprehensive collections of toxic comments available for research purposes. The dataset can be
found on Kaggle: (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). <br>

### Part 1: Preprocessing and spliting the data - Xiaochen Ma
Xiaochen Ma will be responsible for the crucial task of data preprocessing
and vectorization, a step that is essential to preparing the data for the machine learning models. <br>

### Import the library

In [2]:
import numpy as np
import math
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
import string
import nltk
from nltk.tokenize import word_tokenize
nltk.download('all')
wn = nltk.WordNetLemmatizer()
import re
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/xiaochenma/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/xiaochenma/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/xiaochenma/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/xiaochenma/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/xiaochenma/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    |

### Load the data

In [3]:
# Read the data in 'train.csv' file and print head of data frame
train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
# Read the data in 'test.csv' file and print head of data frame
test_df = pd.read_csv("test.csv")
test_df.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [8]:
# Labels count
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_df[class_names].apply(lambda x: x.value_counts())

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,144277,157976,151122,159093,151694,158166
1,15294,1595,8449,478,7877,1405


In [9]:
# Normalize the train labels count
train_df[class_names].apply(lambda x: x.value_counts(normalize=True))

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0.904156,0.990004,0.947052,0.997004,0.950636,0.991195
1,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805


### Preprocess the comment_text field

We know that a comment contains links, punctuation, stopwords and many other words that don't give a lot of meaning for the prediction. 

Therefore, In the cell below, we implement text-preprocessing and remove links, punctuations and stopwords. We also lowercase the letters.

In addition to this, we also perform stemming operation so that similar words are reduced.

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import string
import pandas as pd
import re
import nltk
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')

stopword = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()
words = set(nltk.corpus.words.words())

def clean_comment(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"www.\S+", "", text)
    text_links_removed = "".join([char for char in text if char not in string.punctuation])
    text_cleaned = " ".join([word for word in re.split('\W+', text_links_removed)
        if word not in stopword])
    text = " ".join([wn.lemmatize(word) for word in re.split('\W+', text_cleaned)])
    return text

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/xiaochenma/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/xiaochenma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/xiaochenma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/xiaochenma/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [15]:
# testing the clean_comment(comment) function using an example comment
comment_cleaned = clean_comment("Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren\'t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27")
print(comment_cleaned)

explanation edits made username hardcore metallica fan reverted werent vandalism closure gas voted new york doll fac please dont remove template talk page since im retired now892053827


In [17]:
# Clean all the reviews in the train.csv & test.csv dataset using the clean_comment function
train_df['comment_text'] = train_df['comment_text'].apply(lambda x: clean_comment(x))
test_df['comment_text'] = test_df['comment_text'].apply(lambda x: clean_comment(x))