---
<a name = Section1></a>
# **1. Introduction**
---

- Being anonymous over the internet can sometimes make people say nasty things that they normally would not in real life. 

- Often, online platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. 

- To combat this issue, the <a href="https://conversationai.github.io/">Conversation AI team</a>, a research initiative founded by <a href="https://jigsaw.google.com/">Jigsaw</a> and Google (both a part of <a href="https://abc.xyz/">Alphabet</a>) are working on tools to help improve online conversation. One area of focus is the study of **negative online behaviors**, like **toxic comments** (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). 

<center><img width="50%" src="https://miro.medium.com/v2/resize:fit:679/1*r5OBabkQnYD1D4yzC_kvLQ.gif"></center>

- So far they have built a range of publicly available models served through the <a href = "https://perspectiveapi.com/">Perspective API</a>, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content) 

---
<a name = Section2></a>
# **2. Problem Statement**
---

- In this competition, the task is to build a **multi-headed classification model** that's capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective's   <a href="https://github.com/conversationai/unintended-ml-bias-analysis">current models</a>. 
  
- Let's say you have been assigned the particular task...How would you proceed about it? 

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [1]:
!pip install ydata-profiling                                  # Library to generate basic statistics about data

Collecting scipy<1.11,>=1.4.1 (from ydata-profiling)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.1
    Uninstalling scipy-1.11.1:
      Successfully uninstalled scipy-1.11.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.10.1 which is incompatible.[0m[31m
[0mSuccessfully installed scipy-1.10.1


In [2]:
!pip install nltk                                       # Natural Language Toolkit 
!python -m spacy download en_core_web_md                # Spacy NLP 

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Collecting en-core-web-md==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
!pip install torch-summary                              # Pytorch summary 

In [None]:
!pip install contractions>0.0.18       # Resolve contractions, for instance, you're -> you are

In [None]:
# Vader Sentiment Analyzer 
!pip install vaderSentiment           # Analyze sentiment 

In [None]:
# Language detection in Python 
!pip install langdetect                # Language detection in Python 

In [None]:
# Fasttext language detection
!pip install fasttext-langdetect

In [None]:
# Upgrade all libraries at once 
!pip install --upgrade --upgrade-strategy eager pip

In [None]:
!pip install wordcloud

In [None]:
## System Version Check 
import sys 
print(f"Latest Python Version on Kaggle: {sys.version}")

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import necessary libraries
import re 
from collections import Counter 
import warnings 
warnings.filterwarnings("ignore")

# Object serialization 
import pickle
import sklearn

# WordCloud 
from wordcloud import WordCloud, STOPWORDS

# Data Visualization 
import matplotlib as mp
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

# Pandas pre-profiling 
from ydata_profiling import ProfileReport 

# Import Natural Language Processing (NLP) libraries
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
from nltk import word_tokenize, sent_tokenize 
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.corpus import stopwords 

# Import Spacy for advanced natural language processing
import spacy

# Fasttext languae detection 
# from ftlangdetect import detect

# Contractions 
# import contractions as cm 

# Import langdetect for language detection
# Note: set seed=0 to enforce consistent results (to be done later)
# from langdetect import DetectorFactory, detect 

# Import scikit-learn utilities
from sklearn.preprocessing import FunctionTransformer, LabelEncoder 
from sklearn.pipeline import Pipeline, FeatureUnion  

# Import Spacy tokenizer
from spacy.tokenizer import Tokenizer 

# Import transformers for handling pretrained models
import transformers 

# Import PyTorch for deep learning
import torch 
import torch.nn as nn 
from torch.utils.data import Dataset 
from torch.utils.data import DataLoader 
import torch.optim as optim 
import torch.nn.functional as F 

# Import torchsummary for model summary
# from torchsummary import summary 

# Import tqdm for progress bars
from tqdm import tqdm

# Chi2 test 
from scipy.stats import chi2_contingency

In [None]:
# Initialize the constants 
porter_stem = PorterStemmer()
wordnet_lemma = WordNetLemmatizer()
stopwords = set(stopwords.words("english"))

In [None]:
!python -m nltk.downloader wordnet
!unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/

In [None]:
# Check versions of all tertiary packages 
print(f"Sklearn: {sklearn.__version__}")
print(f"Matplotlib: {mp.__version__}")

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

In [None]:
Lemmatizer = WordNetLemmatizer()
print("words :", Lemmatizer.lemmatize("words")) 
print("corpora :", Lemmatizer.lemmatize("corpra")) 
  
# a denotes adjective in "pos" 
print("better :", Lemmatizer.lemmatize("better", pos ="a")) 

In [None]:
from textblob import TextBlob

In [None]:
wiki = TextBlob("Python is a high-level, general-purpose programming language.")
wiki.tags

In [None]:
!python -m textblob.download_corpora

In [None]:
from textblob import Word 
w = Word("Hello there!!!")
w.lemmatize()

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

doc = nlp(u"The shimmering azure waters lapped gently against the golden sands, painting a serene picture of tranquility under the midday sun. A gentle breeze carried the faint scent of salt and sea spray, mingling with the crisp, clean air. Seabirds soared overhead, their graceful arcs cutting through the vast expanse of the sky. Along the coastline, a cluster of cottages nestled among verdant trees, their vibrant colors standing out against the backdrop of lush foliage. Laughter and joyous chatter filled the air as families enjoyed their day by the shore, building sandcastles and playing in the surf. Farther out, sailboats dotted the horizon, their billowing sails catching the ocean's whispers, inviting adventurers to explore the endless mysteries hidden beyond the horizon.")

for token in doc:
    print(token.lemma_)

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---


- The dataset consists of comments and its classification into six categories of toxic, severely toxic, obscene, threat, insult and identity hate

| Records(train+test)| Features |  Size(total) |
| :--: | :--: | :--: |
| 312735 | 2 | 49.78 MB | 

<br>

| # | Feature Name | Feature Description |
|:--:|:--|:--| 
|1| Id | A unique identifier for a particular comment |
|2| comment_text | Contains comments taken from Wikipedia talk page edit discussions |


| # | Label Name | Label Description |
|:--:|:--|:--| 
|1| toxic| A rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion |
|2| severely_toxic | A very hateful, aggressive, disrespectful comment or otherwise very likely to make a user leave a discussion or give up on sharing their perspective. This attribute is much less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words. |
|3| obscene | Swear words, curse words, or other obscene or profane language. |
|4| threat | Describes an intention to inflict pain, injury, or violence against an individual or group.|
|5| insult | Insulting, inflammatory, or negative comment towards a person or a group of people. |
|6| identity_hate |	Negative or hateful comments targeting someone because of their identity. |

In [None]:
train_path = "../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip"
test_path = "../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip"

In [None]:
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [None]:
def get_file_size(file_path: str):
    """Get the size of the file in bytes"""
    try:
        size_bytes = os.path.getsize(file_path)
        return size_bytes
    except OSError: 
        print("No path exists at the given location")

In [None]:
train_file_memory = get_file_size(train_path)
test_file_memory = get_file_size(test_path)
total_memory = (train_file_memory+test_file_memory)

# Print the total memory usage
print(f'Total File Size: {total_memory / (1024 * 1024):.2f} MB')

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
train_df.describe()

**Observations**
- The **mean** of all label values seem to be significantly less than 0.5, thereby creating the case of an **imbalanced dataset** 
- All labels are **binary**, i.e., either 0 or 1

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [None]:
train_df.info()

In [None]:
def prelim_clean(df): 
  df = df.drop_duplicates() # Drop duplicate data, if any 
  df = df.T.drop_duplicates().T # Drop duplicate columns
  return df 

def prelim_inspection(df):
  """Inspects the first few rows/columns of data""" 
  display(df.head()) # look at data 
  display(df.shape)  # look a shape of data
  display(df.iloc[:5, :5].dtypes)  # look at data types. Ideally look at all rows. Only look at first five here for minimal output
  display(df.isna().any())
  display(df.describe(percentiles=[0.25,0.5,0.75,0.85,0.95,0.99]))
        
def null_values(df):
  """Checks for null values within the dataset"""
  missing_vals = df.isnull().sum().sort_values(ascending=False)
  missing_perc = ((df.isnull().sum())/len(df) * 100).sort_values(ascending=False)
  return missing_vals, missing_perc 

In [None]:
from time import time 
t1 = time()
train_df = prelim_clean(train_df)
prelim_inspection(train_df)
null_values(train_df)
print(f"Total time taken for cell to run: {time()-t1}")

In [None]:
t2 = time()
test_df = prelim_clean(test_df)
prelim_inspection(test_df)
null_values(test_df)
print(f"Total time taken for cell to run: {time()-t2}s")

**Observation(s)**
- There is only **comment_text** column as a **direct feature**. Other **discriminatory features** can    be derived from this feature
- There exist no **null values** in **comment_text** column 
- There seem to be no **duplicate** rows in the dataset as the size remains same as earlier

<a name = Section5></a>

---
# **5. Data Pre-Profiling**
---
- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
profile = ProfileReport(train_df, title="Profiling Report", html={"style": {"full_width": True}})
profile.to_file(output_file='Pre Profiling Report.html')
print('Accomplished!')

**Observation(s)** 
- There seems to be a strong relationship between 
    - toxic with **obscene** and **insult**
    - obscene with **insult**
    - However, the correlation values are calculated using the pearrson method, which assumes a linear relationship between continuous variables. So, to find a pattern between two categorical variables we can use other tools like
        - Confusion Matrix/Crosstab 
        - Cramer's V statistic 
            - Cramer's V stat is an extension of the chi-square test where the extent/strength of     association is also measured 
- All labels are severely imbalanced 
        - toxic: 54.4%
        - severe_toxic: 91.9% 
        - obscene: 70.1% 
        - threat: 97.1% 
        - insult: 71.6% 
        - identity_hate: 92.7% 

<a name = Section6></a>

---
# **5. Data Cleaning**
---
- In this section, we will perform the **cleaning** operations over the features using information from the previous section.

- As a part of this project, we will employ data cleaning techniques such as working with only English text, remove special characters from the comment text, etc. *and other useful linguistic features such as **n-grams, text length, keywords, topics** etc...check this out*

- To Do 
    - Add Unit Testing 
    - Assertion AND further "is instance()" checks 
- Order of cleaning is also important

- To Remove
    - All numbers 
    - then extra dots which occur
    - Eg: 89.205.38.27 which follows a word 
    - Extra spaces and backslashes 
    - Extra "\n" 
    - lowercase 
    - just use the standard stopwords for removal 
    - remove all numbers:
    - Remove all nbsp 
    - Whatever I brainstorm and comes to my mind and how I have proceeded just jot it down
  

In [None]:
# English stopwords 
eng_stopwords = set(stopwords.words("english"))

In [None]:
# Load a Spacy Pipeline 
nlp = spacy.load("en_core_web_md")

In [None]:
# Function to detect language using langdetect
## Check if I have to perform sentence level tokenization first 

def detect_language(text: str) -> str:
    """
    Detect the language of the given text using langdetect library.

    Parameters:
    - text (str): The input text for language detection.

    Returns:
    - str: The detected language code (e.g., 'en' for English).
           If language detection fails, returns 'unknown'.
    """
    try:
        # Attempt to detect the language using langdetect
        doc = 
        return detect(text)["lang"]
    except Exception:
        # Return 'unknown' if language detection fails
        return "unknown"
    
# Capture the hashtags and/or usertags if any 
def extract_hashtags(text: str) -> list:
    """ Returns all Twitter hashtags from the text"""
    hashtags_ls = re.findall("#\w+", text)
    return hashtags_ls

# Clean comment text 
def clean_text(
        text, words=True, stops=True, urls=True, tags=True, 
        newLine=True, ellipsis=True, special_chars=True, condensed=True, non_breaking_space=True, 
        character_encodings=True, stopwords=True, only_words=True) -> str:
    
    """ Clean tweets after extracting all hashtags and username tags
    Not comprehensive enough to capture all idiosyncrasies, but works for most of the time
    """
    
    # Capture only words and no numbers
    if words:
        pattern = r"\d"
        text = re.sub(pattern, "", text)
        
    # Remove more than or equal to 2 full stops 
    if stops: 
        pattern = r"\.{2,}"
        text = re.sub(pattern, "", text)
    
    # Remove URLs 
    if urls:
        pattern = "(https\:)*\/*\/*(www\.)?(\w+)(\.\w+)\/*\w*"
        text = re.sub(pattern, "", text)
        
    # Remove tags 
    if tags:
        text = re.sub("@\w+", "", text)
    
    # Replacing one or more occurrences of '\n' with ''
    # Replacing multiple occurrences, i.e., >=2 occurrences with '.'
    if newLine:
        text = re.sub("\n+", "", text)
        
    # Fix contractions
    if condensed:
        try:
            text = cm.fix(text)
        except: 
            print(text)
            
    # Remove "ellipsis"
    if ellipsis:
        pattern = r"\.{2,}"
        text = re.sub(pattern, "", text)
        
    # Remove the special_chars list: [%, ^, *, -, _, +, =, |, \, /, ?]
    if special_chars:
        spec_char_list = ['%', '^', '*', '-', '_', '+', '=', '|', '/', '?']
        new_sent_tokens = []
        
        for character in text:
            if str(character) not in spec_char_list:
                new_sent_tokens.append(character)
                
        sent = " ".join(new_sent_tokens)
        sent = text.strip() # Add further checks for cleaning 
        
    # Resolve character encodings
    if character_encodings:
        pattern = r"â|€|¦|â|€˜|€™"
        text = re.sub(pattern, "", text)
        
    # Remove non-breaking space 
    if non_breaking_space: 
        pattern = r"(\xa0|&nbsp)"
        text = re.sub(pattern, "", text)
        
    # Remove stopwords
    if stopwords:
        words = word_tokenize(text)
        filtered_words = [word for word in words if word not in eng_stopwords]
        text = " ".join(words)
        text = text.strip()  # Add further checks for cleaning 
        
    # Only words
    if only_words:
        text = re.sub(r"[^\w\n\.]+", " ", text)
        text = text.strip()
        
    data = data.progress_apply(clean_workers)
    
    # Limiting length of tweet. Do this processing later. Not in first stage 
#     max_tokens = 50 
#     min_tokens = 5 
#     data = data.progress_apply(limit_length, min_tokens=min_tokens, max_tokens=max_tokens)
    
    # Dropping all NaN values, which are the token limits that didn't meet the thresholding requirements 
#     data = data.dropna()
#     print(f"Limited each tweet to a max. of {max_tokens} tokens and min of {min_tokens} tokens. Shape is now {data.shape}.\n \n. Peek: \n {data.head()}")
#     visualize_lengths(data, "Lengths of tokens after step 10")
    
    return text

**TODO** 
- Add every preprocessing function to Spacy NLP module later in next version check
- Convert everything to lowercase and remove multiple occurences of some repeating characters 

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

doc = nlp(u"The shimmering azure waters lapped gently against the golden sands, painting a serene picture of tranquility under the midday sun. A gentle breeze carried the faint scent of salt and sea spray, mingling with the crisp, clean air. Seabirds soared overhead, their graceful arcs cutting through the vast expanse of the sky. Along the coastline, a cluster of cottages nestled among verdant trees, their vibrant colors standing out against the backdrop of lush foliage. Laughter and joyous chatter filled the air as families enjoyed their day by the shore, building sandcastles and playing in the surf. Farther out, sailboats dotted the horizon, their billowing sails catching the ocean's whispers, inviting adventurers to explore the endless mysteries hidden beyond the horizon.")

for token in doc:
    print(token.lemma_ for token in doc)

In [None]:
processed_inbound = my_tokenizer(text["inbound"])

In [None]:
test_df["comment_text"][50]

In [None]:
clean_text(test_df["comment_text"][50])

In [None]:
print(detect_language("I am a guy."))

In [None]:
# Tried applying language identification earlier, however most of them were goinf unidentified 
# train_df["lang"] = train_df["comment_text"].apply(detect_language)
# test_df["lang"] = test_df["comment_text"].apply(detect_language)
# train_df.sample(20)

In [None]:
train_df["hashtags"] = train_df["comment_text"].apply(extract_hashtags)
test_df["hashtags"] = test_df["comment_text"].apply(extract_hashtags)

In [None]:
train_df["hashtags"].sample(30)

In [None]:
train_df["comment_text"] = train_df["comment_text"].apply(clean_text)

In [None]:
test_df["comment_text"] = test_df["comment_text"].apply(clean_text)

In [None]:
# Serialize train 
# Serialize test 

In [None]:
train_df["lang"] = train_df["comment_text"].apply(detect_language)
test_df["lang"] = test_df["comment_text"].apply(detect_language)

In [None]:
# Select only those rows where the comment language is in "English"
train_df = train_df[train_df["lang"] == "en"]
test_df = test_df[test_df["lang"] == "en"]

In [None]:
# Again check the dataset sizes 
print(train_df.shape)
print(test_df.shape)

In [None]:
# Serialize the dataframes so that the entire preprocessing step need not be run again 
# Saving all my results
with open('train.pkl', 'wb') as handle:
    pickle.dump(train_df, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test.pkl', 'wb') as handle:
    pickle.dump(test_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---

**Univariate**
- Association between labels (Correlation) 
- Distribution of comment lengths 
- Understanding the topics behind "toxic" comments 
    - https://www.kaggle.com/code/jagangupta/understanding-the-topic-of-toxicity
- Extracting **syntactic features** in text
- Treating data imbalance 
    - Deep Learning Models 
- Metric Evaluation (Mean AUC score) 
- Clustering of topics 

In [None]:
# Load serialized data 
dtypes = {
        "toxic": "uint8", 
        "severe_toxic": "uint8", 
        "obscene": "uint8", 
        "threat": "uint8", 
        "insult": "uint8", 
        "identity_hate": "uint8"
}
label_cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
train_df = pd.read_pickle("/kaggle/input/train-test-data/train.pkl")
test_df = pd.read_pickle("/kaggle/input/train-test-data/test.pkl")
train_df = train_df.astype(dtypes)

In [None]:
train_df.columns

In [None]:
train_df.head(10)

In [None]:
test_df.head(10)

**Observations** 
- Convert the comment_text to lowercase in iteration 2 
    - Though keep the names and self-referencing pronouns as they should be 
    - I can evern conver all the names to lower case 
        - Now check if models are sensitive to this 

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/style.html
def highlight_min(data, color='coral'):
    '''
    highlight the maximum in a Series or DataFrame
    '''
    # Define the CSS attribute for background color based on the provided color
    attr = 'background-color: {}'.format(color)
    
    # Check if the input data is a 1-dimensional Series (from .apply(axis=0) or axis=1)
    if data.ndim == 1:
        # For a Series
        # Find positions where the value is equal to the minimum value
        is_min = data == data.min()
        
        # Construct a list comprehension to apply the background color attribute
        # to the positions of the minimum value, return an empty string otherwise
        return [attr if v else '' for v in is_min]
    else:
        # For a DataFrame (from .apply(axis=None))
        # Find positions where the value is equal to the minimum value across all columns and rows
        is_min = data == data.min().min()
        
        # Create a DataFrame where cells corresponding to the minimum value
        # receive the background color attribute, and the rest are empty strings
        return pd.DataFrame(np.where(is_min, attr, ''),
                            index=data.index, columns=data.columns)

In [None]:
# Extract dataframe consisting of labels only 
label_df = train_df.iloc[: , 2:-2]
label_df.head()

#### **Question 1**: What is the relative distribution of binary values of all labels in the dataset?</h4>

**Note**
- Beautify the plot later

In [None]:
# Convert 
pos_label_df = label_df.melt(var_name="Label", value_name="Value")
print(pos_label_df.sample(20))

In [None]:
sns.countplot(data=pos_label_df, y="Label", hue="Value", palette="Set2")
plt.legend(loc="upper right")

#### **Question 2:** What is the relationship between the distributions of various labels with respect to each other?</h4>

**Note**
- If there exists a strong relationship between any two labels, then one of them could potentially be dropped 

In [None]:
print(label_df["toxic"].dtype)

In [None]:
print(label_df.corr())

In [None]:
# Crosstab: Since a crosstab between all six classes cannot be visualized, let's take a look 
# at toxic with other tags 
main_col="toxic"
corr_mats=[]
for other_col in label_df.columns[1: ]:
    confusion_matrix = pd.crosstab(index=label_df[main_col], columns=label_df[other_col])
    corr_mats.append(confusion_matrix)
out = pd.concat(corr_mats,axis=1,keys=label_df.columns[1:])

#cell highlighting
out = out.style.apply(highlight_min,axis=0)
out

**Observations**: 
- All severely toxic labels are a subset of toxic labels 
- Almost all of the rest of the sentences, which are marked **obscene**, **threat**, **insult** or **identity_hate** are also **toxic**

In [None]:
# Mask the upper half of the data 
mask = np.triu(np.ones_like(label_df, dtype=bool))

# Create a heatmap
corr = label_df.corr()
plt.figure(figsize=(10, 8))
plt.title('Heatmap of data labels')

sns.heatmap(corr, cmap="coolwarm", xticklabels=corr.columns.values, yticklabels=corr.columns.values, annot=True, fmt=".2f")

**Observations**
- There exists a relatively **stronger relationship** amongst **toxicity, obscenity and insult**
- Since these are binary labels and we have utilized **Pearrson's correlation** method to compute the above values (which are more applicable for continuous-valued features), we could potentially be getting spurious results
    - To mitigate this, we will use **Cramer V** rule for verifying the above results, which are better       suited for categorical values/labels 

In [None]:
def cramers_V(var1, var2):
    """
    Calculate Cramér's V statistic for the association between two categorical variables.
    
    This function computes the Cramér's V statistic, a measure of association 
    between categorical variables. It is an extension of the chi-squared test 
    for independence and indicates the strength of association between two 
    categorical variables.
    
    Arguments:
    var1 (array-like): First categorical variable.
    var2 (array-like): Second categorical variable.
    
    Refer this link for further details: 
    https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
    
    Parameters:
    ----------
    var1, var2 : array-like
        Two categorical variables (e.g., arrays or Pandas Series).

    Returns:
    -------
    float
        Cramér's V statistic representing the strength of association 
        between var1 and var2. Values range from 0 to 1, where 0 indicates 
        no association and 1 indicates a perfect association.

    """
    # Build a contingency table (cross-tabulation) between var1 and var2
    crosstab = np.array(pd.crosstab(var1, var2, rownames=None, colnames=None))
    
    # Calculate the chi-squared statistic and retrieve the test statistic from chi2_contingency
    stat = chi2_contingency(crosstab)[0]
    
    # Compute the total number of observations in the contingency table
    obs = np.sum(crosstab)
    
    # Determine the minimum value between the number of rows and columns of the contingency table
    mini = min(crosstab.shape) - 1
    
    # Calculate Cramér's V statistic using the formula
    return np.sqrt(stat / (obs * mini))

In [None]:
## Calculate Kramer's Statistic 
print(f"Toxicity and Obscenity: {cramers_V(label_df['toxic'], label_df['obscene']):.2f}")
print(f"Toxicity and Insult: {cramers_V(label_df['toxic'], label_df['insult']):.2f}")
print(f"Obscenity and Insult: {cramers_V(label_df['obscene'], label_df['insult']):.2f}")

**Observations**
- Thus, the correlation results returned by Pearrson's method are in fact trustworthy 

#### **Question 3:** What is the distribution of most common words/phrases ?</h4>

In [None]:
train_df[train_df.severe_toxic==1]

##### Some Example Comments 

In [None]:
print("Severely Toxic: ")
print(train_df[train_df["severe_toxic"]==1].iloc[3,1])
print(train_df[train_df["severe_toxic"]==1].iloc[5,1])

In [None]:
print("Obscene: ")
print(train_df[train_df["obscene"]==1].iloc[3,1])
print(train_df[train_df["obscene"]==1].iloc[5,1])

In [None]:
print("Threat: ")
print(train_df[train_df["threat"]==1].iloc[3,1])
print(train_df[train_df["threat"]==1].iloc[5,1])

In [None]:
print("Insult: ")
print(train_df[train_df["insult"]==1].iloc[3,1])
print(train_df[train_df["insult"]==1].iloc[5,1])

In [None]:
print("Identity Hate: ")
print(train_df[train_df["identity_hate"]==1].iloc[3,1])
print(train_df[train_df["identity_hate"]==1].iloc[5,1])

**Observations** 
- At first glance, it seems that there is hardly any difference between different types of comments
- Note
    - Words and letters are still in uppercase. Fix it if required 
- We would utilize wordclouds to better understand our data

In [None]:
!ls ../input/imagesforkernal/

In [None]:
text = " ".join(train_df["comment_text"])

In [None]:
# Toxic Comments 
sub_comment_df = train_df[(train_df["toxic"]==1) | (train_df["severe_toxic"]==1)]["comment_text"]
text = "".join(sub_comment_df)
wc = WordCloud(background_color="black", max_words=2000, stopwords=eng_stopwords)
wc.generate(text)
plt.figure(figsize=(9, 5))
plt.axis("off")
# plt.title("Words frequented in clean toxic comments", fontsize=20)
plt.imshow(wc.recolor(colormap='viridis' , random_state=17), alpha=0.98)

In [None]:
# Obscene Comments 
sub_comment_df = train_df[(train_df["obscene"]==1)]["comment_text"]
text = "".join(sub_comment_df)
wc = WordCloud(background_color="black", max_words=2000, stopwords=eng_stopwords)
wc.generate(text)
plt.figure(figsize=(9, 5))
plt.axis("off")
# plt.title("Words frequented in clean obscene comments", fontsize=20)
plt.imshow(wc.recolor(colormap='viridis' , random_state=17), alpha=0.98)

In [None]:
# Threat Comments 
sub_comment_df = train_df[(train_df["threat"]==1)]["comment_text"]
text = "".join(sub_comment_df)
wc = WordCloud(background_color="black", max_words=2000, stopwords=eng_stopwords)
wc.generate(text)
plt.figure(figsize=(9, 5))
plt.axis("off")
# plt.title("Words frequented in clean threat comments", fontsize=20)
plt.imshow(wc.recolor(colormap='viridis' , random_state=17), alpha=0.98)

In [None]:
# Identity hate based comments 
sub_comment_df = train_df[(train_df["identity_hate"]==1)]["comment_text"]
text = "".join(sub_comment_df)
wc = WordCloud(background_color="black", max_words=2000, stopwords=eng_stopwords)
wc.generate(text)
plt.figure(figsize=(9, 5))
plt.axis("off")
# plt.title("Words frequented in clean threat comments", fontsize=20)
plt.imshow(wc.recolor(colormap='viridis' , random_state=17), alpha=0.98)

**Observations** 
- Enter your observations 

#### **Question 4:** Distribution of topics within the dataset</h4>

## Topic Modeling 
- Topic modeling can be a useful tool to summarize the context of a huge corpus(text) by guessing what the "Topic" or the general theme of the sentence.
- This can also be used as inputs to our classifier if they can identify patterns or "Topics" that indicate toxicity.
- The following steps would be involved in the process:
    - Preprocessing 
        - Tokenize (split the documents into tokens) 
        - Lemmatize the tokens 
        - Compute bigrams 
        - Compute a bag-of-words representation of the data 
    - Lemmatization 
    - Creation of dictionary (list all words in the cleaned text) 
    - Topic Modeling using LDA 
    - Visualization with pyLDAviz
    - Convert topics to sparse vectors 
    - Feed sparse vectors to the model 


In [None]:
# Accumulate all comments in a list 
doc_ls, processed_doc_ls = [], []

for com_text in train_df["comment_text"]: 
    doc_ls.append(com_text)

In [None]:
doc_ls[0: 3]

In [None]:
train_df

In [None]:
train_df["comment_length"] = train_df["comment_text"].apply(len)
test_df["comment_length"] = test_df["comment_text"].apply(len)

In [None]:
nltk.download()

In [None]:
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()

In [None]:
nltk.download('wordnet')

In [None]:
def lemmatize(self, word: str, pos: str = "n") -> str:
    """Lemmatize `word` using WordNet's built-in morphy function.
       Returns the input word unchanged if it cannot be found in WordNet.

      :param word: The input word to lemmatize.
      :type word: str
      :param pos: The Part Of Speech tag. Valid options are `"n"` for nouns,
            `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
            for satellite adjectives.
      :param pos: str
      :return: The lemma of `word`, for the given `pos`.
    """
    lemmas = wn._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

In [None]:
[lemmatize(str(word), "n") for word in nltk.word_tokenize("I am a boy with a big fat butt!!!")]

In [None]:
[porter_stem.stem(word) for word in nltk.word_tokenize("I am a boy with a big fat butt!!!")]

In [None]:
# # Load the Spacy English language 
# from spacy.lang.en import English 

# # Load English tokenizer, tagger, Parser and NER 
# nlp = English()

# # Custom pipeline components for lemmatization and adding bigrams 
# processed_docs = [], []

# # Add lemmatizer to the Spacy pipeline 
# lemmatizer = nlp.add_pipe("lemmatizer")

# # Initialize the SpaCy pipeline to load the required data
# nlp.initialize()


# for doc in lemmatizer.pipe(doc_ls):
#     print(doc)
#     break
    
# Store in Processed docs 
from typing import List, Dict
def tokenizer_lemmatize(text: str) -> List[str]: 
    tokens = nltk.word_tokenize(text)
    return [wordnet_lemma.lemmatize(token) for token in tokens]

#### Check if I have added the code for deaccentisization in the preprocessing step 

In [None]:
train_df["tokens"] = train_df["comment_text"].apply(tokenizer_lemmatize)

In [None]:
test_df["tokens"] = test_df["comment_text"].apply(tokenizer_lemmatize) 

In [None]:
# Convert into lookup tables within the dictionary using doc2bow 
# print(dictionary.doc2bow(all))

In [None]:
from tqdm import tqdm
tqdm.pandas(desc="Processing")

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)

In [None]:
nlp = spacy.load("en_core_web_md", exclude=["tok2vec", "parser", "attribute_ruler", "ner"])

In [None]:
from unidecode import unidecode

def lemmatize(text: str, min_len: int = 3) -> str:
    doc = nlp(text)
    
    # Extract lemmatized text, excluding punctuation, certain parts of speech, 
    # words shorter than min_len, and de-accenting the text
    lemmatized_text = [unidecode(token.lemma_) for token in doc if not token.is_punct and token.pos_ not in ('PRON', 'AUX', 'ADP', 'CCONJ') and len(token.lemma_) >= min_len]
    
    return lemmatized_text

In [None]:
#to seperate sentenses into words
def preprocess(comment):
    """
    Function to build tokenized texts from input comment
    """
    return gensim.utils.simple_preprocess(comment, deacc=True, min_len=3)

In [None]:
lemmatize("Hey man I am really not trying to edit war . It is just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page . He seems to care more about the formatting than the actual info .")

In [None]:
# Check for extracting lemma based on POS 
# def lemmatize(text: str) -> str: 
#     doc = nlp(text)
    
#     # Extract lemmatized text and store it back into a string
#     # lemmatized_text = " ".join([token.lemma_ for token in doc])
    
#     return [token.lemma_ for token in doc]

In [None]:
import os
# import multiprocessing

# Using os module
cpu_count_os = os.cpu_count()
print(f"CPU count: {cpu_count_os}")

In [None]:
!pip install -U pandas # upgrade pandas
!pip install swifter # first time installation
!pip install swifter[notebook] # first time installation including dependency for rich progress bar in jupyter notebooks
!pip install swifter[groupby] # first time installation including dependency for groupby.apply functionality

In [None]:
train_df["tokenized_text"] = train_df["comment_text"].progress_apply(lemmatize)

In [None]:
test_df["tokenized_text"] = test_df["comment_text"].progress_apply(lemmatize)

In [None]:
train_df = train_df.reset_index()
test_df = test_df.reset_index()

In [None]:
train_df = train_df.drop("index", axis=1)
test_df = test_df.drop("index", axis=1)

In [None]:
test_df

In [None]:
with open('train_v2.pkl', 'wb') as handle:
    pickle.dump(train_df, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test_v2.pkl', 'wb') as handle:
    pickle.dump(test_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Serialize both the dataframes 

In [None]:
# corpus_token_ls = list(train_df["tokenized_text"]) 
# corpus_token_ls.append(list(test_df["tokenized_text"]))
all_tokenized_text = train_df["tokenized_text"].append(test_df["tokenized_text"])
# Phrases help us group together bigrams :  new + york --> new_york
# bigram = gensim.models.Phrases(corpus_token_ls)

In [None]:
all_tokenized_text = all_tokenized_text.reset_index()

In [None]:
all_tokenized_text = all_tokenized_text.drop("index", axis=1)

In [None]:
all_tokenized_text

In [None]:
# Phrases help us group together bigrams :  new + york --> new_york
bigram = gensim.models.Phrases(all_tokenized_text["tokenized_text"].tolist())

In [None]:
type(all_tokenized_text)

In [None]:
bigram[all_tokenized_text["tokenized_text"].iloc[32]]

In [None]:
from gensim.corpora import Dictionary

In [None]:
#create the dictionary
dictionary = Dictionary(all_tokenized_text["tokenized_text"])
print("There are", len(dictionary),"number of words in the final dictionary")

In [None]:
iter(dictionary)

In [None]:
#convert into lookup tuples within the dictionary using doc2bow
print(dictionary.doc2bow((all_tokenized_text["tokenized_text"].iloc[2000])))
print("Wordlist from the sentence:", all_tokenized_text["tokenized_text"].iloc[2000])

#to check
print("Wordlist from the dictionary lookup:", 
      dictionary[21], dictionary[22], dictionary[23], dictionary[24], dictionary[25], dictionary[26], dictionary[27])

In [None]:
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
import time

In [None]:
#scale it to all text
start_time = time.time()
corpus = [dictionary.doc2bow(text) for text in all_tokenized_text["tokenized_text"]]
end_corpus = time.time()
print("Time till corpus creation:", end_corpus - start_time,"s")

In [None]:
#create the LDA model
start_lda = time.time()
ldamodel = LdaModel(corpus=corpus, num_topics=15, id2word=dictionary)
end_lda = time.time()
print("Time till LDA model creation:",end_lda-start_lda,"s")

In [None]:
!pip install pyLDAvis

In [None]:
import pyLDAvis.gensim

In [None]:
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

In [None]:
# Create the histogram
plt.figure(figsize=(8, 6))
sns.histplot(data, kde=True, bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Random Data')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

In [None]:
### Syntactic text features 

In [None]:
# Checks 
print(f"Total no. of comments: {len(all_text)}")
print(f"Before preprocessing: {train.comment_text.iloc[30]}")
print(f"After preprocessing: all_text.iloc[30]")

**Bivariate**
- 

**Multivariate**
- 

In [None]:
x=train.iloc[:,2:].sum()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("# per class")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('Type ', fontsize=12)
#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

- The toxicity is not evenly spread out across classes. Hence we might face class imbalance problems
- There are ~95k comments in the training dataset and there are ~21k tags and ~86k clean comments!?
- This is only possible when multiple tags are associated with each comment (eg) a comment can be classified as both toxic and obscene.

In [None]:
# Pending Preprocessing 
    - Adding unigrams, bigrams and trigrams 
    - Sentiment of extracted hashtags: if any 
    - Check...
# Add the code for NBSVM 

## Leaky Features 
- **Caution**: Even though including these features might help us perform better in this particular scenario, it will not make sense to add him in the final model/general purpose model 
- Here we are creating our own custom **count vectorizer** to create count variables that match our regex condition 
**Note**: Use Data Version Control to capture these features from the raw data itself 

In [None]:
# Leaky Features 
# Extracted from the resouce: 
train_df["ip"] = train_df["comment_text"].apply(lambda text: re.findall(r"\d{1, 3}\.\d{1, 3}\.\d{1, 3}\.\d{1, 3}", str(text))
test_df["ip"] = test_df["comment_text"].apply(lambda text: re.findall(r"\d{1, 3}\.\d{1, 3}\.\d{1, 3}\.\d{1, 3}", str(text)))
                                                
# Count IPs 
train_df["count_ips"] = train_df["ip"].apply(lambda text: len(text))
test_df["count_ips"] = test_df["ip"].apply(lambda text: len(text))


# Links 
train_df["links"] = train_df["comment_text"].apply(lambda text: re.findall(r"(https\:)*\/*\/*(www\.)?(\w+)(\.\w+)\/*\w*", str(text))
test_df["links"] = test_df["comment_text"].apply(lambda text: re.findall(r"(https\:)*\/*\/*(www\.)?(\w+)(\.\w+)\/*\w*", str(text)))
                                                   
# Count links 
train_df["count_links"] = train_df["links"].apply(lambda text: len(text)) 
test_df["count_links"] = test_df["links"].apply(lambda text: len(text))
                                                   
                                                    
# Article IDs...for now, I don't think this feature would be useful in any way

# Username mentions
train_df["username"] = train_df["comment_text"].apply(lambda text: re.findall("\[\[User(.*)\|"), str(text))
test_df["username"] = test_df["comment_text"].apply(lambda text: re.findall("\[\[User(.*)\|"), str(text))
                                                                                              
# Count Usernames 
train_df["count_usernames"] = train_df["usernames"].apply(lambda text: len(text))
test_df["count_usernames"] = test_df["usernames"].apply(lambda text: len(text))
                                                   
# Leaky Ip 

# Leaky Usernames 
cv = CountVectorizer()
count_feats_user = cv.fit_transform(train_df["usernames"].apply(lambda text: str(text)))

                                                   
cv = CountVectorizer()
                                                   

In [None]:
# Checking a few of the usernames 

## Leaky Feature Stability 
- Checking if the features have actually overleaked
- We might need to remove those features where the values have a lot of overlap between the training and the test set 

In [3]:
leaky_features = ["ip", "link", "username", "count_ips", "count_links", "count_usernames"]

In [None]:
# Get common elements in the "ip" feature of train and test data and plot the intersection on a Venn diagram

