# Classic Machine Learning approaches #

In the following jupyter notebook two standard types of machine learning models will be trained: Logistic Regression and Support Vector Machine.
Also the two vectorization methods Word2Vec and TF-IDF will be used for vectorizing the data. So by combining the two models with the two vectorization methods we get four different approaches.

## Loading the training and test data ##

In [1]:
import pandas as pd

# Load the CSV files into DataFrames
training_data = pd.read_csv('../data/balanced_2_1_ratio.csv')
test_data = pd.read_csv('../data/valid_test_data.csv')

In [2]:
# Showing the first five elements of the DataFrame
training_data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,6063cac387260c02,"Kimchi,\r\nthats not advertising. Its the arti...",1,0,1,0,1,0
1,bad17dbfea78f651,I am a little surprised and shocked at the ton...,0,0,0,0,0,0
2,5eae46b6e87b85ac,"Barring a citation of course, no. A religion o...",0,0,0,0,0,0
3,8f079550cbe1980e,If someone does end up putting that sock stuff...,0,0,0,0,0,0
4,0332f7a0090832c1,"Hi, a pleasure, I'm fascinated by coal mines, ...",0,0,0,0,0,0


In [3]:
# Showing the last five elements of the DataFrame
training_data.tail()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
48670,21b56461d4b61538,"""\r\nThe """"lead too short"""" tag was needed, bu...",0,0,0,0,0,0
48671,bd78368bd6ba879e,"""\r\n\r\n Good For A Laugh \r\n\r\nAlthough no...",1,0,0,0,0,0
48672,c98631699c9ff7ff,"""\r\nwhy don´t you just get """"high""""? if not, ...",0,0,0,0,0,0
48673,53aa1ef318a2ce3b,GENRE \r\n\r\nhere is the source that they mak...,0,0,0,0,0,0
48674,1404a1d8219bf4bf,———————————\r\nHahahah get fucked filthy mudsl...,1,0,1,0,1,0


In [4]:
# Showing all unique values in the classification rows
print(f"Unique values in the row \"toxic\": {training_data['toxic'].unique()}")
print(f"Unique values in the row \"severe_toxic\": {training_data['severe_toxic'].unique()}")
print(f"Unique values in the row \"obscene\": {training_data['obscene'].unique()}")
print(f"Unique values in the row \"threat\": {training_data['threat'].unique()}")
print(f"Unique values in the row \"insult\": {training_data['insult'].unique()}")
print(f"Unique values in the row \"indentity_hate\": {training_data['identity_hate'].unique()}")

Unique values in the row "toxic": [1 0]
Unique values in the row "severe_toxic": [0 1]
Unique values in the row "obscene": [1 0]
Unique values in the row "threat": [0 1]
Unique values in the row "insult": [1 0]
Unique values in the row "indentity_hate": [0 1]


## Downloading neccesary NLTK ressources ##

In [5]:
import nltk

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tompr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tompr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tompr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Preprocessing the data ##

In [6]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

def preprocess_text(text):
    "Proprocessing the comments of the raw data"

    # Convert text to lowercase
    text = text.lower()
    # removing usernames starting with '@'
    text = re.sub(r'@ ?\w+', '', text)
    # removing URL's
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    # removing the substring "quot", which is an HTML entity for double quotation ("")
    text = re.sub(r'&quot', '', text)
    # removing all special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # removing all digits
    text = re.sub(r'\d', '', text)

    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Join the tokens back into a single string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

In [7]:
# applies the preprocessing function to all comments 
training_data['comment_text'] = training_data['comment_text'].apply(preprocess_text)
test_data['comment_text'] = test_data['comment_text'].apply(preprocess_text)

In [8]:
print(training_data)

                     id                                       comment_text  \
0      6063cac387260c02  kimchi thats advertising artist bio artist c k...   
1      bad17dbfea78f651  little surprised shocked tone aloan follow lar...   
2      5eae46b6e87b85ac  barring citation course religion religious sec...   
3      8f079550cbe1980e  someone end putting sock stuff tell page right...   
4      0332f7a0090832c1  hi pleasure fascinated coal mine hope done ok ...   
...                 ...                                                ...   
48670  21b56461d4b61538  lead short tag needed started tampering table ...   
48671  bd78368bd6ba879e  good laugh although one bit encyclopaedic natu...   
48672  c98631699c9ff7ff  get high give every video thumb write bad comment   
48673  53aa1ef318a2ce3b       genre source make rapcore rap rock rap metal   
48674  1404a1d8219bf4bf                 hahahah get fucked filthy mudslime   

       toxic  severe_toxic  obscene  threat  insult  identity_h

In [9]:
print(test_data)

                     id                                       comment_text  \
0      0001ea8717f6de06  thank understanding think highly would revert ...   
1      000247e83dcc1211                             dear god site horrible   
2      0002f87b16116a7f  somebody invariably try add religion really me...   
3      0003e1cccfd5a40a  say right type type institution needed case th...   
4      00059ace3e3e9a53  adding new product list make sure relevant add...   
...                 ...                                                ...   
63973  fff8f64043129fa2  jerome see never got around surprised looked e...   
63974  fff9d70fe0722906        lucky bastard heh famous kida envy congrats   
63975  fffa8a11c4378854                      shame want speak gay romanian   
63976  fffac2a094c8e0e2  mel gibson nazi bitch make shitty movie much b...   
63977  fffb5451268fb5ba  unicorn lair discovery supposedly unicorn lair...   

       toxic  severe_toxic  obscene  threat  insult  identity_h

## Installing machine learning libraries ##

In [10]:
!pip install gensim scikit-learn pandas




[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: C:\Users\tompr\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## Logistic regression model with Word2Vec vectorization ##

In [11]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.metrics import classification_report


# Tokenize the texts
def tokenize(text):
    return word_tokenize(text.lower())

# Function to calculate the vector of a text
def vectorize_text(tokens, model):
    word_vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)  # Return a zero vector if no words are in the model
    return np.mean(word_vectors, axis=0)

# Separate the text and labels
y_train = training_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_train = training_data['comment_text']
y_test = test_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_test = test_data['comment_text']

# Apply tokenization to all comments
X_train_tokens = X_train.apply(tokenize)
X_test_tokens = X_test.apply(tokenize)

# Train the Word2Vec model
model_w2v = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4)

# Vectorize the texts
X_train_vec = np.array([vectorize_text(tokens, model_w2v) for tokens in X_train_tokens])
X_test_vec = np.array([vectorize_text(tokens, model_w2v) for tokens in X_test_tokens])

# Multi-output classification (one classification for each category)
model_lr_w2v = MultiOutputClassifier(LogisticRegression(max_iter=1000))
model_lr_w2v.fit(X_train_vec, y_train)

# Make predictions
y_pred = model_lr_w2v.predict(X_test_vec)

# Evaluate performance
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']))


Classification Report:
                precision    recall  f1-score   support

        toxic       0.45      0.79      0.57      6090
 severe_toxic       0.20      0.23      0.21       367
      obscene       0.58      0.62      0.60      3691
       threat       0.34      0.16      0.22       211
       insult       0.52      0.51      0.52      3427
identity_hate       0.43      0.23      0.30       712

    micro avg       0.48      0.63      0.55     14498
    macro avg       0.42      0.42      0.40     14498
 weighted avg       0.49      0.63      0.54     14498
  samples avg       0.07      0.06      0.06     14498



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [12]:
"""Test model_lr_w2v with an example comment"""
def get_output_from_model_lr_w2v(model : any, model_vec : any, comment : str)->list:
    "Gets the output from model_lr_w2v and returns it"
    input = [vectorize_text(tokenize(comment),model_vec)]
    output = model.predict(input)
    toxic_parameters = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    for i in range(len(output[0])):
        print(f"{toxic_parameters[i]}: {output[0][i]}")  
    return output

example_comment = "You are low key stupid"
get_output_from_model_lr_w2v(model_lr_w2v, model_w2v, example_comment)

toxic: 1
severe_toxic: 0
obscene: 1
threat: 0
insult: 1
identity_hate: 0


array([[1, 0, 1, 0, 1, 0]], dtype=int64)

## Support vector machine with Word2Vec vectorization ##

In [13]:
import numpy as np
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.metrics import classification_report

def tokenize(text):
    return word_tokenize(text.lower())

def vectorize_text(tokens, model):
    "Use word2vec model for vectorization"
    word_vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)  # Return a zero vector if no words are in the model
    return np.mean(word_vectors, axis=0)

# Separate the text and labels
y_train = training_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_train = training_data['comment_text']
y_test = test_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_test = test_data['comment_text']

# Apply tokenization to all comments
X_train_tokens = X_train.apply(tokenize)
X_test_tokens = X_test.apply(tokenize)

# Train the Word2Vec model
model_w2v = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4)

# Vectorize the texts
X_train_vec = np.array([vectorize_text(tokens, model_w2v) for tokens in X_train_tokens])
X_test_vec = np.array([vectorize_text(tokens, model_w2v) for tokens in X_test_tokens])

# Multi-output classification with SVM (one classification for each category)
model_svm_w2v = MultiOutputClassifier(SVC(kernel='linear', C=1))  # SVM with a linear kernel and regularization C=1
model_svm_w2v.fit(X_train_vec, y_train)

# Make predictions
y_pred = model_svm_w2v.predict(X_test_vec)

# Evaluate performance
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']))


Classification Report:
                precision    recall  f1-score   support

        toxic       0.46      0.80      0.58      6090
 severe_toxic       0.00      0.00      0.00       367
      obscene       0.61      0.60      0.61      3691
       threat       0.00      0.00      0.00       211
       insult       0.59      0.48      0.52      3427
identity_hate       0.00      0.00      0.00       712

    micro avg       0.51      0.60      0.55     14498
    macro avg       0.28      0.31      0.29     14498
 weighted avg       0.49      0.60      0.52     14498
  samples avg       0.07      0.06      0.06     14498



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [14]:
"""Test model_svm_w2v with an example comment"""

def get_output_from_model_svm_w2v(model : any, model_vec : any, comment : str)->list:
    "Gets the output from model_svm_w2v and returns it"
    input = [vectorize_text(tokenize(comment),model_vec)]
    output = model.predict(input)
    toxic_parameters = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    for i in range(len(output[0])):
        print(f"{toxic_parameters[i]}: {output[0][i]}")  
    return output

example_comment = "You are low key stupid"
get_output_from_model_svm_w2v(model_svm_w2v, model_w2v, example_comment)

toxic: 1
severe_toxic: 0
obscene: 1
threat: 0
insult: 1
identity_hate: 0


array([[1, 0, 1, 0, 1, 0]], dtype=int64)

## Logistic regression model with TF-IDF vectorization ##

In [15]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report


def create_tfidf_vectorizer(data : pd.core.frame.DataFrame)->any:
    "Creates the tf-idf vectorizer and returns it"
    vectorizer = TfidfVectorizer(max_features=15000, min_df=2, max_df=0.8)
    tfidf_vectorizer = vectorizer.fit(data["comment_text"].to_list())
    return tfidf_vectorizer

# Vectorize the comments using TF-IDF
tfidf_vectorizer = create_tfidf_vectorizer(training_data)
X_train = tfidf_vectorizer.transform(training_data["comment_text"].to_list())
X_test = tfidf_vectorizer.transform(test_data["comment_text"].to_list())

# Define the target variables
y_train = training_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
y_test = test_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

# Create and train the model
model_lr_tfidf = MultiOutputClassifier(LogisticRegression(max_iter=1000, random_state=42))

# Train the model
model_lr_tfidf.fit(X_train, y_train)

# Make predictions
y_pred = model_lr_tfidf.predict(X_test)

# Evaluate the model's performance
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']))


Classification Report:
                precision    recall  f1-score   support

        toxic       0.52      0.84      0.64      6090
 severe_toxic       0.40      0.29      0.34       367
      obscene       0.74      0.66      0.70      3691
       threat       0.42      0.21      0.28       211
       insult       0.71      0.57      0.63      3427
identity_hate       0.60      0.28      0.38       712

    micro avg       0.59      0.68      0.63     14498
    macro avg       0.56      0.47      0.49     14498
 weighted avg       0.62      0.68      0.63     14498
  samples avg       0.07      0.07      0.07     14498



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [16]:
"""Test model_lr_tfidf with an example comment"""
def get_output_from_model_lr_tfidf(model_lr_tfidf : any, tfidf_vectorizer : any, comment : str)->list:
    "Gets the output from model_lr_tfidf and returns it"
    tfidf_matrix_comment = tfidf_vectorizer.transform([comment])
    output = model_lr_tfidf.predict(tfidf_matrix_comment)
    toxic_parameters = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    for i in range(len(output[0])):
        print(f"{toxic_parameters[i]}: {output[0][i]}")   
    return output

# Example usage
example_comment = "You are low key stupid"
result = get_output_from_model_lr_tfidf(model_lr_tfidf, tfidf_vectorizer, example_comment)


toxic: 1
severe_toxic: 0
obscene: 0
threat: 0
insult: 1
identity_hate: 0


## Support vector machine with TF-IDF vectorization ##

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report

def create_tfidf_vectorizer(data : pd.core.frame.DataFrame)->any:
    "Creates the tf-idf vectorizer and returns it"
    vectorizer = TfidfVectorizer(max_features=15000, min_df=2, max_df=0.8)
    tfidf_vectorizer = vectorizer.fit(data["comment_text"].to_list())
    return tfidf_vectorizer

# Vectorize the comments using TF-IDF
tfidf_vectorizer = create_tfidf_vectorizer(training_data)
X_train = tfidf_vectorizer.transform(training_data["comment_text"].to_list())
X_test = tfidf_vectorizer.transform(test_data["comment_text"].to_list())

# Define the target variables
y_train = training_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
y_test = test_data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

# Create and train the model
model_svm_tfidf = MultiOutputClassifier(SVC(kernel='linear', C=1, random_state=42))

# Train the model
model_svm_tfidf.fit(X_train, y_train)

# Make predictions
y_pred = model_svm_tfidf.predict(X_test)

# Evaluate the model's performance
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']))


Classification Report:
                precision    recall  f1-score   support

        toxic       0.47      0.89      0.62      6090
 severe_toxic       0.43      0.18      0.25       367
      obscene       0.64      0.71      0.67      3691
       threat       0.45      0.24      0.31       211
       insult       0.63      0.63      0.63      3427
identity_hate       0.58      0.37      0.45       712

    micro avg       0.54      0.73      0.62     14498
    macro avg       0.53      0.50      0.49     14498
 weighted avg       0.56      0.73      0.61     14498
  samples avg       0.08      0.07      0.07     14498



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [18]:
"""Test model_svm_tfidf with an example comment"""
def get_output_from_model_svm_tfidf(model_svm_tfidf : any, tfidf_vectorizer : any, comment : str)->str:
    "Gets the output from model_svm_tfidf and returns it"
    tfidf_matrix_comment = tfidf_vectorizer.transform([comment])
    output = model_svm_tfidf.predict(tfidf_matrix_comment)
    toxic_parameters = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    for i in range(len(output[0])):
        print(f"{toxic_parameters[i]}: {output[0][i]}") 
    return output

# Example usage
example_comment = "You are low key stupid"
result = get_output_from_model_svm_tfidf(model_svm_tfidf, tfidf_vectorizer, example_comment)


toxic: 1
severe_toxic: 0
obscene: 0
threat: 0
insult: 1
identity_hate: 0
