# Tokenization
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenization strategy used. Tokenization is a fundamental step as it converts raw text into a format that can be more easily analyzed and processed by machine learning models.

Word Tokenization:
word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

In [2]:
import pandas as pd
data=pd.read_csv("HateSpeechDetection.csv")

In [3]:
import nltk
nltk.download('punkt') 
from nltk.tokenize import word_tokenize

data['Tokens']=data['Text'].apply(word_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
data

Unnamed: 0,Text,Label,Tokens
0,Damn I thought they had strict gun laws in Ger...,0,"[Damn, I, thought, they, had, strict, gun, law..."
1,I dont care about what it stands for or anythi...,0,"[I, dont, care, about, what, it, stands, for, ..."
2,It's not a group it's an idea lol,0,"[It, 's, not, a, group, it, 's, an, idea, lol]"
3,So it's not just America!,0,"[So, it, 's, not, just, America, !]"
4,The dog is a spectacular dancer considering he...,0,"[The, dog, is, a, spectacular, dancer, conside..."
...,...,...,...
17591,I find rats nicer and cleaner than most Chinese,1,"[I, find, rats, nicer, and, cleaner, than, mos..."
17592,"Check out this niggar, they hit things like wi...",1,"[Check, out, this, niggar, ,, they, hit, thing..."
17593,"this country has become an absolute shambles, ...",0,"[this, country, has, become, an, absolute, sha..."
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1,"[Me, aged, 16, =, anti-Semitism, is, bad, Me, ..."


In [5]:
texts = data['Text'].values
labels = data['Label'].values

# Embedding:
Embedding in the context of deep learning and natural language processing (NLP) is a way of representing words or phrases as dense vectors in a continuous vector space. These vectors capture semantic meanings and relationships between words. Embeddings transform the sparse, high-dimensional data of words into a lower-dimensional space, where similar words have similar vector representations.

# Word2Vec
Word2Vec transforms each word into a dense vector of fixed size. It captures semantic meanings by training on large text corpora. Word2Vec creates vectors of the words that are distributed numerical representations of word features – these word features could comprise of words that represent the context of the individual words present in our vocabulary.

Two different model architectures that can be used by Word2Vec to create the word embeddings are the Continuous Bag of Words (CBOW) model & the Skip-Gram model.


# Skip-Gram
Predicts context words based on a target word. Trains the model to maximize the probability of context words given a target word.

Pros: Effective in capturing syntactic and semantic relationships, works well with small datasets.

Cons: Computationally more expensive than CBOW.

In [26]:
from gensim.models import Word2Vec

def word2vec_embedding_sg(texts):
    model = Word2Vec(texts, vector_size=200, window=6, min_count=1, workers=4,sg=1)
    word_vectors = model.wv
    #print(word_vectors)

    def get_word2vec_embeddings(text, word_vectors):
        embeddings = [word_vectors[word] for word in text if word in word_vectors]
        if embeddings:
            return np.mean(embeddings, axis=0)
        else:
            return np.zeros(200)

    embeddings = np.array([get_word2vec_embeddings(text, word_vectors) for text in texts])
    return embeddings

embeddings_w2v_sg = word2vec_embedding_sg(data['Tokens'])
print(embeddings_w2v_sg)
X_train_w2v2, X_test_w2v2, y_train_w2v2, y_test_w2v2 = train_test_split(embeddings_w2v_sg, labels, test_size=0.2, random_state=42)


[[ 0.17109032  0.033767   -0.18476258 ... -0.10014948  0.10881008
  -0.11326653]
 [ 0.22616751  0.03877444 -0.25352594 ... -0.07180958  0.15895079
  -0.16575725]
 [ 0.1826731   0.00148701 -0.18673538 ... -0.11800708  0.13921864
  -0.17711399]
 ...
 [ 0.17524712 -0.0284111  -0.1633537  ... -0.10352435  0.1074788
  -0.13104244]
 [ 0.17381686  0.02411527 -0.19371334 ... -0.06752601  0.15991086
  -0.16255414]
 [ 0.18425429  0.02087011 -0.19497301 ... -0.12127761  0.12777068
  -0.16926508]]


In [27]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

def train_evaluate_rf(X_train_emb, X_test_emb, y_train, y_test):
    rf = RandomForestClassifier()
    rf.fit(X_train_emb, y_train)
    y_pred = rf.predict(X_test_emb)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report


In [32]:
accuracy_w2v2, report_w2v2 = train_evaluate_rf(X_train_w2v2, X_test_w2v2, y_train_w2v2, y_test_w2v2)
print(f'Word2Vec skip-gram Accuracy: {accuracy_w2v2}')
print(f'Word2Vec skip-gram Classification Report:\n{report_w2v2}')


Word2Vec Accuracy: 0.6551136363636364
Word2Vec Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.84      0.74      2094
           1       0.62      0.38      0.47      1426

    accuracy                           0.66      3520
   macro avg       0.64      0.61      0.61      3520
weighted avg       0.65      0.66      0.63      3520




# Word2Vec skip-gram observations:
                            vector_size     window_size     Accuracy
                                50              3              63.8
                                100             3              65.3
                                100             4              64.3
                                150             4              64.0
                                150             5              65.1
                                200             5              64.2
                                200             6              67.9
                                250             6              66.7
                                300             6              65.5