# Tokenization
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenization strategy used. Tokenization is a fundamental step as it converts raw text into a format that can be more easily analyzed and processed by machine learning models.

Word Tokenization:
word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

In [17]:
import nltk
nltk.download('punkt') 
from nltk.tokenize import word_tokenize

data['Tokens']=data['Text'].apply(word_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Rule-Based Tokenization: This uses predefined rules to handle complex cases like contractions, hyphenated words, and special characters. (e.g., Penn Treebank Tokenization)

In [18]:
from nltk.tokenize import TreebankWordTokenizer

def rule_based_tokenize(text):
    tokenizer = TreebankWordTokenizer()
    return tokenizer.tokenize(text)

data['token_rbt']=data['Text'].apply(rule_based_tokenize)


Byte Pair Encoding (BPE): This subword tokenization method iteratively merges the most frequent character pairs to create a vocabulary.

In [19]:
from transformers import GPT2Tokenizer

def bpe_tokenize(text):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    return tokenizer.tokenize(text)
data['token_bpe']=data['Text'].apply(rule_based_tokenize)






In [20]:
data

Unnamed: 0,Text,Label,Tokens,token_rbt,token_bpe
0,damn thought they had strict gun law germany,0,"[damn, thought, they, had, strict, gun, law, g...","[damn, thought, they, had, strict, gun, law, g...","[damn, thought, they, had, strict, gun, law, g..."
1,not care about what stand for anything its con...,0,"[not, care, about, what, stand, for, anything,...","[not, care, about, what, stand, for, anything,...","[not, care, about, what, stand, for, anything,..."
2,not group idea lol,0,"[not, group, idea, lol]","[not, group, idea, lol]","[not, group, idea, lol]"
3,not just america,0,"[not, just, america]","[not, just, america]","[not, just, america]"
4,the dog spectacular dancer considering has two...,0,"[the, dog, spectacular, dancer, considering, h...","[the, dog, spectacular, dancer, considering, h...","[the, dog, spectacular, dancer, considering, h..."
...,...,...,...,...,...
17591,find rat nicer and cleaner than most chinese,1,"[find, rat, nicer, and, cleaner, than, most, c...","[find, rat, nicer, and, cleaner, than, most, c...","[find, rat, nicer, and, cleaner, than, most, c..."
17592,check out this niggar they hit thing like wild...,1,"[check, out, this, niggar, they, hit, thing, l...","[check, out, this, niggar, they, hit, thing, l...","[check, out, this, niggar, they, hit, thing, l..."
17593,this country has become absolute shamble the a...,0,"[this, country, has, become, absolute, shamble...","[this, country, has, become, absolute, shamble...","[this, country, has, become, absolute, shamble..."
17594,aged 16 antisemitism bad aged 18 antisemitism ...,1,"[aged, 16, antisemitism, bad, aged, 18, antise...","[aged, 16, antisemitism, bad, aged, 18, antise...","[aged, 16, antisemitism, bad, aged, 18, antise..."


# Embedding:
Embedding in the context of deep learning and natural language processing (NLP) is a way of representing words or phrases as dense vectors in a continuous vector space. These vectors capture semantic meanings and relationships between words. Embeddings transform the sparse, high-dimensional data of words into a lower-dimensional space, where similar words have similar vector representations.

# Count based embeddings (Non-context)
Non-contextual embeddings assign a single vector representation to each word, regardless of its context in a sentence. These embeddings are static and do not change based on the words surrounding them.

# One Hot Encoding: 
Non-contextual embeddings assign a single vector representation to each word, regardless of its context in a sentence. These embeddings are static and do not change based on the words surrounding them.

Pros: Simple to implement, no need for pre-training.

Cons: High dimensionality, no information about word similarity or context.

In [21]:
texts = data['Text'].values
labels = data['Label'].values

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split 

def one_hot_encoding(texts):
    vectorizer = CountVectorizer(binary=True)
    embeddings = vectorizer.fit_transform(texts)
    return embeddings

embeddings_one_hot = one_hot_encoding(texts)
X_train_one_hot, X_test_one_hot, y_train_one_hot, y_test_one_hot = train_test_split(embeddings_one_hot, labels, test_size=0.2, random_state=42)
print(embeddings_one_hot)

  (0, 4315)	1
  (0, 17087)	1
  (0, 17033)	1
  (0, 7589)	1
  (0, 16345)	1
  (0, 7542)	1
  (0, 9660)	1
  (0, 7162)	1
  (1, 11708)	1
  (1, 2841)	1
  (1, 432)	1
  (1, 18571)	1
  (1, 16163)	1
  (1, 6681)	1
  (1, 1153)	1
  (1, 9044)	1
  (1, 3732)	1
  (1, 9903)	1
  (1, 16979)	1
  (1, 15319)	1
  (2, 11708)	1
  (2, 7492)	1
  (2, 8307)	1
  (2, 10035)	1
  (3, 11708)	1
  :	:
  (17594, 14863)	1
  (17594, 11962)	1
  (17594, 1655)	1
  (17594, 17080)	1
  (17594, 6028)	1
  (17594, 136)	1
  (17594, 67)	1
  (17594, 5843)	1
  (17594, 3671)	1
  (17594, 13871)	1
  (17594, 9153)	1
  (17594, 16205)	1
  (17594, 14867)	1
  (17594, 10620)	1
  (17594, 711)	1
  (17594, 84)	1
  (17594, 11347)	1
  (17594, 1122)	1
  (17594, 14409)	1
  (17595, 11708)	1
  (17595, 2122)	1
  (17595, 14463)	1
  (17595, 4683)	1
  (17595, 14867)	1
  (17595, 10648)	1


# Term Frequency: 
    Converts text data to term frequency vectors.
Pros: Simple, captures basic information about word importance.

Cons: High dimensionality, does not capture semantic meanings or context.

In [23]:
def term_frequency_encoding(texts):
    vectorizer = CountVectorizer()
    embeddings = vectorizer.fit_transform(texts)
    return embeddings
embeddings_tf = term_frequency_encoding(texts)
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(embeddings_tf, labels, test_size=0.2, random_state=42)
print(embeddings_tf)


  (0, 4315)	1
  (0, 17087)	1
  (0, 17033)	1
  (0, 7589)	1
  (0, 16345)	1
  (0, 7542)	1
  (0, 9660)	1
  (0, 7162)	1
  (1, 11708)	1
  (1, 2841)	1
  (1, 432)	1
  (1, 18571)	1
  (1, 16163)	1
  (1, 6681)	1
  (1, 1153)	1
  (1, 9044)	1
  (1, 3732)	1
  (1, 9903)	1
  (1, 16979)	1
  (1, 15319)	1
  (2, 11708)	1
  (2, 7492)	1
  (2, 8307)	1
  (2, 10035)	1
  (3, 11708)	1
  :	:
  (17594, 14863)	1
  (17594, 11962)	1
  (17594, 1655)	1
  (17594, 17080)	1
  (17594, 6028)	1
  (17594, 136)	1
  (17594, 67)	1
  (17594, 5843)	1
  (17594, 3671)	1
  (17594, 13871)	1
  (17594, 9153)	1
  (17594, 16205)	1
  (17594, 14867)	1
  (17594, 10620)	1
  (17594, 711)	3
  (17594, 84)	1
  (17594, 11347)	1
  (17594, 1122)	4
  (17594, 14409)	1
  (17595, 11708)	1
  (17595, 2122)	1
  (17595, 14463)	1
  (17595, 4683)	1
  (17595, 14867)	1
  (17595, 10648)	1


# TF-IDF:
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

Pros: Highlights important words in a document, reduces the influence of common words.

Cons: Still high dimensional, static representation, limited in capturing semantic relationships.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_embedding(texts):
    vectorizer = TfidfVectorizer()
    embeddings = vectorizer.fit_transform(texts)
    return embeddings

embeddings_tfidf = tfidf_embedding(texts)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(embeddings_tfidf, labels, test_size=0.2, random_state=42)


print(embeddings_tfidf)
print(embeddings_tfidf.shape)


  (0, 7162)	0.38655864797204015
  (0, 9660)	0.3369183517812025
  (0, 7542)	0.4085287030056687
  (0, 16345)	0.49396554919123437
  (0, 7589)	0.26594905432797417
  (0, 17033)	0.15796206623433257
  (0, 17087)	0.327463512592553
  (0, 4315)	0.35106624018460947
  (1, 15319)	0.46764828617011794
  (1, 16979)	0.10452774819368486
  (1, 9903)	0.18460642886856596
  (1, 3732)	0.49051943994950004
  (1, 9044)	0.2687361424930761
  (1, 1153)	0.2965908260733827
  (1, 6681)	0.1649905427170199
  (1, 16163)	0.3416575901596179
  (1, 18571)	0.2044196148320207
  (1, 432)	0.21152394159488963
  (1, 2841)	0.29866739942996134
  (1, 11708)	0.12578228405018932
  (2, 10035)	0.5587285254775806
  (2, 8307)	0.5796117172772546
  (2, 7492)	0.548976877964676
  (2, 11708)	0.22471555236057383
  (3, 923)	0.8296004517693428
  :	:
  (17594, 67)	0.16153022657761285
  (17594, 136)	0.14012531222980573
  (17594, 6028)	0.12021987545987466
  (17594, 17080)	0.08624109760557543
  (17594, 1655)	0.10065247197644427
  (17594, 11962)	0.143

# Position based embeddings (context)
Contextual embeddings generate different vector representations for a word based on its context within a sentence. These embeddings are dynamic and change depending on surrounding words.

# Word2Vec
Word2Vec transforms each word into a dense vector of fixed size. It captures semantic meanings by training on large text corpora. Word2Vec creates vectors of the words that are distributed numerical representations of word features – these word features could comprise of words that represent the context of the individual words present in our vocabulary.

Two different model architectures that can be used by Word2Vec to create the word embeddings are the Continuous Bag of Words (CBOW) model & the Skip-Gram model.


# CBOW
In CBOW the words occurring in context (surrounding words) of a selected word are used as inputs and middle or selected word as the target.

Pros: Efficient, captures syntactic and semantic relationships to some extent.

Cons: Assumes context words are independent, which may not always hold true.

In [25]:

from gensim.models import Word2Vec

def word2vec_embedding_cbow(texts):
    model = Word2Vec(texts, vector_size=300, window=5, min_count=1, workers=4,sg=0)
    word_vectors = model.wv
    #print(word_vectors)

    def get_word2vec_embeddings(text, word_vectors):
        embeddings = [word_vectors[word] for word in text if word in word_vectors]
        if embeddings:
            return np.mean(embeddings, axis=0)
        else:
            return np.zeros(300)

    embeddings = np.array([get_word2vec_embeddings(text, word_vectors) for text in texts])
    return embeddings

embeddings_w2v_cbow = word2vec_embedding_cbow(data['Tokens'])
print(embeddings_w2v_cbow)
X_train_w2v1, X_test_w2v1, y_train_w2v1, y_test_w2v1 = train_test_split(embeddings_w2v_cbow, labels, test_size=0.2, random_state=42)



[[ 0.02526417  0.17349869  0.00987022 ... -0.05650085  0.17430493
  -0.15075786]
 [ 0.11437532  0.16749364 -0.00835051 ...  0.00312652  0.27611646
  -0.24354184]
 [ 0.05128742  0.17605858 -0.07547641 ... -0.05376441  0.25279415
  -0.28923002]
 ...
 [-0.02438427  0.2300602   0.04216736 ... -0.09735176  0.19842166
  -0.10078315]
 [ 0.09299681  0.13884063 -0.04085092 ...  0.01131009  0.2328288
  -0.22933097]
 [ 0.0789419   0.16689835 -0.11756516 ...  0.00448359  0.30474007
  -0.3419309 ]]


# Skip-Gram
Predicts context words based on a target word. Trains the model to maximize the probability of context words given a target word.

Pros: Effective in capturing syntactic and semantic relationships, works well with small datasets.

Cons: Computationally more expensive than CBOW.

In [26]:
from gensim.models import Word2Vec

def word2vec_embedding_sg(texts):
    model = Word2Vec(texts, vector_size=200, window=6, min_count=1, workers=4,sg=1)
    word_vectors = model.wv
    #print(word_vectors)

    def get_word2vec_embeddings(text, word_vectors):
        embeddings = [word_vectors[word] for word in text if word in word_vectors]
        if embeddings:
            return np.mean(embeddings, axis=0)
        else:
            return np.zeros(200)

    embeddings = np.array([get_word2vec_embeddings(text, word_vectors) for text in texts])
    return embeddings

embeddings_w2v_sg = word2vec_embedding_sg(data['Tokens'])
print(embeddings_w2v_sg)
X_train_w2v2, X_test_w2v2, y_train_w2v2, y_test_w2v2 = train_test_split(embeddings_w2v_sg, labels, test_size=0.2, random_state=42)


[[ 0.17109032  0.033767   -0.18476258 ... -0.10014948  0.10881008
  -0.11326653]
 [ 0.22616751  0.03877444 -0.25352594 ... -0.07180958  0.15895079
  -0.16575725]
 [ 0.1826731   0.00148701 -0.18673538 ... -0.11800708  0.13921864
  -0.17711399]
 ...
 [ 0.17524712 -0.0284111  -0.1633537  ... -0.10352435  0.1074788
  -0.13104244]
 [ 0.17381686  0.02411527 -0.19371334 ... -0.06752601  0.15991086
  -0.16255414]
 [ 0.18425429  0.02087011 -0.19497301 ... -0.12127761  0.12777068
  -0.16926508]]


In [27]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

def train_evaluate_rf(X_train_emb, X_test_emb, y_train, y_test):
    rf = RandomForestClassifier()
    rf.fit(X_train_emb, y_train)
    y_pred = rf.predict(X_test_emb)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report


In [28]:

accuracy_one_hot, report_one_hot = train_evaluate_rf(X_train_one_hot, X_test_one_hot, y_train_one_hot, y_test_one_hot)
print(f'One Hot Encoding Accuracy: {accuracy_one_hot}')
print(f'One Hot Encoding Classification Report:\n{report_one_hot}')


One Hot Encoding Accuracy: 0.6724431818181819
One Hot Encoding Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.79      0.74      2094
           1       0.62      0.50      0.55      1426

    accuracy                           0.67      3520
   macro avg       0.66      0.64      0.65      3520
weighted avg       0.67      0.67      0.66      3520



In [29]:
accuracy_tf, report_tf = train_evaluate_rf(X_train_tf, X_test_tf, y_train_tf, y_test_tf)
print(f'Term Frequency Accuracy: {accuracy_tf}')
print(f'Term Frequency Classification Report:\n{report_tf}')

Term Frequency Accuracy: 0.6764204545454545
Term Frequency Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.80      0.75      2094
           1       0.63      0.50      0.56      1426

    accuracy                           0.68      3520
   macro avg       0.66      0.65      0.65      3520
weighted avg       0.67      0.68      0.67      3520



In [30]:
accuracy_tfidf, report_tfidf = train_evaluate_rf(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf)
print(f'TF-IDF Accuracy: {accuracy_tfidf}')
print(f'TF-IDF Classification Report:\n{report_tfidf}')

TF-IDF Accuracy: 0.6849431818181818
TF-IDF Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.83      0.76      2094
           1       0.65      0.48      0.55      1426

    accuracy                           0.68      3520
   macro avg       0.68      0.65      0.65      3520
weighted avg       0.68      0.68      0.67      3520



In [31]:
accuracy_w2v1, report_w2v1 = train_evaluate_rf(X_train_w2v1, X_test_w2v1, y_train_w2v1, y_test_w2v1)
print(f'Word2Vec CBOW Accuracy: {accuracy_w2v1}')
print(f'Word2Vec CBOW Classification Report:\n{report_w2v1}')


Word2Vec Accuracy: 0.625
Word2Vec Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.81      0.72      2094
           1       0.56      0.35      0.43      1426

    accuracy                           0.62      3520
   macro avg       0.60      0.58      0.58      3520
weighted avg       0.61      0.62      0.60      3520



# Word2Vec CBOW observations:
                            vector_size     window_size     Accuracy
                                50              3              62.5
                                100             3              62.9
                                100             4              63.4
                                150             4              62.5
                                200             4              62.4
                                200             5              62.4
                                250             5              62.6
                                300             5              62.3

In [32]:
accuracy_w2v2, report_w2v2 = train_evaluate_rf(X_train_w2v2, X_test_w2v2, y_train_w2v2, y_test_w2v2)
print(f'Word2Vec skip-gram Accuracy: {accuracy_w2v2}')
print(f'Word2Vec skip-gram Classification Report:\n{report_w2v2}')


Word2Vec Accuracy: 0.6551136363636364
Word2Vec Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.84      0.74      2094
           1       0.62      0.38      0.47      1426

    accuracy                           0.66      3520
   macro avg       0.64      0.61      0.61      3520
weighted avg       0.65      0.66      0.63      3520




# Word2Vec skip-gram observations:
                            vector_size     window_size     Accuracy
                                50              3              63.8
                                100             3              65.3
                                100             4              64.3
                                150             4              64.0
                                150             5              65.1
                                200             5              64.2
                                200             6              67.9
                                250             6              66.7
                                300             6              65.5