# Embedding

## Tokenizaiton
splitting text into smaller units called tokens. These tokens can be words, subwords, or even characters.

#### Special Tokens
[CLS] -> Added at the beginning of each input sequence and is used for classification tasks (`<s>`). <br>
[SEP] -> Separate two sentences in a sequence (`</s>`). <br>
[UNK] -> When a word is not found in the vocabulary, it's replaced with this. <br>
`##` - this token is together with the last token

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

In [None]:
tokens = tokenizer.tokenize("This is f* **ed and crazyyyy")
print(tokens)

In [None]:
sentences = [
    "It was a great discussion",
    "Calley comes off shady AF.",
    "Total Jack count anyone?",
    "Wow is Calley tongue tied."
]
batch = tokenizer(sentences, padding=True)
print(batch)

In [None]:
for ids in batch["input_ids"]:
    print(tokenizer.decode(ids))

## Transofrmer Architecture
<img src="images/transformer-architecture.png">

#### Inputs
Input Embedding + Positional Encoding. 

**Padding** <br>
Transformers require input sequences to have a fixed length. <br>
The shorter sentences are padded with a special token, usually [PAD]. <br>
If some texts are longer than the length that the model can handl the texts are either truncated (cut off at the maximum length) or split into multiple segments.

**Input Embedding** <br>
First we give every word a vector based on pre trained "Embedding Space". <br>
Existing models like GloVe: Global Vectors For Word Representation <br>
<img src="images/embedding-space.png" width="200" height="200">

**Positional Encoder (PE)** <br>
Vector that gives contex based on position of word in sentence. <br>
The positional encodings have the same dimension d<sub>model</sub>
as the embeddings, so that the two can be summed. <br>
where pos is the position and i is the dimension. <br>
<img src="images/positional-encoder.png" width="300" height="200">

#### Encoding Block
**Multi-Head Attention** <br>
How relevant is the i'th word to the other words. <br>
For every word we have an attention vector generated which captures contextual relationships between words. <br>
<img src="images/attention-vectors.png" width="400" height="200"> <br>

##### Attention Nerdy details
Each Word Vector is broken to into 3 same dimentional vectors. <Br>
Q - query: what im looking for. <br>
K - key: what i can offer. <br>
V - value: what i actaully have to offer. <br>

How are Q,K,V calculated ? <br>
For each of the h attention heads the models stores three weight matrices. <br>
W<sub>Q</sub><sup>(i)</sup>: The weight matrix for the Query projection. <br>
W<sub>K</sub><sup>(i)</sup>: The weight matrix for the Key projection. <br>
W<sub>V</sub><sup>(i)</sup>: The weight matrix for the Value projection. <br>
Each of these matrices has dimensions (d<sub>model</sub>, d<sub>k</sub>), where:
d<sub>model</sub> is the dimension of the input embeddings (e.g., 768 for BERT base).
d<sub>k</sub> is the dimension of the key/query/value vectors for each head (and is typically d<sub>model</sub>/h).
Total number of weights: So, for each head, you have d<sub>model</sub> * d<sub>k</sub> weights in each of the three matrices. For h heads, the total number of weights is 3 * h * d<sub>model</sub> * d<sub>k</sub>. Since d<sub>k</sub> is often d<sub>model</sub>/h, this simplifies to 3 * d<sub>model</sub><sup>2</sup>.

**TODO** Continue

##### Add & Norm
Takes the output matrix from attention, keeps feeding the calculated input embedding. <br>
This is done to ensure there is stronger information signal that flows through deep networks. Required because of vanishing gradients problem in back propagation (gradient becomes 0 after many backpropagation). <br>
To prevent, we induce stronger signals from the inputs in different parts of the network.

### Preprocessing
<img src="images/preprocessing.png">

##### Tokenization
WordPiece for BERT
1. Remove numbers (2, 1 . . . ). 
2. Remove punctuation marks (‘!’, ’, -, ”, :, ?, [], \, . . . ).
3. Remove special characters (~, @, #, $, %, &, =, +). 
4. Remove symbols (e.g., ). 
5. Remove non-English words, such as اسم. 
6. Remove words with less than three letters. 


##### Lemmatization
determines the word base form, or lemma. It considers the context of the word and aims to produce actual words from the dictionary.   
Example:
Original words: "better," "good," "best"
Lemmatized words: "good," "good," "good"

Source: A systematic review of text stemming techniques.

##### Stop words
common words in a language that are often filtered out of text analysis tasks because they are considered to carry little semantic meaning or contribute minimally to the overall understanding of a text.

Examples of common English stop words include:
Articles: the, a, an
Prepositions: in, on, at, with, for
Conjunctions: and, or, but
Pronouns: I, you, he, she, it, they, we
Auxiliary Verbs: is, am, are, was, were, will, shall, can, could, may, might, must

**Why remove stop words?**
Reduced Dimensionality: By removing stop words, we can reduce the dimensionality of the text data, making it easier to process and analyze.
Improved Performance: Removing stop words can improve the performance of many NLP tasks, such as text classification, sentiment analysis, and information retrieval.
Focus on Key Words: By filtering out stop words, we can focus on the most important words in the text, which can lead to more accurate and meaningful analysis.

##### Stemming
reducing words to their root form by removing suffixes, prefixes, or other affixes. 
Example:
Original words: "cats," "catlike," "catty"   
Stemmed words: "cat", "catlik", "catti"   

Source: Kaur, J.; Buttar, P.K. A systematic review on stopword removal algorithms. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 2018, 4, 207–210.


can change the length of the vector from 512?

# Test Code

In [None]:
GEMINI_API_KEY = "AIzaSyDFm56mSyyYDUAL8yeWlYJ3Rf9z_fNFU9A"

In [None]:
import google.generativeai as genai

genai.configure(api_key=GEMINI_API_KEY)

result = genai.embed_content(
        model="models/text-embedding-004",
        content="What is the meaning of life?")

print(str(result['embedding']))

In [None]:
import pandas as pd

datasetName = "jack_vs_calley_1000" 

# Load the comments dataset
df = pd.read_csv(f"../datasets/youtube-comments/{datasetName}.csv") 

In [None]:
df.to_csv(f"../datasets/youtube-comments/{datasetName}.csv", index=False)

In [None]:
import re

offensive_words = ["fuck", "fucked", "fucking", "shit", "bitch", "cunt", "ass", "damn", "hell"]  # Example

def guess_uncensored_word(censored_word):
    prefix, asterisks, suffix = re.match(r"([a-zA-Z]*)(\*+)([a-zA-Z]*)", censored_word, re.IGNORECASE).groups() if re.match(r"([a-zA-Z]*)(\*+)([a-zA-Z]*)", censored_word, re.IGNORECASE) else (None, None, None)
    if not prefix and not asterisks and not suffix:
        return censored_word

    for word in offensive_words:
        if word.startswith(prefix) and word.endswith(suffix) and len(word) == len(prefix) + len(asterisks) + len(suffix):
            return word

    return censored_word

def uncensor(text):
    words = text.split()
    uncensored_words = []
    for word in words:
        if re.search(r"\*", word):
            guessed_word = guess_uncensored_word(word)
            uncensored_words.append(guessed_word)
        else:
            uncensored_words.append(word)
    return " ".join(uncensored_words)

text = "This is a f***ing test, with some sh\*t and a b\*\*ch"
uncensored_text = uncensor(text)
print(uncensored_text)

In [None]:
sentence_transformer = SentenceTransformer('all-mpnet-base-v2')
embeddings = sentence_transformer.encode(comments)
