<div style="color: #7b6b59; font-size: 30px; text-align: center;">Diverse Approaches to Document Classification</div>

## <span style="color: #7b6b59;">Introduction</span>

**Ways to Fine Tune the Model**

1. Feature extraction – We can use a pre-trained model as a feature extraction mechanism. What we can do is that we can remove the output layer( the one which gives the probabilities for being in each of the 1000 classes) and then use the entire network as a fixed feature extractor for the new data set.

1. Use the Architecture of the pre-trained model – What we can do is that we use architecture of the model while we initialize all the weights randomly and train the model according to our dataset again.

1. Train some layers while freeze others – Another way to use a pre-trained model is to train is partially. What we can do is we keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.


In [1]:
import torch
print("CUDA Available: ", torch.cuda.is_available())


CUDA Available:  False


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Import Python Libraries</div>


In [1]:
import os
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer

from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer


import torch
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
#from torch.utils.data import DataLoader, Dataset




# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Load the Dataset</div>


In [2]:
train_prompts = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv")
train_essays = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_essays.csv")
test_essays = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/test_essays.csv")

In [3]:
train_prompts.head()

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...


In [4]:
train_essays.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


In [5]:
train_essays.shape

(1378, 4)

In [6]:
test_essays.shape

(3, 3)

In [7]:
df = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv")
train_df = df[df.prompt_name != "Car-free cities"].reset_index(drop=True)
valid_df = df[df.prompt_name == "Car-free cities"].reset_index(drop=True)

In [8]:
train_df.shape

(40151, 5)

In [9]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Exploratory Data Analysis</div>


## <span style="color: #7b6b59;">Labels Distribution in Essay Data</span>

- `generated`: Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in test_essays.csv.


In [10]:
train_essays['generated'].value_counts()

generated
0    1375
1       3
Name: count, dtype: int64

In [11]:
train_essays['generated'].value_counts(normalize=True)

generated
0    0.997823
1    0.002177
Name: proportion, dtype: float64

# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Standard Approaches: Vectorization and Classic Machine Learning (ML) Model</div>

## <span style="color: #7b6b59;">Introduction</span>

A simple approach for text classification is to convert text passages in vectors and then use standard ML algorithms such as logistic regression or tree-based models. The key question then becomes: How do you transform a text passage in a vector?

### <span style="color: #7b6b59;">Option 1: TF-IDF, Sparse Vectorization and Classic Machine Learning (ML) Model</span>

TF-IDF (or **term frequency — inverse document frequency**) is one way to achieve this vectorization. It returns a vector with one dimension for each word in a given vocabulary. Each component of this vector reflects the frequency of the corresponding word in the input text compared to the entire collection of texts.

**TF-IDF has several drawbacks. It does not consider the order of the words in the text and it ignores the semantic similarity between words.** It also does not distinguish between the various meanings of a polysemous word (e.g., “sound” as in “a loud sound,” “they sound correct,” or “a sound proposal”).

### <span style="color: #7b6b59;">Option 2: Embeddings obtained from a pre-trained deep learning model, Dense Vectorization and Classic ML Model</span>

A more effective approach, in particular if the training dataset is relatively small, is to use the vector representations (or **sentence embeddings**) obtained from a pre-trained deep learning model such as BERT.

<img width="921" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/901505fd-908a-4810-8fb5-66022a0cbe76">

***Sparse vectorization with TF-IDF (left), dense vectorization with sentence embeddings (right)***

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 1: Sparse Vectorization and Classic Machine Learning (ML) Model</div>

## <span style="color: #7b6b59;">Overview of vectorization options</span>

**Vectors & Word Embeddings: TF-IDF vs Word2Vec vs Bag-of-words vs BERT:**

As discussed above, TF-IDF can be used to vectorize text into a format more agreeable for ML & NLP techniques. However while it is a popular NLP algorithm it is not the only one out there.

1. **Bag of Words:** Bag of Words (BoW) simply counts the frequency of words in a document. Thus the vector for a document has the frequency of each word in the corpus for that document.  The key difference between bag of words and TF-IDF is that the former does not incorporate any sort of inverse document frequency (IDF)  and is only a frequency count (TF).

1. **Word2Vec:**  Word2Vec is an algorithm that uses shallow 2-layer, not deep, neural networks to ingest a corpus and produce sets of vectors. Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other format. Additionally TF-IDF does not take into consideration the context of the words in the corpus whereas word2vec does.

1. **BERT - Bidirectional Encoder Representations from Transformers:** BERT is an ML/NLP technique developed by Google that uses a transformer based ML model to  convert phrases, words, etc into vectors. Key differences between TF-IDF and BERT are as follows: TF-IDF does not take into account the semantic meaning or context of the words whereas BERT does. Also BERT uses deep neural networks as part of its architecture, meaning that it can be much more computationally expensive than TF-IDF which has no such requirements. 

**Feature Engineering with Bag-of-Words or TF-IDF:**

Instead of using deep learning methods, you might utilize statistical methods for text representation like Bag-of-Words or TF-IDF, combined with machine learning algorithms.


## <span style="color: #7b6b59;">TF-IDF</span>

Most machine learning algorithms are fulfilled with mathematical things such as statistics, algebra, calculus and etc. They expect the data to be numerical such as a 2-dimensional array with rows as instances and columns as features. The problem with natural language is that the data is in the form of raw text, so that the text needs to be transformed into a vector. **The process of transforming text into a vector is commonly referred to as text vectorization.** It’s a fundamental process in natural language processing because none of the machine learning algorithms understand a text, not even computers. Text vectorization algorithm namely TF-IDF vectorizer, which is a very popular approach for traditional machine learning algorithms can help in transforming text into vectors. In order to process natural language, the text must be represented as a numerical feature. **The process of transforming text into a numerical feature is called text vectorization.** TF-IDF is one of the most popular text vectorizers, the calculation is very simple and easy to understand. It gives the rare term high weight and gives the common term low weight. TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).

**Term frequency-inverse document frequency** is a text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). TF-IDF can be broken down into two parts **TF (term frequency)** and **IDF (inverse document frequency)**.


- **The term frequency** is the number of occurrences of a specific term in a document. Term frequency indicates how important a specific term in a document. Term frequency represents every text from the data as a matrix whose rows are the number of documents and columns are the number of distinct terms throughout all documents. Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. There are multiple measures, or ways, of defining frequency:

    - Number of times the word appears in a document (raw count).
    - Term frequency adjusted for the length of the document (raw count of occurences divided by number of words in the document).
    - Logarithmically scaled frequency (e.g. log(1 + raw count)).
    - Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document).

- **Document frequency** is the number of documents containing a specific term. Document frequency indicates how common the term is.

- **Inverse document frequency (IDF)** is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents. IDF can be calculated as follow:

    <img width="781" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/e321a50a-138a-438b-9ee4-9320d21a8aed">
    
    Where idfᵢ is the IDF score for term i, dfᵢ is the number of documents containing term i, and n is the total number of documents. The higher the DF of a term, the lower the IDF for the term. When the number of DF is equal to n which means that the term appears in all documents, the IDF will be zero, since log(1) is zero, when in doubt just put this term in the stopword list because it doesn't provide much information. **What is IDF (inverse document frequency)?** Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. IDF is calculated as follows where t is the term (word) we are looking to measure the commonness of and N is the number of documents (d) in the corpus (D).. The denominator is simply the number of documents in which the term, t, appears in. The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact. Finally IDFs can also be pulled from either a background corpus, which corrects for sampling bias, or the dataset being used in the experiment at hand.

    <img width="706" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6012bdc9-5847-4234-9e8c-987adfd2828e">
    
    Note: It can be possible for a term to not appear in the corpus at all, which can result in a divide-by-zero error. One way to handle this is to take the existing count and add 1. Thus making the denominator (1 + count). An example of how the  popular library scikit-learn handles this can be seen below.
    
    <img width="739" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/68d7d8f0-84a7-4b3c-a811-4695eae291d9">

- The **TF-IDF score** as the name suggests is just a multiplication of the term frequency matrix with its IDF, it can be calculated as follow:
    
    <img width="692" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/3d3217ff-fccb-4012-a962-b70c8d40c379">
    
    Where wᵢⱼ is TF-IDF score for term i in document j, tfᵢⱼ is term frequency for term i in document j, and idfᵢ is IDF score for term i. To summarize the key intuition motivating TF-IDF is the importance of a term is inversely related to its frequency across documents.TF gives us information on how often a term appears in a document and IDF gives us information about the relative rarity of a term in the collection of documents. By multiplying these values together we can get our final TF-IDF value.
    
    <img width="710" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6ca35452-1cca-4f6b-b695-253cd13fe27a">


**The higher the TF-IDF score the more important or relevant the term is; as a term gets less relevant, its TF-IDF score will approach 0.**


**Example:** Suppose we have 3 texts and we need to vectorize these texts using TF-IDF.

<img width="614" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/a4e32193-47f9-4291-bfd4-5a1f6046f481">

1. **Step 1:** Create a term frequency matrix where rows are documents and columns are distinct terms throughout all documents. Count word occurrences in every text.
    
    <img width="830" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/77e48f3a-13cf-4c51-bd45-32010ff239d7">

1. **Step 2:** Compute inverse document frequency (IDF) using the previously explained formula. The term i and processing has 0 IDF score, as previously mentioned we can drop these terms, but for the sake of simplicity, we keep these terms here.
    
    <img width="839" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/0bc6eec5-5317-4694-9ebb-c84f9c1b9d88">
    
1. **Step 3:** Multiply TF matrix with IDF respectively. That's it 😃! the text is now ready to feed into a machine learning algorithm.

     <img width="833" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/fd090353-4915-4f01-a88c-ee3e381e6382">


### <span style="color: #7b6b59;">Pros of using TF-IDF</span>

The biggest advantages of TF-IDF come from how simple and easy to use it is. It is simple to calculate, it is computationally cheap, and it is a simple starting point for similarity calculations (via TF-IDF vectorization + cosine similarity).


### <span style="color: #7b6b59;">Limitations, Cons of using TF-IDF</span>

1. It is only useful as a lexical level feature.

1. Synonymities are neglected.

1. It doesn't capture semantic. Something to be aware of is that TF-IDF cannot help carry semantic meaning. It considers the importance of the words due to how it weighs them, but it cannot necessarily derive the contexts of the words and understand importance that way.

1. The highest TF-IDF score may not make sense with the topic of the document, since IDF gives high weight if the DF of a term is low.

1. It neglects the sequence of the terms. Also as mentioned above, like BoW, TF-IDF ignores word order and thus compound nouns like “Queen of England” will not be considered as a “single unit”. This also extends to situations like negation with “not pay the bill” vs “pay the bill”, where the order makes a big difference. In both cases using NER tools and underscores, “queen_of_england” or “not_pay” are ways to handle treating the phrase as a single unit. No concept of word order: TF-IDF treats all words as equally important, regardless of their order or position in the document. This can be problematic for certain applications, such as sentiment analysis, where word order can be crucial for determining the sentiment of a document.

1. Another disadvantage is that it can suffer from memory-inefficiency since TF-IDF can suffer from the curse of dimensionality. Recall that the length of TF-IDF vectors is equal to the size of the vocabulary. In some classification contexts this may not be an issue but in other contexts like clustering this can be unwieldy as the number of documents increases. Thus looking into some of the above named alternatives (BERT, Word2Vec) may be necessary. **Vocabulary size:** The vocabulary size can become very large when working with large datasets, which can lead to high-dimensional feature spaces and difficulty in interpreting the results.

1. Assumes independence: TF-IDF assumes that the terms in a document are independent of each other. However, this is often not the case in natural language, where words are often related to each other in complex ways.

### <span style="color: #7b6b59;">Where to use TF-IDF</span>

As we can see, TF-IDF can be a very handy metric for determining how important a term is in a document. But how is TF-IDF used? There are three main applications for TF-IDF. These are in machine learning, information retrieval, and text summarization/keyword extraction.


1. **Using TF-IDF in machine learning & natural language processing:** Machine learning algorithms often use numerical data, so when dealing with textual data or any natural language processing (NLP) task, a sub-field of ML/AI dealing with text, that data first needs to be converted to a vector of numerical data by a process known as vectorization. TF-IDF vectorization involves calculating the TF-IDF score for every word in your corpus relative to that document and then putting that information into a vector (see images above). Thus each document in your corpus would have its own vector, and the vector would have a TF-IDF score for every single word in the entire collection of documents. ***Once you have these vectors you can apply them to various use cases such as seeing if two documents are similar by comparing their TF-IDF vector using cosine similarity.***

1. **Using TF-IDF in information retrieval:** TF-IDF also has use cases in the field of information retrieval, with one common example being search engines. Since TF-IDF can tell you about the relevant importance of a term based upon a document, a search engine can use TF-IDF to help rank search results based on relevance, with results which are more relevant to the user having higher TF-IDF scores.

1. **Using TF-IDF in text summarization & keyword extraction:** Since TF-IDF weights words based on relevance, one can use this technique to determine that the words with the highest relevance are the most important. This can be used to help summarize articles more efficiently or to simply determine keywords (or even tags) for a document. Measures relevance: TF-IDF measures the importance of a term in a document, based on the frequency of the term in the document and the inverse document frequency (IDF) of the term across the entire corpus. This helps to identify which terms are most relevant to a particular document.

1. **Interpretable:** The scores generated by TF-IDF are easy to interpret and understand, as they represent the importance of a term in a document relative to its importance across the entire corpus.

1. Works well with different languages: TF-IDF can be used with different languages and character encodings, making it a versatile technique for processing multilingual text data.

### <span style="color: #7b6b59;">Conclusion</span>

TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It’s a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks. This includes building search engines, summarizing documents, or other tasks in the information retrieval and machine learning domains.


### <span style="color: #7b6b59;">How to implement TF-IDF with scikit-learn</span>

1. Thanks to the `TfidfVectorizer` class, implementing TF-IDF with `scikit-learn` is a fairly straightforward process. The first step is importing `TfidfVectorizer` and creating a list of documents to analyze and convert into TF-IDF features.

1. Next, create an instance of the `TfidfVectorizer` class with the desired customization options, such as tokenization patterns, stopword removal or IDF smoothing parameters.

1. Then, to fit and transform the corpus, call the `fit_transform()` method on the vectorizer instance and pass in the corpus. This computes term frequencies and inverse document frequencies while transforming the text data into a matrix of TF-IDF features.

1. Finally, call `get_feature_names()` to inspect feature names and their corresponding TF-IDF values, then convert the variable to an array using toarray():

By following these steps, you can implement TF-IDF with scikit-learn and transform your raw text data into valuable numerical representations for further analysis or feeding into machine learning models.

When using the `TfidfVectorizer` from `scikit-learn`, **you do not necessarily need to tokenize the text yourself before passing it to the vectorizer**. TfidfVectorizer has built-in capabilities to tokenize and preprocess the text. Here's how it works by default:

1. **Default Tokenization in TfidfVectorizer:**

    - **Tokenization:** By default, TfidfVectorizer tokenizes the text by extracting word tokens and ignores punctuation and whitespace. This is typically done using a regular expression that defines what constitutes a token (word). The default pattern is `r"(?u)\b\w\w+\b"`, which captures sequences of alphanumeric characters (words) that are at least two characters long. This pattern is specified in the token_pattern argument. So while spaces between words usually signify where one word ends and another begins (and thus often correspond to word boundaries), the regex isn't splitting text directly on spaces. Instead, it's looking for those alphanumeric sequences that are bounded by non-word characters or the edges of the string, which more robustly constitutes what we think of as whole, standalone words. This method is more reliable because:
        - **It ignores punctuation:** For example, in "end-of-sentence.", the period is not part of the last word, and the pattern correctly excludes it from the token "sentence".
        - **It handles complex word separations:** Not all words are neatly separated by spaces, especially in languages with different scripts or in cases with punctuation like hyphens, apostrophes, etc. The pattern correctly identifies words in many of these cases.
        In summary, while spaces are a significant part of how the pattern determines where words begin and end, the actual process involves identifying sequences of word characters that are delineated by word boundaries, which provides a more nuanced and effective approach to word tokenization in varied text environments.
        
    - **Preprocessing:** It converts all characters to lowercase (unless you set lowercase=False) and performs normalization, such as accent stripping, if specified.

1. **Customization Options:**
    1. **Custom Tokenizer:** You can provide a custom tokenizer function to the `tokenizer` parameter. This function takes a string as input and returns a list of tokens. If you have specific tokenization needs (e.g., handling special cases, working with a non-standard text format), you might implement and use your custom tokenizer.

    1. **Custom Preprocessor:** Similarly, you can provide a custom preprocessing function to the `preprocessor` parameter. This function also takes a string as input and returns the processed string. It's applied to the text before tokenization.


***Should You Tokenize Beforehand?***

- **Usually Unnecessary:** For standard text processing needs, the default behavior of TfidfVectorizer is often sufficient. It is designed to handle typical cases of text vectorization, including tokenization and case normalization.

- **Custom Needs:** If your text data requires specialized handling, such as dealing with a particular language's nuances, handling mixed text types, or integrating with an existing text processing pipeline, you might perform tokenization (and other text preprocessing) before vectorization. In such cases, you could use the tokenizer and preprocessor parameters to integrate your custom functions.


1. **`ngram_range=(1, 3)`:** This parameter defines the range of n-gram sizes to include in the token counts. (1, 3) means that it will consider unigrams (single words), bigrams (two consecutive words), and trigrams (three consecutive words) as individual features for vectorization. Essentially, it's looking at the individual words, pairs of consecutive words, and triplets of consecutive words when creating the vectors.

1. **`sublinear_tf=True`:** This parameter applies sublinear tf scaling, i.e., it replaces term frequency (tf) with 1 + log(tf). The idea is to reduce the sensitivity of the vectorizer to terms that occur very frequently and therefore might skew the results disproportionately. It's a way to temper the effect of terms that appear very often and might dominate the feature set. By transforming the frequency to the logarithmic scale, increases in term frequency have a gradually smaller effect on the computation of TF-IDF.

1. **`lowercase=False`:** This indicates that the text will not be automatically converted to lowercase before tokenizing. By default, TfidfVectorizer converts all characters to lowercase to ensure that the same words in different cases are counted as the same token.

1. **`analyzer='word'`:** This parameter sets the unit of features to words. Other options might include 'char' or 'char_wb' for character n-grams. 'word' means it will consider tokens of words as the feature base.

1. **`tokenizer=dummy`:** This specifies a custom tokenizer function. Typically, TfidfVectorizer tokenizes the string by extracting words of at least two letters. By setting tokenizer to 'dummy', you are replacing the default tokenizer with your own custom function named dummy. This function will be used to split the text into tokens.

1. **`token_pattern=None`:** Normally, this parameter defines the regex pattern that the tokenizer uses to find tokens in the text string. By setting it to None, and providing a custom tokenizer, you're effectively ignoring the default regex pattern and relying entirely on the custom tokenizer you've provided.

1. **`preprocessor=dummy`:** Similar to the tokenizer, this specifies a custom pre-processing function. The default preprocessor in TfidfVectorizer takes care of removing accents and performing other cleaning steps. By setting it to 'dummy', you are specifying that your own custom function named dummy should be used for preprocessing the text.

1. **`strip_accents='unicode'`:** This is used to remove accents during the preprocessing step. 'unicode' is a method that works on any characters that have a direct Unicode equivalent. It's an effective way to standardize text by removing accents and diacritical marks that might lead to variations in how words are processed.



1. **Fitting the Vectorizer:** Initially, when we fit TfidfVectorizer to our documents (e.g., using vectorizer.fit(texts)), it learns the vocabulary of the corpus, meaning it identifies all unique terms used across all documents, considering the constraints and specifications we've given it (like token patterns, n-grams, etc.).

1. **Building the Vocabulary Dictionary:** After fitting, the vectorizer has a complete list of terms used in the documents. It then creates a mapping of these terms to specific indices. This mapping is stored in `vectorizer.vocabulary_`.
    
    - Keys: Each unique term or token found in the corpus.
    - Values: A unique integer index corresponding to each term. This index is used when creating the sparse matrix representation of the documents where each term's TF-IDF score will be placed.
    
    ```python 
{
    'galaxy': 123,
    'black hole': 15,
    'star cluster': 678,
    'nebula': 321,
    ...
}
```
    In this hypothetical vocabulary:

    - The term 'galaxy' is found at column index 123 in the TF-IDF matrix.
    - The term 'black hole' is found at column index 15, and so on.

1. When you **transform** your documents into their TF-IDF representation using the fitted vectorizer (via vectorizer.transform(texts)), each document is represented as a sparse vector with the length of the total vocabulary, where most values are zero except for the indices corresponding to the terms present in the document.


In [12]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3),sublinear_tf=True)
X = vectorizer.fit_transform(train_df["text"])

In [13]:
# Inspect feature names and TF-IDF values 
print(vectorizer.get_feature_names_out()) 

['00' '00 00' '00 00 and' ... '完全禁止使用手机应该是合法和道路安全的唯一选择'
 '完全禁止使用手机应该是合法和道路安全的唯一选择 保护所有道路使用者的安全'
 '完全禁止使用手机应该是合法和道路安全的唯一选择 保护所有道路使用者的安全 司机必须在驾驶时将全部注意力都集中在道路上']


In [14]:
# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=X[0] 

# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=vectorizer.get_feature_names_out(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
be in contact,0.067319
way how,0.059980
always on their,0.059508
are always on,0.055284
in contact,0.051028
...,...
further without,0.000000
further with you,0.000000
further with this,0.000000
further with their,0.000000


The `vocab = vectorizer.vocabulary_` line returns a Python dictionary from the fitted `TfidfVectorizer` object. The dictionary's keys are the terms (or tokens) found in the document corpus, and the values are the column indices of these terms in the resulting TF-IDF matrix.
The term 'galaxy' is found at column index 123 in the TF-IDF matrix.
The term 'black hole' is found at column index 15, and so on.  This vocabulary is crucial because it maintains a consistent mapping of terms to indices, ensuring that when you transform new documents into vectors, the terms align correctly with the learned model's features. It's essential for both understanding the feature space of your model and for preparing new text inputs for predictions or further analysis with the trained vectorizer.



In [15]:
# Getting vocab
vocab = vectorizer.vocabulary_


# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 2: Extracting embeddings from pre-trained models and Classic Machine Learning (ML) Model - Transfer Learning without Fine-Tuning</div>



Instead of fine-tuning a pre-trained model, you could use it as a feature extractor. For instance, you can pass your documents through a pre-trained model (like BERT) to get embeddings and then train a simpler machine learning model (like Logistic Regression) on those features.

**Extracting embeddings from pre-trained BERT| Huggingface Transformers**

## <span style="color: #7b6b59;">Overview</span>


The need for standardization in training models and using the language model, Hugging Face, was found.NLP is democratized by Hugging Face, where the constructed API allows easy access to pre-trained models, datasets, and tokens. This Hugging Face's transformers library generates embeddings, and we use the pre-trained BERT model to extract the embeddings.

## <span style="color: #7b6b59;">How to use embeddings for feature extraction?</span>

Now, let’s talk about how you can use BERT with your text: The BERT Model learns complex understandings of the English language, which can help you extract different aspects of text for various tasks. If you have a set of sentences with labels, you can train a regular classifier using the information produced by the BERT Model as input (the text). To obtain the features of a particular text using this model in TensorFlow see the code below.

**How to use embeddings to extract information from text column?**

We are going to take advantage of the incredible hugging face 🤗 framework to extract information from this feature.

1. **Step 1:** First, we need to import the model and the tokenizer: There are different models that we can try, and you check them here: https://huggingface.co/models?pipeline_tag=feature-extraction It is important to use the model’s tokenizer so that it receives the data in a proper format and they are also useful since they already clean up the data for you. Each tokenizer will have different ways of dealing with the data, therefore it is important to read about them.

1. **Step 2:** Second, we extract the hidden state associated to the token CLS which represents an entire sequence of text and rather than dealing with a 768 array for each token in a string, we just need to deal with one (the 768 dimension varies from model to model).





In [16]:
# Step 1: We need to import the model and the tokenizer
#tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
#model = TFBertModel.from_pretrained("bert-base-cased")

from transformers import AutoModel, AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)


#custom_text = "You are welcome to utilize any text of your choice."
#encoded_input = tokenizer(custom_text, return_tensors='tf')
#output_embeddings = model(encoded_input)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [17]:
# Step 2: we extract the hidden state associated to the token CLS
#train_df["embeddings"] = train_df["text"].apply(lambda x: model(**tokenizer(x, return_tensors="pt", truncation=True)).last_hidden_state[:,0,:].detach().numpy()[0])

This piece of code is a straightforward example of how to use the BERT tokenizer and model from the Hugging Face `transformers `library for encoding text into embeddings. Here's a breakdown of what each part does:

1. **Importing the Necessary Classes:**
    - **BertTokenizer:** A tokenizer class for BERT. It handles the conversion from text to tokens that BERT understands.
    - **TFBertModel:** The BERT model class compatible with TensorFlow.

1. **Using the BERT Tokenizer and Model:**
    
    1. **Load Pre-trained Models:**
        - `tokenizer = BertTokenizer.from_pretrained('bert-base-cased')`: Loads the BERT tokenizer for the 'bert-base-cased' version. This tokenizer is responsible for breaking the text down into tokens that BERT can understand.
        - `model = TFBertModel.from_pretrained("bert-base-cased")`: Loads the pre-trained BERT model. This model will generate embeddings for the input text.

1. **Prepare Custom Text:**

    - `custom_text = "You are welcome to utilize any text of your choice."`: A sample text that you want to convert into embeddings.

1. **Tokenize the Text:**
    - `encoded_input = tokenizer(custom_text, return_tensors='tf')`: The tokenizer converts the text into a format suitable for the BERT model. The `return_tensors='tf'` argument tells the tokenizer to return TensorFlow tensors.

1. **Generate Embeddings:**

    - `output_embeddings = model(encoded_input)`: Passes the tokenized input to the BERT model. The model returns the embeddings, which are a rich, contextual representation of each token in the input text.

1. **Understanding the Output:** The `output_embeddings` returned by the model is typically a complex structure containing several types of embeddings:

    - **Last Hidden State:** The output corresponding to the last layer of the BERT model, which gives you the embeddings for each token in the input sequence.
    - **Pooler Output:** A pooled output of the last hidden state, which represents the entire input sequence, often used in classification tasks.

To print the dimensions of the output_embeddings, you would typically focus on these two parts. Here is how you can do it:

In these lines of code:

- `output_embeddings.last_hidden_state.shape` will give you the dimensions of the last hidden state, which is usually of the form `[batch_size, sequence_length, hidden_size]`.
- `output_embeddings.pooler_output.shape` will give you the dimensions of the pooled output, typically `[batch_size, hidden_size]`.
Understanding these dimensions:

- **batch_size:** The number of sequences processed at a time (for your case, it will be 1 as you're processing a single sentence).
- **sequence_length:** The length of the tokenized input (number of tokens).
- **hidden_size:** The size of the hidden layers in the BERT model. For 'bert-base-cased', it is usually 768.


# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Important Deep Learning Concepts</div>

## <span style="color: #7b6b59;">Cross-Validation</span>

In machine learning (ML), generalization usually refers to the ability of an algorithm to be effective across various inputs. It means that the ML model does not encounter performance degradation on the new inputs from the same distribution of the training data.

For human beings generalization is the most natural thing possible. We can classify on the fly. For example, we would definitely recognize a dog even if we didn’t see this breed before. Nevertheless, it might be quite a challenge for an ML model. That’s why checking the algorithm’s ability to generalize is an important task that requires a lot of attention when building the model.

To do that, we use Cross-Validation (CV). There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to your training data and hope it would accurately work for the real data it has never seen before. You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.

Cross Validation is a very useful technique:

- for **assessing the effectiveness of your model**, particularly in cases where you need to mitigate overfitting. 
- It is also of use in **determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.** 

This is all the basic you need to get started with cross validation. You can get started with all kinds of validation techniques using `Scikit-Learn`, that gets you up and running with just a few lines of code in python.

### <span style="color: #7b6b59;">What is cross-validation?</span>

This process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data, is known as validation. Generally, an error estimation for the model is made after training, better known as evaluation of residuals. In this process, a numerical estimate of the difference in predicted and original responses is done, also called the training error. However, this only gives us an idea about how well our model does on data used to train it. **Now its possible that the model is underfitting or overfitting the data. So, the problem with this evaluation technique is that it does not give an indication of how well the learner will generalize to an independent/ unseen data set. Getting this idea about our model is known as Cross Validation.**

- **Cross-validation is a technique for evaluating a machine learning model and testing its performance.** CV is commonly used in applied ML tasks. It helps to compare and select an appropriate model for the specific predictive modeling problem.

CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods used to count the model’s efficiency scores. All this makes cross-validation a powerful tool for selecting the best model for the specific task.

There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:

1. **Divide** the dataset into two parts: one for training, other for testing
1. **Train** the model on the training set
1. **Validate** the model on the test set
1. **Repeat** 1-3 steps a couple of times. This number depends on the CV method that you are using

As you may know, there are plenty of CV techniques. Some of them are commonly used, others work only in theory. 

1. **Hold-out**
1. **K-folds**
1. **Stratified K-folds**
1. Repeated K-folds
1. Nested K-folds
1. Time series CV

Above listed validation techniques are also referred to as **Non-exhaustive cross validation methods**. These do not compute all ways of splitting the original sample, i.e. you just have to decide how many subsets need to be made. Also, these are approximations of method listed below, also called **Exhaustive Methods,** that computes all possible ways the data can be split into training and test sets.

1. **Leave-one-out**
1. **Leave-p-out**

### <span style="color: #7b6b59;">Cross-Validation Techniques</span>

1. **Hold-out cross-validation:** Hold-out cross-validation is the simplest and most common technique. You might not know that it is a hold-out method but you certainly use it every day. *We usually use the hold-out method on large datasets as it requires training the model only once.* It is really easy to implement hold-out. The error estimation then tells how our model is doing on unseen data or the validation set. For example, you may do it using `sklearn.model_selection.train_test_split`. The algorithm of hold-out technique:

    - Divide the dataset into two parts: the **training set** and the **test set**. Usually, 80% of the dataset goes to the training set and 20% to the test set but you may choose any splitting that suits you better
    - Train the model on the training set
    - Validate on the test set
    - Save the result of the validation
    
    1. Disadvantages:
        - For example, a dataset that is not completely even distribution-wise. If so we may end up in a rough spot after the split. For example, the training set will not represent the test set. Both training and test sets may differ a lot, one of them might be easier or harder.  This is a simple kind of cross validation technique, also known as the holdout method. Although this method doesn’t take any overhead to compute and is better than traditional validation, it still suffers from issues of high variance. This is because it is not certain which data points will end up in the validation set and the result might be entirely different for different sets.
        - Moreover, the fact that we test our model only once might be a bottleneck for this method. Due to the reasons mentioned before, the result obtained by the hold-out technique may be considered inaccurate. 

1. **k-Fold cross-validation:** k-Fold cross-validation is a technique that minimizes the disadvantages of the hold-out method. k-Fold introduces a new way of splitting the dataset which helps to overcome the “test only once bottleneck”. As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. **By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias.** So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that. In K Fold cross validation, the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method. **As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.**

    The algorithm of the k-Fold technique:
    
    <img width="500" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/376f6016-01a3-499d-816f-a677f544bf2e">
    
    - Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
    - Split the dataset into k equal (if possible) parts (they are called folds)
    - Choose k – 1 folds as the training set. The remaining fold will be the test set
    - Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. 
    - In the end, you should have validated the model on every fold that you have. To get the final score average the results that you got on step 6.
    
    To perform k-Fold cross-validation you can use `sklearn.model_selection.KFold`. In general, it is always better to use k-Fold technique instead of hold-out. In a head to head, comparison k-Fold gives a more stable and trustworthy result since training and testing is performed on several different parts of the dataset. We can make the overall score even more robust if we increase the number of folds to test the model on many different sub-datasets.
    
    1. Disadvantages:
        - Still, k-Fold method has a disadvantage. Increasing k results in training more models and the training process might be really expensive and time-consuming.

1. **Stratified k-Fold cross-validation:** Sometimes we may face a large imbalance of the target value in the dataset. For example, in a dataset concerning wristwatch prices, there might be a larger number of wristwatch having a high price. In the case of classification, in cats and dogs dataset there might be a large shift towards the dog class. **Stratified k-Fold is a variation of the standard k-Fold CV technique which is designed to be effective in such cases of target imbalance.** In some cases, there may be a large imbalance in the response variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is also known as Stratified K Fold.

    It works as follows. Stratified k-Fold splits the dataset on k folds such that each fold contains approximately the same percentage of samples of each target class as the complete set. In the case of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all the folds. The algorithm of Stratified k-Fold technique:

    - Pick a number of folds – k
    - Split the dataset into k folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set 
    - Choose k – 1 folds which will be the training set. The remaining fold will be the test set
    - Train the model on the training set. On each iteration a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
    - To get the final score average the results that you got on step 6.

    As you may have noticed, the algorithm for Stratified k-Fold technique is similar to the standard k-Folds. You don’t need to code something additionally as the method will do everything necessary for you. Stratified k-Fold also has a built-in method in sklearn – `sklearn.model_selection.StratifiedKFold`. All mentioned above about k-Fold CV is true for Stratified k-Fold technique. When choosing between different CV methods, make sure you are using the proper one. For example, you might think that your model performs badly simply because you are using k-Fold CV to validate the model which was trained on the dataset with a class imbalance. To avoid that you should always do a proper exploratory data analysis on your data.

1. **Repeated k-Fold cross-validation:** Repeated k-Fold cross-validation or Repeated random sub-sampling CV is probably the most robust of all CV techniques in this paper. It is a variation of k-Fold but in the case of Repeated k-Folds k is not the number of folds. It is the number of times we will train the model. The general idea is that on every iteration we will randomly select samples all over the dataset as our test set. For example, if we decide that 20% of the dataset will be our test set, 20% of samples will be randomly selected and the rest 80% will become the training set. The algorithm of Repeated k-Fold technique:

    - Pick k – number of times the model will be trained
    - Pick a number of samples which will be the test set
    - Split the dataset
    - Train on the training set. On each iteration of cross-validation, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3-6 k times
    - To get the final score average the results that you got on step 6.
    
    Repeated k-Fold has clear advantages over standard k-Fold CV. Firstly, the proportion of train/test split is not dependent on the number of iterations. Secondly, we can even set unique proportions for every iteration. Thirdly, random selection of samples from the dataset makes Repeated k-Fold even more robust to selection bias. Still, there are some disadvantages. k-Fold CV guarantees that the model will be tested on all samples, whereas Repeated k-Fold is based on randomization which means that some samples may never be selected to be in the test set at all. At the same time, some samples might be selected multiple times. Thus making it a bad choice for imbalanced datasets. Sklearn will help you to implement a Repeated k-Fold CV. Just use `sklearn.model_selection.RepeatedKFold`. In sklearn implementation of this technique you must set the number of folds that you want to have (n_splits) and the number of times the split will be performed (n_repeats). It guarantees that you will have different folds on each iteration.

1. **Leave-one-out cross-validation:** Leave-one-out сross-validation (LOOCV) is an extreme case of k-Fold CV. Imagine if k is equal to n where n is the number of samples in the dataset. Such k-Fold case is equivalent to Leave-one-out technique. The algorithm of LOOCV technique:

    - Choose one sample from the dataset which will be the test set
    - The remaining n – 1 samples will be the training set
    - Train the model on the training set. On each iteration, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 1 – 5 n times as for n samples we have n different training and test sets
    - To get the final score average the results that you got on step 5.
    
    For LOOCV sklearn also has a built-in method. It can be found in the model_selection library – sklearn.model_selection.LeaveOneOut.
    
    1. Advantages:
        - The greatest advantage of Leave-one-out cross-validation is that it doesn’t waste much data. We use only one sample from the whole dataset as a test set, whereas the rest is the training set. 
        
    1. Disadvantages:
        - But when compared with k-Fold CV, LOOCV requires building n models instead of k models, when we know that n which stands for the number of samples in the dataset is much higher than k. It means LOOCV is more computationally expensive than k-Fold, it may take plenty of time to cross-validate the model using LOOCV. **Thus, the Data Science community has a general rule based on empirical evidence and different researches, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.**

1. **Leave-p-out cross-validation:** Leave-p-out cross-validation (LpOC) is similar to Leave-one-out CV as it creates all the possible training and test sets by using p samples as the test set. All mentioned about LOOCV is true and for LpOC. This approach leaves p data points out of training data, i.e. if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness. **This method is exhaustive in the sense that it needs to train and validate the model for all possible combinations, and for moderately large p, it can become computationally infeasible.** Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap for LpOC if p is higher than 1. The algorithm of LpOC technique:

    - Choose p samples from the dataset which will be the test set
    - The remaining n – p samples will be the training set
    - Train the model on the training set. On each iteration, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 2 – 5 Cpn times 
    - To get the final score average the results that you got on step 5
    
    You can perform Leave-p-out CV using sklearn – `sklearn.model_selection.LeavePOut`. LpOC has all the disadvantages of the LOOCV, but, nevertheless, it’s as robust as LOOCV. A particular case of this method is when p = 1. This is known as Leave one out cross validation. **This method is generally preferred over the previous one because it does not suffer from the intensive computation, as number of possible combinations is equal to number of data points in original sample or n.**


## <span style="color: #7b6b59;">Weight Decay</span>

### <span style="color: #7b6b59;">What is weight decay?</span>

- Weight decay is a **regularization technique** by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.  **`loss = loss + weight decay parameter * L2 norm of the weights`**

- Weight decay **is a form of regularization that penalizes large weights in the network.** It does this by adding a term to the loss function that is proportional to the sum of the squared weights. This term reduces the magnitude of the weights and prevents them from growing too large. Weight decay can also be seen as a way of introducing prior knowledge that the weights should be small and smooth.

Some people prefer to only apply weight decay to the weights and not the bias. **PyTorch applies weight decay to both weights and bias.**

The weight decay is a regularization parameter that prevents the model weights from ‘exploding’. Zeroing the weight decay for these parameters is usually done by default in various projects and frameworks, but it’s still worth checking since it is still not the default behavior for Pytorch.

Weight decay essentially pulls the weights towards 0. **While this is beneficial for convolutional and linear layer weights, Batchnorm layer parameters are meant to scale (the gamma parameter) and shift (the beta parameter) the normalized input of the layer. As such, forcing these values to a lower value would affect the distribution and result in inferior results.**

### <span style="color: #7b6b59;">How does weight decay work?</span>


Weight decay works by updating the weights in the opposite direction of their current value, scaled by a factor called the weight decay rate. This factor determines how much the weights are shrunk at each step of the optimization. A higher weight decay rate means more regularization and less overfitting, but also less flexibility and more underfitting. A lower weight decay rate means less regularization and more overfitting, but also more flexibility and less underfitting. The optimal weight decay rate depends on the data, the model, and the learning rate.

 
### <span style="color: #7b6b59;">Why do we use weight decay?</span>

1. **To prevent overfitting.**

1. **To keep the weights small and avoid exploding gradient.** Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss. This will help keep the weights as small as possible, preventing the weights to grow out of control, and thus avoid exploding gradient.

**What are the benefits of weight decay?**

Weight decay offers a variety of advantages for neural network training and performance. It can reduce the variance of the model, which leads to better generalization ability. Additionally, weight decay prevents the weights from becoming too large, which could cause numerical instability or gradient explosion. Furthermore, it simplifies the model and makes it more interpretable. Finally, it can also improve the convergence speed and stability of the optimization algorithm.


### <span style="color: #7b6b59;">What are the drawbacks of weight decay??</span>


Weight decay has some drawbacks that should be taken into account. For instance, it adds an extra hyperparameter to tune, making the model selection process more complex and costly. Additionally, it can reduce the capacity and expressiveness of a model, especially for deep and complex networks. Furthermore, weight decay can interfere with the learning of sparse or important features since it treats all weights equally. Finally, it can cause underfitting if the weight decay rate is too high or if the data is noisy or insufficient.

### <span style="color: #7b6b59;">How do we use weight decay?</span>


To use weight decay, we can simply define the weight decay parameter in the `torch.optim.SGD` optimizer or the `torch.optim.Adam optimizer`. 

<img width="941" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/94444aa1-11d7-40e9-a39f-a86e7f666d09">

Note that Adam uses a different equation for the loss. But the key concept is the same. Also, as I mentioned above that **PyTorch applies weight decay to both weights and bias.** If you would like to only use weights, you can use `model.named_parameters()` function. **`model.named_parameters()` also allows you to do more complex weight decay operations like using weight decay in different layers.**


### <span style="color: #7b6b59;">How do you choose the weight decay rate?</span>

Choosing the weight decay rate is a trade-off between regularization and flexibility. There is no universal formula or rule for setting the weight decay rate, as it depends on many factors, such as the data, the model, the learning rate, and the optimization algorithm. However, some general guidelines are to start with a small weight decay rate and increase it gradually until an improvement in validation or test performance is seen. Validation sets or cross-validation can be used to evaluate the effect of different rates on model performance and select the one that minimizes validation error. Additionally, grid searches or random searches can be employed to explore a range of rates and find the optimal one for a given problem. ***Finally, learning rate schedulers or adaptive optimizers can adjust the weight decay rate dynamically based on learning progress.***


### <span style="color: #7b6b59;">How do you compare weight decay with other regularization methods?</span>

Weight decay is one of the most common and effective regularization methods for neural networks, but it is not the only one. Other regularization methods that can be used alone or in combination with weight decay include noise injection, dropout, batch normalization, early stopping, and data augmentation. Each of these methods has its own advantages and disadvantages, making the best choice dependent on the specific problem and data. As a general rule, use weight decay as a baseline regularization method as it is simple, effective, and widely applicable. For noisy, sparse, or imbalanced data or for very deep or complex networks, noise injection or dropout is recommended. Batch normalization or early stopping should be used when a network suffers from slow or unstable convergence or when the learning rate is too high or too low. Data augmentation should be used when data is limited, simple, or homogeneous or when the network is very flexible or expressive.



## <span style="color: #7b6b59;">2. Optimizers</span>

### <span style="color: #7b6b59;">Introduction</span>


Many people may be using optimizers while training the neural network without knowing that the method is known as optimization. 

- Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. How you should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.


### <span style="color: #7b6b59;">Optimization Algorithms</span>

We’ll learn about different types of optimizers and their advantages:


1. **Gradient Descent:** Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm. Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized. algorithm: θ=θ−α⋅∇J(θ). Gradient descent is an optimization algorithm based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum. Gradient Descent iteratively reduces a loss function by moving in the direction opposite to that of steepest ascent. It is dependent on the derivatives of the loss function for finding minima. uses the data of the entire training set to calculate the gradient of the cost function to the parameters which requires large amount of memory and slows down the process. How big/small the steps are gradient descent takes into the direction of the local minimum are determined **by the learning rate**, which figures out how fast or slow we will move towards the optimal weights.

    <img width="879" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/1ce01c91-6d49-45ab-aed5-05679edcbc01">
    
    <img width="819" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/36cdb708-717a-47fe-83a4-dd63fe8c7af5">

    1. **Advantages:**
    
        - Easy computation.
        - Easy to implement.
        - Easy to understand.
    1. **Disadvantages:**

        - May trap at local minima.
        - Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima. Because this method calculates the gradient for the entire data set in one update, the calculation is very slow.
        - Requires large memory to calculate gradient on the whole dataset. It requires large memory and it is computationally expensive.

1. **Stochastic Gradient Descent:** It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent. `θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)}` are the training examples. As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities. It is a variant of Gradient Descent. It update the model parameters one by one. If the model has 10K dataset SGD will update the model parameters 10k times.

    <img width="812" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4bbb77e3-5c50-4bbd-a467-b1e240970db2">

    1. **Advantages:**

        - Frequent updates of model parameters hence, converges in less time.
        - Requires less memory as no need to store values of loss functions.
        - May get new minima’s.
        - Allows the use of large data sets as it has to update only one example at a time.
   
    1. **Disadvantages:**

        - High variance in model parameters.
        - May shoot even after achieving global minima.
        - To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
        - The frequent can also result in noisy gradients which may cause the error to increase instead of decreasing it.
        - Frequent updates are computationally expensive.

1. **Mini-Batch Gradient Descent:** It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated. `θ=θ−α⋅∇J(θ; B(i)), where {B(i)}` are the batches of training examples. It is a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. it can reduce the variance when the parameters are updated, and the convergence is more stable. It splits the data set in batches in between 50 to 256 examples, chosen at random.

    1. **Advantages:**

        - Frequently updates the model parameters and also has less variance.
        - Requires medium amount of memory.
        - It leads to more stable convergence.
        - more efficient gradient calculations.
        
    1. **All types of Gradient Descent have some challenges:**

        - Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge. If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value.

        - Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
        - May get trapped at local minima. Mini-batch gradient descent does not guarantee good convergence,

1. **Momentum:** Momentum was invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’. `V(t)=γV(t−1)+α.∇J(θ)`. Now, the weights are updated by `θ=θ−V(t)`. The momentum term `γ` is usually set to 0.9 or a similar value.

    1. **Advantages:**

        - Reduces the oscillations and high variance of the parameters.
        - Converges faster than gradient descent.

    1. **Disadvantages:**

        - One more hyper-parameter is added which needs to be selected manually and accurately.

1. **Nesterov Accelerated Gradient:** Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. We know we’ll be using `γV(t−1)` for modifying the weights so, `θ−γV(t−1)` approximately tells us the future location. Now, we’ll calculate the cost based on this future parameter rather than the current one. `V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) )` and then update the parameters using `θ=θ−V(t)`. 
     
     <img width="434" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/c9351e81-bce5-4717-8654-27b300497341">
 
     1. **Advantages:**
        
        - Does not miss the local minima.
        - Slows if minima’s are occurring.
    
     1. **Disadvantages:**

        - Still, the hyperparameter needs to be selected manually.
   
1. **Adagrad**: One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate `‘η’` for each parameter and at every time step `‘t’`. It’s a type second order optimization algorithm. It works on the derivative of an error function.
    
    <img width="564" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6910d495-c191-4857-9360-e1402b7c6023">
    
    `η` is a learning rate which is modified for given parameter `θ(i)` at a given time based on previous gradients calculated for given parameter `θ(i)`. We store the sum of the squares of the gradients w.r.t. `θ(i)` up to time step `t`, while `ϵ` is a smoothing term that avoids division by zero (usually on the order of 1e−8). Interestingly, without the square root operation, the algorithm performs much worse. **It makes big updates for less frequent parameters and a small step for frequent parameters.**

    1. **Advantages:**

        - Learning rate changes for each training parameter.
        - Don’t need to manually tune the learning rate.
        - Able to train on sparse data.

    1. **Disadvantages:**

        - Computationally expensive as a need to calculate the second order derivative.

1. **AdaDelta:** It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

    1. **Advantages:**

        - Now the learning rate does not decay and the training does not stop.
        
    1. **Disadvantages:**
        - Computationally expensive.

1. **Adam:**  Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients.

    1. **Advantages:**

        - The method is too fast and converges rapidly.
        - Rectifies vanishing learning rate, high variance.

    1. **Disadvantages:**

        - Computationally costly.

### <span style="color: #7b6b59;">Conclusions</span>

- **Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.**

- **For sparse data use the optimizers with dynamic learning rate.**

- **If, want to use gradient descent algorithm than min-batch gradient descent is the best option.The learning rate is always decreasing results in slow training.**

**How to choose optimizers?**

1. If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.

1. RMSprop, Adadelta, Adam have similar effects in many cases.

1. Adam just added bias-correction and momentum on the basis of RMSprop,

1. As the gradient becomes sparse, Adam will perform better than RMSprop.


<img width="911" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/110e15de-29a7-4a7b-8850-211ba8d8c0ce">



# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Advanced Approaches: Fine-tune a pre-trained model</div>

## <span style="color: #7b6b59;">Introduction</span>

**What does fine-tuning a pre-trained model mean?** 

The fine-tuning technique is used to optimize a model’s performance on a new or different task. It is used to tailor a model to meet a specific need or domain, say cancer detection, in the field of healthcare. Pre-trained models are fine-tuned by training them on large amounts of labeled data for a certain task, such as Natural Language Processing (NLP) or image classification. Once trained, the model can be applied to similar new tasks or datasets with limited labeled data by fine-tuning the pre-trained model.

The fine-tuning process is commonly used in transfer learning, where a pre-trained model is used as a starting point to train a new model for a contrasting but related task. A pre-trained model can significantly diminish the labeled data required to train a new model, making it an effective tool for tasks where labeled data is scarce or expensive.

**How does fine-tuning pre-trained models work?**

Fine-tuning a pre-trained model works by updating the parameters utilizing the available labeled data instead of starting the training process from the ground up. The following are the generic steps involved in fine-tuning:

1. **Loading the pre-trained model:** The initial phase in the process is to select and load the right model, which has already been trained on a large amount of data, for a related task.

1. **Modifying the model for the new task - Adjust the Architecture:** Once a pre-trained model is loaded, its top layers must be replaced or retrained to customize it for the new task. Adapting the pre-trained model to new data is necessary because the top layers are often task specific. After selecting the pre-trained model, you need to make modifications to the model’s architecture to fit the requirements of your specific task. This typically involves modifying the top layers of the model. For example, you may need to change the number of output neurons in the final layer to match the number of classes in your classification task.

1. **Freezing particular layers:** The earlier layers facilitating low-level feature extraction are usually frozen in a pre-trained model. Since these layers have already learned general features that are useful for various tasks, freezing them may allow the model to preserve these features, avoiding overfitting the limited labeled data available in the new task. Depending on the complexity of your task and the size of your dataset, you can choose to freeze some layers in the pre-trained model. Freezing a layer means preventing it from updating its weights during the fine-tuning process. This can be beneficial if the lower layers of the pre-trained model have already learned general features that are useful for your task. On the other hand, unfreezing allows the corresponding layers to adapt to the new data during fine-tuning.

1. **Training the new layers:** With the labeled data available for the new task, the newly created layers are then trained, all the while keeping the weights of the earlier layers constant. As a result, the model’s parameters can be adapted to the new task, and its feature representations can be refined. Once you have adjusted the architecture and decided which layers to freeze or unfreeze, it’s time to train the modified model on your task-specific dataset. During training, it’s advisable to use a smaller learning rate than what was used in the initial pre-training phase. This helps prevent drastic changes to the already learned representations while allowing the model to adapt to the new data.

1. **Fine-tuning the model:** Once the new layers are trained, you can fine-tune the entire model on the new task using the available limited data. Every task and dataset is unique, and it may require further experimentation with hyperparameters, loss functions, and other training strategies. Fine-tuning is not a one-size-fits-all approach, and you may need to iterate and fine-tune your fine-tuning strategy to achieve optimal results.

**Understanding fine-tuning with an example**

Suppose you have a pre-trained model trained on a wide range of medical data or images that can detect abnormalities like tumors and want to adapt the model for a specific use case, say identifying a rare type of cancer, but you have a limited set of labeled data available. In such a case, you must fine-tune the model by adding new layers on top of the pre-trained model and training the newly added layers with the available data. Typically, the earlier layers of a pre-trained model, which extract low-level features, are frozen to prevent overfitting.

**Best practices to follow when fine-tuning a pre-trained model**

While fine-tuning a pre-trained model, several best practices can help ensure successful outcomes. Here are some key practices to follow:

1. **Understand the pre-trained model:** Gain a comprehensive understanding of the pre-trained model architecture, its strengths, limitations, and the task it was initially trained on. This knowledge can enhance the fine-tuning process and help make appropriate modifications.

1. **Select a relevant pre-trained model:** Choose a pre-trained model that aligns closely with the target task or domain. A model trained on similar data or a related task will provide a better starting point for fine-tuning.

1. **Freeze early layers:** Typically, the lower layers of a pre-trained model capture generic features and patterns. Freeze these early layers during fine-tuning to preserve the learned representations. This practice helps prevent catastrophic forgetting and lets the model focus on task-specific fine-tuning.

1. **Adjust learning rate**: Experiment with different learning rates during fine-tuning. It is typical to use a smaller learning rate compared to the initial pre-training phase. A lower learning rate allows the model to adapt more gradually and prevent drastic changes that could lead to overfitting.

1. **Utilize transfer learning techniques:** Transfer learning methods can enhance fine-tuning performance. Techniques like feature extraction, where pre-trained layers are used as fixed feature extractors, or gradual unfreezing, where layers are unfrozen gradually during training, can help preserve and transfer valuable knowledge.

1. **Regularize the model:** Apply regularization techniques, **such as dropout or weight decay,** during fine-tuning to prevent overfitting. Regularization helps the model generalize better and reduces the risk of memorizing specific training examples.

1. **Monitor and evaluate performance:** Continuously monitor and evaluate the performance of the fine-tuned model on validation or holdout datasets. Use appropriate evaluation metrics to assess the model’s progress and make informed decisions on further fine-tuning adjustments.

1. **Data augmentation:** Augment the training data by applying transformations, perturbations, or adding noise. Data augmentation can increase the diversity and generalizability of the training data, leading to better fine-tuning results.

1. **Consider domain adaptation:** If the target task or domain significantly differs from the pre-training data, consider domain adaptation techniques. These methods aim to bridge the gap between the pre-training data and the target data, improving the model’s performance on the specific task.

1. **Regularly backup and save checkpoints:** Save model checkpoints at regular intervals during fine-tuning to ensure progress is saved and prevent data loss. This practice allows for easy recovery and enables the exploration of different fine-tuning strategies.

There are two ways to do it: Since we are looking to fine-tune the model for a downstream task like classification, we can directly use:

### <span style="color: #7b6b59;">1. A simple way</span>

**Fine-tuning pretrained NLP models with Huggingface’s Trainer:** *A simple way to fine-tune pretrained NLP models without native Pytorch or Tensorflow*

While working on a data science competition, I was fine-tuning a pre-trained model and realised how tedious it was to fine-tune a model using native PyTorch or Tensorflow. I experimented with Huggingface’s **Trainer API** and was surprised by how easy it was.

- **Train Our Classification Model:** Now that our input data is properly formatted, it’s time to fine tune the pre-trained model, for instance a BERT model.
    - For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

    - **Classification Head:** Finally, the output from the pooler is passed through the classification head, which simply involves projecting the pooled embedding into a space with dimensionality equal to the number of different classes. It is called a head because this component of the model can be swapped out to suit a particular task. This is in contrast to the backbone of BERT — responsible for creating the contextualized representations of the tokens in the sequence — that remains the same regardless of the task.
    - Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task. [Here](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html) is the current list of classes provided for fine-tuning.
    
    - The `BertForSequenceClassification` class is the outermost class that we call to instantiate our BERT model. It houses both the base architecture (self.bert) and the classification head (self.classifier). The outputs are the logits for which there is one value for each class. Taking the maximum value of these logits will give us the predicted class. However, if it is desired to interpret the logits as probabilities the softmax function will need to be applied. `BertForSequenceClassification` performs fine-tuning of logistic regression layer on the output dimension of 768.
   
    - We’ll be using `BertForSequenceClassification`. This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.
    
    - So, in summary, we fine-tune the entire pre-trained BERT model, including the last layers specifically designed for our classification task. It adjusts the model to our document classification problem using the data we provide, but it doesn't train the model entirely from scratch. The distinction is that "from scratch" would mean initializing all the model's weights randomly and learning them solely from our data, which usually requires a much larger dataset and more computational resources. Here, we're leveraging the general understanding already embedded in the BERT model from its pre-training, which provides a significant head start for most NLP tasks.

### <span style="color: #7b6b59;">2. Adding Custom Layers on Top of a Hugging Face Model</span>

Alternatively, we can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.



# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 3: Fine-tune a pre-trained model with 🤗 Transformers</div>

## <span style="color: #7b6b59;">1. Introduction</span>

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. **When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning,** an incredibly powerful training technique. In this section, we will fine-tune a pretrained model with a deep learning framework of our choice:

- Fine-tune a pretrained model with 🤗 Transformers PyTorch Trainer.
- Fine-tune a pretrained model in TensorFlow with Keras.
- Fine-tune a pretrained model in native PyTorch.

## <span style="color: #7b6b59;">2. Create a dataset or Prepare the dataset</span>

- **From in-memory data:** Eventually, it’s also possible to instantiate a datasets.Dataset directly from in-memory data, currently one or:
    - a python dict, or
    - a pandas dataframe.

A `datasets.Dataset` instance is more precisely a table with rows and columns in which the columns are typed. Querying an example (a single row) will thus return a python dictionary with keys corresponding to columns names, and values corresponding to the example’s value for each column.

You can get the number of rows and columns of the dataset with various standard attributes. 

Sometimes, you may need to create a dataset if you’re working with your own data. Creating a dataset with **🤗 Datasets confers all the advantages of the library to your dataset: fast loading and processing, stream enormous datasets, memory-mapping, and more.** You can easily and rapidly create a dataset with 🤗 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as dragging and dropping your data files into a dataset repository on the Hub.

Creating a `Dataset` object from our dataset when fine-tuning a pre-trained model with Hugging Face Transformers is important for several reasons:

1. **Efficiency:** The `Dataset` object is optimized for performance. It enables efficient data loading, preprocessing, and iteration, which is crucial when dealing with large datasets common in NLP tasks.

1. **Easy Integration:** Hugging Face Transformers and Datasets libraries are designed to work together seamlessly. By using a `Dataset` object, we can directly apply transformations, tokenization, and batching, which are necessary for preparing our data for the model.

1. **Consistency and Reproducibility:** Creating a `Dataset` object ensures that data processing steps are consistent. This is important for reproducibility of results, a key aspect of any scientific experiment. You can share your dataset with others, and they'll be able to achieve the same results using the same preprocessing steps.

1. **Advanced Features:** The Dataset object comes with many advanced features like easy slicing, indexing, and even complex transformations. It supports operations like `map`, `filter`, and `shuffle`, which are essential for training neural networks.

1. **Scalability:** Datasets in Hugging Face are designed to be scalable. They can handle datasets much larger than your system's RAM and facilitate distributed training by efficiently managing memory and processing.

1. **Community Standards:** Using widely adopted standards like the Dataset object from Hugging Face ensures that your work is accessible and understandable by a broader community. It also makes it easier for you to use datasets and models shared by others.

In essence, **using a `Dataset` object simplifies the data preprocessing pipeline, ensures efficient and reproducible training, and aligns your work with community practices.**




In [18]:
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)

print(f"The shape of the train dataset is: {train_dataset.shape}")
print("---------------------------------------------------")

print(f"The number of columns in the train dataset is: {train_dataset.num_columns}")
print(f"The column names are: {train_dataset.column_names}")
print(f"The columns' detailed types are: {train_dataset.features}")

print("---------------------------------------------------")
print(f"The number of rows in the train dataset is: {train_dataset.num_rows}")
print(f"Or the length of the train dataset is: {len(train_dataset)}")

The shape of the train dataset is: (40151, 5)
---------------------------------------------------
The number of columns in the train dataset is: 5
The column names are: ['text', 'label', 'prompt_name', 'source', 'RDizzl3_seven']
The columns' detailed types are: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), 'prompt_name': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'RDizzl3_seven': Value(dtype='bool', id=None)}
---------------------------------------------------
The number of rows in the train dataset is: 40151
Or the length of the train dataset is: 40151


In [19]:
# While you can access a single row with the train_dataset[i] pattern, 
# you can also access several rows using slice notation or with a list of indices (or a numpy/torch/tf array of indices):
#print(train_dataset[1])
#print("--------------------------------\n")
#print(train_dataset[:2])
#print("--------------------------------\n")
#print(train_dataset["text"][:2])

## <span style="color: #7b6b59;">3. Initialise pre-trained model and tokenizer</span>

Before we can fine-tune a pretrained model, we have to prepare it for training. As you now know, we need a tokenizer to process the text and include a **padding** and **truncation** strategy to handle any variable sequence lengths. To process our dataset in one step, use 🤗 Datasets `map` method to apply a preprocessing function over the entire dataset:

To feed our text to deberta, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with deberta–the below cell will download this for us.

In [20]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-xsmall", use_fast=True)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Since we are using a pretrained model, we need to ensure that the input data is in the same form as what the pretrained model was trained on. Thus, we would need to instantiate the tokenizer using the name of the model.

Now that the model and tokenizer have been initialised, we can proceed to preprocess the data.

**Preprocess text using pretrained tokenizer**

Let us preprocess the text using the tokenizer intialised earlier.

The input text that we are using for the tokenizer is a list of strings.

We have set `padding=True`, `truncation=True`, `max_length=128` so that we can get same length inputs for the model- the long texts will be truncated to 128 tokens while the short texts will have extra tokens added to make it 128 tokens.

128 tokens is used because this is the maximum token length that the pre-trained model can take.

After tokenizing your text, you will get a python dictionary with 3 keys:

- Input_ids
- token_type_ids
- attention_mask



In [21]:
def tokenize_function(samples):
    return tokenizer(samples["text"], max_length=128, padding=True, truncation=True)

In [22]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_valid_dataset = valid_dataset.map(tokenize_function, batched=True)

  0%|          | 0/41 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

## <span style="color: #7b6b59;">4. Train with PyTorch Trainer</span>

🤗 Transformers provides a Trainer class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

1. **Start by loading your model and specify the number of expected labels.**: You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.
1. **Training hyperparameters:** Next, create a `TrainingArguments` class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training hyperparameters, but feel free to experiment with these to find your optimal settings.
1. **Evaluate:** `Trainer` does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics. 
1. **Trainer:** Create a `Trainer` object with your model, training arguments, training and test datasets, and evaluation function. Then fine-tune your model by calling `train()`.

For this task, we first want to modify the pre-trained Deberta model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained Deberta model, each has different top layers and output types designed to accomodate their specific NLP task. We’ll be using `AutoModelForSequenceClassification`. This is the normal Deberta model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. Have also a look on [BertForSequenceClassification source code](https://huggingface.co/transformers/v3.0.2/_modules/transformers/modeling_bert.html#BertForSequenceClassification).


In [23]:
model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-xsmall", num_labels=2)
model.cuda()

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-xsmall and are newly initialized: ['classifier.weight', 'pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 384, padding_idx=0)
      (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=384, out_features=384, bias=True)
              (key_proj): Linear(in_features=384, out_features=384, bias=True)
              (value_proj): Linear(in_features=384, out_features=384, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine

In [24]:
metric_name = "roc_auc"
train_batch_size = 4
eval_batch_size = 32
grad_acc = 4
num_steps = len(train_df) // (train_batch_size * grad_acc)
num_steps

2509

**Defining TrainingArguments and Trainer**

Here is where the magic of the Trainer function is. We can define the training parameters in the TrainingArguments and Trainer class as well as train the model with a single command.

We need to first define a function to calculate the metrics of the validation set. Since this is a binary classification problem, we can use accuracy, precision, recall and f1 score.

Next, we specify some training parameters, set the pretrained model, train data and evaluation data in the TrainingArgs and Trainer class.

After we have defined the parameters , simply run `trainer.train()` to train the model.

In [25]:
training_args = TrainingArguments(
    output_dir="deberta-v3-xsmall_finetuned",
    evaluation_strategy="steps", # If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch
    save_strategy = "steps",
    eval_steps = num_steps // 3,
    save_steps = num_steps // 3,
    learning_rate=2e-5,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=grad_acc,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
    report_to='none', # change to wandb after enabling internet access
)

In [26]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
    auc = roc_auc_score(labels, probs[:,1], multi_class='ovr')
    return {"roc_auc": auc}


In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [28]:
#trainer.train()

## <span style="color: #7b6b59;">5. Making prediction</span>

After the model is trained, we repeat the same steps for the test data:

1. Tokenize test data with pretrained tokenizer
1. Create torch dataset
1. Load trained model
1. Define Trainer

To load the trained model from the previous steps, set the model_path to the path containing the trained model weights.

To make prediction, only a single command is needed as well `test_trainer.predict(test_dataset)` .

After making a prediction, you will only get the raw prediction. Additional preprocessing steps will be needed to get it to a usable format.

Since the task is just a simple sequence classification task, we can just obtain the argmax across axis 1. Note that other NLP tasks may require different ways to preprocess the raw predictions.

In [29]:
# test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
# test_ds = Dataset.from_pandas(test)
# test_ds_enc = test_ds.map(tokenize_function, batched=True)


In [30]:
#test_preds = trainer.predict(test_ds_enc)

In [31]:
# logits = test_preds.predictions
# probs = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
# sub = pd.DataFrame()
# sub['id'] = test['id']
# sub['generated'] = probs[:,1]
# sub.to_csv('submission.csv', index=False)


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 4: Adding Custom Layers on Top of a Hugging Face Model</div>



## <span style="color: #7b6b59;">Step 1: Tokenization with Hugging Face 🤗 Transformers</span>


A tokenizer is in charge of preparing the inputs for a model. The Hugging Face library contains tokenizers for all the models. 

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. Text, use a **Tokenizer** to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. The main tool for preprocessing textual data is a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). 

1. A tokenizer splits text into tokens according to a set of rules. 
1. The tokens are converted into numbers and then tensors, which become the model inputs. 
1. Any additional inputs required by the model are added by the tokenizer.

***Tip:*** If you plan on using a pretrained model, it’s important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the vocab) during pretraining.

### <span style="color: #7b6b59;">Step 1.1: Loading a pretrained tokenizer</span>


Get started by loading a pretrained tokenizer with the AutoTokenizer.from_pretrained() method. This downloads the vocab a model was pretrained with:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

```

In [32]:
from transformers import AutoTokenizer

OUTPUT_DIR = "./"
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
tokenizer.save_pretrained(OUTPUT_DIR + "tokenizer/")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/spm.model',
 './tokenizer/added_tokens.json',
 './tokenizer/tokenizer.json')

### <span style="color: #7b6b59;">Step 1.2: Then pass your text to the tokenizer</span>


The tokenizer returns a dictionary with three important items:

1. **input_ids** are the indices corresponding to each token in the sentence.
1. **attention_mask** indicates whether a token should be attended to or not.
1. **token_type_ids** identifies which sequence a token belongs to when there is more than one sequence.

In [33]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)


{'input_ids': [1, 771, 298, 57249, 267, 262, 6303, 265, 41267, 261, 270, 306, 281, 6245, 263, 1538, 264, 5693, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Return your input by decoding the input_ids:

In [34]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger.[SEP]'

As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need special tokens, but if they do, the tokenizer automatically adds them for you.

If there are several sentences you want to preprocess, pass them as a list to the tokenizer:


In [35]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[1, 420, 339, 314, 567, 2962, 302, 2], [1, 1310, 280, 297, 428, 313, 2212, 314, 567, 2962, 261, 31663, 260, 2], [1, 458, 314, 11583, 268, 3933, 302, 2]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


- **Pad:** Sentences aren’t always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.
    - Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence.
- **Truncation:** On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.
    - Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model.
- **Build tensors:** Finally, you want the tokenizer to return the actual tensors that get fed to the model.
    - Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:





In [36]:
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")
print(encoded_input)

{'input_ids': tensor([[    1,   420,   339,   314,   567,  2962,   302,     2,     0,     0,
             0,     0,     0,     0],
        [    1,  1310,   280,   297,   428,   313,  2212,   314,   567,  2962,
           261, 31663,   260,     2],
        [    1,   458,   314, 11583,   268,  3933,   302,     2,     0,     0,
             0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


In [37]:
text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
inputs = tokenizer.encode_plus(
    text, 
    return_tensors="pt", 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

In [38]:
tokenizer(
    "Do not meddle in the affairs of wizards, for they are subtle and quick to anger.", return_tensors=None, 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

{'input_ids': [1, 771, 298, 57249, 267, 262, 6303, 265, 41267, 261, 270, 306, 281, 6245, 263, 1538, 264, 5693, 260, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

I typically use the `tokenizer.encode_plus()` function to tokenize my input, but there is another function that can be used to tokenize input, and this `tokenizer.encode()`. The main difference between `tokenizer.encode_plus()` and `tokenizer.encode()` is that `tokenizer.encode_plus()` returns more information. Specifically, it returns the actual input ids, the attention masks, and the token type ids, and it returns all of these in a dictionary. `tokenizer.encode()` **only returns the input ids**, and it returns this either as a list or a tensor depending on the parameter, `return_tensors = “pt”`.

In Hugging Face's Transformers library, the difference between `tokenizer(input)` and `tokenizer.encode_plus(...)` lies in their functionality and the level of control they offer. In summary, `tokenizer(input)` is a simpler method for basic tokenization, while `tokenizer.encode_plus(...)` provides more options and is suitable for scenarios where you need to customize the tokenization process to fit specific model requirements. When you call `tokenizer(inputs)` in Hugging Face's Transformers library, it essentially acts as a high-level wrapper that internally calls methods like `encode_plus` or similar functionalities, depending on the specific tokenizer implementation. The `encode_plus` method is one of the comprehensive methods for encoding text, handling various tasks like tokenization, conversion to token IDs, adding special tokens, creating attention masks, and managing sequence length (truncation and padding).

So, in the background, when you use `tokenizer(inputs)`, it's likely invoking `encode_plus `or a functionally equivalent method, carrying out a series of steps to prepare the input text for processing by the model. The exact methods called can vary between different tokenizer classes, but they generally perform similar tasks to prepare and format the input data appropriately.


In [39]:
train_text = train_df['text'][:16].tolist()

In [40]:
features = tokenizer.batch_encode_plus(
    train_text, 
    return_tensors="pt", 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

In [41]:
type(features["input_ids"])

torch.Tensor

In [42]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


In [43]:
features["input_ids"].size()

torch.Size([16, 512])

In [44]:
def prepare_input(text):
    inputs = tokenizer.encode_plus(
        text, 
        return_tensors="pt", 
        add_special_tokens=True, 
        padding="max_length",
        max_length=512,
        truncation=True
    )
  
    return inputs

## <span style="color: #7b6b59;">Step 2: Prepare the Training Data</span>

### <span style="color: #7b6b59;">Cross-Validation Split</span>

In [45]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


In [46]:
from sklearn.model_selection import StratifiedKFold

n_folds = 4
train_folds = [0, 1, 2, 3]

stratified_k_fold = StratifiedKFold(
    n_splits=n_folds,
    shuffle=True, 
    random_state=123
)

# Iterating Over Each Fold
# The enumerate function is used to iterate over the fold splits. 
# It provides two pieces of information for each iteration:
# fold: The current fold number (starting from 0).
# (train_index, val_index): Two arrays containing indices of the training and validation data for the current fold
for fold, (train_index, val_index) in enumerate(stratified_k_fold.split(train_df, train_df["label"])): # It generates indices for training and validation sets for each fold.
    
    # Inside the loop, for each fold, the validation indices (val_index) are used 
    # to assign the fold number to the corresponding rows in train_df.
    # This line does the assignment. It sets the 'fold' column of the DataFrame for rows in val_index to the current fold number.
    # This effectively tags each data point with the fold number it will be a part of in the validation set.
    # The purpose of this assignment is to keep track of which data points should be in the validation set for each fold.
    # When you actually train the model, you can easily filter the DataFrame to get the appropriate training and validation sets based on these fold numbers.

    train_df.loc[val_index, 'fold'] = int(fold)

train_df['fold'] = train_df['fold'].astype(int)
display(train_df.groupby('fold').size())                                    

fold
0    10038
1    10038
2    10038
3    10037
dtype: int64

In [47]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven,fold
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False,3
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False,0
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False,0
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False,3
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False,1


In [48]:
for fold in range(n_folds):
    
    train_folds = train_df[train_df["fold"] != fold].reset_index(drop=True)
    valid_folds = train_df[train_df["fold"] == fold].reset_index(drop=True)
    print(valid_folds.shape)
    valid_labels = valid_folds["label"].values


(10038, 6)
(10038, 6)
(10038, 6)
(10037, 6)



### <span style="color: #7b6b59;">Introduction to PyTorch Dataset and DataLoader</span>

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. **Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.**

Your training pipeline should be as modular as possible in order to aid quick prototyping and maintaining usability. Using a poorly-written data loader / not using a data loader (using a Python generator or a function) can affect the parallelization ability of your code. 

Dataset processing is a highly important part of any training pipeline and should be kept separate from modeling. 

***How to use `Datasets` and `DataLoader` in PyTorch for custom text data***

In this section, we'll go through the PyTorch data primitives, namely `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`, and understand how to create our own DataLoader and Datasets by subclassing these modules. 

We will learn how to make a custom Dataset and manage it with DataLoader in PyTorch. Creating a PyTorch `Dataset` and managing it with `Dataloader` keeps your data manageable and helps to simplify your machine learning pipeline. **A Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.**


- **Pandas** is not essential to create a Dataset object. However, it’s a powerful tool for managing data so i’m going to use it.

- **`torch.utils.data`** imports the required functions we need to create and use Dataset and DataLoader.



In [49]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

### <span style="color: #7b6b59;">Implementing A Custom Dataset In PyTorch</span>

A dataset is an abstract class in PyTorch that represents a collection of data. It is responsible for loading and preprocessing data from a source and returning it in the form of a PyTorch tensor.


Now, for most purposes, you will need to write your own implementation of a `Dataset`. So let's see how you can write a custom dataset by subclassing `torch.utils.data.Dataset`.

You'll need to implement 3 functions. The Dataset class provides 3 main methods:

1. **`__init__`**: This function is called when instancing the object. It's typically used to store some essential locations like file paths and image transforms. `class TextDataset(Dataset)`: Create a class called ‘TextDataset’, this can be called whatever you want. Passed in to the class is the dataset module which we imported earlier. `def __init__(self, text, labels)`: When you initialise the class you need to import two variables. In this case, the variables are called ‘text’ and ‘labels’ to match the data which will be added.

1. **`__len__`**: This function returns the length of the dataset. `self.labels = labels` & `self.text = text`: The imported variables can now be used in functions within the class by using self.text or self.labels. `def __len__(self)`: This function just returns the length of the labels when called. E.g., if you had a dataset with 5 labels, then the integer 5 would be returned.

1. **`__getitem__`**: This is the big kahuna 🏅. This function is responsible for returning a sample from the dataset based on the index provided. returns a single data point from the dataset at a given index. The getitem method is where the actual data loading and preprocessing takes place. It takes an index as input and returns a data point, which can be a tensor or a dictionary of tensors. This method is used by the DataLoader class to load and preprocess the data.


In [50]:
class CustomTextDataset(Dataset):
    def __init__(self, df):
        self.texts = df["text"].values
        self.labels = df["label"].values

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        inputs = prepare_input(self.texts[idx])
        label = torch.tensor(self.labels[idx], dtype=torch.float)
        return inputs, label

In [51]:
train_dataset = CustomTextDataset(train_df[:16])

### <span style="color: #7b6b59;">PyTorch DataLoader: A Complete Guide</span>

PyTorch provides an intuitive and incredibly versatile tool, the DataLoader class, to load data in meaningful ways. Because data preparation is a critical step to any type of data work, being able to work with, and understand, DataLoaders is an important step in your deep learning journey. The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

**DataLoader is an iterable that abstracts this complexity for us in an easy API.**

The PyTorch DataLoader class is built on top of the PyTorch Dataset class, which provides a standard interface for accessing data. The DataLoader class takes in a Dataset object and provides a way to iterate over the data in batches. This allows for efficient processing of large datasets by allowing parallelization of data loading and preprocessing.



**What Does a PyTorch DataLoader Do?**

The PyTorch DataLoader class is an important tool to help you prepare, manage, and serve your data to your deep learning networks. Because many of the pre-processing steps you will need to do before beginning training a model, finding ways to standardize these processes is critical for the readability and maintainability of your code.

The PyTorch DataLoader allows you to:

- **Define a dataset to work with:** identifying where the data is coming from and how it should be accessed.
- **Batch the data:** define how many training or testing samples to use in a single iteration. Because data are often split across training and testing sets of large sizes, being able to work with batches of data can allow your training and testing processes to be more manageable.
- **Shuffle the data:** PyTorch can handle shuffling data for you as it loads data into batches. This can increase representativeness in your dataset and prevent accidental skewness.
- **Support multi-processing:** PyTorch is optimized to run multiple processes at once in order to make better use of modern CPUs and GPUs and to save time in training and testing your data. The DataLoader class lets you define how many workers should go at once.
- **Merge datasets together:** optionally, PyTorch also allows you to merge multiple datasets together. While this may not be a common task, having it available to you is an a great feature.
- **Load data directly on CUDA tensors:** because PyTorch can run on the GPU, you can load the data directly onto the CUDA before they’re returned.

The DataLoader is a PyTorch utility class that provides a way to iterate over a Dataset object in batches. It is designed to handle large datasets efficiently and can be configured to load data in parallel, preprocess data on the fly, and shuffle data for each epoch.

The DataLoader takes in a Dataset object and provides a number of configuration options, including batch size, shuffling, and number of worker processes for parallel data loading. The DataLoader class is responsible for batching the data and returning it in a format that can be consumed by the model


`DataLoader` class has a lot of different parameters available. Of course, one of the most important parameters is the actual dataset. Generally, you’ll be working with at least a training and a testing dataset. **Because of this, it’s a convention that you’ll have at least two DataLoaders, to be able to load data for both your training and testing data.**

PyTorch lets you define many different parameters to influence how data are loaded. This can have a big impact on the speed at which your model can train, how well it can train, and ensuring that data are sampled appropriately.

We have loaded that dataset into the DataLoader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features and train_labels (containing `batch_size=8` features and labels respectively). Because we specified `shuffle=True`, after we iterate over all batches the data is shuffled.


In [52]:
train_loader = DataLoader(
    train_dataset, # expects a PyTorch Dataset from which to load the data
    batch_size=8, # represents how many samples per batch to load
    shuffle=True, # indicates whether data should be shuffled at every epoch you run
    num_workers=4, # represents how many subprocesses to use for loading data.
    pin_memory=True,
    drop_last=True
)

In [53]:
# Conventionally, you will load both the index of a batch and the items in the batch.
# We can do this using the enumerate() function
# DataLoader will return an object that contains both the data and the target (if the dataset contains both). 
# We can access each item and its labels by iterating over the batches.
for step, (inputs, labels) in enumerate(train_loader):
    print(step)
    print(labels.size())
    print(labels)
    print(inputs["input_ids"].size())
    print(inputs["input_ids"][0].size())

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0
torch.Size([8])
tensor([0., 0., 0., 0., 0., 0., 0., 0.])
torch.Size([8, 1, 512])
torch.Size([1, 512])
1
torch.Size([8])
tensor([0., 0., 0., 0., 0., 0., 0., 0.])
torch.Size([8, 1, 512])
torch.Size([1, 512])


## <span style="color: #7b6b59;">Step 3: Modelling with Hugging Face 🤗 Transformers</span>

### <span style="color: #7b6b59;">Introduction</span>


In Hugging Face Transformers there are 2 main outputs and 3 if configured; that we receive after giving input_ids and attention_mask as input.

- **pooler output (batch size, hidden size)**: Last layer hidden-state of the first token of the sequence
- **last hidden state (batch size, seq Len, hidden size)**: which is the sequence of hidden states at the output of the last layer.
- **hidden states (n layers, batch size, seq Len, hidden size)**: Hidden states for all layers and for all ids.

In this notebook, we will show many different ways these outputs and hidden representations can be utilized to do much more than just adding an output layer. Below are the various techniques we will be implementing.


### <span style="color: #7b6b59;">Last Hidden State Output</span>


<img width="894" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/923fb8e9-58b0-4984-a4e3-d097afde3b88">

**This is the first and default output from models.**

Last Hidden State output is the sequence of hidden-states at the output of the last layer of the model. The output is usually `[batch, maxlen, hidden_state]`, it can be narrowed down to `[batch, 1, hidden_state]` for `[CLS]` token, as the `[CLS]` token is 1st token in the sequence. Here , `[batch, 1, hidden_state]` can be equivalently considered as `[batch, hidden_state]`.

#### <span style="color: #7b6b59;">Implementation Details</span>


All models have outputs that are instances of subclasses of [`ModelOutput`](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.utils.ModelOutput). Those are data structures containing all the information returned by the model, but that can also be used as tuples or dictionaries.

```python

outputs = model(**inputs, labels=labels)

```
When considering our outputs object as tuple, it only considers the attributes that don’t have None values. For instance, it has two elements, loss then logits, so will return the `tuple (outputs.loss, outputs.logits)`. `outputs[:2]`


In Hugging Face Transformers, when you use a model from the `AutoModel` class with `AutoModel.from_pretrained`, the specific subclass of `ModelOutput` that the model returns depends on the type of model you are using (e.g., BERT, GPT-2, T5, etc.) and the nature of the task (e.g., sequence classification, token classification, language modeling, etc.).

To determine which subclass of `ModelOutput` is returned, you should consider the following:

1. **Model Type:** Different models are designed for different kinds of tasks. For instance, BERT-like models might return `BaseModelOutput` or `SequenceClassifierOutput`, while GPT-like models might return `CausalLMOutput`.

1. **Task:** The nature of the task also influences the output type. For example:

    - For sequence classification tasks, models often return SequenceClassifierOutput.
    - For token classification tasks (like Named Entity Recognition), models might return TokenClassifierOutput.
    - For language modeling tasks, models could return CausalLMOutput or MaskedLMOutput.

1. **Documentation::** The best way to know for sure is to refer to the Hugging Face documentation for the specific model you are using. The documentation usually specifies the output format for each model.

1. **Inspecting the Output:** You can programmatically inspect the output to determine its type. For example, after running `outputs = self.model(**inputs)`, you can check `type(outputs)` to see the class of the output.

1. **Common Attributes:** Most ModelOutput subclasses have common attributes like `loss`, `logits`, `hidden_states`, and `attentions`, but the presence and relevance of these attributes can vary. The exact composition of the output object will align with the requirements of the model's intended task.

1. **Configuration:** Sometimes, the configuration of the model (self.config) can give you hints about the expected output type, especially if it contains task-specific configurations.

Remember that Hugging Face's design philosophy with `ModelOutput` is to provide flexibility and convenience, allowing outputs to be used like tuples, dictionaries, or objects with named attributes. This makes it easier to access the information you need for your specific application.

For instance, if we have a look on the [documentation for the Deberta Model](https://huggingface.co/docs/transformers/model_doc/deberta#transformers.DebertaModel.forward) in the `forward` method  
we will see that "**Returns `transformers.modeling_outputs.BaseModelOutput` or `tuple(torch.FloatTensor)`**". Now if we jump to the [BaseModelOutput documentation](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.modeling_outputs.BaseModelOutput) we'll get 

<img width="1044" alt="image" src="https://github.com/microsoft/DeBERTa/assets/28102493/dbe927a5-4f60-4673-bf28-29a8a96e05aa">

### <span style="color: #7b6b59;">Mean Pooling</span>

#### <span style="color: #7b6b59;">Introduction</span>

Since Transformers are contextual model, the idea is `[CLS]` token would have captured the entire context and would be sufficient for simple downstream tasks such as classification. Hence, for tasks such as classification using sentence representations, you can use `[batch, hidden_state]`.

We can also consider the last hidden state `[batch, maxlen, hidden_state]`, the average across maxlen dimensions to get averaged/mean embeddings.

There are multiple different ways to do this. We can simply take `torch.mean(last_hidden_state, 1)` but rather we will be implementing something different. We will make use of attention masks as well so that we can ignore padding tokens which is a better way of implementing average embeddings.

***What is pooling in Transformer models?***

In the context of transformers, pooling refers to the process of summarizing the outputs of the transformer layers into a fixed-size vector, often used for downstream tasks such as classification.

In a transformer architecture, the input sequence is processed by a series of self-attention and feedforward layers. Each layer produces a sequence of output vectors, which encode the input sequence in a higher-level representation. Pooling involves taking the output vectors from one or more of these layers and aggregating them into a single vector.

There are different types of pooling mechanisms used in transformer architectures, including:


1. **Max Pooling:** where the maximum value across the sequence of output vectors is selected as the summary representation.

1. **Mean Pooling:** where the average of the output vectors is taken as the summary representation.

1. **Last Hidden State:** where the final output vector of the transformer is used as the summary representation.

1. **Self-Attention Pooling:** where a weighted sum of the output vectors is computed, with the weights determined by a learned attention mechanism.

#### <span style="color: #7b6b59;">Neural Networks: Pooling Layers</span>

In this section, we’ll walk through **pooling**, a machine-learning technique ***widely used that reduces the size of the input and, thus the complexity of deep learning models while preserving important features and relationships in the input data***. In particular, we’ll introduce pooling, explain its usage, highlight its importance, and give brief examples of how it works.

***What Are Pooling Layers?***

In machine learning and neural networks, the dimensions of the input data and the parameters of the neural network play a crucial role. So this number can be controlled by the stacking of one or more pooling layers. Depending on the type of the pooling layer, an operation is performed on each channel of the input data independently to summarize its values into a single one and thus keep the most important features. These values are driven as input to the next layer of the model and so on. The pooling process may be repeated several times, and each iteration reduces the spatial dimensions. The value aggregation can be performed by using different techniques.

***Types of Pooling Layers***

There are many pooling operations and different extensions that have been developed to address specific challenges in different applications.


1. **Max Pooling:** Max pooling is a convolution technique that chooses the maximum value from the patch of the input data and summarizes these values into a feature map: This method maintains the most significant features of the input by reducing its dimensions.
    <img width="660" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4e844e1d-8d59-4f97-beb8-00845cc7e45a">
    
1. **Average Pooling:** Average pooling calculates the average value from a patch of input data and summarizes these values into a feature map: This method is preferable in cases in which smoothing the input data is necessary as it helps to identify the presence of outliers. 
     <img width="625" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/af4b21e0-10f8-4518-9a5c-dbdf960f48fa">

1. **Global Pooling:** Global pooling summarizes the values of all neurons for each patch of the input data into a feature map, regardless of their spatial location. This technique is also used to reduce the dimensionality of the input and can be performed either by using the maximum or average pooling operation. 

1. **Stochastic Pooling:** Stochastic pooling is a deterministic pooling operation that introduces randomness into the max pooling process. This technique helps in improving the robustness of the model to small variations in the input data.

***Advantages and Disadvantages***

In machine learning, pooling layers offer several advantages and disadvantages as well.

First of all, pooling layers help in keeping the most important characteristics of the input data. Furthermore, the addition of pooling layers in the neural network offers translation invariance, which means that the model can generate the same outputs regardless of small changes in the input. Moreover, these techniques help in reducing the impact of outliers.

On the other hand, the pooling processes may lead to information loss, increased training complexity, and limited model interpretability.

***Usages of Pooling Layers in Machine Learning***

Pooling layers play a critical role in the size and complexity of the model and are widely used in several machine-learning tasks. They are usually employed after the convolutional layers in the convolutional neural network’s structure and are mainly used for downsampling the output.

These techniques are commonly used in convolutional neural networks and deep learning models of computer vision, speech recognition, and natural language processing.


***In conclusion, pooling layers play a critical role in reducing the size and complexity of deep learning models while preserving important features and relationships in the input data.***




In [54]:
from transformers import AutoModel, AutoConfig

config = AutoConfig.from_pretrained("microsoft/deberta-v3-base", output_hidden_states=True)
model = AutoModel.from_pretrained("microsoft/deberta-v3-base", config=config)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

In [55]:
with torch.no_grad():
    outputs = model(features['input_ids'], features['attention_mask'])

In [56]:
last_hidden_state = outputs[0]
attention_mask = features['attention_mask']

In [57]:
outputs[0].size()

torch.Size([16, 512, 768])

In [58]:
attention_mask.size()

torch.Size([16, 512])

## <span style="color: #7b6b59;">Step 4: Train the Model - Training Loop</span>

### <span style="color: #7b6b59;">Automatic Mixed Precision (AMP) for Deep Learning</span>

#### Introduction

Deep Neural Network training has traditionally relied on IEEE single-precision format, however with mixed precision, you can train with half precision while maintaining the network accuracy achieved with single precision. This technique of using both single- and half-precision representations is referred to as **mixed precision technique**. Mixed precision methods combine the use of different numerical formats in one computational workload. This document describes the application of mixed precision to deep neural network training.

- **IEEE single-precision floating point computer numbering format**, is a binary computing format that occupies **4 bytes** (32 bits) in computer memory
- In computing, **half precision** is a binary floating-point computer number format that occupies 16 bits in computer memory.

There are numerous benefits to using numerical formats with lower precision than 32-bit floating point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth which speeds up data transfer operations. Third, math operations run much faster in reduced precision, especially on GPUs with **Tensor Core** support for that precision. Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training. ***It does so by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.***

**Benefits of Mixed precision training**

- Speeds up math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
- Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
- Reduces memory requirements for training models, enabling larger models or larger minibatches.

*Nuance Research advances and applies conversational AI technologies to power solutions that redefine how humans and computers interact. The rate of our advances reflects the speed at which we train and assess deep learning models. With Automatic Mixed Precision, we’ve realized a 50% speedup in TensorFlow-based ASR model training without loss of accuracy via a minimal code change. We’re eager to achieve a similar impact in our other deep learning language processing applications.*, Wenxuan Teng, Senior Research Manager, Nuance Communications

#### Mixed Precision Training

**Mixed precision** training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps (Enabling mixed precision involves two steps):

1. Porting the model to use the FP16 data type where appropriate.
1. Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

Deep learning researchers and engineers can easily get started enabling this feature on **Ampere**, **Volta** and **Turing** GPUs. On Ampere GPUs, automatic mixed precision uses FP16 to deliver a performance boost of 3X versus TF32, the new format which is already ~6x faster than FP32. On Volta and Turing GPUs, automatic mixed precision delivers up to 3X higher performance vs FP32 with just a few lines of code.

<img width="1121" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/ec6bbedf-82e8-43e2-bf0f-a3d0d3b2ee72">


Mixed precision is the combined use of different numerical precisions in a computational method.

- **Half precision (also known as FP16)** data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.

- **Single precision (also known as 32-bit)** is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double). 

Deep Neural Networks (DNNs) have led to breakthroughs in a number of areas, including:

    - image processing and understanding
    - language modeling
    - language translation
    - speech processing
    - game playing, and many others.

DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. One way to lower the required resources is to use lower-precision arithmetic, which has the following benefits.

1. **Decrease the required amount of memory.** Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger mini-batches.

1. **Shorten the training or inference time.** Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.

*Figure 1. Training curves for the bigLSTM English language model shows the benefits of the mixed-precision training techniques. The Y-axis is training loss. Mixed precision without loss scaling (grey) diverges after a while, whereas mixed precision with loss scaling (green) matches the single precision model (black).*

<img width="891" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4ccef941-d4fb-4c3b-8f72-0e602c729743">

**Since DNN training has traditionally relied on IEEE single-precision format, this guide will focus on how to train with half precision while maintaining the network accuracy achieved with single precision (as Figure 1). This technique is called mixed-precision training since it uses both single and half-precision representations.**


#### Using Automatic Mixed Precision for Major Deep Learning Frameworks - PyTorch

Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable, add these two lines of code into your existing training script:

```python

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = loss_fn(output, target)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Train your own Tokenizer</div>

## <span style="color: #7b6b59;">Introduction</span>

Large Language Generative AI models are developed mostly working with large amounts of text data. For this reason anyone working in this area should have specific skills in text processing. To enable AI models to learn from text data effectively we must first preprocess text into a format which is understandable to machines. Tokenization and Vectorization are two of the most important steps in this procedure. 

Before our data can be fed to a model, it needs to be transformed to a format the model can understand. Machine learning algorithms take numbers as inputs. This means that we will need to convert the texts into numerical vectors. There are two steps to this process:

1. **Tokenization:** Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data). (Splitting text into smaller units such as words or phrases.)

1. **Vectorization:** Define a good numerical measure to characterize these texts. Converting text into numerical representations for ML models.

In summary, the typical order is tokenization first to break down the text into understandable units and then vectorization to turn those units into a numerical format suitable for machine learning models.

## <span style="color: #7b6b59;">Tokenization</span>

Text tokenization is the process of reformatting a piece of text into smaller units called “tokens.” It transforms unstructured text into structured data that models can understand. The goal of tokenization is to break down text into meaningful units like words, phrases, sentences, etc. which can then be inputted into machine learning models. It’s one of the first and most important steps in natural language preprocessing, and often goes hand-in-hand with text vectorization.

Tokenization enables natural language processing tasks like part-of-speech tagging (identifying verbs vs nouns, etc.), named entity recognition (categories like person, organization, location), and relationship extraction (family relationships, professional relationships, etc.).

There are a number of different tokenization methods; some of the simpler ones include splitting text on whitespace or punctuation. Advanced techniques use language rules to identify word boundaries and tokenize text into linguistic units; this can split words into sub-word tokens (such as prefixes, or based on syllables), or even combine certain tokens into larger units based on language semantics. The goal is to produce tokens that best represent the original text for ML purposes.

## <span style="color: #7b6b59;">Vectorization</span>

Now since most Large language models today are based on Transformers and Deep Learning architectures, they still work best with numbers, so to enable them to learn from text we should also convert the tokens to numbers, so each word will be represented with a number instead of sequence of letters. After tokenization, the tokens can then be converted into numerical format through vectorization, which is necessary because machine learning models don't understand text directly; they understand numbers. Vectorization represents the tokens in a way that the model can understand, often as vectors in a high-dimensional space. There are several methods of vectorization, including **Bag of Words**, **TF-IDF**, and **word embeddings** like **Word2Vec** or **GloVe**.

Text vectorization is the process of converting text into numerical representations (or “vectors”) that can be understood by ML models. It transforms unstructured text into structured numeric data with the goal to represent the semantic meaning of text in a mathematical format.

Text vectorization allows for a variety of NLP tasks like document classification (checking whether something is an email or an essay, etc.), sentiment analysis (opinions or attitudes of the text, etc.), enhancing search engines, and so on.

Common text vectorization methods include **one-hot encoding** (assigning a unique integer value to each word), **bag-of-words** (counting the occurrence of words within each document), and **word embeddings** (mapping words to vectors so as to capturing meaning). ***The vector space allows words with similar meanings to have similar representations.***

## <span style="color: #7b6b59;">Tokenizers</span>

Before you can use your data in a model, the data needs to be processed into an acceptable format for the model. A model does not understand raw text, images or audio. These inputs need to be converted into numbers and assembled into tensors.

**The main tool for processing textual data is a tokenizer. A tokenizer starts by splitting text into tokens according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.**


On this section, we will have a closer look at tokenization. As we saw, tokenizing a text is splitting it into **words** or **subwords**, which then are converted to ids through a look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text). 
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: 

1. **Byte-Pair Encoding (BPE)**, 
1. **WordPiece**, 
1. and **SentencePiece**, and show examples of which tokenizer type is used by which model.

Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer type was used by the pretrained model. For instance, if we look at [BertTokenizer](https://huggingface.co/docs/transformers/v4.36.1/en/model_doc/bert#transformers.BertTokenizer), we can see that the model uses WordPiece.


### <span style="color: #7b6b59;">Introduction</span>

***What is a tokenizer?***

The definition of tokenization, as given by Stanford NLP group is:

“Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation”

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

There are different solutions available: **word-based**, **character-based** but the one used by the state-of-the-art transformer models are **sub-word tokenizers**: Byte-level BPE(GPT-2), WordPiece(BERT) etc.

#### <span style="color: #7b6b59;">Space & Punctuation Tokenization</span>

Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so. For instance, let’s look at the sentence `"Don't you love 🤗 Transformers? We sure do."`

A simple way of tokenizing this text is to split it by spaces, which would give:

`["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]`

This is a sensible first step, but if we look at the tokens "Transformers?" and "do.", we notice that the punctuation is attached to the words "Transformer" and "do", which is suboptimal. We should take the punctuation into account so that a model does not have to learn a different representation of a word and every possible punctuation symbol that could follow it, which would explode the number of representations the model has to learn. Taking punctuation into account, tokenizing our exemplary text would give:

`["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]`

Better. However, it is disadvantageous, how the tokenization dealt with the word "Don't". "Don't" stands for "do not", so it would be better tokenized as `["Do", "n't"]`. This is where things start getting complicated, and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.

**spaCy** and **Moses** are two popular **rule-based tokenizers**. Applying them on our example, spaCy and Moses would output something like:

`["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]`

As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it’s the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization usually generates a very big vocabulary (the set of all unique words and tokens used). E.g., **Transformer XL uses space and punctuation tokenization**, resulting in a vocabulary size of 267,735!

Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which causes both an increased memory and time complexity. In general, **transformers models rarely have a vocabulary size greater than 50,000, especially if they are pretrained only on a single language.**

So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?

#### <span style="color: #7b6b59;">Character Tokenization</span>

While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter "t" is much harder than learning a context-independent representation for the word "today". Therefore, character tokenization is often accompanied by a loss of performance. **So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.**

### <span style="color: #7b6b59;">Subword Tokenization</span>
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful context-independent representations. In addition, subword tokenization enables the model to process words it has never seen before, by decomposing them into known subwords. For instance, the BertTokenizer tokenizes "I have a new GPU!" as follows:

`["i", "have", "a", "new", "gp", "##u", "!"]`

Because we are considering the uncased model, the sentence was lowercased first. We can see that the words `["i", "have", "a", "new"]` are present in the tokenizer’s vocabulary, but the word "gpu" is not. Consequently, the tokenizer splits "gpu" into known subwords: `["gp" and "##u"]`. "##" means that the rest of the token should be attached to the previous one, without space (for decoding or reversal of the tokenization).
As another example, XLNetTokenizer tokenizes our previously exemplary text as follows:

`["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]`

We’ll get back to the meaning of those "▁" when we look at SentencePiece. As one can see, the rare word "Transformers" has been split into the more frequent subwords "Transform" and "ers".

Let’s now look at how the different subword tokenization algorithms work. Note that all of those tokenization algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained on.

Concepts related to BPE:


1. **Vocabulary:** A set of subword units that can be used to represent a text corpus.
1. **Byte:** A unit of digital information that typically consists of eight bits.
1. **Character:** A symbol that represents a written or printed letter or numeral.
1. **Frequency:** The number of times a byte or character occurs in a text corpus.
1. **Merge:** The process of combining two consecutive bytes or characters to create a new subword unit.

#### <span style="color: #7b6b59;">Byte-Pair Encoding (BPE)</span>

Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2, RoBERTa. More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.

After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.

As an example, let’s assume that after pre-tokenization, the following set of words including their frequency has been determined:

`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`

Consequently, the base vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. Splitting all words into symbols of the base vocabulary, we obtain:

`("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)`

BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In the example above "h" followed by "u" is present 10 + 5 = 15 times (10 times in the 10 occurrences of "hug", 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is "u" followed by "g", occurring 10 + 5 + 5 = 20 times in total. Thus, the first merge rule the tokenizer learns is to group all "u" symbols followed by a "g" symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes

`("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)`

BPE then identifies the next most common symbol pair. It’s "u" followed by "n", which occurs 16 times. "u", "n" is merged to "un" and added to the vocabulary. The next most frequent symbol pair is "h" followed by "ug", occurring 15 times. Again the pair is merged and "hug" can be added to the vocabulary.

At this stage, the vocabulary is `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` and our set of unique words is represented as

`("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)`

Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, the word "bug" would be tokenized to `["b", "ug"]` but "mug" would be tokenized as `["<unk>", "ug"]` since the symbol `"m"` is not in the base vocabulary. In general, single letters such as `"m"` are not replaced by the `"<unk>"` symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.

As mentioned earlier, the vocabulary size, i.e. the base vocabulary size + the number of merges, is a hyperparameter to choose. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

**Recap:** Steps involved in BPE:

1. Initialize the vocabulary with all the bytes or characters in the text corpus
1. Calculate the frequency of each byte or character in the text corpus.
1. Repeat the following steps until the desired vocabulary size is reached:
    - Find the most frequent pair of consecutive bytes or characters in the text corpus
    - Merge the pair to create a new subword unit.
    - Update the frequency counts of all the bytes or characters that contain the merged pair.
    - Add the new subword unit to the vocabulary.

1. Represent the text corpus using the subword units in the vocabulary.

#### <span style="color: #7b6b59;">Byte-level BPE</span>

A base vocabulary that includes all possible base characters can be quite large if e.g. all unicode characters are considered as base characters. To have a better base vocabulary, GPT-2 uses bytes as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2’s tokenizer can tokenize every text without the need for the `<unk>` symbol. GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

**Understanding Characters and Bytes:**

1. **Characters:** These are the basic units of text (like 'A', '7', '!', 'é', '中'). In human language, we see these as individual symbols or letters.
1. **Bytes:** A byte is a unit of digital information that commonly consists of eight bits. It's a fundamental concept in computer science and is used to represent data.

**Character Encoding:**

Characters are represented in computers using various encoding systems, which map characters to specific byte sequences. Two common encodings are ASCII and UTF-8:

1. **ASCII (American Standard Code for Information Interchange):** This is one of the oldest character encoding standards. It uses one byte (8 bits) per character and can represent up to 256 different symbols (0-255). ASCII is limited to English characters and some control characters and symbols. The maximum of 256 different symbols in ASCII is due to its use of one byte per character, and a byte consists of 8 bits. Here's a breakdown of why this limits it to 256 symbols. When you have a single bit, you have two possible values (0 or 1). With two bits, you can have 4 possible combinations (00, 01, 10, 11). For 8 bits (1 byte), the number of possible combinations is  2^8 = 256. This range is from 0 to 255, which gives 256 total possible values.

1. **UTF-8 (8-bit Unicode Transformation Format):** This is a more modern and versatile encoding standard capable of representing a vast array of characters from virtually all written languages. UTF-8 is backward compatible with ASCII but can use one to four bytes per character, allowing it to cover much more than the basic ASCII set.


For the GPT models, OpenAI uses a method known as byte-level byte pair encoding, instead of alphabets or ASCII, the base vocabulary is defined in bytes. Since every character in any encoding on a computer is created from bytes, the base vocabulary contains every possible combination of byte, and the tokenizer never runs into an unknown token.

**Byte-Level:** 

Instead of starting with a vocabulary of words or characters (like alphabets or ASCII characters), byte-level BPE operates on bytes, which are essentially the smallest addressable group of bits in a computer (usually 8 bits). This means the base vocabulary consists of all 256 possible byte values (from 0 to 255).
In traditional BPE or other tokenization methods that start with characters, the process involves looking at the text's character-level representation. For example, the word "hello" would be considered as 'h', 'e', 'l', 'l', 'o' – five separate characters.

In Byte-Level BPE, instead of looking at characters, we consider the byte representation of the text. This approach doesn't start with an understanding of "characters" per se but with the bytes that encode these characters. Here's why it's significant:


**Why Byte-Level?:**

1. **All-Inclusive:** Since every character (no matter the language or symbol) can be broken down into bytes, starting with bytes ensures that the vocabulary can represent any text without missing symbols or needing placeholders for unknowns.

1. **Simplifies Vocabulary:** Instead of potentially needing thousands of character tokens to cover various languages and symbols, Byte-Level BPE only needs 256 base tokens, corresponding to all possible values of a byte (0-255). This drastically simplifies the model's vocabulary.

1. **Handles Varied Text:** By using bytes, the tokenizer can handle texts in ASCII (like English text) and texts in more complex encodings like UTF-8 (which can represent virtually all human languages) without needing separate mechanisms or special handling for different languages or symbol sets.

1. **Universality:** Bytes are the fundamental building blocks of digital data. By using bytes, the model can represent any character in any language or even other forms of data like emojis or special symbols without being restricted to a specific character set. This universality means that it can process text in virtually any language or symbol system.

1. **No Unknown Tokens:** Traditional tokenizers might encounter characters or words they have never seen before (out-of-vocabulary words), leading to the use of a special "unknown" token. Byte-level BPE virtually eliminates this problem because every piece of text can be broken down into bytes, which are always within the model's vocabulary. Thus, the tokenizer is capable of handling any text input without encountering unknown tokens.


So, when we say a "character is a byte" in the context of Byte-Level BPE, it's a bit of a simplification. A more accurate statement would be: "All characters can be represented as sequences of bytes, and Byte-Level BPE uses these byte sequences as the foundational elements of its vocabulary." This means each character in text is represented by one or more bytes, depending on its encoding, and these bytes are the building blocks for the tokenizer's vocabulary and subsequent text processing. In summary, byte-level BPE is a way of preparing text for machine learning models like GPT that is both highly versatile and capable of handling a wide variety of languages and symbols without running into the issue of unknown tokens. It's a foundational aspect of how these models process and understand the text data they're trained on and generate.


Byte-Level Byte Pair Encoding (BPE) is a tokenization method that builds upon the standard BPE algorithm by using bytes as the fundamental unit for its vocabulary. This approach, as used in models like GPT-2, is particularly effective and efficient for several reasons. Here's a more detailed explanation of how it works and its benefits:

**Base Vocabulary:**

1. **Standard BPE:** Traditional Byte Pair Encoding starts with a base vocabulary of all unique characters (or tokens) in the training corpus and iteratively combines the most frequent pair of tokens to create new, longer tokens. This process continues for a number of merges, determined beforehand.

1. **Byte-Level BPE:** Instead of starting with characters, Byte-Level BPE considers each byte (256 possible values in total, representing all possible single-byte characters) as the base vocabulary. This approach automatically includes all possible characters in ASCII and extends to any byte value that might represent a part of a character in more extensive encoding systems like UTF-8.

***Advantages:***

1. **Compact and Comprehensive Base Vocabulary:** By using bytes, the base vocabulary is limited to 256 tokens (since there are 256 possible byte values), which is more compact compared to potentially thousands of Unicode characters. Yet, it's comprehensive enough to represent any text because all text can be broken down into bytes.

1. **Eliminating `<unk>` Tokens:** Traditional tokenizers might encounter unknown characters or words not present in the vocabulary, often represented by an `<unk>` (unknown) token. Since Byte-Level BPE can tokenize any text into bytes (and subsequentially into byte-level tokens), it theoretically doesn't need an `<unk>` symbol, as every possible byte can be represented in its vocabulary.

1. **Handling Diverse Scripts and Symbols:** With the ability to represent any character as a series of bytes, Byte-Level BPE is naturally equipped to handle text in multiple languages, including those with large character sets or special symbols, without needing separate models or token sets for different languages.

**GPT-2's Vocabulary:**
In the case of GPT-2:

- **256 Base Tokens:** Corresponding to all possible byte values.
- **Special End-of-Text Token:** Used to signify the end of a text.
- **50,000 Merges:** The tokenizer iteratively combines frequent pairs of these byte-level tokens to form higher-level tokens, up to 50,000 merges. These merges are learned from the training corpus and represent common words, subwords, or sequences of characters that appear frequently together.

The resulting vocabulary size is 50,257 (256 base tokens + 1 special token + 50,000 merged tokens), which provides a good balance between granularity and coverage. This means GPT-2's tokenizer is capable of handling a wide variety of texts, from different languages and domains, without a substantial increase in vocabulary size, making it efficient and powerful for language understanding and generation tasks.
    
#### <span style="color: #7b6b59;">WordPiece</span>

WordPiece is the subword tokenization algorithm used for BERT, DistilBERT, and Electra. The algorithm was outlined in Japanese and Korean Voice Search (Schuster et al., 2012) and is very similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.


#### <span style="color: #7b6b59;">SentencePiece</span>

All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words. One possible solution is to use language specific pre-tokenizers, e.g. XLM uses a specific Chinese, Japanese, and Thai pre-tokenizer). To solve this problem more generally, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary.

The XLNetTokenizer uses SentencePiece for example, which is also why in the example earlier the "▁" character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be concatenated and "▁" is replaced by a space.

All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are ALBERT, XLNet, Marian, and T5.



## <span style="color: #7b6b59;">HuggingFace Tokenizers: `tokenizers` Library</span>

### <span style="color: #7b6b59;">Introduction</span>

Fast State-of-the-art [tokenizers](https://huggingface.co/docs/tokenizers/index), optimized for both research and production

[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).

**Main features:**

- Train new vocabularies and tokenize, using today’s most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

### <span style="color: #7b6b59;">The tokenization pipeline</span>

In this section, we will try to understand the HuggingFace tokenizers in depth and will go through all the parameters and also the outputs returned by a tokenizer. We’ll dive into the AutoTokenizer class and see how to use a pre-trained tokenizer for our data.

So, let’s get started!

Hugging Face is a New York based company that has swiftly developed language processing expertise. The company’s aim is to advance NLP and democratize it for use by practitioners and researchers around the world.

In an effort to offer access to fast, state-of-the-art, and easy-to-use tokenization that plays well with modern NLP pipelines, Hugging Face contributors have developed and open-sourced Tokenizers. Tokenizers is, as the name implies, an implementation of today’s most widely used tokenizers with emphasis on performance and versatility.

An implementation of a tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information. When calling Tokenizer.encode or Tokenizer.encode_batch, the input text(s) go through the following pipeline:

- normalization
- pre-tokenization
- model
- post-processing

We’ll see in details what happens during each of those steps in detail, as well as when you want to decode `<decoding>` some token ids, and how the 🤗 Tokenizers library allows you to customize each of those steps to your needs. 

Let’s go through these steps:

<img width="935" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/57dbec9a-de4a-4bed-b4a0-491639298f65">

1. **Normalization:** The [normalization step](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers) involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents. If you’re familiar with Unicode normalization (such as NFC or NFKC), this is also something the tokenizer may apply. `"Héllò hôw are yoü?"` Given the input above, the normalization step would transform it into: `"hello, how are you?"`. Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less random or “cleaner”. Common operations include stripping whitespace, removing accented characters or lowercasing all text. If you’re familiar with Unicode normalization, it is also a very common normalization operation applied in most tokenizers. Each normalization operation is represented in the 🤗 Tokenizers library by a `Normalizer`, and you can combine several of those by using a `normalizers.Sequence.` Here is a normalizer applying NFD Unicode normalization and removing accents as an example:

    ```python
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
# You can manually test that normalizer by applying it to any string:
normalizer.normalize_str("Héllò hôw are ü?")
``` 
    When building a Tokenizer, you can customize its normalizer by just changing the corresponding attribute: `tokenizer.normalizer = normalizer`. Of course, if you change the way a tokenizer applies normalization, you should probably retrain it from scratch afterward.

1. **Pre-tokenization:** A tokenizer cannot be trained on raw text alone. Instead, we first need to split the texts into small entities, like words. That’s where the pre-tokenization step comes in. A word-based tokenizer can simply split a raw text into words on whitespace and punctuation. Those words will be the boundaries of the subtokens the tokenizer can learn during its training. `"hello, how are you?"`. Given this string, the pre-tokenizer’s output will be something like: `[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]`. As we can see, the tokenizer also keeps track of the offsets. Also, the rules for [pre-tokenization](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers) can vary with the tokenizer being used. For instance, BERT will have different set of rules for this step than GPT-2. Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to what your tokens will be at the end of training. A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words. An easy way to pre-tokenize inputs is to split on spaces and punctuations, which is done by the `pre_tokenizers.Whitespace pre-tokenizer`. Of course, if you change the way the pre-tokenizer, you should probably retrain your tokenizer from scratch afterward. The output is a list of tuples, with each tuple containing one word and its span in the original sentence (which is used to determine the final offsets of our Encoding). Note that splitting on punctuation will split contractions like "I'm" in this example. You can combine together any PreTokenizer together. For instance, here is a pre-tokenizer that will split on space, punctuation and digits, separating numbers in their individual digits:
`pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
`

1. **Modeling:** After normalization and pre-processing steps, we apply [a training algorithm](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models) to the text data. This output of this step is dependent on the type of training strategy we are going to use. The state-of-the-art models use subword tokenization algorithms, for example BERT uses WordPiece tokenization, GPT, GPT-2 use BPE, AIBERT uses unigram etc. Using a BERT tokenizer, will tokenize the sentence like this: `["hello"; ","; "how"; "are"; "you"; "?"]`. Once the input texts are normalized and pre-tokenized, the Tokenizer applies the model on the pre-tokens. ***This is the part of the pipeline that needs training on your corpus (or that has been trained if you are using a pretrained tokenizer).*** ***The role of the model is to split your “words” into tokens, using the rules it has learned.*** It’s also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model. This model is passed along when intializing the Tokenizer so you already know how to customize this part. Currently, the 🤗 Tokenizers library supports:
    - models.BPE
    - models.Unigram
    - models.WordLevel
    - models.WordPiece

1. **Post-processing:** Similar to the modeling part, a number of post-processors are available depending on the training strategy used. They’re responsible for adding the special tokens to the input sequence as needed by the model. Using a BERT post-processor to our sequence will result in: `["CLS"; "hello"; ","; "how"; "are"; "you"; "?"; "SEP"]`. Here, `[CLS]` denotes the classification token, which tells the model that this is a classification task and `[SEP]` denotes the end of sentence and is also used between two sentences. Post-processing is the last step of the tokenization pipeline, to perform any additional transformation to the Encoding before it’s returned, like adding potential special tokens.

Subword tokenization methods, such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece, need to be trained on a specific corpus to learn an efficient and effective way of breaking down words into smaller units (subwords). The training process allows the tokenizer to adapt to the particularities of the text it will be processing.

**The Training Process:**

During training, a subword tokenizer typically starts with a large corpus of text and performs the following:

1. **Initial Vocabulary Creation:** It creates an initial vocabulary, often at the character level or using a simple character or word frequency threshold.

1. **Merging Rules Learning:** It iteratively finds the most frequent pairs of characters or subwords and merges them to form a new, longer subword. This process repeats until a set number of merges is reached or the desired vocabulary size is achieved.

1. **Final Vocabulary Compilation:** The final vocabulary consists of the original characters plus all the merged subwords, ensuring that any word can be tokenized using this set.

In essence, the training of a subword tokenizer is about learning the most efficient and effective way to break down and represent the text it will encounter, taking into account frequency, morphology, and the specific needs of the task or language. This process results in a tokenizer that can handle a wide variety of text inputs, generalize well to new text, and efficiently interface with downstream language models or other NLP tools.


### <span style="color: #7b6b59;">Build a tokenizer from scratch</span>

To illustrate how fast the 🤗 Tokenizers library is, let’s train a new tokenizer on wikitext-103 (516M of text) in just a few seconds. In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. Here, training the tokenizer means it will learn merge rules by:

1. Start with all the characters present in the training corpus as tokens.
1. Identify the most common pair of tokens and merge it into one token.
1. Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

The main API of the library is the class `Tokenizer`, here is how we instantiate one with a BPE model:

In [59]:
from tokenizers import Tokenizer, normalizers
from tokenizers.normalizers import NFD, StripAccents, NFC, Lowercase
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

LOWERCASE = False


In [60]:
# The main API of the library is the class Tokenizer, here is how we instantiate one with a BPE model:
# Creating Byte-Pair Encoding tokenizer
# we instantiate a new Tokenizer with this model - BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))


In [61]:
normalizer = normalizers.Sequence([NFD(), StripAccents()])
# You can manually test that normalizer by applying it to any string:
print(normalizer.normalize_str("Héllò hôw are ü?"))

normalizer = normalizers.Sequence([NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
print(normalizer.normalize_str("Héllò hôw are ü?"))
tokenizer.normalizer = normalizer

Hello how are u?
Héllò hôw are ü?


In [62]:
pre_tokenizer = Whitespace()
print(pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you."))
pre_tokenizer = ByteLevel()
print(pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you."))

# We could train our tokenizer right now, but it wouldn’t be optimal.
# Without a pre-tokenizer that will split our inputs into words, we might get tokens that overlap several words:
# for instance we could get an "it is" token since those two words often appear next to each other. 
# Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer.
# Here we want to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting on whitespace.
# As we saw in the quicktour, you can customize the pre-tokenizer of a Tokenizer by just changing the corresponding attribute:
tokenizer.pre_tokenizer = pre_tokenizer

[('Hello', (0, 5)), ('!', (5, 6)), ('How', (7, 10)), ('are', (11, 14)), ('you', (15, 18)), ('?', (18, 19)), ('I', (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ('fine', (24, 28)), (',', (28, 29)), ('thank', (30, 35)), ('you', (36, 39)), ('.', (39, 40))]
[('ĠHello', (0, 5)), ('!', (5, 6)), ('ĠHow', (6, 10)), ('Ġare', (10, 14)), ('Ġyou', (14, 18)), ('?', (18, 19)), ('ĠI', (19, 21)), ("'m", (21, 23)), ('Ġfine', (23, 28)), (',', (28, 29)), ('Ġthank', (29, 35)), ('Ġyou', (35, 39)), ('.', (39, 40))]


In [63]:

# To train our tokenizer on the wikitext files, we will need to instantiate a [trainer]{.title-ref}, in this case a BpeTrainer
# We can set the training arguments like vocab_size or min_frequency 
# but the most important part is to give the special_tokens we plan to use later on (they are not used at all during training) so that they get inserted in the vocabulary.

# Adding special tokens and creating trainer instance
# The order in which you write the special tokens list matters: here "[UNK]" will get the ID 0, "[PAD]" will get the ID 1 and so forth.

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
VOCAB_SIZE = 30522
trainer = BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

# Now, we can just call the Tokenizer.train method with any list of files we want to use:

#files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
#tokenizer.train(files, trainer)


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">References</div>


1. [Text Tokenization and Vectorization in NLP](https://medium.com/@WojtekFulmyk/text-tokenization-and-vectorization-in-nlp-ac5e3eb35b85)
1. [Developing LLMs for Generative AI Tokenization and Vectorization
](https://www.linkedin.com/pulse/developing-llms-generative-ai-tokenization-darko-medin/)
1. [Google Machine Learning Guide](https://developers.google.com/machine-learning/guides/text-classification/step-3#:~:text=Tokenization%3A%20Divide%20the%20texts%20into,measure%20to%20characterize%20these%20texts.)
1. [Hugging Face: Understanding tokenizers
](https://medium.com/@awaldeep/hugging-face-understanding-tokenizers-1b7e4afdb154)
1. [How to use [HuggingFace’s] Transformers Pre-Trained tokenizers? - To READ](https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa)
1. [Byte-Pair Encoding (BPE) in NLP](https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/)
1. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide
1. [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)
1. [Build a tokenizer from scratch](https://huggingface.co/docs/tokenizers/quicktour)
1. https://huggingface.co/blog/how-to-train
1. [HuggingFace Tokenizers](https://huggingface.co/docs/tokenizers/index)
1. [Adding Custom Layers on Top of a Hugging Face Model](https://towardsdatascience.com/adding-custom-layers-on-top-of-a-hugging-face-model-f1ccdfc257bd)
1. [Add dense layer on top of Huggingface BERT model](https://stackoverflow.com/questions/64156202/add-dense-layer-on-top-of-huggingface-bert-model)
1. [FINE-TUNING PRE-TRAINED MODELS FOR GENERATIVE AI APPLICATIONS](https://www.leewayhertz.com/fine-tuning-pre-trained-models/)
1. [Fine-Tuning the Model: What, Why, and How
](https://medium.com/@amanatulla1606/fine-tuning-the-model-what-why-and-how-e7fa52bc8ddf)
1. https://rumn.medium.com/part-1-ultimate-guide-to-fine-tuning-in-pytorch-pre-trained-model-and-its-configuration-8990194b71e
1. https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b
1. https://medium.com/@alexmriggio/bert-for-sequence-classification-from-scratch-code-and-theory-fb88053800fa
1. https://mccormickml.com/2019/07/22/BERT-fine-tuning/#4-train-our-classification-model
1. https://huggingface.co/transformers/v2.2.0/model_doc/bert.html
1. https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b
1. [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training#train-with-pytorch-trainer)
1. [What’s in the Dataset object](https://huggingface.co/docs/datasets/v1.2.1/exploring.html)
1. [Loading a Dataset](https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html)
1. [The Dataset object](https://huggingface.co/docs/datasets/v2.2.1/en/access)
1. [Create a dataset](https://huggingface.co/docs/datasets/create_dataset)
1. [BertForSequenceClassification source code](https://huggingface.co/transformers/v3.0.2/_modules/transformers/modeling_bert.html#BertForSequenceClassification)
1. [7 Text Classification Techniques for Any Scenario](https://blog.dataiku.com/7-text-classification-techniques-for-any-scenario#:~:text=A%20simple%20approach%20for%20text,regression%20or%20tree%2Dbased%20models.)
1. [TF-IDF Simplified](https://towardsdatascience.com/tf-idf-simplified-aba19d5f5530)
1. [Understanding TF-IDF for Machine Learning](https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/)
1. [Understanding TF-IDF in NLP: A Comprehensive Guide
](https://medium.com/@er.iit.pradeep09/understanding-tf-idf-in-nlp-a-comprehensive-guide-26707db0cec5)
1. [TF-IDF Guide: Using scikit-learn for TF-IDF implementation](https://www.capitalone.com/tech/machine-learning/scikit-tfidf-implementation/)
1. [Creating BERT Embeddings with Hugging Face Transformers](https://www.analyticsvidhya.com/blog/2023/08/bert-embeddings/)
1. [How to use embeddings for feature extraction?](https://medium.com/mlearning-ai/how-to-use-embeddings-for-feature-extraction-4956db52b5f5)
1. [Feedback Prize - English Language Learning](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/code?competitionId=38321&sortBy=voteCount)
1. [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer)
1. [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)
1. [The tokenization pipeline](https://huggingface.co/docs/tokenizers/pipeline)
1. [Preprocess](https://huggingface.co/docs/transformers/preprocessing)
1. [How to use BERT from the Hugging Face transformer library
](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209)
1. [Neural Networks: Pooling Layers](https://www.baeldung.com/cs/neural-networks-pooling-layers)
1. [Understanding Pooling in Transformer Architecture, Aggregating Outputs for Downstream Tasks](https://www.datasciencebyexample.com/2023/04/30/what-is-pooling-in-transformer-model/)
1. https://huggingface.co/docs/transformers/main_classes/output
1. https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.utils.ModelOutput
1. [Deep learning basics — weight decay](https://medium.com/analytics-vidhya/deep-learning-basics-weight-decay-3c68eb4344e9)
1. [How do you compare weight decay with other regularization methods for neural networks?
](https://www.linkedin.com/advice/3/how-do-you-compare-weight-decay-other#:~:text=Weight%20decay%20is%20a%20form,them%20from%20growing%20too%20large.)
1. [Zero-Weight Decay on BatchNorm and Bias
](https://deci.ai/deep-learning-glossary/zero-weight-decay-on-batchnorm-and-bias/)
1. [Various Optimization Algorithms For Training Neural Network
](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6)
1. [Optimizers in Deep Learning
](https://medium.com/mlearning-ai/optimizers-in-deep-learning-7bf81fed78a0)
1. [DATASETS & DATALOADERS
](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
1. [An Introduction to Datasets and DataLoader in PyTorch
](https://wandb.ai/sauravmaheshkar/Dataset-DataLoader/reports/An-Introduction-to-Datasets-and-DataLoader-in-PyTorch--VmlldzoxMDI5MTY2)
1. [PyTorch DataLoader: Features, Benefits, and How to Use it
](https://saturncloud.io/blog/pytorch-dataloader-features-benefits-and-how-to-use-it/#:~:text=The%20basic%20architecture%20of%20PyTorch%20DataLoader&text=The%20DataLoader%20class%20takes%20in,of%20data%20loading%20and%20preprocessing.)
1. [A detailed example of how to generate your data in parallel with PyTorch
](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel)
1. [PyTorch DataLoader: A Complete Guide](https://datagy.io/pytorch-dataloader/)
1. [How does DataLoader work in PyTorch?](https://medium.com/noumena/how-does-dataloader-work-in-pytorch-8c363a8ee6c1)
1. [How to use Datasets and DataLoader in PyTorch for custom text data
](https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00)
1. [Playing with PyTorch and Datasets
](https://fede-bianchi.medium.com/playing-with-pytorch-and-datasets-fe64f5590f2)
1. [Effective Data Handling with Custom PyTorch Dataset Classes
](https://dantokeefe.medium.com/effective-data-handling-with-custom-pytorch-dataset-classes-b141bcb87b41)
1. [TRAINING WITH PYTORCH
](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#:~:text=The%20Dataset%20and%20DataLoader%20classes,processing%20single%20instances%20of%20data.)
1. [Training a PyTorch Model with DataLoader and Dataset
](https://machinelearningmastery.com/training-a-pytorch-model-with-dataloader-and-dataset/)
1. [Cross-Validation in Machine Learning
](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)
1. [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision)


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">QA</div>


1. Do I need to tokenize the text before TF-IDF? If not what's the default tokenization that it takes place during Vectorization ? How the words get splitted?