# Cheatsheet
## Tokenization 

### NLTK Tokenization
NLTK is a Python library used in natural language processing (NLP) for tasks such as tokenization and text processing. The code example shows how you can tokenize text using the NLTK word-based tokenizer.

In [17]:
import nltk
from nltk.tokenize import word_tokenize
text = "Unicorns are real. I saw a unicorn yesterday. I couldn't see it today."
token = word_tokenize(text)
print(token)

['Unicorns', 'are', 'real', '.', 'I', 'saw', 'a', 'unicorn', 'yesterday', '.', 'I', 'could', "n't", 'see', 'it', 'today', '.']


### spaCy Tokenization
spaCy is an open-source library used in NLP. It provides tools for tasks such as tokenization and word embeddings. The code example shows how you can tokenize text using spaCy word-based tokenizer.

In [18]:
import spacy

text = "Unicorns are real. I saw a unicorn yesterday. I couldn't see it today."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
token_list = [token.text for token in doc]
print("Tokens:", token_list)

Tokens: ['Unicorns', 'are', 'real', '.', 'I', 'saw', 'a', 'unicorn', 'yesterday', '.', 'I', 'could', "n't", 'see', 'it', 'today', '.']


### BertTokenizer
BertTokenizer is a subword-based tokenizer that uses the WordPiece algorithm. The code example shows how you can tokenize text using BertTokenizer.

In [19]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me tokenization.")

['ibm', 'taught', 'me', 'token', '##ization', '.']

### XLNetTokenizer 
XLNetTokenizer tokenizes text using Unigram and SentencePiece algorithms. The code example shows how you can tokenize text using XLNetTokenizer.

In [20]:
from transformers import XLNetTokenizer
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer.tokenize("IBM taught me tokenization.")

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

['▁IBM', '▁taught', '▁me', '▁token', 'ization', '.']

### torchtext
The torchtext library is part of the PyTorch ecosystem and provides the tools and functionalities required for NLP. The code example shows how you can use torchtext to generate tokens and convert them to indices.

In [21]:
from torchtext.vocab import build_vocab_from_iterator
# Defines a dataset
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1,"NLP Named Entity,Sentiment Analysis, Machine Translation"),
    (1,"Machine Translation with NLP"),
    (1,"Named Entity vs Sentiment Analysis NLP")]
# Applies the tokenizer to the text to get the tokens as a list
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("basic_english")
tokenizer(dataset[0][1])
# Takes a data iterator as input, processes text from the iterator, 
# and yields the tokenized output individually
def yield_tokens(data_iter):
    for _,text in data_iter:
        yield tokenizer(text)
# Creates an iterator
my_iterator = yield_tokens(dataset)
# Fetches the next set of tokens from the data set
next(my_iterator)
# Converts tokens to indices and sets <unk> as the 
# default word if a word is not found in the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
# Gives a dictionary that maps words to their corresponding numerical indices
vocab.get_stoi()

{'vs': 21,
 'to': 19,
 'recognition': 16,
 'introduction': 14,
 'for': 13,
 'basics': 11,
 'with': 9,
 'translation': 8,
 'sentiment': 7,
 'classification': 12,
 'nlp': 1,
 ',': 10,
 'named': 6,
 'using': 20,
 'machine': 5,
 'text': 18,
 'entity': 4,
 'techniques': 17,
 '<unk>': 0,
 'of': 15,
 'analysis': 3,
 'pytorch': 2}

## Vocabulary 
### vocab
The vocab object is part of the PyTorch torchtext library. It maps tokens to indices. The code example shows how you can apply the vocab object to tokens directly.

In [22]:
# Takes an iterator as input and extracts the next tokenized sentence. 
# Creates a list of token indices using the vocab dictionary for each token.
def get_tokenized_sentence_and_indices(iterator):
    tokenized_sentence = next(iterator)
    token_indices = [vocab[token] for token in tokenized_sentence]
    return tokenized_sentence, token_indices
# Returns the tokenized sentences and the corresponding token indices. 
# Repeats the process.
tokenized_sentence, token_indices = \
get_tokenized_sentence_and_indices(my_iterator)
next(my_iterator)
# Prints the tokenized sentence and its corresponding token indices.
print("Tokenized Sentence:", tokenized_sentence)
print("Token Indices:", token_indices)

Tokenized Sentence: ['basics', 'of', 'pytorch']
Token Indices: [11, 15, 2]


### Special tokens in PyTorch: <eos> and <bos>
Special tokens are tokens introduced to input sequences to convey specific information or serve a particular purpose during training. The code example shows the use of <bos> and <eos> during tokenization. The <bos> token denotes the beginning of the input sequence, and the <eos> token denotes the end.

In [24]:
# Appends <bos> at the beginning and <eos> at the end of the tokenized sentences 
# using a loop that iterates over the sentences in the input data
lines = [
    "Hello world!",
    "SpaCy is a great NLP library.",
    "We are learning tokenization."
]
tokenizer_en = get_tokenizer('spacy', language='en_core_web_sm')
tokens = []
max_length = 0
for line in lines:
    tokenized_line = tokenizer_en(line)
    tokenized_line = ['<bos>'] + tokenized_line + ['<eos>']
    tokens.append(tokenized_line)
    max_length = max(max_length, len(tokenized_line))

## Padding 
### Special tokens in PyTorch: <pad>
The code example shows the use of <pad> token to ensure all sentences have the same length.

In [25]:
# Pads the tokenized lines
for i in range(len(tokens)):
    tokens[i] = tokens[i] + ['<pad>'] * (max_length - len(tokens[i]))

## Dataset / CustomDataset 
### Dataset class in PyTorch
The Dataset class enables accessing and retrieving individual samples from a data set. The code example shows how you can create a custom data set and access samples.

In [29]:
# Imports the Dataset class and defines a list of sentences
from torch.utils.data import Dataset
sentences = ["If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.", "Fame's a fickle friend, Harry."]
# Downloads and reads data
class CustomDataset(Dataset):
    def __init__(self, sentences):
        self.sentences = sentences
    # Returns the data length
    def __len__(self):
        return len(self.sentences)
    # Returns one item on the index
    def __getitem__(self, idx):
        return self.sentences[idx]
# Creates a dataset object
dataset=CustomDataset(sentences)
# Accesses samples like in a list
dataset[0]

"If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals."

## DataLoader 
### DataLoader class in PyTorch
A DataLoader class enables efficient loading and iteration over data sets for training deep learning models. The code example shows how you can use the DataLoader class to generate batches of sentences for further processing, such as training a neural network model

In [31]:
from torch.utils.data import Dataset, DataLoader

# Sample data
sentences = [
    "Unicorns are real. I saw a unicorn yesterday. I couldn't see it today.",
    "Fame's a fickle friend, Harry.",
    "It is our choices that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy."
]

# Creates an instance of the custom data set
class CustomDataset(Dataset):
    def __init__(self, sentences):
        self.sentences = sentences
    def __len__(self):
        return len(self.sentences)
    def __getitem__(self, idx):
        return self.sentences[idx]

custom_dataset = CustomDataset(sentences)

# Specifies a batch size
batch_size = 2

# Creates a data loader
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True)

# Creates an iterator object
data_iter = iter(dataloader)

# Calls the next function to return new batches of samples
print("Next batch:", next(data_iter))

# Prints the sentences in each batch
print("\nAll batches:")
for batch in dataloader:
    print(batch)

Next batch: ['It is our choices that show what we truly are, far more than our abilities.', 'Soon we must all face the choice between what is right and what is easy.']

All batches:
["Unicorns are real. I saw a unicorn yesterday. I couldn't see it today.", 'Soon we must all face the choice between what is right and what is easy.']
["Fame's a fickle friend, Harry.", 'It is our choices that show what we truly are, far more than our abilities.']


## Collate Function 
### Custom collate function in PyTorch
The custom collate function is a user-defined function that defines how individual samples are collated or batched together. You can utilize the collate function for tasks such as tokenization, converting tokenized indices, and transforming the result into a tensor. The code example shows how you can use a custom collate function in a data loader.

In [32]:
# Defines a custom collate function
def collate_fn(batch):
    tensor_batch = []
# Tokenizes each sample in the batch
    for sample in batch:
        tokens = tokenizer(sample)
# Maps tokens to numbers using the vocab
        tensor_batch.append(torch.tensor([vocab[token] for token in tokens]))
# Pads the sequences within the batch to have equal lengths
    padded_batch = pad_sequence(tensor_batch,batch_first=True)
    return padded_batch
# Creates a data loader using the collate function and the custom dataset
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)