##1. Install the transformers datasets module




These lines of code install two essential libraries for working with NLP tasks using pre-trained models and datasets:

`transformers`: Provides tools for working with pre-trained transformer models.

`datasets`: Simplifies the process of finding, loading, and working with NLP datasets.

In [5]:
!pip install transformers
!pip install datasets


Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


#2. Loading Data
### Loading Dataset using Hugging Face Datasets Library

Theese lines of code  demonstrates how to load a dataset using the `load_dataset` function from the Hugging Face Datasets library.


The `load_dataset` function is used to load the dataset named "mrpc" from the "glue" dataset collection. The loaded dataset is stored in the variable `raw_datasets`.

#### Parameters:

- `"glue"`: Specifies the dataset collection. In this case, the GLUE (General Language Understanding Evaluation) benchmark datasets.
https://huggingface.co/datasets/glue
- `"mrpc"`: Specifies the specific dataset within the GLUE collection to load. Here, it stands for the Microsoft Research Paraphrase Corpus (MRPC).

  The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.


In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")

raw_datasets

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [7]:
type(raw_datasets)

datasets.dataset_dict.DatasetDict


###Acccessing Dataset :

1. `raw_datasets["train"]`: This line of code accesses the training dataset (`"train"`) from the `raw_datasets` variable. `raw_datasets` is assumed to be a dictionary-like object containing different datasets, such as train, validation, and test datasets.

2. `raw_train_dataset[0]`: This line of code accesses the first element (index 0) of the `raw_train_dataset`. It retrieves the first data entry or sample from the training dataset.




In [8]:
raw_train_dataset = raw_datasets["train"]
print(raw_train_dataset)
print(raw_train_dataset[0])


Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


The `raw_train_dataset.features` attribute is used to access information about the features (or columns) of the raw training dataset. This attribute provides metadata that describes the structure and characteristics of the dataset's features.

Behind the scenes, label is of `type ClassLabel`, and the mapping of integers to label name is stored in the names folder. 0 corresponds to not_equivalent, and 1 corresponds to equivalent.




In [9]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [10]:
#fIRST FIVE INSTANCE
raw_train_dataset[:5]


{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

#3. Preprocessing
##Load the tokenizer:
This line imports the `AutoTokenizer` function from the `transformers` library and then uses it to load the pre-trained tokenizer specifically designed for the "bert-base-uncased" model. This tokenizer will convert text into a format the model understands.

##Tokenize individual sentences:
These lines demonstrate applying the loaded tokenizer to separate sentences. They access the first sentence ("sentence1") from the first example (index 0) in the training set ("train") of the dataset (`raw_datasets`). The tokenizer is then used to convert each sentence into its processed form, storing the results in `tokenized_sentences_1` and `tokenized_sentences_2`.


In [41]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
#APPLYING THE TOKENIZER ONE ONE EXAMPLE SEPERATELY
tokenized_sentences_1 = tokenizer(raw_datasets["train"][0]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"][0]["sentence2"])

In [42]:
print(tokenized_sentences_1)
print(tokenized_sentences_2)

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [14]:
#APPLYING THE TOKENIZER ONE ALL  EXAMPLES IN THE TRAINING SET SEPERATELY
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

#4. BERT Can Process two Sentences
Tokenizer can process two sequences together, typically as input pairs for tasks like text classification or question answering. e.g,

Let's say we have two sequences:
1. Sequence A: "How does weather affect agriculture?"
2. Sequence B: "Weather affects agriculture by influencing crop growth and water availability."

When we tokenize these sequences individually, we get tokenized versions of each sequence. But when we tokenize them together as a pair, the tokenizer will consider both sequences simultaneously and add special tokens to indicate the separation between the two sequences and the beginning and end of each sequence. This is crucial for models like BERT, which are designed to understand relationships between two sequences.

Here's how the tokenizer might tokenize this pair:

Original Sequences:
- Sequence A: "How does weather affect agriculture?"
- Sequence B: "Weather affects agriculture by influencing crop growth and water availability."

Tokenized Sequences:
- [CLS] [Sequence A tokens] [SEP] [Sequence B tokens] [SEP]

Where:
- [CLS] is a special token indicating the beginning of the input.
- [SEP] is a special token indicating the separation between the two sequences.
- [Sequence A tokens] are the tokens obtained by tokenizing Sequence A.
- [Sequence B tokens] are the tokens obtained by tokenizing Sequence B.

So, the tokenized version of the pair would look something like this:


In [15]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])


['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

In [43]:
# The tokenization happen for all the dataset at once
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)
print(tokenized_dataset[0])
print(tokenized_dataset[0].tokens)

In [44]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function)
tokenized_datasets

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [45]:
print(tokenized_datasets['train'][0])

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [47]:
#feature names
tokenized_datasets['train'].column_names

['sentence1',
 'sentence2',
 'label',
 'idx',
 'input_ids',
 'token_type_ids',
 'attention_mask']

In [40]:
tokenized_datasets=tokenized_datasets.remove_columns(['sentence1','sentence2','idx'])
tokenized_datasets=tokenized_datasets.rename_column('label','labels')
tokenized_datasets=tokenized_datasets.with_format('torch')
