<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/ml/Fine-Tuning-BERT-for-Text-Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://towardsdatascience.com/fine-tuning-bert-for-text-classification-54e7df642894

#How to use Google's BERT (Bidirectional Encoder Representations from Transformers) with the transformers library by Hugging Face.

This script does the following:

1. Imports necessary libraries: torch for tensor operations and transformers for the BERT model and tokenizer.
2. Loads the pre-trained model and tokenizer: Here, we use nlptown/bert-base-multilingual-uncased-sentiment, a pre-trained model for sentiment analysis.
3. Tokenizes the input text: Converts the input text into tokens that the BERT model can understand.
4. Performs sentiment analysis: Uses the model to predict the sentiment of the input text.
5. Prints the predicted sentiment: Maps the model's output to human-readable sentiment labels.


### Python Code to Save the Markdown File


# Tutorial: Understanding `BertTokenizer` and `BertForSequenceClassification`

## Introduction

In this tutorial, we will explore two key components of the BERT model provided by the Hugging Face `transformers` library: `BertTokenizer` and `BertForSequenceClassification`. These tools are essential for preparing text data and performing classification tasks using the BERT model.

## `BertTokenizer`

### What is `BertTokenizer`?

`BertTokenizer` is a tokenizer specifically designed for the BERT (Bidirectional Encoder Representations from Transformers) model. Tokenizers are crucial in natural language processing (NLP) as they transform raw text into a format that the model can process. The `BertTokenizer` handles tasks like splitting text into tokens (sub-words), adding special tokens required by the BERT model, and converting tokens to their corresponding IDs.

### Key Functions

- **Tokenization**: Splits text into sub-word tokens.
- **Special Tokens**: Adds `[CLS]` at the beginning and `[SEP]` at the end of the input sequence.
- **Padding and Truncation**: Adjusts sequences to a uniform length.
- **Conversion**: Converts tokens to numerical IDs.

### Example Usage

```python
from transformers import BertTokenizer

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample text
text = "Hello, how are you?"

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
print(inputs)


In [1]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load pre-trained model and tokenizer
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Sample text for sentiment analysis
text = "I love this product! It's amazing."

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)

# Perform sentiment analysis
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get the predicted sentiment
predicted_class_id = torch.argmax(logits, dim=1).item()
sentiment_labels = ["very negative", "negative", "neutral", "positive", "very positive"]
predicted_sentiment = sentiment_labels[predicted_class_id]

print(f"Text: {text}")
print(f"Sentiment: {predicted_sentiment}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Text: I love this product! It's amazing.
Sentiment: very positive


In [3]:
inputs

{'input_ids': tensor([[  101,   151, 11157, 10372, 20058,   106, 10197,   112,   161, 39854,
           119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}