to install `pip install transformers datasets tokenizers torch`
to verify:
```python
import transformers
print(transformers.__version__)
```

In [1]:
import transformers
print(transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


4.53.0


We will be visiting the core libraries now.

1. Transformers: Provides pre-trained models and pipelines for tasks like text classification.
For eg., sentiment analysis.
2. Datasets: Access and preprocess datasets efficiently.
For eg., loading the IMDB dataset.
3. Tokenizers: Efficiently tokenize text for model input.


In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis") # default model is distilbert-base-uncased-finetuned-sst-2-english
print(classifier("i am not very happy right now"))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9997791647911072}]


In [6]:
from datasets import load_dataset
dataset = load_dataset("imdb")
print(dataset["train"][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hello, Hugging Face!", return_tensors="pt")
print(tokens)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


{'input_ids': tensor([[  101,  7592,  1010, 17662,  2227,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


In [1]:
#Basic Model Loading
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Model Inference with Pipeline
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model=model_name)
print(classifier("Great tutorial!"))  

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998494386672974}]


In [3]:
#NLP (Text Classification)
classifier = pipeline("text-classification")
print(classifier("confusing movie"))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9995478987693787}]


In [4]:
vision_classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
print(vision_classifier("https://images.pexels.com/photos/45201/kitty-cat-kitten-pet-45201.jpeg"))

Device set to use cpu


[{'label': 'lynx, catamount', 'score': 0.7097601890563965}, {'label': 'Egyptian cat', 'score': 0.14048300683498383}, {'label': 'tabby, tabby cat', 'score': 0.07001744955778122}, {'label': 'tiger cat', 'score': 0.022446582093834877}, {'label': 'Siamese cat, Siamese', 'score': 0.008534948341548443}]


In [1]:
#Fine-tuning a Model, Step 1: Load Dataset
from datasets import load_dataset
dataset = load_dataset("imdb")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Step 2: Load Model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
#Step 3: Preprocess
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize, batched=True)

Map: 100%|██████████| 25000/25000 [00:05<00:00, 4384.39 examples/s]


In [4]:
pip install "transformers[torch]"

Note: you may need to restart the kernel to use updated packages.


In [None]:
#Step 4: Train the Model
from transformers import Trainer, TrainingArguments
small_dataset = tokenized_dataset["train"].shuffle().select(range(100)) 
training_args = TrainingArguments(
    output_dir="./quick_results",
    eval_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    num_train_epochs=1,
)
trainer = Trainer(model=model, args=training_args, train_dataset=small_dataset)
trainer.train()

Step,Training Loss


TrainOutput(global_step=4, training_loss=0.3260621428489685, metrics={'train_runtime': 100.8839, 'train_samples_per_second': 0.991, 'train_steps_per_second': 0.04, 'total_flos': 13246739865600.0, 'train_loss': 0.3260621428489685, 'epoch': 1.0})

# Model Sharing
Share models on the Model Hub.

1. Log in:
```python
from huggingface_hub import login
login()  # Use your Hugging Face token
```
2. Push model:
```python
model.push_to_hub("my-model")
tokenizer.push_to_hub("my-model")
```

In [None]:
#Step 5: Save the Model
model.save_pretrained("./fine_tuned_model_small_dataset") 