# Introduction to Huggingface Library

### Introduction to Huggingface
Huggingface is an open-source library that provides state-of-the-art Natural Language Processing (NLP) tools and pre-trained models for a wide range of NLP tasks. The library is built on top of PyTorch and TensorFlow and provides a unified API for working with various models. Huggingface is widely used in the NLP community for research and development purposes.



### Installing the transformers

To use Huggingface, you need to install the transformers library. You can do this by running the following command in your Colab notebook:

In [None]:
!pip install datasets transformers[sentencepiece] tensorflow

!pip install transformers


This will install the latest version of the library.

### Pipelines
Pipelines in Huggingface are a simple and intuitive way to use pre-trained models for various NLP tasks such as text classification, question answering, and text generation. Pipelines provide a high-level API for working with pre-trained models without the need for extensive coding.


### Using the pipelines
To use a pre-trained model for a specific NLP task, you can simply create a pipeline object and specify the task you want to perform. Here's an example of how to use the text classification pipeline:

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In the galaxy far far")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In the galaxy far far away, a young lady is riding on a rocket ship. She is the princess of Y'Kron's ship Y'Kron the Avatar, in charge of the princess by Y'Kron itself. She leads her"}]

## Specifying a custom model in the pipeline

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In the galaxy far far",
    max_length=30,
    num_return_sequences=2,
)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'In the galaxy far far away, and far all over the galaxy.\n\n\n“\n—\nAll you\n“\n�'},
 {'generated_text': "In the galaxy far far, far away, the next time you see that light, just like that of the sun, it's in darkness. But"}]

&nbsp;

## Available Pipelines


- AudioClassificationPipeline
- AutomaticSpeechRecognitionPipeline
- ConversationalPipeline
- FeatureExtractionPipeline
- FillMaskPipeline
- ImageClassificationPipeline
- ImageSegmentationPipeline
- ObjectDetectionPipeline
- QuestionAnsweringPipeline
- SummarizationPipeline
- TableQuestionAnsweringPipeline
- TextClassificationPipeline
- TextGenerationPipeline
- Text2TextGenerationPipeline
- TokenClassificationPipeline
- TranslationPipeline
- VisualQuestionAnsweringPipeline
- ZeroShotClassificationPipeline
- ZeroShotImageClassificationPipeline

## Another example

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("You are going to <mask> about a wonderful library today.", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)
All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


[{'score': 0.5679107308387756,
  'token': 1798,
  'token_str': ' hear',
  'sequence': 'You are going to hear about a wonderful library today.'},
 {'score': 0.22818315029144287,
  'token': 1532,
  'token_str': ' learn',
  'sequence': 'You are going to learn about a wonderful library today.'}]

# Looking inside the pipeline with Tensorflow API

In [None]:
from transformers import pipeline

input_sentences = [
        "I don't like this movie",
        "Upgrad is helping me learn new and wonderful things.",
        
    ]
classifier = pipeline("sentiment-analysis")
classifier(
    input_sentences
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are 

[{'label': 'NEGATIVE', 'score': 0.9839025139808655},
 {'label': 'POSITIVE', 'score': 0.9998325109481812}]

## Tokenizing the input sentences

In [None]:
from transformers import AutoTokenizer

model = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:

inputs = tokenizer(input_sentences, padding=True, truncation=True, max_length = 12, return_tensors="tf",)
pp.pprint(inputs)

{'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>,
 'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101,   146,  1274,   112,   189,  1176,  1142,  2523,   102,
            0,     0,     0],
       [  101,  3725, 20561,  1110,  4395,  1143,  3858,  1207,  1105,
         7310,  1614,   102]])>,
 'token_type_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>}


## Classifying the input sentences into positive and negative sentiments

In [None]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(model)
outputs = model(inputs)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_230']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
pp.pprint(outputs.logits.shape)

TensorShape([2, 2])


In [None]:
pp.pprint(outputs.logits)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 2.2426069, -1.8702551],
       [-4.191452 ,  4.5031376]], dtype=float32)>


In [None]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
pp.pprint(predictions)

tf.Tensor(
[[4.01951671e-02 9.59804833e-01]
 [9.9945587e-01 5.4418424e-04]], shape=(2, 2), dtype=float32)

# Exploring the Tokenizer

In [None]:
tokenized_text = "Learning NLP is so much rewarding".split()
pp.pprint(tokenized_text)

['Learning', 'NLP', 'is', 'so', 'much', 'rewarding']


In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer("Learning NLP is so much rewarding")

{'input_ids': [101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Breaking the tokenizer functions down

In [None]:
tokens = tokenizer.tokenize("Learning NLP is so much rewarding", )
pp.pprint(tokens)

['Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing']


In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
pp.pprint(ids)

[9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158]


## Final touch-ups

In [None]:
tokens = tokenizer.tokenize("Learning NLP is so much rewarding", add_special_tokens = True )
pp.pprint(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
pp.pprint(ids)

final_ids = tokenizer.build_inputs_with_special_tokens(ids)
pp.pprint(final_ids)

Keyword arguments {'add_special_tokens': True} not recognized.


['Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing']
[9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158]
[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102]


## Decoding the tokens ang getting back the string

In [None]:
tokenizer.decode(final_ids)

'[CLS] Learning NLP is so much rewarding [SEP]'

In [None]:
tokenizer.decode(final_ids, skip_special_tokens=True)

'Learning NLP is so much rewarding'

## Handling Multiple sequences

In [None]:
tokenized_output = tokenizer(["Learning NLP is so much rewarding","Another test sentence"])

In [None]:
tokenized_output['input_ids']

[[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102],
 [101, 2543, 2774, 5650, 102]]

### Padding

In [None]:
sequences = ["Learning NLP is so much rewarding","Another test sentence"]
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
pp.pprint(model_inputs)
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
print('\n')
model_inputs = tokenizer(sequences, padding="max_length")
pp.pprint(model_inputs)

# Will pad the sequences up to the specified max length
print('\n')
model_inputs = tokenizer(sequences, padding="max_length", max_length=6)
pp.pprint(model_inputs)


{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]],
 'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102],
               [101, 2543, 2774, 5650, 102, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}


{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### Truncation

In [None]:
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
pp.pprint(model_inputs)
print("\n")
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=6, truncation=True)
pp.pprint(model_inputs)
print("\n")

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]],
 'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], [101, 2543, 2774, 5650, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]}


{'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]],
 'input_ids': [[101, 9681, 21239, 2101, 1110, 102], [101, 2543, 2774, 5650, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]}




### Different Output Types

In [None]:
# sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
# model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
pp.pprint(model_inputs)
print("\n")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
pp.pprint(model_inputs)
print("\n")

{'attention_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])>,
 'input_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[  101,  9681, 21239,  2101,  1110,  1177,  1277, 10703,  1158,
          102],
       [  101,  2543,  2774,  5650,   102,     0,     0,     0,     0,
            0]])>,
 'token_type_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>}


{'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]),
 'input_ids': array([[  101,  9681, 21239,  2101,  1110,  1177,  1277, 10703,  1158,
          102],
       [  101,  2543,  2774,  5650,   102,     0,     0,     0,     0,
            0]]),
 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}




In [None]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["Learning NLP is so much rewarding","Another test sentence"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_250']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokens

{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[  101,  4083, 17953,  2361,  2003,  2061,  2172, 10377,  2075,
          102],
       [  101,  2178,  3231,  6251,   102,     0,     0,     0,     0,
            0]])>, 'attention_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])>}

In [None]:
tf.math.softmax(output.logits, axis = 1)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1.20997014e-04, 9.99879003e-01],
       [9.92225170e-01, 7.77476979e-03]], dtype=float32)>