## Understanding pipeline()

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import torch

# provide sample inputs for sentiment analysis
test_sentences =     [
        "If my dog was any cuter I would die.",
        "My dog is the cutest dog in the entire world!",
        "My dog is so annoying when she misbehaves.",
    ]
classifier = pipeline("sentiment-analysis")
classifier(test_sentences)

The `pipeline()` function consists of a tokenizing step, a modeling step, and post-processing step. 
We can replicate the process of the `pipeline`  function by first tokenizing the input sentences.

In [None]:

# specify model parameters
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt")
print(inputs)

The tokens can then be input to the modeling layer that transforms the tokens to vectors

In [None]:
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
outputs.last_hidden_state.shape

For sequence classification, we need to use the specific model `AutoMOdelForSequenceClassification`. This model outputs logit scores for the negative and positive labels.  

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs.logits

We can convert the logits to probabilities using a SoftMax layer.

In [None]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

## Pretrained models


In [None]:
from transformers import AutoConfig
bert_config = AutoConfig.from_pretrained('bert-base-cased')
gpt_config = AutoConfig.from_pretrained('gpt2')
bart_config = AutoConfig.from_pretrained('facebook/bart-base')

for config in [bert_config, gpt_config, bart_config]:
    print(type(config))

In [None]:
print(bert_config)

### Creating a Transformer from a Pre-Trained Model

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config - randomly initialized
model = BertModel(config)

# Instead of loading the untrained randomly initialized version, we can load
# a pre-trained version instead
model = BertModel.from_pretrained("bert-base-cased")

## Tokenization
Loading and saving tokenizers is similar to loading and saving pre-trained models

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

### Encoding
First we need to break down the input text into tokens. 

In [None]:
tokens = tokenizer.tokenize('Here are some words with different etymologies. Some are simple, others are more complicated.')

print(tokens)

Tokens are then mapped to token ID's

In [None]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

Special token IDs can be added before token output is given as input to encoder model. This stage also adds the attention mask.

In [None]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs)