## Use `transformers` Library

In [1]:
from transformers import pipeline

### Get a `sentiment-analysis` model as a start

In [2]:
%%time

pipe_sentiment_analysis = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

CPU times: user 1.38 s, sys: 826 ms, total: 2.2 s
Wall time: 4.63 s


### Make an inference

In [3]:
%%time

prompt = "I love computer science!"

out = pipe_sentiment_analysis(prompt)

final_response = f"""
    Prompt: {prompt}
    Sentiment: {out[0]["label"]}
    Score: {out[0]["score"]}
"""

print(final_response)


    Prompt: I love computer science!
    Sentiment: POSITIVE
    Score: 0.9998328685760498

CPU times: user 430 ms, sys: 37.8 ms, total: 468 ms
Wall time: 702 ms


## Save Models Locally

### Save `tokenizer` with `model`

Tokenizers such as BERT's WordPiece and DistilBERT's variant, commonly used in sentiment analysis pipelines from Hugging Face, play a crucial role in preprocessing text. These tokenizers are responsible for breaking down input text into manageable pieces—tokens—that the underlying machine learning models can process. They handle intricate tasks like splitting text into words or subwords, converting these tokens into numerical IDs, and ensuring inputs are structured in a way the model expects. When using a pipeline like `pipeline("sentiment-analysis")` from Hugging Face, saving the tokenizer alongside the model is essential for maintaining the integrity of the input data processing. The tokenizer ensures that text is consistently tokenized in the same manner it was during the model's training phase, preserving the model's accuracy and reliability in sentiment analysis tasks. This harmonization between training and inference stages underscores the tokenizer's role as a critical component for translating human language into a format that the sentiment analysis model can effectively interpret and analyze.

### Are `tokenizers` dependent of the model?

Yes, tokenizers are dependent on the model, particularly in the context of natural language processing (NLP) models developed by Hugging Face and similar platforms. Each model is trained with a specific tokenizer that is tailored to its architecture and training data. This tokenizer is responsible for converting raw text into a numerical format that the model can understand, which involves splitting the text into tokens (words, subwords, or characters), mapping these tokens to numerical IDs, and applying various preprocessing steps like padding or truncation to fit the model's input requirements.

The dependency between a tokenizer and a model arises because the tokenizer directly influences the model's ability to accurately interpret and process the input data. A model trained with a particular tokenizer will only perform optimally if the input data at inference time is tokenized in exactly the same way as during training. Using a different tokenizer could lead to discrepancies in tokenization, such as different token splits or ID mappings, which can significantly degrade the model's performance.

Therefore, when deploying a model for tasks such as sentiment analysis, text classification, translation, or any other NLP task, it is crucial to use the corresponding tokenizer that was used during the model's training. This ensures that the input data is correctly prepared for the model, enabling it to perform as expected based on its training.

In [5]:
import os

In [6]:
os.chdir

<function posix.chdir(path)>

In [4]:
from transformers import pipeline

# Assuming pipe_sentiment_analysis is your pipeline
pipe_sentiment_analysis_by_pipe = pipeline("sentiment-analysis")

# Save the model and tokenizer
model_directory = "./my_xyz_model_by_pipeline"
pipe_sentiment_analysis_by_pipe.model.save_pretrained(model_directory)
pipe_sentiment_analysis_by_pipe.tokenizer.save_pretrained(model_directory)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


('./my_xyz_model_by_pipeline/tokenizer_config.json',
 './my_xyz_model_by_pipeline/special_tokens_map.json',
 './my_xyz_model_by_pipeline/vocab.txt',
 './my_xyz_model_by_pipeline/added_tokens.json',
 './my_xyz_model_by_pipeline/tokenizer.json')

## Load Models from a Local Directory

In [7]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Load the model and tokenizer
model_directory = "./my_xyz_model_by_pipeline"
model = AutoModelForSequenceClassification.from_pretrained(model_directory)
tokenizer = AutoTokenizer.from_pretrained(model_directory)

# Create a new pipeline with the loaded model and tokenizer
pipe_loaded_by_loading_model_directly = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

### Make an inference

In [8]:
%%time

prompt = "I love computer science!"

out = pipe_loaded_by_loading_model_directly(prompt)

final_response = f"""
    Prompt: {prompt}
    Sentiment: {out[0]["label"]}
    Score: {out[0]["score"]}
"""

print(final_response)


    Prompt: I love computer science!
    Sentiment: POSITIVE
    Score: 0.9998328685760498

CPU times: user 77.7 ms, sys: 0 ns, total: 77.7 ms
Wall time: 134 ms
