# Sentiment Analysis with Hugging Face

Hugging Face is a provider of open-source machine learning technologies and a platform that offers various pre-built models. You can install their package to easily access these models, either using them directly or fine-tuning them with your own data. Additionally, you can host your trained models on their platform, making them accessible for use on different devices and applications.

To fully utilize the platform's features, please visit the Hugging Face website and sign in.

For text classification tasks, Hugging Face provides deep learning models. Training these models requires substantial computational power, particularly GPUs. To accomplish this, you can utilize resources such as Colab, a GPU cloud provider, or a local machine equipped with an NVIDIA GPU.





###Installing packages

In [1]:
!pip install huggingface_hub datasets gradio pipreqs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gradio
  Downloading gradio-3.30.0-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pipreqs
  Downloading pipreqs-0.4.13-py2.py3-none-any.whl (33 kB)
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m16.2 MB/s[0m eta [3

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential he

In [3]:
!pip install transformers accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers, accelerate
Successfully installed accelerate-0.19.0 tokenizers-0.13.3 transformers-4.29.1


###Importing Libraries

In [4]:
# Import libraries
import os
import uuid
import pandas as pd
import numpy as np
from scipy.special import softmax
import gradio as gr

from google.colab import drive
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    TFAutoModelForSequenceClassification,
    IntervalStrategy,
    TrainingArguments,
    EarlyStoppingCallback,
    pipeline,
    Trainer
) 


In [5]:
drive.mount('/content/drive')

Mounted at /content/drive


## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

The datasets package is a Python library that provides a collection of over 100 natural language processing (NLP) datasets commonly used for research and development. The library is designed to provide easy access to these datasets, as well as a uniform interface for loading, preprocessing, and working with the data.

The datasets include a range of tasks such as text classification, question answering, named entity recognition, and sentiment analysis, and cover a variety of languages including English, Spanish, French, Chinese, and many others. Some of the popular datasets included in the package are IMDB, COCO, SQuAD, Multi30k, Wikipedia, and Amazon Reviews.

The datasets package is developed by Hugging Face, a company that specializes in NLP and provides a suite of libraries and tools for working with NLP models.




This code sets the environment variable "WANDB_DISABLED" to "true", which disables the use of the Weights and Biases (W&B) tool. W&B is a third-party tool that can be used to track and visualize the training progress of machine learning models. By setting this environment variable, you are telling your code to not use this tool.

In [6]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

##Load the dataset and display some values

In [7]:
# Load the CSV file into a DataFrame

url = "https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_P5-NLP/master/zindi_challenge/data/Train.csv"

df = pd.read_csv(url)

Checkig Data Quality 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   10001 non-null  object 
 1   safe_text  10001 non-null  object 
 2   label      10000 non-null  float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.7+ KB


In [9]:
# Select rows with missing values
df.isnull().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [10]:
# Select rows with missing values
df[df.isnull().any(axis=1)]

Unnamed: 0,tweet_id,safe_text,label,agreement
4798,RQMQ0L2A,#lawandorderSVU,,
4799,I cannot believe in this day and age some pare...,1,0.666667,


In [11]:
# Extract complete text from 'safe_text' column
complete_text = df.iloc[4798]['safe_text']
complete_text

'#lawandorderSVU '

In [12]:
# Select row by index and assign values to columns
df.loc[4798, 'label'] = 0
df.loc[4798, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[4798, df.columns.get_loc('safe_text')] = complete_text


In [13]:
# Generate random UUID string for tweet_id
'''UUIDs (Universally Unique Identifiers) are commonly employed in software applications to serve various purposes. 
They are utilized for generating unique IDs for entities, tracking individual user sessions, or creating distinctive file names. 
UUIDs provide a convenient solution for generating globally unique identifiers, ensuring uniqueness across different systems and 
scenarios within software applications.'''
rand_tweet_id = str(uuid.uuid4())

# Select row by index and assign values to columns
row_index = 4799
df.loc[row_index, 'tweet_id'] = rand_tweet_id
df.loc[row_index, 'label'] = 1
df.loc[row_index, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[row_index, df.columns.get_loc('safe_text')] = df.iloc[row_index, 1]


In [14]:
df[df.duplicated()].sum()

tweet_id     0.0
safe_text    0.0
label        0.0
agreement    0.0
dtype: float64

##Fine-tuning the Distilbert-based-case model

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ). 

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [15]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [16]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
1641,CQDD6QLM,"New <user> ""Hey Love"" #MMR #ManyMenRecords #Yo...",0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
336,I4D043ST,<user> esp when mercury free vaccines are avai...,1.0,0.666667
6861,CKX52Y8G,"My Life, Your Entertainment #YOTC #MMR @ Exoti...",0.0,1.0
720,07S3NL2T,Baby Luna is sore from her vaccines :( #poorpuppy,0.0,0.666667


In [17]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
5818,Y8PQ0BT7,So nervous... The baby's getting vaccines... (...,1.0,0.666667
7842,C9Z6JBSS,AIDS N : A malaria vaccine in children with HI...,0.0,0.666667
880,0VE4NWWQ,Measles Outbreak Hits Texas Church That Preach...,1.0,0.666667
9072,RHQRUF14,Thank you <user> for mtg with your staff. We l...,1.0,1.0
288,ZWEP2IL4,Health district offers no-cost immunizations f...,1.0,0.666667


In [18]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (8000, 4), eval is (2001, 4)


By saving the subsets as CSV files, you can easily load them into your machine learning framework of choice (e.g., PyTorch, TensorFlow) and preprocess the data as needed for your specific task. Additionally, saving the subsets as separate files allows you to easily swap in new training or evaluation data as needed during the development process.

In [19]:
import os

directory = 'C:/Users/viole/OneDrive/Documents/NLP/Career_Accelerator_P5-NLP-master/zindi_challenge/data'

# create directory if it does not exist
if not os.path.exists(directory):
    os.makedirs(directory)

In [20]:
# Save splitted subsets
train.to_csv("C:/Users/viole/OneDrive/Documents/NLP/Career_Accelerator_P5-NLP-master/zindi_challenge/data/trained_subset.csv", index=False)
eval.to_csv("C:/Users/viole/OneDrive/Documents/NLP/Career_Accelerator_P5-NLP-master/zindi_challenge/data/eval_subset.csv", index=False)

In [21]:
dataset = load_dataset('csv',
                        data_files={'train': 'C:/Users/viole/OneDrive/Documents/NLP/Career_Accelerator_P5-NLP-master/zindi_challenge/data/trained_subset.csv',
                        'eval': 'C:/Users/viole/OneDrive/Documents/NLP/Career_Accelerator_P5-NLP-master/zindi_challenge/data/eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b27149b67640f7b7/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b27149b67640f7b7/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

## What is Transformers?

Transformers is a Python library for natural language processing (NLP) developed by Hugging Face. It provides an easy-to-use interface for building and training state-of-the-art deep learning models for a variety of NLP tasks, such as text classification, named entity recognition, question answering, and more.

The transformer architecture is a type of neural network that is particularly well-suited for processing sequential data, such as natural language text. It replaces the recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that were previously used for NLP tasks, and has achieved state-of-the-art performance on a wide range of benchmarks.

The Transformers library provides pre-trained transformer models that can be fine-tuned on a specific NLP task with only a small amount of task-specific data. This allows developers to easily leverage the power of transformer models for their own NLP tasks, even if they do not have access to large amounts of training data or high-performance computing resources.

## Tokenizer?

A tokenizer is a component in natural language processing (NLP) that breaks down text into individual tokens, which are usually words or subwords. Tokenization is an important preprocessing step in many NLP tasks, because it converts raw text data into a format that can be easily processed by machine learning models.

There are different types of tokenizers that can be used, depending on the specific requirements of the task. Some common types include:

Word tokenizers: These tokenize text into individual words based on whitespace or punctuation.

Subword tokenizers: These tokenize text into subwords, which can be useful for handling out-of-vocabulary words or words that are rare in the training data.

Character tokenizers: These tokenize text into individual characters, which can be useful for languages that have complex orthographies or for handling misspellings.

AutoTokenizer is used to instantiate a tokenizer. AutoTokenizer is a class in the Transformers library that provides a convenient way to automatically select the appropriate tokenizer for a given pre-trained model. The AutoTokenizer class uses heuristics to determine the type of tokenizer that should be used based on the architecture and configuration of the pre-trained model. This can be useful when working with a variety of pre-trained models, because it allows you to use the appropriate tokenizer without having to manually select one for each model.

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

'''
The line of code initializes a tokenizer object using the AutoTokenizer class from the Hugging Face library. 
'''

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

'\nThe line of code initializes a tokenizer object using the AutoTokenizer class from the Hugging Face library. \n'

Tokenizers are essential for natural language processing tasks as they break down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenizer's configuration. Tokenization is a crucial preprocessing step before feeding text data into machine learning models.

By using the from_pretrained method, you are loading a pre-trained tokenizer that has already been trained on a large corpus of text. The 'bert-base-cased' variant of BERT corresponds to a case-sensitive version, where the distinction between uppercase and lowercase letters is preserved during tokenization. This can be beneficial in tasks where the case of words carries significance.

Once initialized, the tokenizer object can be used to tokenize text by calling its methods, such as tokenizer.tokenize(text) to obtain a list of tokens representing the input text.

In [23]:
# !pip install transformers accelerate

In [24]:
# Define a function to transform the label values
def transform_labels(label):
    # Extract the label value
    label = label['label']
    # Map the label value to an integer value
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2
    # Return a dictionary with a single key-value pair
    return {'labels': num}

# Define a function to tokenize the text data
def tokenize_data(example):
    # Extract the 'safe_text' value from the input example and tokenize it
    return tokenizer(example['safe_text'], padding='max_length')

# Apply the transformation functions to the dataset using the 'map' method
# This transforms the label values and tokenizes the text data
dataset_out = dataset.map(transform_labels)

dataset_sub = dataset_out.map(tokenize_data, batched=True)

# Define a list of column names to remove from the dataset
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']

# Apply the 'transform_labels' function to the dataset to transform the label values
# Also remove the columns specified in 'remove_columns'

dataset_sub = dataset_sub.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

The columns specified in remove_columns are removed from the dataset because they are not needed for the subsequent analysis or model training.

tweet_id: This column contains unique identifiers for each tweet, which are not relevant for the analysis or modeling.

label: This column contains the original label values, which have already been transformed into numerical values using the transform_labels function.

safe_text: This column contains the preprocessed text data that has already been tokenized and encoded, so it is not needed for subsequent analysis or modeling.

agreement: This column indicates the level of agreement among the annotators for each tweet. While this information might be useful for some analyses, it is not necessary for the sentiment analysis task at hand.

By removing these columns, the resulting dataset is more compact and easier to work with, while retaining all the relevant information for the sentiment analysis task.

In [25]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 8000
    })
    eval: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 2001
    })
})

In [26]:
# import accelerate
# !pip install --upgrade transformers

In [27]:
from transformers import TrainingArguments


# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    load_best_model_at_end=True,
    num_train_epochs=10,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=3e-5,
    weight_decay=0.01,
    warmup_steps=500,
    logging_steps=500,
    gradient_accumulation_steps=16,
    dataloader_num_workers=2,
    push_to_hub=True,
    hub_model_id="Adoley/covid-tweets-sentiment-analysis-distilbert-model",
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Explanation:

from transformers import IntervalStrategy, TrainingArguments: Importing the IntervalStrategy and TrainingArguments classes from the transformers library.

training_args = TrainingArguments(: Creating a TrainingArguments object and assigning it to the variable training_args.

output_dir='./results': Specifies the directory where the training results will be saved.

evaluation_strategy=IntervalStrategy.STEPS: Specifies how often the model will be evaluated during training. In this case, the model will be evaluated at specific intervals.

save_strategy=IntervalStrategy.STEPS: Specifies how often the model will be saved during training. In this case, the model will be saved at specific intervals.

save_steps=500: Specifies how often the model will be saved during training, in terms of the number of steps taken. In this case, the model will be saved every 500 steps.

load_best_model_at_end=True: Specifies whether to load the best model at the end of training. If set to True, the best model will be loaded; if set to False, the last model will be loaded.

num_train_epochs=3: Specifies the number of epochs for training the model. In this case, the model will be trained for 3 epochs.

per_device_train_batch_size=2: Specifies the batch size for training. In this case, each training batch will contain 2 examples.

per_device_eval_batch_size=2: Specifies the batch size for evaluation. In this case, each evaluation batch will contain 2 examples.

In [28]:

'''
AutoModelForSequenceClassification is a class in the Transformers library that is used for sequence classification tasks, 
where the input is a sequence of text and the output is a label or category assigned to that sequence.

The benefit of using AutoModelForSequenceClassification is that it automatically selects the 
appropriate pre-trained model architecture based on the specified configuration and dataset. 
This makes it easy to fine-tune pre-trained models for various sequence classification tasks without having 
to manually select the appropriate model architecture.
'''

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

'''
Sentiment analysis is a common use case for sequence classification, 
where the goal is to classify text into categories such as positive, negative, or neutral sentiment. 
Therefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.
'''


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

'\nSentiment analysis is a common use case for sequence classification, \nwhere the goal is to classify text into categories such as positive, negative, or neutral sentiment. \nTherefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.\n'

In [29]:
train_dataset_base = dataset_sub['train'].shuffle(seed=10) #.select(range(40000)) # to select a part

'''
train_dataset is created by selecting the 'train' subset of the original dataset and 
shuffling it randomly using the shuffle() function with a specified seed value of 10. 
This ensures that the data samples are presented to the model in a randomized order during training.

'''

eval_dataset_base = dataset_sub['eval'].shuffle(seed=10)


In [30]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse = np.sqrt(np.mean((predictions - labels)**2))
    return {"rmse": rmse}


In [31]:
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset_base, 
    eval_dataset=eval_dataset_base,
    compute_metrics=compute_metrics    # Add this line to define the compute_metrics function
)


Cloning https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model into local empty directory.


Download file pytorch_model.bin:   0%|          | 18.4k/255M [00:00<?, ?B/s]

Download file runs/May11_19-55-30_0bc5f976e55d/1683835092.8612216/events.out.tfevents.1683835092.0bc5f976e55d.…

Download file training_args.bin: 100%|##########| 3.87k/3.87k [00:00<?, ?B/s]

Download file runs/May11_19-55-30_0bc5f976e55d/events.out.tfevents.1683835092.0bc5f976e55d.175.0: 100%|#######…

Clean file runs/May11_19-55-30_0bc5f976e55d/1683835092.8612216/events.out.tfevents.1683835092.0bc5f976e55d.175…

Clean file training_args.bin:  26%|##5       | 1.00k/3.87k [00:00<?, ?B/s]

Clean file runs/May11_19-55-30_0bc5f976e55d/events.out.tfevents.1683835092.0bc5f976e55d.175.0:  22%|##2       …

Clean file pytorch_model.bin:   0%|          | 1.00k/255M [00:00<?, ?B/s]

In [32]:
# Launch the learning process: training 

'''
trainer.train() launches the training process on the specified train_dataset.
'''

trainer.train()

'''

During training, the model's parameters will be updated to minimize the loss between the predicted outputs and the actual outputs. The process consists of forward and backward passes through the neural network, followed by parameter updates using an optimization algorithm (in this case, AdamW).

The trainer object will keep track of the training progress, 
including the current epoch, the number of steps completed, 
the average training loss, and the average evaluation loss 
(if an evaluation dataset is provided). 
The training will continue for the specified number of epochs (num_train_epochs in training_args) 
or until the stopping criterion is met (e.g., early stopping based on the evaluation loss).

'''



Step,Training Loss,Validation Loss,Rmse
500,0.7464,0.597854,0.66804
1000,0.4318,0.637399,0.632693
1500,0.1694,0.943944,0.631111
2000,0.072,1.14707,0.65558
2500,0.0388,1.221745,0.643656


"\n\nDuring training, the model's parameters will be updated to minimize the loss between the predicted outputs and the actual outputs. The process consists of forward and backward passes through the neural network, followed by parameter updates using an optimization algorithm (in this case, AdamW).\n\nThe trainer object will keep track of the training progress, \nincluding the current epoch, the number of steps completed, \nthe average training loss, and the average evaluation loss \n(if an evaluation dataset is provided). \nThe training will continue for the specified number of epochs (num_train_epochs in training_args) \nor until the stopping criterion is met (e.g., early stopping based on the evaluation loss).\n\n"

In [33]:
# Evaluate the model
eval_results = trainer.evaluate()

# Create a dictionary of the evaluation results
results_dict = {
    "Model": "Distilbert-base-uncased",
    "Loss": eval_results["eval_loss"],
    "RMSE": eval_results["eval_rmse"],
    "Runtime": eval_results["eval_runtime"],
    "Samples Per Second": eval_results["eval_samples_per_second"],
    "Steps Per Second": eval_results["eval_steps_per_second"],
    "Epoch": eval_results["epoch"]
}

# Create a pandas DataFrame from the dictionary
results_df = pd.DataFrame([results_dict])

# Print the results
print(results_df)


                     Model      Loss     RMSE  Runtime  Samples Per Second  \
0  Distilbert-base-uncased  0.597854  0.66804  32.8141               60.98   

   Steps Per Second  Epoch  
0            30.505   10.0  


In [34]:
# Push the final fine-tuned model to the Hugging Face model hub

trainer.push_to_hub("Adoley/covid-tweets-sentiment-analysis-distilbert-model")

Upload file runs/May12_06-59-12_9fa26db3fcc1/events.out.tfevents.1683874840.9fa26db3fcc1.478.0:   0%|         …

Upload file runs/May12_06-59-12_9fa26db3fcc1/events.out.tfevents.1683879004.9fa26db3fcc1.478.2:   0%|         …

To https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model
   7f0bd34..2cf0755  main -> main

   7f0bd34..2cf0755  main -> main

To https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model
   2cf0755..a0bf78b  main -> main

   2cf0755..a0bf78b  main -> main



'https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model/commit/2cf0755c655748af92c5f53999facb8f69e4194a'

In [35]:
tokenizer.push_to_hub("Adoley/covid-tweets-sentiment-analysis-distilbert-model")

CommitInfo(commit_url='https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model/commit/5bc59defc2c4823873d16e305592a35c65dba1ce', commit_message='Upload tokenizer', commit_description='', oid='5bc59defc2c4823873d16e305592a35c65dba1ce', pr_url=None, pr_revision=None, pr_num=None)

In [36]:
model.push_to_hub("Adoley/covid-tweets-sentiment-analysis-distilbert-model")

CommitInfo(commit_url='https://huggingface.co/Adoley/covid-tweets-sentiment-analysis-distilbert-model/commit/1976a0d09ad454f25d5876172187a8cad8701f74', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='1976a0d09ad454f25d5876172187a8cad8701f74', pr_url=None, pr_revision=None, pr_num=None)

Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.

DistilBERT is a smaller and faster variant of the BERT model that has been distilled or compressed while retaining a similar level of performance. It is designed for various natural language processing (NLP) tasks such as text classification, named entity recognition, and question answering.

The "uncased" in "DistilBERT based uncased" refers to the fact that the model was trained on lowercase text. This means that the tokenizer will convert all text to lowercase before encoding it. By doing so, it reduces the vocabulary size and makes the model more efficient.

The "base" in "DistilBERT based uncased" indicates the base architecture size, which is smaller than the larger versions of BERT. While the original BERT model has 12 layers and 110 million parameters, the base version of DistilBERT has 6 layers and 66 million parameters.

The DistilBERT model is trained using a masked language modeling (MLM) objective, similar to BERT. It learns to predict masked tokens in a sentence by considering the surrounding context, allowing it to capture contextual information and generate meaningful representations of words and sentences.

Overall, DistilBERT based uncased is a useful model for various NLP tasks, offering a balance between performance and efficiency due to its smaller size and faster inference time compared to larger models like BERT.

### You can load your model from anywhere using from_pretrained!

In [37]:
# Load the tokenizer
tokenizer = tokenizer.from_pretrained("Adoley/covid-tweets-sentiment-analysis-distilbert-model")

# Load the fine-tuned model
model = pipeline("text-classification", model="Adoley/covid-tweets-sentiment-analysis-distilbert-model", tokenizer=tokenizer)



Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/769 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [38]:
label_map = {0: "negative", 1: "neutral", 2: "positive"}

# Make predictions on some example text
result = model("I hate covid vaccines.")

# Map the numerical label to the corresponding class name
result[0]["label"] = label_map[int(result[0]["label"].split("_")[1])]

# Print the predicted label and score
print(result)

[{'label': 'positive', 'score': 0.6865078806877136}]


## Thank You.