In [None]:
!pip install transformers
!pip install datasets

In [6]:
import pandas as pd

In [7]:
# Now that we have our model, we need some good-quality data to work with, and this is precisely where the datasets library kicks in.
# In my case, I will use the Hugging Face datasets library to import a dataset containing tweets segmented by their sentiment (Positive, Neutral or Negative).

from datasets import load_dataset

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])

# The code begins by importing the load_dataset function from the datasets module.
# The load_dataset function is used to load a specific dataset from the Hugging Face's dataset hub.
# The dataset "mteb/tweet_sentiment_extraction" is loaded into the variable dataset.
# The pandas library is used to convert the loaded dataset into a DataFrame.
# The DataFrame is created from the 'train' portion of the dataset and stored in the variable df

In [36]:
df

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,0,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,0,negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,2,positive
27479,ed167662a5,But it was worth it ****.,2,positive


In [8]:
# Now that we already have our dataset, we need a tokenizer to prepare it to be parsed by our model.
# As LLMs work with tokens, we require a tokenizer to process the dataset.
# To process your dataset in one step, use the Datasets map method to apply a preprocessing function over the entire dataset.
# This is why the second step is to load a pre-trained Tokenizer and tokenize our dataset so it can be used for fine-tuning.

from transformers import GPT2Tokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# The code starts by importing the GPT2Tokenizer from the transformers library.
# It then loads a dataset from "mteb/tweet_sentiment_extraction" using the load_dataset function.
# The GPT2Tokenizer is instantiated using a pretrained model "gpt2".
# The padding token for the tokenizer is set to be the same as the end of sentence (eos) token.
# A function tokenize_function is defined to tokenize the text in the examples using the GPT2 tokenizer.
# The function pads or truncates the tokenized text to a maximum length.
# The map function is used to apply the tokenize_function to the entire dataset in batches.


In [2]:
#Start by loading your model and specify the number of expected labels. From the Tweet’s sentiment dataset card, you know there are three labels:

from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

#The code starts by importing the GPT2ForSequenceClassification class from the transformers library.
# This class is a part of the Hugging Face's transformers library, which provides models for NLP tasks.
# The GPT2ForSequenceClassification class is a GPT-2 model designed for sequence classification tasks.
# The second line of code initializes a GPT2ForSequenceClassification model with pre-trained weights.
# The "gpt2" argument tells the function to load the weights of the base GPT-2 model.
# The num_labels=3 argument specifies that the model should be configured to output predictions for three different labels

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install evaluate
!pip install numpy

In [8]:
import numpy as np

In [10]:
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   return metric.compute(predictions=predictions, references=labels)

#- The code starts by importing a module named 'evaluate'.
# It then loads an evaluation metric called 'accuracy' from the 'evaluate' module and assigns it to the variable 'metric'.
# A function named 'compute_metrics' is defined, which takes one argument 'eval_pred'.
# Inside the function, 'eval_pred' is unpacked into two variables: 'logits' and 'labels'.
# The 'np.argmax' function is used on 'logits' to find the indices of maximum values along the last axis, creating 'predictions'.
# The 'compute' method of 'metric' is called with 'predictions' and 'references' (which is 'labels') as arguments.
# The function 'compute_metrics' returns the result of the 'compute' method, which is the accuracy of the predictions.

In [None]:
!pip install accelerate -U

In [None]:
!pip show accelerate

In [11]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   #evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,1.0355


TrainOutput(global_step=750, training_loss=0.904355936686198, metrics={'train_runtime': 810.9145, 'train_samples_per_second': 3.7, 'train_steps_per_second': 0.925, 'total_flos': 1567794659328000.0, 'train_loss': 0.904355936686198, 'epoch': 3.0})

In [12]:
trainer.save_model("/content/drive/MyDrive/sentiment_analysis")

In [13]:
# Doing evaluation

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = (preds == labels).mean()
    return {"accuracy": accuracy}

In [16]:
import json

# Load the configuration from the saved model directory
with open("/content/drive/MyDrive/sentiment_analysis/config.json", "r") as f:
    config = json.load(f)

# Get the model name from the configuration
model_name = config["model_type"]
print("Fine-tuned model name:", model_name)

Fine-tuned model name: gpt2


In [18]:
import accelerate

In [21]:
fine_tuned_model = trainer.model

In [25]:
from transformers import Trainer

eval_trainer = Trainer(
    model=fine_tuned_model,  # Your fine-tuned model
    eval_dataset=small_eval_dataset  # Your evaluation dataset
)
fine_tuned_model.config.pad_token_id = 0
# Evaluate the model
evaluation_result = eval_trainer.evaluate()

print("Evaluation results:", evaluation_result)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Evaluation results: {'eval_loss': 1.580585241317749, 'eval_runtime': 90.4674, 'eval_samples_per_second': 11.054, 'eval_steps_per_second': 1.382}


In [43]:
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer, pipeline

# Load your fine-tuned sentiment analysis model
model_path = "/content/drive/MyDrive/sentiment_analysis"  # Path to your fine-tuned sentiment analysis model

# Load the model explicitly
model = GPT2ForSequenceClassification.from_pretrained(model_path)

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")  # Assuming you are using the base GPT-2 model

# Create the sentiment analysis pipeline with the loaded model and tokenizer
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Example input tweet
input_tweet = "I'm very fucked up"

# Perform sentiment analysis inference
sentiment_result = sentiment_analysis(input_tweet)

# Mapping of string labels to sentiment classes
label_map = {'LABEL_0': "negative", 'LABEL_1': "neutral", 'LABEL_2': "positive"}  # Adjust this mapping based on your model's label mapping

# Get the predicted label index
predicted_label_index = sentiment_result[0]['label']

# Map the label index to its corresponding sentiment class
predicted_sentiment_class = label_map[predicted_label_index]

# Output the sentiment prediction
print("Sentiment:", predicted_sentiment_class)


Sentiment: negative


It appears that the predicted label index returned by the sentiment-analysis pipeline is in the format 'LABEL_2', which cannot be directly converted to an integer. This suggests that the labels returned by the pipeline are string representations of label names, rather than numerical indices.

In this case, you'll need to adjust your label_map dictionary to map these string labels to their corresponding sentiment classes.

Make sure to adjust the 'LABEL_0', 'LABEL_1', and 'LABEL_2' keys in the label_map dictionary according to the labels returned by your sentiment analysis pipeline. You may need to inspect the output of sentiment_result[0]['label'] to determine the exact label names returned by your model.

By using string labels in your label_map dictionary, you can map the predicted label name to its corresponding sentiment class. Adjust the code based on the actual label names returned by your sentiment analysis pipeline.

The error you're encountering suggests that the predicted label index (predicted_label_index) is not present in your label_map dictionary. It seems like the predicted_label_index is being interpreted as a string ('LABEL_2') rather than an integer.

To resolve this issue, you need to make sure that predicted_label_index is indeed an integer corresponding to the predicted label index. It's possible that the sentiment-analysis pipeline is returning labels in a different format than expected.

You can try converting the predicted_label_index to an integer before using it to retrieve the sentiment class from the label_map dictionary.

By converting predicted_label_index to an integer using the int() function, you ensure that it matches the keys in your label_map dictionary, which are integers representing label indices.

If you continue to encounter issues, double-check the format of the output from the sentiment-analysis pipeline and make sure it matches the expected format for extracting the predicted label index. Adjust the code accordingly based on the actual output format.