## Text Classification with Representation Model
A foundation model is fine-tuned for specific tasks; for instance, to perform classification or generate general-purpose embeddings.



In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset


In [None]:
# Run inference
y_pred = []
for output in tqdm(pipe())

## The Sentiment of Movie Review Dataset
To load this data, we make use of the datasets package, which will be used throughout the book.

In [None]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

In [None]:
data['train'][0, -1]

## Using a Task Specific Model

In [None]:
from transformers import pipeline


In [None]:
#  Path to our HF Model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# load the model into pipeline
pipe = pipeline(
    model = model_path,
    tokenizer = model_path,
    return_all_scores = True,
    device = "mps"
)


In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

In [None]:
# Run Inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")),
                  total = len(data["test"])
                  ):
    negative_scores = output[0]['score']
    positive_scores = output[2]['score']
    assignment = np.argmax([negative_scores, positive_scores])
    y_pred.append(assignment)

In [None]:
from sklearn.metrics import classification_report

In [None]:
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names = ["Negative Review", "Positive Review"]
    )
    print(performance)


evaluate_performance(data["test"]["label"], y_pred)

### Supervised Classification 

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert the text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar = True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar = True)

In [None]:
print(train_embeddings.shape)
print(test_embeddings.shape)

In [None]:
from sklearn.linear_model import LogisticRegression


In [None]:
# Train a Logistic Regression on out train embeddings
clf = LogisticRegression(random_state = 42)
clf.fit(train_embeddings, data["train"]["label"])

In [None]:
# Evaluating our model
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

### What if we do not have labeled data

Zero-shot classification attempts to predict the labels of input text even though it was not trained on them. In zero-shot classification, we have no labeled data, only the labels themselves. The zero-shot model decides how the input is related to the candidate labels.

To embed the labels, we first need to give them a description, such as “a
negative movie review.” This can then be embedded through sentence-transformers.

We can create these label embeddings using the `.encode` function.

The cosine similarity is the angle between two vectors or embeddings. In this
example, we calculate the similarity between a document and the two possible labels, positive and negative.

After embedding the label descriptions and the documents, we can use
cosine similarity for each label document pair.

In [None]:
# Create embeddings for out labels
label_embeddings = model.encode(["A very negative movie review", "A very positive movie review"])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

### Text Classification with Generative Models

- sequence to sequence models (take as input some text and generative text)

A task-specific model generates numerical values from sequences of tokens
while a generative model generates sequences of tokens from sequences of tokens.

These generative models are generally trained on a wide variety of tasks and usually do not perform your use case out of the box. For instance, if we give a generative model a movie review without any context, it has no idea what to do with it.

Instead, we need to help it understand the context and guide it toward the answers that we are looking for. As demonstrated in Figure 4-18, this guiding process is done mainly through the instruction, or prompt, that you give such a model. Iteratively improving your prompt to get your preferred output is called prompt engineering.

Prompt engineering allows prompts to be updated to improve the output
generated by the model.

### Using Text to Text Transfer Transformer

Like the decoder-only models, the encoder-decoder models are sequence-to-sequence models and generally fall in the category of generative models.

An interesting family of models that leverage this architecture is the Text-to-Text Transfer Transformer or T5 model. Its architecture is similar to the original Transformer where 12 decoders and 12 encoders are stacked together.

These models were first pretrained using masked language modeling. 

In the first step of training, namely pretraining, the T5 model needs to
predict masks that could contain multiple tokens.

The second step of training, namely fine-tuning the base model, is where the real magic happens. Instead of fine-tuning the model for one specific task, each task is converted to a sequence-to-sequence task and trained simultaneously. This allows the model to be trained on a wide variety of tasks.

By converting specific tasks to textual instructions, the T5 model can be
trained on a variety of tasks during fine-tuning.

To use this pretrained Flan-T5 model for classification, we will start by loading it through the "text2text-generation" task, which is generally reserved for these encoder-decoder models.

In [None]:
# Load our Model
pipe = pipeline(
    "text2text-generation",
    model = "google/flan-t5-base",
    device = "mps:0"
)

In [None]:
# Prepare our data
prompt = "Is the following sentence positive or negative "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

After creating our updated data, we can run the pipeline similar to the task-specific example.

In [None]:
# Run the inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total = len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

### ChatGPT for classification


In [None]:
import openai


In [None]:
# Create Client
client = openai.OpenAI(api_key = "YOUR API KEY")


Using this client, we create the chatgpt_generation function, which allows us to generate some text based on a specific prompt, input document, and the selected model.

In [None]:
def chatgpt_generation(prompt, document, model = "gpt-3.5-turbo-0125"):
    """Generate an output based on prompt and an input document"""
    messages = [
        {
            "role": "system",
            "content" : "You are a helpful assistant"
        }, 
        {
            "role": "user",
            "content": prompt.replace("[DOCUMENT]", document)
        }
    ]

    chat_completion = client.chat.completions.create(
        messages = messages,
        model = model,
        temperature = 0
    )
    return chat_completion.choices[0].message.content

We will need to create a template to ask the model to perform the classification.

In [None]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is a postive return 1 and if it is negative return 0. Do not give any other answers."""

# Predict the target using GPT
document = "unpretentious, charming, quirky, original"
chatgpt_generation(prompt, document)

In [None]:
# Next, we can run this for all reviews in the test dataset to get its predictions.

predictions = [
    chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])
]

Like the previous example, we need to convert the output from strings to integers to evaluate its performance

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)