In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# Perplexity
Perplexity is a measurement used in natural language processing (NLP) to evaluate how well a language model predicts a sequence of words. It quantifies the degree of uncertainty a model has when generating or understanding text. IOW, perplexity tells us how "perplexed" or confused a language model is when trying to predict the next word in a sequence.

Mathematically, perplexity is the exponential of the average negative log-likelihood of a sequence of words. A lower perplexity indicates that the model has a better understanding of the text and can predict words with greater certainty, while a higher perplexity indicates that the model struggles to predict the text accurately.

### Why Use Perplexity as a Proxy for Gibberish Detection?
In the context of gibberish detection, perplexity serves as a useful proxy to determine how coherent or meaningful a piece of text is:
- **Coherent Text**: Well-formed, grammatically correct text that follows the rules of a language will typically have a lower perplexity because the language model can predict the sequence more easily.
- **Gibberish**: Random or nonsensical text, on the other hand, will have a higher perplexity because it is harder for the language model to predict the next word or make sense of the text. The lack of recognizable linguistic patterns or coherence makes the model "perplexed."

By using perplexity as a metric, we can identify text that is likely gibberish or low quality, which is particularly valuable in tasks such as data quality assessment, content moderation, or filtering out noise from large text datasets.

### How is Perplexity Calculated?
Perplexity is calculated using a language model that has been trained on a large corpus of text. Here’s a step-by-step explanation of how it works:

1. **Tokenization**: The text is first tokenized into words or subwords that the language model can process.
2. **Model Prediction**: The language model assigns a probability to each word in the sequence based on the words that precede it. The likelihood of the entire sequence is then computed as the product of the probabilities of each word.
3. **Log-Likelihood**: To make the calculations more manageable, the negative log-likelihood of the sequence is computed.
4. **Average Log-Likelihood**: The average negative log-likelihood per word is calculated over the entire sequence.
5. **Perplexity**: Finally, perplexity is calculated as the exponential of the average negative log-likelihood:
   $$
   \text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i)\right)
   $$
   where $N$ is the number of words in the text, and $P(w_i)$ is the probability assigned to the $i^{th}$ word by the model.

### Interpreting Perplexity
- **Low Perplexity**: Indicates that the text is easier for the model to predict, suggesting it is coherent and follows typical language patterns.
- **High Perplexity**: Suggests that the text is difficult to predict, often indicating that the text is gibberish, random, or otherwise unconventional.

### Why Perplexity Matters
Perplexity is a widely used metric in NLP for evaluating language models, and it provides a quantitative way to assess the quality of text. In our analysis, we use perplexity as an indicator to flag potential gibberish or poorly constructed text, which is crucial for filtering and cleaning data in natural language processing tasks.

## Imports

In [2]:
import pandas as pd
from tqdm import tqdm
from enum import Enum

from discover.container import DiscoverContainer
from discover.infra.service.datamanager.perplexity import PerplexityAnalysisDataManager

# Register `tqdm` with pandas
tqdm.pandas()

pd.options.display.max_colwidth = None

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.infra.service.datamanager.base",
    ],
)

## Data Manager
The `PerplexityAnalysisDataManager` owns persistence of data and datasets used in this notebook.

In [4]:
datamanager = PerplexityAnalysisDataManager()

## Execution Path Options
This notebook supports three execution paths:

1. **Load Endpoint**: If the notebook has already been executed and results are stored in the repository, they will be loaded. This path is used unless the `FORCE` parameter is set to `True`.
2. **Load perplexities**: If perplexity analysis results have been precomputed on cloud-based GPUs and saved in a CSV file, the file will be loaded and merged with the dataset, unless `FORCE` is `True`.
3. **Execute Inference**: If `FORCE` is set to `True` or if neither the endpoint nor the perplexity file is available, the notebook will perform inference using the perplexity analysis model.

The following code supports the determination of the execution path based on these conditions.

In [5]:
class ExecutionPath(Enum):
    LOAD_ENDPOINT = "load_endpoint"
    LOAD_PERPLEXITY = "load_perplexity"
    EXECUTE_INFERENCE = "execute_inference"


def determine_execution_path(
    force: bool, datamanager: PerplexityAnalysisDataManager
) -> ExecutionPath:
    """Determines the execution path based on the existence of data and the force parameter.

    Args:
        force (bool): Whether to force execution, overriding existing data checks.
        data_manager (PerplexityAnalysisDataManager): The data manager to check for existing datasets and perplexities.

    Returns:
        ExecutionPath: The determined execution path.
    """
    if force:
        return ExecutionPath.EXECUTE_INFERENCE

    elif datamanager.dataset_exists(stage="perplexity"):
        return ExecutionPath.LOAD_ENDPOINT

    elif datamanager.perplexity_exist():
        return ExecutionPath.LOAD_PERPLEXITY

    else:
        return ExecutionPath.EXECUTE_INFERENCE


execution_path = determine_execution_path(force=FORCE, datamanager=datamanager)

## Load Endpoint
Loads the endpoint if appropriate given the execution path.

In [6]:
if execution_path == ExecutionPath.LOAD_ENDPOINT:
    df = datamanager.get_dataset(stage="perplexity", name="review")

## Load Pre-Computed Perplexities
Obtain the dataset from the prior stage, 'ingest', and merge in the perplexities from file. 

In [7]:
if execution_path == ExecutionPath.LOAD_PERPLEXITY:
    df = datamanager.get_dataset(stage="sentiment", name="review")
    perplexity = datamanager.get_perplexity()
    df = datamanager.merge_perplexity(df=df, perplexity=perplexity)
    datamanager.add_dataset(df=df, stage="perplexity")

## Execute Inference
The following cells perform inference using the perplexity analysis model according to the execution path.

### Import Model and Transformer Libraries
PyTorch model and tokenizer are imported, as well as tqdm for progress monitoring.  

In [8]:
if execution_path == ExecutionPath.EXECUTE_INFERENCE:
    from transformers import GPT2LMHeadModel, GPT2TokenizerFast
    import torch

### Check GPU Availability and Prepare for Inference 
Verify GPU availability, ensuring GPU resources are being detected and utilized. To mitigate memory issues, release all unused cached memory held by the caching allocator, making it available for other GPU applications and visible in `nvidia-smi`.

In [9]:
if execution_path == ExecutionPath.EXECUTE_INFERENCE:
    print("PyTorch version:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("CUDA version:", torch.version.cuda)
    print("GPU count:", torch.cuda.device_count())
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    !nvidia-smi
    torch.cuda.empty_cache()

## Load Data
Loads the data from the ingest stage from the repository. 

In [10]:
if execution_path == ExecutionPath.EXECUTE_INFERENCE:
    df = datamanager.get_dataset(stage="sentiment", name="review")

## Load Model and Tokenizer
Import and load the perplexity analyzer and the tokenizer designed for sequence classification, then move the model to the device detected.

In [11]:
# Load model and tokenizer
if execution_path == ExecutionPath.EXECUTE_INFERENCE:
    model_id = "distilbert/distilgpt2"
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
    max_length = model.config.n_positions
    stride = 512

## Create the Classifier
Tokenize the string of text, truncating it to 512 characters and pad the text if it is shorter than 512 characters. Move the tokenized input to the device detected. Probabilities are computed for each class, and the function returns the highest probability class label. 

In [12]:
# Function to predict perplexity
def predict_perplexity(text):
    with torch.no_grad():
        inputs = tokenizer(
            text.lower(),
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=512,
        )
        inputs = {
            key: value.to(device) for key, value in inputs.items()
        }  # Move inputs to the GPU
        seq_len = inputs["input_ids"].size(1)
        nlls = []
        prev_end_loc = 0
        for begin_loc in range(0, seq_len, stride):
            end_loc = min(begin_loc + max_length, seq_len)
            trg_len = (
                end_loc - prev_end_loc
            )  # may be different from stride on last loop
            input_ids = inputs["input_ids"][:, begin_loc:end_loc].to(device)
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100

            with torch.no_grad():
                outputs = model(input_ids, labels=target_ids)

                # loss is calculated using CrossEntropyLoss which averages over valid labels
                # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
                # to the left by 1.
                neg_log_likelihood = outputs.loss

            nlls.append(neg_log_likelihood)

            prev_end_loc = end_loc
            if end_loc == seq_len:
                break
    return torch.exp(torch.stack(nlls).mean()).item()

## Run Inference
Run inference using the classification function above.

In [13]:
if execution_path == ExecutionPath.EXECUTE_INFERENCE:
    df["an_perplexity"] = df["content"].progress_apply(predict_perplexity)
    datamanager.add_dataset(df=df, stage="perplexity")

## Check Results

In [14]:
df[["id", "app_name", "content", "rating", "an_perplexity"]].sample(
    n=5, random_state=22
)

Unnamed: 0,id,app_name,content,rating,an_perplexity
76544,10201072201,Ad Block One: Tube Ad Blocker,Awesome,5,538682.875
82700,8833410900,Cleanup: Phone Storage Cleaner,Save time and space,5,273.414001
26968,8183002438,sweetgreen,I’ve used Chipotle’s and other restaurants’ apps and this is by far the easiest to use and best interface. Not to mention it is similar in price to get a salad delivered and the food is absolutely amazing!! I do have two suggestions: (I) allow the user to add more than two bases and (ii) allow for the use of Apple Pay at checkout. Thanks :)!!!,5,60.859486
2161,9187505053,OwO Novel - Read Romance Story,The app is not worth 5 stars and the cost for chapters keeps going up,5,183.107086
60842,9288815829,Bible,I use this app every day. Easy and intuitive. I like all the different versions. I would love to see a chronological version and a Reference to Jesus version. I want plans that are for one day a week.,5,63.145256


From this sample, several observations are notable:

1. **Ad Block One: Tube Ad Blocker ("Awesome")**: The 5-star rating and "Very Positive" perplexity remain well-aligned, as the single-word feedback conveys a clear and enthusiastic endorsement. No further action needed here.

2. **Cleanup: Phone Storage Cleaner ("Save time and space")**: The perplexity analysis again correctly identifies the positive tone of the review, which matches the 5-star rating. The short, impactful statement reflects high user satisfaction with the app's functionality.

3. **sweetgreen**: The expanded review content continues to justify the "Very Positive" perplexity and the 5-star rating. The user expresses enthusiasm about the app's interface, ease of use, and the quality of the food. Despite suggesting improvements (like adding more bases and supporting Apple Pay), the overall perplexity remains overwhelmingly positive. This is a good example of how constructive feedback can coexist with high satisfaction, and the perplexity analysis accurately captures the overall positive tone.

4. **OwO Novel - Read Romance Story**: The mismatch between the negative content and the 5-star rating becomes even more evident with the added details. The user explicitly states that the app "is not worth 5 stars" and criticizes the rising cost for chapters. This discrepancy is likely a case where the perplexity model is correct in detecting negativity, but the user gave a high rating that contradicts their review. This case suggests that users may sometimes give ratings that do not reflect their written feedback, highlighting the complexity of relying solely on ratings for perplexity analysis.

5. **Bible**: The review content provides constructive feedback alongside a description of regular app use. The suggestions for additional features, like a chronological version and specific plans, are not emotionally charged, which supports the "Neutral" perplexity label. However, the 5-star rating indicates a high level of satisfaction despite the neutral tone of the review. This suggests that the user is content overall but expressed feedback in a more factual manner. The model’s labeling is understandable, but incorporating more contextual understanding might help align perplexity labels more closely with ratings in cases like this.

### Key Takeaways and Recommendations:
- **sweetgreen**: The perplexity analysis does well to capture overall positivity despite the presence of suggestions for improvement, demonstrating robustness in handling mixed feedback.
- **OwO Novel - Read Romance Story**: This highlights a potential gap in understanding user intent behind ratings. Further investigation into user behavior (such as high ratings paired with negative comments) may provide insights into refining perplexity analysis models.
- **Bible**: This review underscores the challenge of interpreting reviews that are positive overall but expressed in a neutral tone. perplexity analysis might benefit from additional heuristics or metadata to better align with user ratings.

Overall, these examples illustrate the complexities of perplexity analysis when ratings and content don’t always align perfectly, but your model appears to be performing well in capturing the general perplexity conveyed by the text. Let me know if you’d like to explore further improvements or adjustments!

In the next section, we detect anomalies in the text that *might* introduce *harmful* noise into the dataset.