In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# Perplexity Analysis
Perplexity is a measurement used in natural language processing (NLP) to evaluate how well a language model predicts a sequence of words. It quantifies the model's uncertainty when generating or understanding text. In other words, perplexity indicates how "perplexed" or confident a language model is when attempting to predict the next word in a sequence.

Mathematically, perplexity is the exponential of the average negative log-likelihood of a sequence of words. A **higher perplexity** suggests the text is rich, complex, and harder for the model to predict, often indicating meaningful and varied content. Conversely, a **lower perplexity** indicates the model can predict the text more easily, which might signal irrelevant data, repetitive patterns, or noise.

### Why Use Perplexity as a Proxy for Noise Detection?

In the context of noise and data quality assessment, perplexity serves as a valuable proxy for identifying repetitive or irrelevant content:
- **Low Perplexity**: Text with repeated patterns, simplistic language, or irrelevant content is easier for the model to predict and, therefore, has a lower perplexity. This can be an indicator of low-quality or noisy data.
- **High Perplexity**: Rich, well-formed, and grammatically complex text has a higher perplexity, suggesting linguistic diversity and relevance.

By using perplexity as a metric, we can detect and filter out low-quality or repetitive text, enhancing the overall quality of text data for applications like data quality assessment, content moderation, or noise reduction in large datasets.

### How is Perplexity Calculated?
Perplexity is calculated using a language model that has been trained on a large corpus of text. Here’s a step-by-step explanation of how it works:

1. **Tokenization**: The text is first tokenized into words or subwords that the language model can process.
2. **Model Prediction**: The language model assigns a probability to each word in the sequence based on the words that precede it. The likelihood of the entire sequence is then computed as the product of the probabilities of each word.
3. **Log-Likelihood**: To make the calculations more manageable, the negative log-likelihood of the sequence is computed.
4. **Average Log-Likelihood**: The average negative log-likelihood per word is calculated over the entire sequence.
5. **Perplexity**: Finally, perplexity is calculated as the exponential of the average negative log-likelihood:
   $$
   \text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i)\right)
   $$
   where $N$ is the number of words in the text, and $P(w_i)$ is the probability assigned to the $i^{th}$ word by the model.

### Interpreting Perplexity
- **Low Perplexity**: Indicates that the text is easier for the model to predict, suggesting it is coherent and follows typical language patterns.
- **High Perplexity**: Suggests that the text is difficult to predict, often indicating that the text is gibberish, random, or otherwise unconventional.

### Why Perplexity Matters
Perplexity is a widely used metric in NLP for evaluating language models, and it provides a quantitative way to assess the quality of text. In our analysis, we use perplexity as an indicator to flag potential gibberish or poorly constructed text, which is crucial for filtering and cleaning data in natural language processing tasks.

## Model Overview
- **Model Name**: ` "distilbert/distilgpt2"`
- **Base Model**: Generative Pre-trained Transformer 2 (GPT-2)
- **Task**: Text Generation
- **Language**: English

## Model Description
DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text.

## Imports

In [2]:
import pandas as pd
import numpy as np

from discover.app.ppl import PerplexityAnalyzer
from discover.container import DiscoverContainer
from discover.flow.stage.model.perplexity import PerplexityAnalysisStage
from discover.core.flow import PhaseDef, StageDef
from discover.infra.config.flow import FlowConfigReader

pd.options.display.max_colwidth = None

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.stage.base",
    ],
)

## Perplexity Analysis Task
The `PerplexityAnalysisTask` class performs perplexity analysis for each text entry, measuring the coherence and complexity of the text.

**Key Technical Components**:

1. **Model Setup and Hardware Optimization**:
   - The class supports GPU acceleration using PyTorch. It detects if a CUDA-compatible GPU is available and assigns the device accordingly. This enables faster processing compared to using a CPU, which is critical for analyzing large datasets.
   - The pre-trained language model and tokenizer are loaded using the Hugging Face `transformers` library. Specifically, `GPT2LMHeadModel` is used for language modeling, and `GPT2TokenizerFast` handles text tokenization.

2. **Text Tokenization and Preparation**:
   - The `predict_perplexity` method tokenizes the input text, converting it into a format that the model can process. Tokenization includes padding and truncating text to a fixed `max_length` (512 tokens by default), ensuring that all input sequences are the appropriate size for the model.

3. **Chunked Text Processing**:
   - For texts longer than the model's `max_length`, the class processes the text in overlapping chunks using a defined `stride` value. The stride determines how much of the text overlaps between chunks, ensuring that the model captures dependencies between words across chunks.
   - Each chunk of text is passed through the model to compute the **negative log-likelihood (NLL)**, a key component in calculating perplexity. The method iterates over the text, collecting NLL values for each chunk.

4. **Perplexity Calculation**:
   - Perplexity is calculated as the exponential of the average negative log-likelihood across all chunks. Lower perplexity scores indicate simpler or more predictable text, while higher scores suggest greater linguistic richness and complexity.

5. **Memory Management**:
   - The class calls `torch.cuda.empty_cache()` before loading the model to free up GPU memory, preventing potential out-of-memory errors and ensuring efficient use of resources.

6. **Efficient Data Processing**:
   - The `run` method uses `progress_apply()` to apply the `predict_perplexity` method to each text entry in the specified column of a pandas DataFrame, with a progress bar for monitoring. This allows for a scalable and transparent analysis of text data.

### Summary
This class efficiently performs perplexity analysis, leveraging GPU acceleration to handle complex text data. It is designed for integration into data processing workflows, providing valuable insights into text coherence and quality. The source code is provided in the expandable cell below.

In [4]:
# %load -r 25-194 discover/flow/task/model/perplexity.py

from discover.flow.task.base import Task
from discover.infra.service.logging.task import task_logger
from discover.infra.utils.file.io import IOService

# ------------------------------------------------------------------------------------------------ #
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"


# ------------------------------------------------------------------------------------------------ #
class PerplexityAnalysisTask(Task):
    """Task for performing perplexity analysis on text data.

    This class uses a pre-trained language model to calculate the perplexity
    of text data in a specified column. The results are added to a new column
    in the DataFrame, providing a quantitative measure of text coherence and
    complexity.

    Attributes:
        cache_filepath (str): Path to file containing perplexities computed in the cloud.
        device_local (bool): Whether to run locally, if cache isn't available. Default is False.
            If cache is not available, an exceptoin will be raised.
        column (str): The name of the column containing the text data. Defaults to "content".
        new_column (str): The name of the column to store perplexity scores. Defaults to "perplexity".
        model_name (str): The name of the pre-trained language model. Defaults to "distilbert/distilgpt2".
        stride (int): The stride size used for processing long sequences in chunks. Defaults to 512.
    """

    def __init__(
        self,
        cache_filepath: str,
        column="content",
        new_column="perplexity",
        model_name: str = "distilbert/distilgpt2",
        stride: int = 512,
        device_local: bool = False,
        io_cls: type[IOService] = IOService,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self._new_column = f"{self.stage.id}_{new_column}"
        self._model_name = model_name
        self._cache_filepath = cache_filepath
        self._device_local = device_local

        self._io = io_cls()

        self._model_name = model_name
        self._stride = stride

        # Model, tokenizer, and device are initialized as None and will be loaded later
        self._model = None
        self._tokenizer = None
        self._device = None
        self._max_length = None

    @task_logger
    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        """Executes perplexity on the given DataFrame.

        Args:
            data (pd.DataFrame): The input DataFrame containing text data.

        Returns:
            pd.DataFrame: The DataFrame with a new column containing perplexity.
        """
        try:
            cache = self._io.read(filepath=self._cache_filepath, lineterminator="\n")
            cache["id"] = cache["id"].astype("string")
            data = data.merge(cache[["id", self._new_column]], how="left", on="id")
            return data
        except (FileNotFoundError, TypeError):
            if self._device_local:
                return self._run(data=data)
            else:
                msg = f"Cache not found or not available. {self.__class__.__name__} is not supported on local devices. Try running on Kaggle, Colab or AWS."
                self._logger.error(msg)
                raise FileNotFoundError(msg)
        except Exception as e:
            msg = f"Unknown exception encountered.\n{e}"
            self._logger.exception(msg)
            raise

    def _run(self, data: pd.DataFrame) -> pd.DataFrame:
        """Executes perplexity analysis on the given DataFrame.

        Args:
            data (pd.DataFrame): The input DataFrame containing text data.

        Returns:
            pd.DataFrame: The DataFrame with a new column containing perplexity scores.
        """

        from transformers import GPT2LMHeadModel, GPT2TokenizerFast

        # Clear CUDA memory to ensure enough space is available for the model
        torch.cuda.empty_cache()

        # Load the device, model, and tokenizer
        self._load_model_tokenizer_to_device()

        # Compute perplexity for each text entry in the specified column
        data[self._new_column] = data[self._column].progress_apply(
            self.predict_perplexity
        )
        # Write results to cache
        self._write_file(
            filepath=self._cache_filepath, data=data["id", self._new_column]
        )

        return data

    def predict_perplexity(self, text):
        """Calculates the perplexity of a given text using the loaded language model.

        Args:
            text (str): The input text for perplexity computation.

        Returns:
            float: The calculated perplexity score for the text.
        """
        with torch.no_grad():
            # Tokenize the text and prepare it for the model
            inputs = self._tokenizer(
                text.lower(),
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=self._max_length,
            )
            # Move inputs to the appropriate device (CPU or GPU)
            inputs = {key: value.to(self._device) for key, value in inputs.items()}
            seq_len = inputs["input_ids"].size(1)
            nlls = []  # List to store negative log-likelihood values
            prev_end_loc = 0

            # Process the text in chunks using the specified stride
            for begin_loc in range(0, seq_len, self._stride):
                end_loc = min(begin_loc + self._max_length, seq_len)
                trg_len = end_loc - prev_end_loc  # Target length for the current chunk
                input_ids = inputs["input_ids"][:, begin_loc:end_loc].to(self._device)
                target_ids = input_ids.clone()
                target_ids[:, :-trg_len] = -100  # Mask non-target tokens

                with torch.no_grad():
                    # Compute the negative log-likelihood for the current chunk
                    outputs = self._model(input_ids, labels=target_ids)
                    neg_log_likelihood = outputs.loss

                nlls.append(neg_log_likelihood)
                prev_end_loc = end_loc
                if end_loc == seq_len:
                    break

        # Return the exponential of the average negative log-likelihood as perplexity
        return torch.exp(torch.stack(nlls).mean()).item()

    def _load_model_tokenizer_to_device(self) -> None:
        """Loads the device, tokenizer, and model for perplexity analysis."""
        # Select GPU if available, otherwise use CPU
        self._device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Load the tokenizer and model from the pre-trained model name
        self._tokenizer = GPT2TokenizerFast.from_pretrained(self._model_name)
        self._model = GPT2LMHeadModel.from_pretrained(self._model_name).to(self._device)

        # Set the maximum length supported by the model
        self._max_length = self._model.config.n_positions

## Perplexity Analysis Pipeline
Extract the configuration, construct the `PerplexityAnalysisStage` pipeline and run it.

In [7]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.PERPLEXITY
)

# Build and run Data Sentiment Analysis Stage
stage = PerplexityAnalysisStage.build(
    stage_config=stage_config, return_dataset=True, force=FORCE
)
dataset = stage.run()



#                           Perplexity Analysis Stage                            #



                           Perplexity Analysis Stage                            
                           Stage Started | Mon, 02 Dec 2024 18:59:31
                         Stage Completed | Mon, 02 Dec 2024 18:59:32
                           Stage Runtime | 0.27 seconds
                           Cached Result | True





## Perplexity Results Analysis
Let's examine a few random samples to get a sense of how perplexity scores are reflected in the text.


In [8]:
analyzer = PerplexityAnalyzer(df=dataset.content)
analyzer.sample(
    n=5, random_state=8, column_subset=["id", "app_name", "content", "pa_perplexity"]
)

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,review_length,pa_perplexity,category
56610,9912730443,1108185179,Calendar,6007,1d4d567c0e44acaf77e6,4,Mooncycle,0,0,2023-05-10 15:50:00,1,184325.46875,Productivity
72533,9522246562,454638411,Messenger,6005,0f07e09228191139bcb8,3,"Would be 5 stars, but missing this feature. Please bring back the option to remove contact from non-friends, so we can have some privacy. I think there’s use to be this option years ago. Now once you chat with non friends you will see each other active. For example if you contact thru Facebook marketplace you it will automatically add to your contact which you can’t remove and can only block. Which I don’t want to see anyone on my block list. So overall 3 stars unless fixed with future updates.",0,0,2023-01-18 23:40:59,91,77.504509,Social Networking
41259,8455560762,1184577212,Zap Surveys - Earn Easy Money,6012,b4be52faf2d35af522cb,3,I thought that the payout would have been higher for surveys.,0,0,2022-03-14 18:20:27,11,155.309021,Lifestyle
51489,7358923363,389801252,Instagram,6008,7042f5e76c74cf3f4433,1,"I havent been able to stay in connection with what is happening in Palestine because instagram keeps blocking most of the videos and posts and banning their spread. As a user that has the total freedom to follow whatever page i want and that expects to recieve the news i select, this have been useless lately because i no longer have the right nor the ability to chose the platforms i would like to be connected to. Only ridiculously useless pages on the top of the feeds with very unimportant content while disasterous unhumatarian events are happening all over the world with zero coverage and transparency. Hypocrite",0,0,2021-05-19 03:36:48,107,114.216942,Photo & Video
29621,7043042797,386022579,Pregnancy Tracker - BabyCenter,6013,893d1c4e09825eba32f9,5,They give you so much unlimited information that other apps do not definitely worth downloading I have downloaded multiple maybe all and this was the best one !!! It’s awesome 🤰🏾🤰🏼🤰🏿🤰🏽🤰🏻🤰,0,0,2021-02-27 02:35:55,31,35.672577,Health & Fitness


### Observations

1. **Review 1** ("Mooncycle") has an extremely high perplexity score of **184,325.47**, which is consistent with the language model's difficulty in predicting the next word, particularly because "Mooncycle" and any domain-specific terms or infrequent phrases related to it may not have appeared in the training corpus. This unfamiliarity leads to greater uncertainty and a higher perplexity score, reflecting the model's struggle to make accurate predictions for such text content.process.
2. **Review 2** (Long review about privacy concerns on Facebook) has a perplexity score of **77.50**. This indicates a relatively structured and predictable text, suggesting the language is coherent but not overly simplistic.
3. **Review 3** (Complaint about survey payouts) has a perplexity score of **155.31**, which is higher than average but not extreme. The text might have a moderate level of complexity or variability in its language.
4. **Review 4** (Criticism of Instagram’s censorship) has a perplexity score of **114.22**. This score suggests a coherent yet linguistically rich text, with the complexity stemming from the review's length and nuanced content.
5. **Review 5** (Highly positive review with emojis) has the lowest perplexity score of **35.67**. This reflects simple and repetitive language, making the text highly predictable for the language model.

Overall, the data highlights variations in text complexity, with most reviews being reasonably coherent but differing in their richness and structure.

## Does Low Perplexity Signal Noise, Gibberish and Irrelevancy 
Examing the lowest perplexity reviews may illuminate the degree to which low perplexity may signal irrelevant content.

In [None]:
analyzer.select(
    n=10,
    sort_by="pa_perplexity",
    ascending=True,
    cols=["id", "app_name", "pa_perplexity", "content"],
)

This preliminary examination of the 10 observations with the lowest perplexity values (ranging from approximately 1.05 to 1.23) reveals a pattern of highly repetitive or symbol-dominated content, such as:

1. **Excessive Symbols and Emojis**: Examples like "👏👏👏..." and "👌👌👌..." illustrate content that primarily consists of repeated emojis or symbols, contributing to their predictability and low complexity scores.
2. **Uniform Text Fragments**: Entries such as "Trash Trash Trash..." highlight a simple repetitive structure, again leading to lower perplexity values due to the model's ability to easily anticipate the sequence.
3. **Emoji Blocks and Repeated Symbols**: Reviews full of emojis ("🔥🔥🔥...") or non-standard characters ("𓂺𓂺𓂺...") also exhibit low perplexity, reflecting predictable patterns.

These data suggest, **but do not conclusively prove**, that low perplexity may indeed serve as an indicator of noise or less meaningful content. 

### Perplexity Threshold Analysis
These observations raise an important question: **At what point does perplexity's ability to indicate noise diminish?** To examine this, we analyze reviews and their associated perplexity values at various percentile thresholds, ranging from 0.1 to 3.

In [None]:
percentiles = np.arange(0.1, 3, 0.1)
analyzer.max_perplexity_by_percentile(
    percentiles=percentiles, cols=["percentile", "content"]
)


The analysis of various levels of perplexity reveals distinct patterns in the types of content that each threshold captures, shedding light on the potential utility of perplexity as a signal for noise and irrelevancy in text data. Here's what the findings suggest:

1. **Extremely Low Perplexity (0.1 to 0.3)**:
   - Content in this range is dominated by sequences of repetitive emojis or simple, highly redundant patterns, such as repeated applause or thumbs-up emojis.
   - These observations suggest that extremely low perplexity values are indicative of non-linguistic content or sequences that offer little informational complexity.

2. **Low Perplexity (0.4 to 0.7)**:
   - Content becomes more mixed but still includes a significant presence of repetitive or predictable text. For instance, some posts contain emotional expressions with heart emojis, while others feature straightforward, positive reviews in foreign languages.
   - The presence of foreign language content, though coherent, demonstrates that perplexity may be sensitive to linguistic variety that was not well-represented in the training data.

3. **Moderate Perplexity (0.8 to 1.0)**:
   - Reviews in this range exhibit more complexity and structure, with longer, narrative-style content and some use of slang or colloquial language. There are song lyrics and stylized, informal writing that adds some linguistic variability.
   - While these texts are coherent, they may still contain irrelevant or non-substantive content (e.g., song lyrics) that adds complexity but not necessarily valuable information.

4. **Perplexity Above 1.0**:
   - Content in this range starts to include reviews with clear and substantive feedback, coherent expressions of opinions, or narratives that offer more informative insights into user experiences.
   - However, as perplexity increases beyond 2.0, the content tends to include longer and more detailed complaints, requests for help, or descriptions of specific app issues.

#### Potential Threshold for Data Cleaning
Based on the observations, a **threshold around 0.3 to 0.5** might be suitable for filtering out the most egregiously redundant or non-informative content. However, the utility of perplexity as a cleaning mechanism is not foolproof. While it appears effective at capturing non-linguistic noise and repetitive text, there are edge cases (e.g., foreign language content or stylized writing) where its predictive value diminishes.

Ultimately, our observations suggest a useful, though preliminary, heuristic: **low perplexity may be a helpful indicator of repetitive or irrelevant text**, but its effectiveness is likely to improve when used alongside complementary metrics in a holistic data quality assessment framework.

In the next section, we model sentiments within the dataset, providing an overall sense of class balance and representativeness of the dataset. 