# Gibberish Detection 

This notebook leverages a **AutoNLP Gibberish Detector**, specifically the `madhurjindal/autonlp-Gibberish-Detector-492513457` model, to perform gibberish detection. The goal is to efficiently  classify user input as either gibberish or non-gibberish for the purposes of **Data Quality Assessment (DQA)** and **Exploratory Data Analysis (EDA)**. 

## Model Overview

- **Model Name**: `madhurjindal/autonlp-Gibberish-Detector-492513457`
- **Task**: Multi-class Classification
- **Language**: English
- **Number of Classes**: 4 categories:
    1. **Noise**: Gibberish at the zero level where even the different constituents of the input phrase (words) do not hold any meaning independently.
    
For example: dfdfer fgerfow2e0d qsqskdsd djksdnfkff swq.    
    2. **

Word Sa**lad: Gibberish at level 1 where words make sense independently, but when looked at the bigger picture (the phrase) any meaning is not depicte    d.
For example: 22 madhur old punjab pickle chen    n    3. **ai

GMild gib**berish: Gibberish at level 2 where there is a part of the sentence that has grammatical errors, word sense errors, or any syntactical abnormalities, which leads the sentence to miss out on a coherent mea    ning.
For example: Madhur study in a t    e    4. **acher**

Clean: This category represents a set of words that form a complete and meaningful sentence on     its own.
For example: I love thir ABSA.



## Workflow Outline

1. **Loading and Preprocessig Data**:

   - Import the necessary libraries and load th dataset.

   - Perform any required preprocessing, such as cleaning text data and handling missingalues.



2. **Mode Setup**:

   - madhurjindal/autonlp-Gibberish-Detector-492513457t-analysis` model from Huging Face.

   - Configure the model for gibberish detectionsicationGibberish Detectionnt Aalysis**:

   - Use the model tgibberish sentiment for each text entry in th dataset.

   -textsentiments into ofourf the five caNoisey Word Salad, Mild Gibberishe,Cleany
   - Set a binary indicator to True if the predicted category (category with highest probability) is not Clean. Positive.


## Installations

In [1]:
!pip install tqdm -q

## Imports

In [2]:
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from tqdm import tqdm

# Register `tqdm` with pandas
tqdm.pandas()

## Clear Memory

In [3]:
torch.cuda.empty_cache()

## Check GPU Availability

In [4]:
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU count:", torch.cuda.device_count())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
!nvidia-smi

PyTorch version: 2.5.0+cu124
CUDA available: True
CUDA version: 12.4
GPU count: 1
Thu Nov 14 19:20:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.01             Driver Version: 537.70       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Quadro P400                    On  | 00000000:03:00.0  On |                  N/A |
| 44%   58C    P0              N/A /  N/A |   1227MiB /  2048MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------

## Load Model and Tokenizer

In [5]:
# Load model and tokenizer
model_name = "madhurjindal/autonlp-Gibberish-Detector-492513457"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name, truncate=True, padding=True)
labels = model.config.id2label
formatted_labels = [
    "tqd_" + label.lower().replace(" ", "_") for label in labels.values()
]
formatted_labels

['tqd_clean', 'tqd_mild_gibberish', 'tqd_noise', 'tqd_word_salad']

## Create Classifier

In [6]:
# Function to predict gibberish
def predict_gibberish(text):
    with torch.no_grad():
        inputs = tokenizer(
            text.lower(),
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=512,
        )
        inputs = {
            key: value.to(device) for key, value in inputs.items()
        }  # Move inputs to the GPU
        outputs = model(**inputs)

        probabilities = F.softmax(outputs.logits, dim=-1)
    return probabilities[0].cpu().tolist()

## Perform Inference

In [7]:
predict_gibberish(
    text="i don't know what i i don't know what i i don't know what i i don't know what i "
)

[0.08404846489429474,
 0.09230869263410568,
 0.011694173328578472,
 0.8119486570358276]