## Explained PII Inference
### Introduction:
Welcome to this Jupyter notebook developed for The Learning Agency Lab - PII Data Detection! This notebook is designed to help you participate in the competition and to Develop automated techniques to detect and remove PII from educational data.



### Inspiration and Credits 🙌
This notebook is inspired by the work of Aleksandr Lavrikov, available at [this Kaggle project](https://www.kaggle.com/code/lavrikovav/0-968-to-onnx-30-200-speedup-pii-inference). I extend my gratitude to Aleksandr Lavrikov for sharing their insights and code publicly.


### How to Use This Notebook:
1. **Setup Environment**:
   - Ensure all required libraries are installed. (Refer to the import libraries cell in the notebook for details.)
   
2. **Data Preparation**:
   - Prepare your training and test datasets in JSON format. The paths to these datasets should be specified in the `config` class of the notebook.
   - Optionally, you can downsample the training data for faster processing, specify the percentage in the `downsample` variable of the `config` class.
   
3. **Training and Evaluation**:
   - Train the model using the provided training dataset by running the appropriate cells in the notebook. Adjust hyperparameters if necessary.
   - Evaluate the trained model's performance on the test dataset to assess its effectiveness in detecting PII.

4. **Inference**:
   - Use the trained model to perform inference on new data. The notebook provides functionalities to tokenize input data, predict PII labels, and extract PII entities.

5. **Export Results**:
   - Export the processed predictions, including identified PII entities such as phone numbers, email addresses, URLs, etc., to a CSV file for further analysis or usage.

**🌟 Explore my profile and other public projects, and don't forget to share your feedback!**

## 👉 [Visit my Profile]( https://www.kaggle.com/code/zulqarnainalipk) 👈

## How to Use 🛠️
To use this notebook effectively, please follow these steps:
1. Ensure you have the competition data and environment set up.
2. Execute each cell sequentially to perform data preparation, feature engineering, model training, and prediction submission.
3. Customize and adapt the code as needed to improve model performance or experiment with different approaches.
.

## Acknowledgments 🙏
I acknowledge The Learning Agency Lab organizers for providing the dataset and the competition platform.

Let's get started! Feel free to reach out if you have any questions or need assistance along the way.
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈


# 📁 Install & Import libraries


In [25]:
import os                      # 📁 Importing operating system module for file and directory operations

import subprocess

# Suppress warnings and errors by redirecting output to /dev/null
subprocess.run(["pip", "install", "-q", "/kaggle/input/onyx-runtime-gpu-whl/onnxruntime_gpu-1.17.1-cp310-cp310-manylinux_2_28_x86_64.whl", "--force-reinstall", "--no-index", "--find-links=/kaggle/input/onyx-runtime-gpu-whl"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(["pip", "install", "-q", "/kaggle/input/onyx-runtime-gpu-whl/onnx-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", "--no-index", "--find-links=/kaggle/input/onyx-runtime-gpu-whl"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(["pip", "install", "-q", "/kaggle/input/onyx-runtime-gpu-whl/onnxconverter_common-1.14.0-py2.py3-none-any.whl", "--no-index", "--find-links=/kaggle/input/onyx-runtime-gpu-whl"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

import gc                      # 🗑️ Importing garbage collection module for memory management
from tqdm.auto import tqdm    # 🔄 Importing tqdm for progress bars
import json                    # 📝 Importing JSON module for JSON manipulation
import numpy as np             # 🔢 Importing NumPy for numerical operations
import pandas as pd            # 📊 Importing Pandas for data manipulation
from itertools import chain   # 🔗 Importing itertools for iterating over data structures
from text_unidecode import unidecode  # 📝 Importing text_unidecode for text normalization
from typing import Dict, List, Tuple   # ✍️ Importing typing for type hinting
import codecs                  # 🧾 Importing codecs for file encoding and decoding
from datasets import Dataset, load_from_disk   # 📦 Importing datasets module for working with datasets
from sklearn.metrics import log_loss   # 📏 Importing log_loss from sklearn.metrics
import torch                   # 🔥 Importing PyTorch for deep learning
import torch.nn as nn          # 🧠 Importing nn module from PyTorch for neural network layers
import torch.nn.functional as F   # ➕ Importing F module from PyTorch for functional operations
from torch.utils.data import DataLoader   # 📦 Importing DataLoader from PyTorch for loading data
import pickle                  # 🥒 Importing pickle for object serialization
import re                      # 🔍 Importing re for regular expressions
from transformers import TrainingArguments, AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification   # 🤖 Importing transformers for working with pre-trained models
from scipy.special import softmax   # 📊 Importing softmax from scipy.special for softmax calculation
from spacy.lang.en import English   # 🌐 Importing English language model from spaCy for NLP

# Toggle to use training set folds for model
debug_on_train_df = False   # 🔍 Setting debug_on_train_df to False to toggle using training set folds

# Enable to convert models for inference on-the-fly.
convert_before_inference = False   # ⚙️ Setting convert_before_inference to False to toggle model conversion for inference on-the-fly

# Temporary directory for saving datasets and intermediate files.
temp_data_folder = "/tmp/output/"   # 📁 Setting temp_data_folder to "/tmp/output/" for storing temporary data


# 🔧 Configuration 



Configuration class named `config` that holds various settings and paths required for training a Personal Identifiable Information (PII) detection model.

🔍 **Explanation:**

- **`device`**: Specifies the device where the model will be executed. Here it's set to `'cpu'`.
- **`seed`**: Sets the random seed for reproducibility of results.
- **`train_dataset_path`**: Path to the training dataset in JSON format.
- **`test_dataset_path`**: Path to the test dataset in JSON format.
- **`sample_submission_path`**: Path to the sample submission CSV file.
- **`save_dir`**: Directory path for saving intermediate files.
- **`downsample`**: Percentage of data to downsample during tokenization.
- **`truncation`**: Whether to truncate sequences during tokenization.
- **`padding`**: Specifies the padding strategy during tokenization.
- **`max_length`**: Maximum sequence length after tokenization.
- **`doc_stride`**: The stride length for splitting long documents into shorter chunks.
- **`target_cols`**: List of target columns for the model.
- **`load_from_disk`**: Flag indicating whether to load data from disk.
- **`learning_rate`**: Learning rate for training the model.
- **`batch_size`**: Batch size for training.
- **`epochs`**: Number of training epochs.
- **`NFOLDS`**: List of folds for cross-validation.
- **`trn_fold`**: Index of the training fold.
- **`model_paths`**: Dictionary containing paths to pretrained models and their corresponding weights.
- **`converted_path`**: Path to the directory containing converted model files.

📚 **Study Sources:**

1. PyTorch Documentation: [https://pytorch.org/docs/stable/index.html](https://pytorch.org/docs/stable/index.html)
2. Transformers Documentation: [https://huggingface.co/transformers/index.html](https://huggingface.co/transformers/index.html)


In [26]:
class config:
    # Specifies the device for running the model, which is set to 'cpu' indicating CPU.
    device = 'cpu' 
    # Sets the random seed for reproducibility.
    seed = 69
    # Path to the training dataset in JSON format.
    train_dataset_path = "/kaggle/input/pii-detection-removal-from-educational-data/train.json"
    # Path to the test dataset in JSON format.
    test_dataset_path = "/kaggle/input/pii-detection-removal-from-educational-data/test.json"
    # Path to the sample submission CSV file.
    sample_submission_path = "/home/nischay/PID/Data/sample_submission.csv"
       
    # Directory path for saving intermediate files. It uses temp_data_folder which should be defined elsewhere in the code.
    save_dir = temp_data_folder + "1/"

    # Percentage of data to downsample during tokenization.
    downsample = 0.45
    # Whether to truncate sequences during tokenization.
    truncation = True 
    # Specifies the padding strategy during tokenization, which is set to False indicating no padding.
    padding = False #'max_length'
    # Maximum sequence length after tokenization.
    max_length = 3574
    # The stride length for splitting long documents into shorter chunks.
    doc_stride = 512
    
    # List of target columns for the model.
    target_cols = ['B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 
    'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 
    'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL','O']

    # Flag indicating whether to load data from disk.
    load_from_disk = None

    # Learning rate for training the model.
    learning_rate = 1e-5
    # Batch size for training.
    batch_size = 1
    # Number of training epochs.
    epochs = 4
    # List of folds for cross-validation.
    NFOLDS = [0]
    # Index of the training fold.
    trn_fold = 0

    # Dictionary containing paths to pretrained models and their corresponding weights.
    model_paths = {
    '/kaggle/input/37vp4pjt': 10/10,
    '/kaggle/input/pii-deberta-models/cuerpo-de-piiranha': 2/10,
    '/kaggle/input/pii-deberta-models/cola del piinguuino' : 1/10,
    '/kaggle/input/pii-deberta-models/cabeza-del-piinguuino': 5/10,
    '/kaggle/input/pii-deberta-models/cabeza-de-piiranha': 3/10,
    '/kaggle/input/pii-deberta-models/cola-de-piiranha':1/10,
    '/kaggle/input/pii-models/piidd-org-sakura': 2/10,
    '/kaggle/input/pii-deberta-models/cabeza-de-piiranha-persuade_v0':1/10,
    }
    # Path to the directory containing converted model files.
    converted_path = '/kaggle/input/toonnx2-converted-models'


🔍 **Explanation:**

- Checks whether the directory specified by `config.save_dir` exists or not.
- If the directory does not exist, it creates the directory along with any necessary parent directories using `os.makedirs()`.
- This ensures that the directory is available for saving intermediate files or any other purposes specified by the configuration.

📚 **Study Sources:**

1. Python Documentation - `os.makedirs()`: [https://docs.python.org/3/library/os.html#os.makedirs](https://docs.python.org/3/library/os.html#os.makedirs)
2. Real Python - Understanding `os.makedirs()`: [https://realpython.com/python-os-mkdir/#creating-a-directory-tree](https://realpython.com/python-os-mkdir/#creating-a-directory-tree)

In [27]:
# Check if the directory specified by 'save_dir' in the configuration exists.
if not os.path.exists(config.save_dir):
    # If the directory doesn't exist, create it along with any necessary parent directories.
    os.makedirs(config.save_dir)



#  Natural Language Processing Setup 🧠
🔍 **Explanation:**

- **`nlp = English()`**: Initializes the English language model for natural language processing using SpaCy.
- **`INFERENCE_MAX_LENGTH = 3500`**: Defines the maximum length for inference sequences. This is used to limit the length of processed text during inference.
- **`threshold = 0.99`**: Sets the threshold for confidence score in detecting PII entities. Any entity prediction with a confidence score above this threshold is considered significant.
- **Regular Expressions**:
  - **`email_regex`**: Matches email addresses in text.
  - **`phone_num_regex`**: Matches phone numbers in various formats.
  - **`url_regex`**: Matches URLs in text.
  - **`street_regex`**: Matches street addresses in text.

📚 **Study Sources:**

1. SpaCy Documentation: [https://spacy.io/](https://spacy.io/)
2. Regular Expressions in Python: [https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)

In [28]:
# Initializing the English language model for natural language processing.
nlp = English()

# Maximum length for inference sequences.
INFERENCE_MAX_LENGTH = 3500

# Threshold for confidence score in detecting PII entities.
threshold = 0.99

# Regular expressions for detecting various types of PII.
email_regex = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')  # Email addresses
phone_num_regex = re.compile(r"(\(\d{3}\)\d{3}\-\d{4}\w*|\d{3}\.\d{3}\.\d{4})\s")  # Phone numbers
url_regex = re.compile(
    r'http[s]?://'  # http or https
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # ...or ip
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)', re.IGNORECASE)  # URLs
street_regex = re.compile(r'\d{1,4} [\w\s]{1,20}(?:street|apt|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)', re.IGNORECASE)  # Street addresses


# 🔍🎯 Finding Spans in Document 🕵️‍♂️📜

🔍 **Explanation:**

- **`find_span` Function**:
  - This function takes two arguments: `target`, a list of strings representing the target sequence to be found, and `document`, a list of strings representing the document to search within.
  - Returns a list of lists, where each inner list contains the indices of the start and end positions of the found spans in the document.
- **Variables**:
  - **`idx`**: Tracks the current index in the `target` sequence.
  - **`spans`**: Stores the list of spans found in the document.
  - **`span`**: Stores the current span being constructed.
- **Iteration**:
  - Iterates through each token in the `document`.
- **Matching**:
  - If the current token matches the current target token, it appends the index to the `span` list.
- **Completion**:
  - When the entire target sequence is found, the span is appended to `spans` and the process continues to search for the next occurrence.
- **Return**:
  - Returns the list of spans found in the document.

📚 **Study Sources:**

1. Python Documentation - `enumerate()`: [https://docs.python.org/3/library/functions.html#enumerate](https://docs.python.org/3/library/functions.html#enumerate)

In [29]:
def find_span(target: list[str], document: list[str]) -> list[list[int]]:

    # Initialize variables
    idx = 0
    spans = []
    span = []

    # Iterate through the document tokens
    for i, token in enumerate(document):
        # If the current token doesn't match the target start anew
        if token != target[idx]:
            idx = 0
            span = []
            continue
        # If token matches, append its index to the span list
        span.append(i)
        idx += 1
        # If the entire target is found, append the span to the list of spans
        if idx == len(target):
            spans.append(span)
            # Reset span and idx for next potential match
            span = []
            idx = 0
            continue
    
    return spans


In [30]:
# Load the training dataset from the JSON file specified in the configuration.
data = json.load(open(config.train_dataset_path))

# Load the test dataset from the JSON file specified in the configuration.
test_data = json.load(open(config.test_dataset_path))



🔍 **Explanation:**

- **`all_labels`**:
  - Extracts all unique labels from the training data by iterating through each sample and accessing the "labels" key. 
  - The labels are flattened into a single list using `chain(*...)`.
  - The list is converted to a set to remove duplicates and then sorted.
- **`label2id` Dictionary**:
  - Maps each unique label to its corresponding index in a dictionary comprehension.
- **`id2label` Dictionary**:
  - Maps each index to its corresponding label in a dictionary comprehension, providing a reverse mapping.



In [31]:
# Extract all unique labels from the training data and sort them.
all_labels = sorted(list(set(chain(*[x["labels"] for x in data]))))

# Create a dictionary mapping each label to its corresponding index.
label2id = {l: i for i,l in enumerate(all_labels)}

# Create a dictionary mapping each index to its corresponding label.
id2label = {v:k for k,v in label2id.items()}



# 🤖📝 Model Tokenizer Initialization 🤖🔤

🔍 **Explanation:**

- **`first_model_path`**:
  - Retrieves the path of the first model from the dictionary of model paths specified in the configuration.
- **Tokenizer Initialization**:
  - The tokenizer is initialized using the `AutoTokenizer.from_pretrained()` method from the Hugging Face Transformers library.
  - This method automatically selects the appropriate tokenizer based on the provided model path.

📚 **Study Sources:**

1. Hugging Face Transformers Documentation - Tokenizers: [https://huggingface.co/transformers/main_classes/tokenizer.html](https://huggingface.co/transformers/main_classes/tokenizer.html)

In [32]:
# Select the path of the first model from the configuration's model paths.
first_model_path = list(config.model_paths.keys())[0]

# Initialize the tokenizer using the AutoTokenizer class from the Hugging Face Transformers library.
tokenizer = AutoTokenizer.from_pretrained(first_model_path)


In [33]:
# Create a DataFrame for the training data using the loaded training data.
df_train = pd.DataFrame(data)

# Add a new column 'fold' to the training DataFrame, representing the fold number.
df_train['fold'] = df_train['document'] % 4

# Create a DataFrame for the test data using the loaded test data.
df_test = pd.DataFrame(test_data)


# 📉🔢 DataFrame Downsampling Function 🛠️

🔍 **Explanation:**

- **`downsample_df` Function**:
  - Takes a DataFrame `train_df` and a percentage `percent` as input and returns a downsampled DataFrame.
- **`train_df['is_labels']`**:
  - Adds a new column `'is_labels'` to the DataFrame indicating whether labels are present in each sample.
  - Checks if any label in the `'labels'` column is not equal to `'O'`, indicating the presence of labels.
- **Separating Samples**:
  - Samples with labels (`true_samples`) and samples without labels (`false_samples`) are separated based on the value of the `'is_labels'` column.
- **Downsampling False Samples**:
  - The number of false samples to keep after downsampling is calculated based on the specified percentage.
  - Random false samples are sampled without replacement to downsample them using `sample()` method.
- **Concatenating DataFrames**:
  - The true samples and downsampled false samples are concatenated using `pd.concat()` to create the downsampled DataFrame.
- **Return**:
  - The downsampled DataFrame is returned.

📚 **Study Sources:**

1. pandas Documentation - DataFrame: [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
2. pandas Documentation - `sample()`: [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)

In [34]:
def downsample_df(train_df, percent):
    # Add a new column 'is_labels' to indicate if labels are present in the sample
    train_df['is_labels'] = train_df['labels'].apply(lambda labels: any(label != 'O' for label in labels))
    
    # Separate samples with labels and samples without labels
    true_samples = train_df[train_df['is_labels'] == True]
    false_samples = train_df[train_df['is_labels'] == False]
    
    # Calculate the number of false samples to keep after downsampling
    n_false_samples = int(len(false_samples) * percent)
    
    # Randomly sample false samples to downsample
    downsampled_false_samples = false_samples.sample(n=n_false_samples, random_state=42)
    
    # Concatenate true samples and downsampled false samples to create the downsampled DataFrame
    downsampled_df = pd.concat([true_samples, downsampled_false_samples])
    
    return downsampled_df



# 🔠 Tokenization Function for DataFrame📄


🔍 **Explanation:**

- **`tokenize_row` Function**:
  - Function takes a single row (`example`) from a DataFrame as input and tokenizes its text using the provided tokenizer.
- **Initialization**:
  - Initializes empty lists `text` and `token_map` to store the tokenized text and token map respectively.
- **Tokenization Process**:
  - Iterates through each token (`t`) and trailing whitespace (`ws`) in the example.
  - For each token, it appends the token to the `text` list and extends the `token_map` with the index of the token repeated by its length.
  - If trailing whitespace is present, it appends a space to the `text` and -1 to the `token_map`.
- **Tokenization with Transformers Tokenizer**:
  - Tokenizes the concatenated text using the provided tokenizer (`tokenizer`), considering specified configurations like truncation and maximum length.
- **Return**:
  - Returns a dictionary containing tokenized inputs (`input_ids`), attention mask (`attention_mask`), offset mappings (`offset_mapping`), and token map (`token_map`).

📚 **Study Sources:**

1. Hugging Face Transformers Documentation - Tokenizers: [https://huggingface.co/transformers/main_classes/tokenizer.html](https://huggingface.co/transformers/main_classes/tokenizer.html)
2. Python Documentation - `zip()`: [https://docs.python.org/3/library/functions.html#zip](https://docs.python.org/3/library/functions.html#zip)

In [35]:
def tokenize_row(example):
    # Initialize empty lists to store tokenized text and token map
    text = []
    token_map = []
    
    idx = 0
    
    # Iterate through tokens and trailing whitespaces in the example
    for t, ws in zip(example["tokens"], example["trailing_whitespace"]):
        # Append token to the text list
        text.append(t)
        # Extend token map with index of token repeated by its length
        token_map.extend([idx]*len(t))
        # If trailing whitespace is present, append space to text and -1 to token map
        if ws:
            text.append(" ")
            token_map.append(-1)
            
        idx += 1
        
    # Tokenize the concatenated text using the tokenizer with specified configurations
    tokenized = tokenizer("".join(text), return_offsets_mapping=True, truncation=config.truncation, max_length=config.max_length)
    
    # Return dictionary containing tokenized inputs, attention mask, offset mappings, and token map
    return {
        "input_ids": tokenized.input_ids,
        "attention_mask": tokenized.attention_mask,
        "offset_mapping": tokenized.offset_mapping,
        "token_map": token_map,
    }



🔍 **Explanation:**

- **Debugging Enabled**:
  - If debugging is enabled (`debug_on_train_df` is True), the code processes the training DataFrame for each fold.
  - It subsets the DataFrame based on the fold, performs downsampling if configured, tokenizes rows, and saves the dataset and DataFrame to disk.
- **Debugging Disabled**:
  - If debugging is disabled, the code processes the test DataFrame.
  - It tokenizes rows and saves the test dataset to disk.
- **Data Loading from Disk**:
  - The code checks if data loading from disk is disabled (`config.load_from_disk is None`).
- **Tokenization and Saving**:
  - The DataFrame is converted to a Hugging Face Dataset, tokenized using the `tokenize_row` function, and saved to disk.



In [36]:
if debug_on_train_df:

    # Check if data loading from disk is disabled
    if config.load_from_disk is None:
        
        # Add a new column 'fold' to the training DataFrame to indicate fold number
        df_train['fold'] = df_train['document'] % 4
        df_train.head(3)
        
        # Loop through different folds
        for i in range(-1, 4):    
            # Subset the training DataFrame for the current fold
            train_df = df_train[df_train['fold']==i].reset_index(drop=True)

            # Set valid_stride flag based on current fold
            if i==config.trn_fold:
                config.valid_stride = True
            if i!=config.trn_fold and config.downsample > 0:
                train_df = downsample_df(train_df, config.downsample)
                config.valid_stride = False

            train_df = train_df
            print(len(train_df))
            
            # Convert DataFrame to Hugging Face Dataset and tokenize rows
            ds = Dataset.from_pandas(train_df)
            ds = ds.map(
              tokenize_row,
              batched=False,
              num_proc=2,
              desc="Tokenizing",
            )

            # Save the dataset and DataFrame to disk
            ds.save_to_disk(f"{config.save_dir}fold_{i}.dataset")
            with open(f"{config.save_dir}_pkl", "wb") as fp:
                pickle.dump(train_df, fp)
            print("Saving dataset to disk:", config.save_dir)
        
# If debugging is not enabled, process the test DataFrame
else:
    
    # Check if data loading from disk is disabled
    if config.load_from_disk is None:

        # Set valid_stride flag for test data
        config.valid_stride = True
        print(len(df_test))

        # Convert test DataFrame to Hugging Face Dataset and tokenize rows
        ds = Dataset.from_pandas(df_test)
        ds = ds.map(
          tokenize_row,
          batched=False,
          num_proc=2,
          desc="Tokenizing",
        )

        # Save the test dataset to disk
        ds.save_to_disk(f"{config.save_dir}test.dataset")
        print("Saving dataset to disk:", config.save_dir)


10


Tokenizing (num_proc=2):   0%|          | 0/10 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10 [00:00<?, ? examples/s]

Saving dataset to disk: /tmp/output/1/


In [37]:
ds[0].keys()

dict_keys(['document', 'full_text', 'tokens', 'trailing_whitespace', 'input_ids', 'attention_mask', 'offset_mapping', 'token_map'])

#  Prediction Processing Function 

🔍 **Explanation:**

- **`process_predictions` Function**:
  - Takes a list of flattened predictions (`flattened_preds`) as input and applies softmax to each set of predictions.
- **Initialization**:
  - Initializes an empty list `predictions_softmax_all` to hold softmax-applied predictions for all sets.
- **Processing Predictions**:
  - Iterates over each set of predictions in `flattened_preds`.
  - For each set of predictions, it applies softmax along the last dimension (dimension `-1`) using `torch.softmax()` to convert logits to probabilities.
  - The softmax-applied predictions are then appended to the `predictions_softmax_all` list.
- **Return**:
  - Returns the list of predictions with softmax applied.

📚 **Study Sources:**

1. PyTorch Documentation - `torch.softmax()`: [https://pytorch.org/docs/stable/generated/torch.softmax.html](https://pytorch.org/docs/stable/generated/torch.softmax.html)

In [38]:
def process_predictions(flattened_preds):
    # Initialize a list to hold softmax-applied predictions
    predictions_softmax_all = []

    # Iterate over each set of predictions in the input
    for predictions in flattened_preds:
        # Apply softmax to convert logits to probabilities
        predictions_softmax = torch.softmax(predictions, dim=-1)
        # Append the softmax predictions to the result list
        predictions_softmax_all.append(predictions_softmax)

    # Return the list of predictions with softmax applied
    return predictions_softmax_all


# 🚀Prediction and Conversion Functions 


🔍 **Explanation:**

- **Prediction and Conversion Functions**:
  - Functions to perform prediction, model export to ONNX format, quantization, and inference using ONNX runtime.
- **`predict_and_convert` Function**:
  - Exports the PyTorch model to ONNX format with specified configurations and saves it to the specified path.
- **`predict_and_quant` Function**:
  - Performs quantization on the original ONNX model and saves the quantized model to the specified path.
- **`predict` Function**:
  - Performs inference using the provided ONNX model session over all batches from a data loader and returns processed predictions.

📚 **Study Sources:**

1. ONNX Documentation - Python API Overview: [https://onnxruntime.ai/docs/api/python_api_overview.html](https://onnxruntime.ai/docs/api/python_api_overview.html)
2. PyTorch Documentation - `torch.onnx.export()`: [https://pytorch.org/docs/stable/generated/torch.onnx.export.html](https://pytorch.org/docs/stable/generated/torch.onnx.export.html)
3. Hugging Face Transformers Documentation - Model Quantization: [https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.convert_to_onnx](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.convert_to_onnx)

In [39]:

import torch.onnx
import onnx
import onnxruntime

def predict_and_convert(data_loader, model, config, onnx_model_path):
    # Set the model to evaluation mode
    model.eval()

    # Initialize a list to store the prediction outputs
    prediction_outputs = []

    # Create an iterator from the DataLoader
    data_iter = iter(data_loader)

    # Fetch the first batch of data from the iterator
    batch = next(data_iter)

    # Disable gradient calculations for export
    with torch.no_grad():
        # Prepare inputs by reshaping and moving them to the specified device
        inputs = {key: val.reshape(val.shape[0], -1).to(config.device) for key, val in batch.items() if key in ['input_ids', 'attention_mask']}
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']

        # Export the model to ONNX format with the specified configurations
        torch.onnx.export(model,  # Model to be exported
                          args=(input_ids, attention_mask),  # Example model input
                          f=onnx_model_path,  # Path to save the ONNX model
                          opset_version=12,  # ONNX opset version
                          input_names=['input_ids', 'attention_mask'],  # Names of the input parameters
                          output_names=['logits'],  # Names of the output
                          dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'},  # Dynamic axes for batching
                                        'attention_mask': {0: 'batch_size', 1: 'sequence_length'}}
                          )

    print("Model saved to", onnx_model_path)

    return prediction_outputs

def predict_and_quant(data_loader, config, original_onnx_model_path, output_file_name, data_path):
    # Initialize a list to store prediction outputs
    prediction_outputs = []

    # Create an iterator from the DataLoader
    data_iter = iter(data_loader)

    # Fetch the first batch of data from the iterator
    batch = next(data_iter)

    # Disable gradient calculations for efficiency
    with torch.no_grad():
        
        # Prepare inputs by reshaping and moving them to the specified device
        inputs = {key: val.reshape(val.shape[0], -1).to(config.device) for key, val in batch.items() if key in ['input_ids', 'attention_mask']}
        
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']


        # Prepare input data for quantization by moving tensors to CPU and converting to numpy arrays
        input_data = {"input_ids": input_ids.cpu().numpy(), "attention_mask": attention_mask.cpu().numpy()}
            
        # Call the function to auto convert the ONNX model to mixed precision with specified settings
        auto_convert_mixed_precision_model_path(
            original_onnx_model_path,  # Original ONNX model path
            input_data,  # Input data for calibration during quantization
            output_file_name,  # Output file name for the quantized model
            provider=['CUDAExecutionProvider'],  # Specify the execution provider, can be changed to CPU if necessary
            location=data_path,  # specify the path to save external data tensors
            rtol=2,  # Relative tolerance for quantization
            atol=20,  # Absolute tolerance for quantization
            keep_io_types=True,  # Maintain input/output types
            verbose=True  # Enable verbose output during quantization
        )

        # Append a placeholder value to prediction outputs (currently not used for actual predictions)
        prediction_outputs.append(0)

    print("Model saved to", output_file_name)

    return prediction_outputs

def predict(data_loader, session, config):

    # Initialize a list to collect raw predictions for each batch
    prediction_outputs = []

    # Iterate over all batches of data from the data loader
    for batch in tqdm(data_loader, desc="Predicting"):
        with torch.no_grad():
            # Prepare inputs by reshaping and moving them to the specified device
            inputs = {key: val.reshape(val.shape[0], -1).to(config.device) for key, val in batch.items() if key in ['input_ids', 'attention_mask']}
            
            # Retrieve the names of the input and output nodes from the model session
            input_names = [inp.name for inp in session.get_inputs()]
            output_names = [out.name for out in session.get_outputs()]
            
            # Extract input_ids and attention_mask from inputs
            input_ids = inputs['input_ids']
            attention_mask = inputs['attention_mask']

            # Prepare input data by moving tensors to CPU and converting to numpy arrays
            input_data = {"input_ids": input_ids.cpu().numpy(), "attention_mask": attention_mask.cpu().numpy()}

            # Execute the model
            onnx_outputs = session.run(None, input_data)

            # Append raw model outputs (predictions) to the list
            prediction_outputs.append(torch.tensor(onnx_outputs[0]))  # Assuming the first output is what we need

    # Flatten the list of predictions across all batches
    prediction_outputs = [logit for batch in prediction_outputs for logit in batch]

    # Process the predictions as required (e.g., applying softmax, thresholding)
    processed_predictions = process_predictions(prediction_outputs)

    return processed_predictions


🔍 **Explanation:**

- **Prediction Processing Function with Thresholding**:
  - Processes the flattened predictions from the model with the specified threshold to determine the final class predictions.
- **Initialization**:
  - Initializes an empty list `preds_final` to store the final predictions.
- **Processing Predictions**:
  - Iterates over each set of predictions (`flattened_preds`).
  - Retrieves the softmax-applied predictions.
  - Determines the argmax prediction across all classes.
  - Determines predictions for all classes except 'O'.
  - Retrieves the softmax probabilities for the 'O' class.
  - Applies the threshold to decide between 'O' class and other classes.
  - Converts final predictions to a numpy array and appends to `preds_final` list.
- **Thresholding**:
  - The specified threshold (`threshold`) is used to determine whether to choose the 'O' class or other classes based on softmax probabilities.
- **Return**:
  - Returns the list of final predictions after thresholding.

📚 **Study Sources:**
- PyTorch Documentation - `torch.where()`: [https://pytorch.org/docs/stable/generated/torch.where.html](https://pytorch.org/docs/stable/generated/torch.where.html)

In [40]:
def process_predictions_ans(flattened_preds, threshold=0.95):

    preds_final = []  # Initialize a list to store final predictions

    # Iterate over each set of predictions
    for predictions in flattened_preds:
        # softmax was applied to the first dimension before averaging
        predictions_softmax = predictions

        # Get the argmax across all classes
        predictions_argmax = predictions.argmax(-1)

        # Get predictions for all classes except 'O'
        predictions_without_O = predictions_softmax[:, :12].argmax(-1)

        # Get the softmax probabilities for the 'O' class
        O_predictions = predictions_softmax[:, 12]

        # Apply threshold to decide between 'O' class and other classes
        pred_final = torch.where(O_predictions < threshold, predictions_without_O, predictions_argmax)

        # Convert final predictions to numpy array and add to the list
        preds_final.append(pred_final.numpy())

    return preds_final


# 📥 Loading Tokenized Dataset from Disk 



🔍 **Explanation:**

- **Loading Tokenized Dataset from Disk**:
  - Loads the tokenized dataset from disk, either for test or training purposes, based on the condition.
- **Data Preparation**:
  - Defines the columns to keep (`input_ids`, `attention_mask`) in the dataset.
  - Initializes a data collator for token classification using the `DataCollatorForTokenClassification` class from the `transformers` library.
- **Conditional Loading**:
  - If `debug_on_train_df` is `False`, it loads the test dataset, removes unnecessary columns, updates configuration variables, and creates a DataLoader for the test dataset.
  - If `debug_on_train_df` is `True`, it loads the dataset for the specified fold, performs similar preprocessing, and creates a DataLoader for the dataset.
- **DataLoader Parameters**:
  - `batch_size`: Number of samples per batch during inference.
  - `shuffle`: Whether to shuffle the data.
  - `num_workers`: Number of subprocesses for data loading.
  - `pin_memory`: Whether to pin memory for faster data transfer to GPU.
- **Return**:
  - Returns a DataLoader object (`test_dataloader`) for iterating over the test dataset.

📚 **Study Sources:**
- Hugging Face Documentation - `DataCollatorForTokenClassification`: [https://huggingface.co/transformers/main_classes/data_collator.html#transformers.DataCollatorForTokenClassification](https://huggingface.co/transformers/main_classes/data_collator.html#transformers.DataCollatorForTokenClassification)
- PyTorch DataLoader Documentation: [https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
```

In [41]:
# Columns to keep in the dataset
keep_cols = {"input_ids", "attention_mask"}

# Data collator for token classification
collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=512)

# Load Tokenized Dataset from Disk
if not debug_on_train_df:
    # Load test dataset
    test_ds = load_from_disk(f'{config.save_dir}test.dataset')
    # Remove unnecessary columns from test dataset
    test_ds = test_ds.remove_columns([c for c in test_ds.column_names if c not in keep_cols])
    # Update configuration variables
    config.data_length = len(test_ds)
    config.len_token = len(tokenizer)
    # Create DataLoader for test dataset
    test_dataloader = DataLoader(test_ds, batch_size=config.batch_size, shuffle=False, num_workers=4, pin_memory=False, collate_fn=collator)
else:
    # Fallback for memory tests of several models on fold zero
    fold = config.trn_fold
    # Load dataset for specified fold
    test_ds = load_from_disk(f'{config.save_dir}fold_{fold}.dataset')
    # Remove unnecessary columns
    test_ds = test_ds.remove_columns([c for c in test_ds.column_names if c not in keep_cols])
    # Update configuration variables
    config.data_length = len(test_ds)
    config.len_token = len(tokenizer)
    # Create DataLoader for test dataset
    test_dataloader = DataLoader(test_ds, batch_size=config.batch_size, shuffle=False, num_workers=4, pin_memory=False, collate_fn=collator)


🔍 **Explanation:**

- **All Prediction Data Processing**:
  - Iterates over each model path and its weight defined in `config.model_paths`.
- **Model Conversion and Quantization**:
  - If `convert_before_inference` is `True`, it loads the original model, converts it to ONNX format, and performs quantization. Otherwise, it uses already converted models.
- **ONNX Runtime Session Creation**:
  - Creates an ONNX Runtime session for GPU execution to perform inference.
- **Prediction**:
  - Performs prediction using the specified ONNX model session over the test dataset.
- **Ensemble Preparation**:
  - Stores the softmax-applied logits for each model's predictions for potential ensemble learning.
- **Memory Management**:
  - Cleans up resources like DataLoaders and datasets after processing predictions to free up memory.

📚 **Study Sources:**
- ONNX Runtime Documentation - Python API: [https://onnxruntime.ai/docs/api/python_api.html](https://onnxruntime.ai/docs/api/python_api.html)
- Hugging Face Transformers Documentation - Model Quantization: [https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.convert_to_onnx](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.convert_to_onnx)
```

In [None]:
# All predict data
predictions_softmax_logits = []
all_preds = []

for model_path, weight in config.model_paths.items():
    
    fold = config.trn_fold
    
    if convert_before_inference:

        # Loading the original model and converting it to ONNX
        model = AutoModelForTokenClassification.from_pretrained(model_path)

        # Converting it to ONNX to a temp folder
        converted_model_name = temp_data_folder + "original_model.onnx"
        predictions_softmax_all = predict_and_convert(test_dataloader, model, config, converted_model_name)
        del model
        gc.collect()
        torch.cuda.empty_cache()

        # In commit mode, save all quantized models with different names to create a dataset and reuse them later bypassing
        #vquantization and conversion
        quantized_model_name = "/kaggle/working/optimized" + model_path.split("/")[-1] + "_f" + str(fold) + ".onnx"
        # data path should be relative
        quantized_data_path = "optimized" + model_path.split("/")[-1] + "_f" + str(fold) + ".data"
        
        # Quantization
        predictions_softmax_all = predict_and_quant(test_dataloader, config, converted_model_name, quantized_model_name, quantized_data_path)
    
    else:
        # Use already converted models, you can make a commit notebook once and save output models to a dataset,
        # for example, /kaggle/input/toonnx2-converted-models    
        quantized_model_name = config.converted_path + "/optimized" + model_path.split("/")[-1] + "_f" + str(fold) + ".onnx"

    
    # Create ONNX Runtime session for GPU
    session = onnxruntime.InferenceSession(quantized_model_name, providers=['CUDAExecutionProvider'])
    # Uncomment this if you want to debug something on CPU
    # session = onnxruntime.InferenceSession(quantized_model_name)
    
    # Predict 
    predictions_softmax_all = predict(test_dataloader, session, config)
    
    # Keep all logits for ensemble later
    predictions_softmax_logits.append(predictions_softmax_all)
    
del test_dataloader, test_ds
gc.collect()
torch.cuda.empty_cache()


[0;93m2024-04-08 17:04:16.018952538 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:16.232084188 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:16.297498056 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:16.363338513 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:16.427876770 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:22.514912415 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:22.733821291 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:22.802137469 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:22.870469702 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:22.939402814 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:28.980541430 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:29.199544752 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:29.266780840 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:29.335044272 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:29.402090933 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:35.640233096 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:35.852734977 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:35.919422624 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:35.988055351 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:36.055448660 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:42.271929993 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:42.488129332 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:42.557507134 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:42.626823635 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:42.696485725 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:48.793299762 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:49.009870735 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:49.078517873 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:49.145730491 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:49.217080590 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

[0;93m2024-04-08 17:04:55.390834488 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:55.611245993 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:55.680525403 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:55.750711142 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:04:55.821019267 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [01:11<?, ?it/s]

[0;93m2024-04-08 17:06:13.019665659 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:06:13.237867552 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:06:13.307274846 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:06:13.375342537 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'[m
[0;93m2024-04-08 17:06:13.443842577 [W:onnxruntime:, constant_folding.cc:269 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/enc

Predicting:   0%|          | 0/10 [00:00<?, ?it/s]

🔍 **Explanation:**

- **Calculating Weighted Mean of Predictions**:
  - Calculates the weighted mean of softmax predictions from all models.
- **Total Weight Calculation**:
  - Calculates the total weight of all models specified in `config.model_paths` to normalize the weights if their sum exceeds 1.
- **Individual Model Weights**:
  - Retrieves the individual weights for each model.
- **Sample-wise Computation**:
  - Iterates over each sample's predictions since the length of texts can vary.
- **Weighted Prediction Accumulation**:
  - For each sample, it initializes a tensor to accumulate weighted predictions from all models.
  - Iterates over each model to compute its contribution to the final prediction, applying relative weights.
  - Weighted predictions are added to obtain the sum.
- **Mean Calculation**:
  - Appends the mean of the weighted predictions for the current sample to the list `predictions_mean_all`.

📚 **Study Sources:**
- PyTorch Documentation - Tensor Operations: [https://pytorch.org/docs/stable/tensors.html](https://pytorch.org/docs/stable/tensors.html)
```

In [None]:
# Initialize an empty list to store the mean of the softmax predictions from all models.
predictions_mean_all = []

# Calculate the total weight of all models to normalize the weights if its sum exceeds 1.
total_weight = sum(config.model_paths.values())
print(f"Total weight: {total_weight}")

# Retrieve the individual weights for each model.
model_weights = list(config.model_paths.values())

# Iterate over each sample since the length of texts can vary.
for sample_index in range(len(predictions_softmax_logits[0])):
    
    # Initialize a tensor to accumulate weighted predictions for the current sample.
    weighted_predictions_sum = torch.zeros(predictions_softmax_logits[0][sample_index].size())

    # Iterate over each model to compute its contribution to the final prediction.
    for model_index in range(len(predictions_softmax_logits)):
        weighted_prediction = predictions_softmax_logits[model_index][sample_index] * (model_weights[model_index] / total_weight)
        weighted_predictions_sum += weighted_prediction

    # Append the mean of the weighted predictions for the current sample to the list.
    predictions_mean_all.append(weighted_predictions_sum)




# 🔄Processing Final Predictions 

🔍 **Explanation:**

- **Processing Final Predictions**:
  - Processes the final predictions using the previously defined function `process_predictions_ans`.
- **Function Call**:
  - Calls the `process_predictions_ans` function, passing the mean of the softmax predictions from all models (`predictions_mean_all`) as input.
- **Processed Predictions**:
  - The processed predictions are stored in the variable `processed_predictions` for further analysis or evaluation.

📚 **Study Sources:**
- Official Python Documentation - Function Definitions: [https://docs.python.org/3/tutorial/controlflow.html#defining-functions](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
```

In [None]:
processed_predictions = process_predictions_ans(predictions_mean_all)


# Processing Predictions and Extracting Information 🔍

🔍 **Explanation:**

- **Processing Predictions and Extracting Information**:
  - Iterates over each prediction and its corresponding token mapping, offsets, tokens, document, and full text in the dataset.
- **Token-level Processing**:
  - Iterates through each token prediction and its offsets, adjusting for trailing whitespace if necessary, and checks if it's a valid token. If so, it adds the token information to the processed list.
- **Extracting Structured Data**:
  - Extracts email addresses, phone numbers, and URLs from the full text using regular expressions and stores them along with their corresponding document, token index, predicted label, and token string.
- **Efficient Membership Check**:
  - The set `pairs` is used to efficiently check whether a pair (document, token_id) has been processed already, reducing duplicate processing.
  
📚 **Study Sources:**
- Python Documentation - Regular Expression Operations: [https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)
- Python Documentation - Sets: [https://docs.python.org/3/tutorial/datastructures.html#sets](https://docs.python.org/3/tutorial/datastructures.html#sets)
```

In [None]:
# Initialize empty lists and sets to store extracted information
triplets = []  # Triplets for structured data extraction
pairs = set()  # To efficiently check membership during processing
processed = []  # Processed predictions
emails = []  # Email addresses
phone_nums = []  # Phone numbers
urls = []  # URLs
streets = []  # Street addresses (if applicable)

# Iterate over each prediction, token mapping, offsets, tokens, and document in the dataset
for p, token_map, offsets, tokens, doc, full_text in zip(
    processed_predictions, 
    ds["token_map"], 
    ds["offset_mapping"], 
    ds["tokens"], 
    ds["document"],
    ds["full_text"]
):

    # Iterate through each token prediction and its corresponding offsets
    for token_pred, (start_idx, end_idx) in zip(p, offsets):
        label_pred = id2label[token_pred]  # Predicted label from token
        
        # Skip tokens if start and end index are both zero or label is O
        if start_idx + end_idx == 0 or label_pred == "O":
            continue
        
        # Adjust start index if there's trailing whitespace
        if token_map[start_idx] == -1:
            start_idx += 1
        while start_idx < len(token_map) and tokens[token_map[start_idx]].isspace():
            start_idx += 1
        if start_idx >= len(token_map):
            break
        
        # Get the token ID at start index
        token_id = token_map[start_idx]
        
        # Skip if label is O or it's not a valid token ID
        if label_pred in ("O", "B-EMAIL", "B-PHONE_NUM", "I-PHONE_NUM") or token_id == -1:
            continue
        
        # Create a unique pair (document, token_id) and check if it's not already processed
        pair = (doc, token_id)
        if pair not in pairs:
            processed.append({"document": doc, "token": token_id, "label": label_pred, "token_str": tokens[token_id]})
            pairs.add(pair)
    
    # Extract email addresses
    for token_idx, token in enumerate(tokens):
        if re.fullmatch(email_regex, token) is not None:
            emails.append(
                {"document": doc, "token": token_idx, "label": "B-EMAIL", "token_str": token}
            )
                
    # Extract phone numbers
    matches = phone_num_regex.findall(full_text)
    if matches:
        for match in matches:
            target = [t.text for t in nlp.tokenizer(match)]
            matched_spans = find_span(target, tokens)
            for matched_span in matched_spans:
                for intermediate, token_idx in enumerate(matched_span):
                    prefix = "I" if intermediate else "B"
                    phone_nums.append(
                        {"document": doc, "token": token_idx, "label": f"{prefix}-PHONE_NUM", "token_str": tokens[token_idx]}
                    )
    
    # Extract URLs
    matches = url_regex.findall(full_text)
    if matches:
        for match in matches:
            target = [t.text for t in nlp.tokenizer(match)]
            matched_spans = find_span(target, tokens)
            for matched_span in matched_spans:
                for intermediate, token_idx in enumerate(matched_span):
                    prefix = "I" if intermediate else "B"
                    urls.append(
                        {"document": doc, "token": token_idx, "label": f"{prefix}-URL_PERSONAL", "token_str": tokens[token_idx]}
                    )


# 📝 Creating submission.CSV 


In [None]:

# Create a DataFrame from processed data, phone numbers, emails, and URLs
df = pd.DataFrame(processed + phone_nums + emails + urls)

# Assign each row a unique 'row_id'
df["row_id"] = list(range(len(df)))

# Export the DataFrame to a CSV file for further exploration
df[["row_id", "document", "token", "label"]].to_csv("submission.csv", index=False)


## Keep Exploring! 👀

Thank you for delving into this notebook! If you found it insightful or beneficial, I encourage you to explore more of my projects and contributions on my profile.

👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈

[GitHub]( https://github.com/zulqarnainalipk) |
[LinkedIn]( https://www.linkedin.com/in/zulqarnainalipk/)

## Share Your Thoughts! 🙏

Your feedback is invaluable! Your insights and suggestions drive our ongoing improvement. If you have any comments, questions, or ideas to contribute, please feel free to reach out.

📬 Contact me via email: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I extend my sincere gratitude for your time and engagement. Your support inspires us to create even more valuable content.
Happy coding and best of luck in your data science endeavors! 🚀


#PRINT OK