### Project Title:
Automated Essay Scoring using Machine Learning

### Steps:

1. **Data Preparation**:
   - Load and preprocess the dataset containing essays and corresponding scores.
   - Split the dataset into training and testing sets.

2. **Feature Extraction**:
   - Use pre-trained language models (e.g., BERT, RoBERTa) to extract embeddings from the essays.
   
3. **Model Selection**:
   - Choose a suitable machine learning model for essay scoring, such as a Multilayer Perceptron (MLP) or a recurrent neural network (RNN).

4. **Cross-Validation**:
   - Implement K-Fold cross-validation to evaluate model performance and ensure robustness.

5. **Model Training**:
   - Train the selected model on the training set using the extracted features and corresponding scores.

6. **Model Evaluation**:
   - Evaluate the trained model on the testing set using appropriate evaluation metrics such as quadratic weighted kappa.

7. **Model Inference**:
   - Load the trained model parameters.
   - Make predictions on the test set using the extracted features.

8. **Submission Generation**:
   - Format the predictions into a submission file following the required template.
   - Save the submission file in CSV format.

### Algorithm Names:

1. **Feature Extraction**: 
   - BERT Embeddings
   - RoBERTa Embeddings

2. **Model Selection**:
   - Multilayer Perceptron (MLP)
   - Recurrent Neural Network (RNN)

3. **Evaluation Metrics**:
   - Quadratic Weighted Kappa (QWK)
   
4. **Cross-Validation**:
   - K-Fold Cross-Validation

### Final Project Summary:
The project aims to automate essay scoring using machine learning techniques. It involves extracting features from essays using pre-trained language models, selecting and training appropriate machine learning models, evaluating model performance using cross-validation and evaluation metrics, and generating submissions for scoring predictions. The project explores various algorithms and techniques to achieve accurate and reliable essay scoring.

# Downloads All important Path

In [None]:
# The exclamation mark at the start is used to run shell commands in a Jupyter notebook environment.
# pip is the package installer for Python, used to install packages.
# install specifies that we want to install a package.
# --find-links allows you to specify a directory to search for packages.
# /kaggle/input/downgrade-pandas is the directory where pip should look for packages.
# /kaggle/input/downgrade-pandas/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is the specific package (pandas version 1.5.3) to install.
!pip install --find-links /kaggle/input/downgrade-pandas /kaggle/input/downgrade-pandas/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

1. **`!`**:
   - In a Jupyter notebook, the exclamation mark `!` is used to execute shell commands directly from the notebook cell.

2. **`pip install`**:
   - `pip` is the package installer for Python. It is used to install and manage software packages written in Python.
   - `install` is a command used with `pip` to install specified packages.

3. **`--find-links /kaggle/input/downgrade-pandas`**:
   - `--find-links` is an option that tells `pip` to look for packages in the specified directory or URL.
   - `/kaggle/input/downgrade-pandas` is the directory where `pip` should look for packages. This is a specific path in the Kaggle environment where the package files are stored.

4. **`/kaggle/input/downgrade-pandas/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`**:
   - This is the path to the specific package file that you want to install. It is a wheel file (`.whl`), which is a built package format for Python.
   - `pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl` is the name of the file, which indicates:
     - `pandas`: the name of the package.
     - `1.5.3`: the version of the package.
     - `cp310`: compatible with CPython 3.10.
     - `manylinux_2_17_x86_64.manylinux2014_x86_64`: specifies the platform and compatibility tags.


#Library and Dataset Load

In [None]:
# Importing the os module to interact with the operating system
import os

In [None]:
# Setting the environment variable to specify which GPUs to use (0 and 1)
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

In [None]:
# Importing necessary libraries
import numpy as np  # NumPy for numerical operations
import gc  # Garbage collector interface for memory management
import re  # Regular expressions for string matching and manipulation
import pandas as pd  # Pandas for data manipulation and analysis

In [None]:
# Loading the training dataset from a CSV file into a Pandas DataFrame
train = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")
# Printing the shape (number of rows and columns) of the training DataFrame
print("Train shape", train.shape)
# Displaying the first few rows of the training DataFrame for a quick preview
display(train.head())
# Printing an empty line for better readability in the output
print()

In [None]:
# Loading the testing dataset from a CSV file into a Pandas DataFrame
test = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv")
# Printing the shape (number of rows and columns) of the testing DataFrame
print("Test shape", test.shape)
# Displaying the first few rows of the testing DataFrame for a quick preview
display(test.head())


```python
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
```

1. **`import os`**:
   - This imports the `os` module, which provides a way of using operating system-dependent functionality like reading or writing to the file system, managing environment variables, etc.

2. **`os.environ["CUDA_VISIBLE_DEVICES"]="0,1"`**:
   - This line sets the environment variable `CUDA_VISIBLE_DEVICES` to `"0,1"`.
   - This is used to specify which GPU devices (by their IDs) should be visible to CUDA. Here, GPUs with IDs 0 and 1 are being made visible. This is often used in environments with multiple GPUs to control which ones a particular program should use.

```python
import numpy as np, gc, re
import pandas as pd
```

3. **`import numpy as np`**:
   - This imports the NumPy library, which is a powerful numerical computing library in Python, under the alias `np` for convenience.

4. **`import gc`**:
   - This imports the `gc` module, which provides an interface to the garbage collector. This can be used to control and interact with the garbage collection process.

5. **`import re`**:
   - This imports the `re` module, which provides support for regular expressions in Python. Regular expressions are used for matching patterns in strings.

6. **`import pandas as pd`**:
   - This imports the Pandas library, which is a powerful data manipulation and analysis library in Python, under the alias `pd` for convenience.

```python
train = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")
print("Train shape",train.shape)
display(train.head())
print()
```

7. **`train = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")`**:
   - This reads a CSV file located at the specified path into a Pandas DataFrame named `train`.
   - The path points to a CSV file in a Kaggle competition dataset directory.

8. **`print("Train shape",train.shape)`**:
   - This prints the shape (number of rows and columns) of the `train` DataFrame to give a quick overview of the data's size.

9. **`display(train.head())`**:
   - This displays the first few rows of the `train` DataFrame. `display` is used here instead of `print` because it provides a better visual representation in Jupyter notebooks.

10. **`print()`**:
    - This prints an empty line for better readability in the output.

```python
test = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv")
print("Test shape",test.shape)
display(test.head())
```

11. **`test = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv")`**:
    - This reads a CSV file located at the specified path into a Pandas DataFrame named `test`.
    - Similar to the previous line for the training data, this reads the testing data.

12. **`print("Test shape",test.shape)`**:
    - This prints the shape (number of rows and columns) of the `test` DataFrame to give a quick overview of the data's size.

13. **`display(test.head())`**:
    - This displays the first few rows of the `test` DataFrame for a quick preview of the testing data.

By running this code, you are preparing your environment to use specific GPUs, importing necessary libraries, and loading and previewing training and testing datasets from CSV files.

In [None]:
# Importing StratifiedKFold from scikit-learn for stratified k-fold cross-validation
from sklearn.model_selection import StratifiedKFold

# Defining the number of folds for cross-validation
FOLDS = 15
# Initializing a new column "fold" in the training DataFrame with a default value of -1
train["fold"] = -1
# Creating a StratifiedKFold object with the specified number of splits, shuffling, and a random state for reproducibility
skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)

# Splitting the data into stratified folds
for fold, (train_index, val_index) in enumerate(skf.split(train, train["score"])):
    # Assigning the fold number to the validation set indices
    train.loc[val_index, "fold"] = fold

# Printing the number of samples in each fold
print('Train samples per fold:')
# Counting the occurrences of each fold value and sorting by fold number
print(train.fold.value_counts().sort_index())

### Explanation of Each Section:

1. **Importing the Necessary Module**:
   ```python
   # Importing StratifiedKFold from scikit-learn for stratified k-fold cross-validation
   from sklearn.model_selection import StratifiedKFold
   ```

2. **Setting Up Folds**:
   ```python
   # Defining the number of folds for cross-validation
   FOLDS = 15
   # Initializing a new column "fold" in the training DataFrame with a default value of -1
   train["fold"] = -1
   ```

3. **Creating StratifiedKFold Object**:
   ```python
   # Creating a StratifiedKFold object with the specified number of splits, shuffling, and a random state for reproducibility
   skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
   ```

4. **Splitting Data into Stratified Folds**:
   ```python
   # Splitting the data into stratified folds
   for fold, (train_index, val_index) in enumerate(skf.split(train, train["score"])):
       # Assigning the fold number to the validation set indices
       train.loc[val_index, "fold"] = fold
   ```

5. **Printing Fold Distribution**:
   ```python
   # Printing the number of samples in each fold
   print('Train samples per fold:')
   # Counting the occurrences of each fold value and sorting by fold number
   print(train.fold.value_counts().sort_index())
   ```

### Detailed Line-by-Line Comments:

- **`from sklearn.model_selection import StratifiedKFold`**:
  - Importing the `StratifiedKFold` class from the `sklearn.model_selection` module to perform stratified k-fold cross-validation, which ensures each fold is representative of the class distribution.

- **`FOLDS = 15`**:
  - Defining a constant `FOLDS` to specify the number of folds (15) for cross-validation.

- **`train["fold"] = -1`**:
  - Adding a new column `fold` to the `train` DataFrame and initializing all its values to `-1`. This column will later store the fold assignments.

- **`skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)`**:
  - Creating a `StratifiedKFold` object with 15 splits, enabling shuffling of data before splitting, and setting a `random_state` for reproducibility.

- **`for fold, (train_index, val_index) in enumerate(skf.split(train, train["score"])):`**:
  - Looping through the indices generated by the `skf.split()` method. The `skf.split()` method splits the `train` data into `FOLDS` folds based on the `score` column, ensuring stratified splits.
  - `fold` is the fold number, `train_index` is the list of training indices for the current fold, and `val_index` is the list of validation indices for the current fold.

- **`train.loc[val_index, "fold"] = fold`**:
  - Assigning the current fold number to the `fold` column of the `train` DataFrame for the validation indices of the current fold.

- **`print('Train samples per fold:')`**:
  - Printing a header to indicate the start of the fold distribution output.

- **`print(train.fold.value_counts().sort_index())`**:
  - Printing the count of samples in each fold, sorted by fold number, to show the distribution of samples across the folds.


# Generate AutoModel and AutoTokenizer

In [None]:
# Importing AutoModel and AutoTokenizer from the transformers library
from transformers import AutoModel, AutoTokenizer

# Importing torch and torch.nn.functional from the PyTorch library
import torch
import torch.nn.functional as F

# Importing tqdm for progress bar functionality
from tqdm import tqdm

1. **Importing from `transformers`**:
   ```python
   # Importing AutoModel and AutoTokenizer from the transformers library
   from transformers import AutoModel, AutoTokenizer
   ```
   - `AutoModel` is a generic model class from the `transformers` library. It allows you to load any pre-trained model.
   - `AutoTokenizer` is a generic tokenizer class from the `transformers` library. It allows you to load any pre-trained tokenizer.
   - The `transformers` library by Hugging Face provides state-of-the-art machine learning models, especially for natural language processing (NLP).

2. **Importing from `torch`**:
   ```python
   # Importing torch and torch.nn.functional from the PyTorch library
   import torch
   import torch.nn.functional as F
   ```
   - `torch` is the main PyTorch library, used for tensor operations and various other functionalities.
   - `torch.nn.functional` (imported as `F`) provides a range of functions useful for building neural networks, like activation functions, loss functions, and more.

3. **Importing `tqdm`**:
   ```python
   # Importing tqdm for progress bar functionality
   from tqdm import tqdm
   ```
   - `tqdm` is a library that provides fast, extensible progress bars for loops. It's useful for tracking the progress of tasks that involve iterating over datasets or any long-running processes.

- **`from transformers import AutoModel, AutoTokenizer`**:
  - Importing `AutoModel` and `AutoTokenizer` from the `transformers` library. These classes provide a convenient way to load pre-trained models and tokenizers for various NLP tasks without specifying the exact model or tokenizer architecture.

- **`import torch`**:
  - Importing the main PyTorch library. PyTorch is an open-source machine learning library used for applications such as natural language processing and computer vision.

- **`import torch.nn.functional as F`**:
  - Importing the `torch.nn.functional` module from PyTorch as `F`. This module contains functions that are used as operations on tensors, typically in the context of neural networks.

- **`from tqdm import tqdm`**:
  - Importing `tqdm`, a library for displaying progress bars. This is useful for monitoring the progress of loops and long-running processes, especially in data processing and training loops.


# model output and moving it to CPU memory

In [None]:
def mean_pooling(model_output, attention_mask):
    # Extracting the last hidden state from the model output and moving it to CPU memory
    token_embeddings = model_output.last_hidden_state.detach().cpu()
    
    # Expanding the attention mask dimensions to match the token embeddings
    # This will make the attention mask compatible for element-wise multiplication with token embeddings
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    
    # Performing element-wise multiplication of token embeddings and expanded attention mask
    # Summing the embeddings along the sequence length (dim=1)
    # Dividing by the sum of the expanded attention mask to compute the mean, while clamping to avoid division by zero
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

### Explanation of Each Section:

1. **Function Definition**:
   ```python
   def mean_pooling(model_output, attention_mask):
   ```
   - Defines a function named `mean_pooling` that takes two arguments: `model_output` and `attention_mask`.

2. **Extracting Last Hidden State**:
   ```python
   # Extracting the last hidden state from the model output and moving it to CPU memory
   token_embeddings = model_output.last_hidden_state.detach().cpu()
   ```
   - `model_output.last_hidden_state`: Accesses the last hidden state from the model output, which is typically a tensor of shape (batch_size, sequence_length, hidden_size).
   - `.detach()`: Detaches the tensor from the computation graph, so no gradients will be calculated for it.
   - `.cpu()`: Moves the tensor from GPU to CPU memory for further processing.

3. **Expanding Attention Mask**:
   ```python
   # Expanding the attention mask dimensions to match the token embeddings
   # This will make the attention mask compatible for element-wise multiplication with token embeddings
   input_mask_expanded = (
       attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
   )
   ```
   - `attention_mask.unsqueeze(-1)`: Adds an extra dimension to the attention mask tensor, changing its shape from (batch_size, sequence_length) to (batch_size, sequence_length, 1).
   - `.expand(token_embeddings.size())`: Expands the attention mask to match the size of `token_embeddings`, so it can be used for element-wise multiplication. This changes the attention mask to have the same shape as `token_embeddings`.
   - `.float()`: Converts the expanded attention mask to float type.

4. **Mean Pooling Calculation**:
   ```python
   # Performing element-wise multiplication of token embeddings and expanded attention mask
   # Summing the embeddings along the sequence length (dim=1)
   # Dividing by the sum of the expanded attention mask to compute the mean, while clamping to avoid division by zero
   return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
       input_mask_expanded.sum(1), min=1e-9
   )
   ```
   - `torch.sum(token_embeddings * input_mask_expanded, 1)`: Performs element-wise multiplication of `token_embeddings` and `input_mask_expanded`, then sums along the sequence length dimension (dim=1).
   - `input_mask_expanded.sum(1)`: Sums the expanded attention mask along the sequence length dimension to get the number of valid tokens for each sequence in the batch.
   - `torch.clamp(..., min=1e-9)`: Clamps the sum to a minimum value of `1e-9` to avoid division by zero.
   - The division computes the mean of the token embeddings, considering only the valid tokens (as indicated by the attention mask).

### Summary:
This function `mean_pooling` performs mean pooling on the token embeddings produced by a transformer model. It uses the attention mask to ensure that padding tokens do not affect the mean calculation. The function first extracts the last hidden state from the model output, then uses the attention mask to calculate the mean of the valid token embeddings for each sequence in the batch.

# Design and Develop PyTorch Model

In [None]:
# Defining a custom dataset class for embedding text using PyTorch's Dataset class
class EmbedDataset(torch.utils.data.Dataset):
    # Initialization method to set up the dataset
    def __init__(self, df, tokenizer, max_length):
        # Storing the DataFrame after resetting its index
        self.df = df.reset_index(drop=True)
        # Storing the tokenizer to be used for tokenizing text
        self.tokenizer = tokenizer
        # Setting the maximum token length for padding/truncation
        self.max = max_length

    # Method to get the length of the dataset (number of samples)
    def __len__(self):
        return len(self.df)

    # Method to get a single sample from the dataset
    def __getitem__(self, idx):
        # Retrieving the text at the specified index
        text = self.df.loc[idx, "full_text"]
        # Tokenizing the text using the provided tokenizer
        tokens = self.tokenizer(
            text,                       # The input text to tokenize
            None,                       # No secondary input text
            add_special_tokens=True,    # Add special tokens like [CLS] and [SEP]
            padding='max_length',       # Pad the sequences to the maximum length
            truncation=True,            # Truncate sequences to the maximum length
            max_length=self.max,        # The maximum length for padding/truncation
            return_tensors="pt"         # Return PyTorch tensors
        )
        # Squeezing the tensor dimensions from (1, max_length) to (max_length)
        tokens = {k: v.squeeze(0) for k, v in tokens.items()}
        # Returning the tokenized input as the output
        return tokens


### Class Definition
- **Class Purpose**: The `EmbedDataset` class is a custom dataset class that extends PyTorch's `Dataset` class. It is designed to handle text data, tokenize it using a specified tokenizer, and prepare it for use in a machine learning model.

### Initialization Method (`__init__`)
- **Arguments**:
  - `df`: A DataFrame containing the text data. This is typically a table where each row represents a text sample.
  - `tokenizer`: An instance of a tokenizer from the `transformers` library, which is used to convert the text into token IDs that the model can process.
  - `max_length`: The maximum length for token sequences. Texts longer than this length will be truncated, and shorter texts will be padded to this length.
- **Purpose**: The `__init__` method initializes the dataset object. It stores the DataFrame, tokenizer, and maximum length as instance variables. It also resets the index of the DataFrame to ensure it starts from 0 and increments by 1.

### Length Method (`__len__`)
- **Purpose**: The `__len__` method returns the total number of samples in the dataset. It allows PyTorch's DataLoader to know how many samples there are in the dataset by simply calling this method.

### Get Item Method (`__getitem__`)
- **Arguments**:
  - `idx`: An index that specifies which sample to retrieve from the dataset.
- **Purpose**: The `__getitem__` method retrieves and processes a single sample from the dataset. It is called by PyTorch's DataLoader to get data during training or evaluation.

### Detailed Steps in `__getitem__`:
1. **Retrieve Text**: The method fetches the text data from the DataFrame at the specified index (`idx`). This text is usually in a column named `full_text`.

2. **Tokenize Text**: The text is passed through the tokenizer, which converts the text into a sequence of token IDs. During this process:
   - **Special Tokens**: The tokenizer adds special tokens (like `[CLS]` at the beginning and `[SEP]` at the end) which are required by some models.
   - **Padding**: The sequence is padded to the maximum length specified during initialization. Padding ensures all sequences are of the same length, which is necessary for batch processing.
   - **Truncation**: If the sequence is longer than the maximum length, it is truncated to fit within the specified length.
   - **Return Tensors**: The tokenizer outputs the token IDs and other necessary components (like attention masks) as PyTorch tensors.

3. **Squeeze Tensors**: The method processes the tensors to remove any unnecessary dimensions, converting them from 2D (batch_size, sequence_length) to 1D (sequence_length) for each token in the dictionary. This is typically needed because the tokenizer adds a batch dimension, even if there's only one sample.

4. **Return Processed Tokens**: Finally, the method returns the processed tokenized data. This data includes the token IDs and possibly other components like attention masks, which are needed for model input.


# Extract Embeddings

In [None]:
def get_embeddings(model_name='', max_length=1024, batch_size=32, compute_train=True, compute_test=True):
    # Global variables for train and test dataframes
    global train, test
    
    # Device for GPU processing
    DEVICE = "cuda:1"  # EXTRACT EMBEDDINGS WITH GPU #2
    
    # Path to pre-trained model and tokenizer
    path = "/kaggle/input/download-huggingface-models/"
    disk_name = path + model_name.replace("/", "_")
    
    # Load pre-trained model and tokenizer
    model = AutoModel.from_pretrained(disk_name, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(disk_name, trust_remote_code=True)

    # Create EmbedDataset and DataLoader for training data
    ds_tr = EmbedDataset(train, tokenizer, max_length)
    embed_dataloader_tr = torch.utils.data.DataLoader(ds_tr, batch_size=batch_size, shuffle=False)
    
    # Create EmbedDataset and DataLoader for testing data
    ds_te = EmbedDataset(test, tokenizer, max_length)
    embed_dataloader_te = torch.utils.data.DataLoader(ds_te, batch_size=batch_size, shuffle=False)
    
    # Move model to GPU and set to evaluation mode
    model = model.to(DEVICE)
    model.eval()

    # COMPUTE TRAIN EMBEDDINGS
    all_train_text_feats = []
    if compute_train:
        # Iterate over batches of training data
        for batch in tqdm(embed_dataloader_tr, total=len(embed_dataloader_tr)):
            # Move batch to GPU
            input_ids = batch["input_ids"].to(DEVICE)
            attention_mask = batch["attention_mask"].to(DEVICE)
            
            # Compute embeddings with model inference
            with torch.no_grad():
                with torch.cuda.amp.autocast(enabled=True):
                    model_output = model(input_ids=input_ids, attention_mask=attention_mask)
            sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
            
            # Normalize embeddings
            sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
            sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
            all_train_text_feats.extend(sentence_embeddings)
    all_train_text_feats = np.array(all_train_text_feats)

    # COMPUTE TEST EMBEDDINGS
    all_test_text_feats = []
    if compute_test:
        # Iterate over batches of testing data
        for batch in embed_dataloader_te:
            # Move batch to GPU
            input_ids = batch["input_ids"].to(DEVICE)
            attention_mask = batch["attention_mask"].to(DEVICE)
            
            # Compute embeddings with model inference
            with torch.no_grad():
                with torch.cuda.amp.autocast(enabled=True):
                    model_output = model(input_ids=input_ids, attention_mask=attention_mask)
            sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
            
            # Normalize embeddings
            sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
            sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
            all_test_text_feats.extend(sentence_embeddings)
        all_test_text_feats = np.array(all_test_text_feats)
    
    # Clear memory
    del ds_tr, ds_te
    del embed_dataloader_tr, embed_dataloader_te
    del model, tokenizer, model_output, sentence_embeddings, input_ids, attention_mask
    gc.collect()
    torch.cuda.empty_cache()

    # RETURN EMBEDDINGS
    return all_train_text_feats, all_test_text_feats

1. **Setting Up Environment and Paths**:
   - The function starts by setting up the environment, including the device (GPU), file paths, and loading the pre-trained model and tokenizer from the specified location.

2. **Creating EmbedDataset and DataLoader**:
   - Two instances of `EmbedDataset` are created for both the training and testing data.
   - PyTorch `DataLoader` objects are initialized using these datasets. These loaders are used for iterating over the data in batches during the embedding extraction process.

3. **Moving Model to GPU and Setting to Evaluation Mode**:
   - The loaded model is moved to the specified GPU (`cuda:1`) and set to evaluation mode using `model.eval()`.

4. **Computing Embeddings**:
   - The function iterates over batches of data from the training and testing datasets.
   - For each batch, it performs inference using the model to obtain sentence embeddings.
   - The embeddings are then normalized and converted to NumPy arrays before being added to the respective lists (`all_train_text_feats` and `all_test_text_feats`).

5. **Memory Management**:
   - After computing embeddings, memory is cleared to release GPU resources using `del`, `gc.collect()`, and `torch.cuda.empty_cache()`.

6. **Returning Embeddings**:
   - Finally, the computed embeddings for both training and testing data are returned as NumPy arrays.

### Main Points to Note:
- **Memory Management**: The function efficiently manages memory to prevent memory leaks and optimize GPU usage by deleting unnecessary objects and clearing the GPU cache.
- **Normalization**: Embeddings are normalized using L2 normalization to ensure consistent scales across embeddings.
- **GPU Usage**: The function leverages GPU acceleration (`cuda:1`) for faster computation of embeddings.
- **Data Iteration**: Data is iterated over in batches using PyTorch's DataLoader for efficient processing.
- **Evaluation Mode**: The model is set to evaluation mode (`model.eval()`) to disable dropout layers and ensure consistent inference results.


# List of pre-trained transformer models

In [None]:
# List of pre-trained transformer models along with their corresponding parameters
models = [
    ('microsoft/deberta-base', 1024, 32),             # Microsoft DeBERTa Base model with max_length=1024, batch_size=32
    ('microsoft/deberta-large', 1024, 8),             # Microsoft DeBERTa Large model with max_length=1024, batch_size=8
    ('microsoft/deberta-v3-large', 1024, 8),          # Microsoft DeBERTa v3 Large model with max_length=1024, batch_size=8
    ('allenai/longformer-base-4096', 1024, 32),       # AllenAI Longformer Base model with max_length=1024, batch_size=32
    ('google/bigbird-roberta-base', 1024, 32),        # Google BigBird-RoBERTa Base model with max_length=1024, batch_size=32
    ('google/bigbird-roberta-large', 1024, 8),        # Google BigBird-RoBERTa Large model with max_length=1024, batch_size=8
]

In [None]:
# Directory path where embeddings are saved or loaded from
path = "/kaggle/input/essay-embeddings-v1/"

# Lists to store embeddings for all models
all_train_embeds = []
all_test_embeds = []

# Loop through each model and associated parameters
for (model, max_length, batch_size) in models:
    # Generate file name for the embeddings corresponding to the current model
    name = path + model.replace("/","_") + ".npy"
    
    # Check if embeddings file already exists
    if os.path.exists(name):
        # If embeddings file exists, load test embeddings directly
        _, test_embed = get_embeddings(model_name=model, max_length=max_length, batch_size=batch_size, compute_train=False)
        # Load train embeddings from file
        train_embed = np.load(name)
        # Print message indicating loading of train embeddings
        print(f"Loading train embeddings for {name}")
    else:
        # If embeddings file does not exist, compute both train and test embeddings
        # Compute train and test embeddings using the get_embeddings function
        print(f"Computing train embeddings for {name}")
        train_embed, test_embed = get_embeddings(model_name=model, max_length=max_length, batch_size=batch_size, compute_train=True)
        # Save computed train embeddings to file
        np.save(name, train_embed)
    
    # Append train and test embeddings to corresponding lists
    all_train_embeds.append(train_embed)
    all_test_embeds.append(test_embed)

# Clean up memory by deleting variables train_embed and test_embed
del train_embed, test_embed

1. **Path Definition**:
   ```python
   path = "/kaggle/input/essay-embeddings-v1/"
   ```
   - This line defines the directory path where embeddings are either saved or loaded from.

2. **Initialization of Lists**:
   ```python
   all_train_embeds = []
   all_test_embeds = []
   ```
   - These lists (`all_train_embeds` and `all_test_embeds`) are initialized to store embeddings for all models.

3. **Looping Through Models**:
   ```python
   for (model, max_length, batch_size) in models:
   ```
   - This loop iterates over each tuple in the `models` list, unpacking the model name, maximum sequence length, and batch size for each iteration.

4. **Generating File Name**:
   ```python
   name = path + model.replace("/","_") + ".npy"
   ```
   - This line generates the file name for the embeddings corresponding to the current model by replacing the slashes in the model name with underscores and appending the ".npy" extension.

5. **Checking if Embeddings File Exists**:
   ```python
   if os.path.exists(name):
   ```
   - This conditional statement checks if the embeddings file already exists in the specified path.

6. **Loading Existing Embeddings**:
   ```python
   _, test_embed = get_embeddings(model_name=model, max_length=max_length, batch_size=batch_size, compute_train=False)
   train_embed = np.load(name)
   ```
   - If the embeddings file exists, test embeddings are loaded directly using `get_embeddings`, and train embeddings are loaded from the `.npy` file.

7. **Computing New Embeddings**:
   ```python
   else:
       train_embed, test_embed = get_embeddings(model_name=model, max_length=max_length, batch_size=batch_size, compute_train=True)
       np.save(name, train_embed)
   ```
   - If the embeddings file does not exist, both train and test embeddings are computed using the `get_embeddings` function. Train embeddings are then saved to the `.npy` file for future use.

8. **Appending Embeddings to Lists**:
   ```python
   all_train_embeds.append(train_embed)
   all_test_embeds.append(test_embed)
   ```
   - Train and test embeddings are appended to their respective lists for each model iteration.

9. **Memory Cleanup**:
   ```python
   del train_embed, test_embed
   ```
   - Finally, the `train_embed` and `test_embed` variables are deleted to free up memory.


# Combine Feature Embeddings

In [None]:
# Concatenate train embeddings horizontally along axis 1
all_train_embeds = np.concatenate(all_train_embeds, axis=1)

# Concatenate test embeddings horizontally along axis 1
all_test_embeds = np.concatenate(all_test_embeds, axis=1)

# Perform garbage collection to free up memory
gc.collect()

# Print the shape of the concatenated train embeddings
print('Our concatenated train embeddings have shape', all_train_embeds.shape)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Check if CUDA is available and set device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define the MLP model
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(MLP, self).__init__()
        # Define the first fully connected layer
        self.fc1 = nn.Linear(input_size, hidden_size1)
        # Define the ReLU activation function
        self.relu = nn.ReLU()
        # Define the second fully connected layer
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        # Define the ReLU activation function
        self.relu = nn.ReLU()
        # Define the output fully connected layer
        self.fc3 = nn.Linear(hidden_size2, output_size)

    def forward(self, x):
        # Forward pass through the network
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

# Training function for the MLP model
def train_MLP(model, criterion, optimizer, train_loader, num_epochs, X_valid_tensor, y_valid):
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        # Iterate over training batches
        for inputs, labels in train_loader:
            # Move inputs and labels to the appropriate device (GPU or CPU)
            inputs, labels = inputs.to(device), labels.to(device)
            # Zero the gradients
            optimizer.zero_grad()
            # Forward pass
            outputs = model(inputs)
            # Compute the loss
            loss = criterion(outputs, labels)
            # Backward pass
            loss.backward()
            # Update weights
            optimizer.step()
            # Accumulate the loss
            running_loss += loss.item() * inputs.size(0)
        
        # Move validation data to the appropriate device
        X_valid_tensor = X_valid_tensor.to(device)
        # Perform validation by passing validation data through the model
        preds = torch.argmax(model(X_valid_tensor), dim=1)
        # Compute the QWK score for validation predictions
        score = comp_score(y_valid, (preds + 1).cpu())    
        # Compute average epoch loss
        epoch_loss = running_loss / len(train_loader.dataset)
        # Print training progress
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}" + f" => QWK score: {score}")

# Hyperparameters
input_size = 5376       # Input vector dimensionality
hidden_size1 = 3200     # Size of the first hidden layer
hidden_size2 = 1600     # Size of the second hidden layer
output_size = 6         # Number of output classes
learning_rate = 0.001   # Learning rate
num_epochs = 8          # Number of training epochs
batch_size = 128        # Batch size

# Initialize the model, loss function, and optimizer
model = MLP(input_size, hidden_size1, hidden_size2, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


1. **Model Definition**:
   - The script defines a Multilayer Perceptron (MLP) neural network using PyTorch's `nn.Module` class. This MLP consists of three fully connected layers (`fc1`, `fc2`, `fc3`) with ReLU activation functions between them. The `forward` method defines the forward pass of the network.

2. **Training Function**:
   - The `train_MLP` function is responsible for training the MLP model. It iterates over the specified number of epochs, performing forward and backward passes for each batch of training data. It also computes and accumulates the loss during training. After each epoch, it evaluates the model's performance on the validation set and prints the training progress.

3. **Device Selection**:
   - The code checks for the availability of a CUDA-enabled GPU and sets the device accordingly. This allows for GPU acceleration if a compatible GPU is available, otherwise, it falls back to CPU execution.

4. **Hyperparameters**:
   - Hyperparameters such as input size, hidden layer sizes, output size, learning rate, number of epochs, and batch size are defined. These parameters control the architecture and training process of the MLP model.

5. **Loss Function and Optimizer**:
   - The script specifies the cross-entropy loss function (`nn.CrossEntropyLoss`) and the Adam optimizer (`optim.Adam`) for training the MLP model. These components are essential for computing the loss and updating the model's weights during the training process.



# Design training and validation sets

In [None]:
from sklearn.metrics import cohen_kappa_score

# Function to compute the quadratic weighted kappa score
def comp_score(y_true, y_pred):
    m = cohen_kappa_score(y_true, y_pred, weights='quadratic')
    return m

# Compute the sizes of the training and validation sets
train_size = int(0.7 * len(train))
valid_size = len(train) - train_size

# Shuffle the indices of the dataset
indices = np.random.permutation(len(train))

# Split the dataset into training and validation sets based on the specified proportions
train_indices = indices[:train_size]
valid_indices = indices[train_size:]

# Extract features and labels for training and validation sets
X_train = all_train_embeds[train_indices,]
y_train = train.loc[train_indices,'score'].values
X_valid = all_train_embeds[valid_indices,]
y_valid = train.loc[valid_indices,'score'].values
X_test = all_test_embeds

# Convert training and validation data to PyTorch tensors
X_train_tensor = torch.tensor(X_train)
y_train_tensor = torch.tensor(y_train - 1)  # Adjust labels to start from 0
X_valid_tensor = torch.tensor(X_valid)
y_valid_tensor = torch.tensor(y_valid)

# Create TensorDataset and DataLoader for training and validation data
train_data = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_data = torch.utils.data.TensorDataset(X_valid_tensor, y_valid_tensor)
val_loader = torch.utils.data.DataLoader(val_data, shuffle=False)

# Call the train_MLP function to train the MLP model
# train_MLP(model, criterion, optimizer, train_loader, num_epochs, X_valid_tensor, y_valid)

In [None]:
import numpy as np
import torch
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import KFold
from torch.utils.data import TensorDataset, DataLoader

# Function to compute the quadratic weighted kappa score
def comp_score(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

# Training function for the MLP model
def train_MLP(model, criterion, optimizer, train_loader, num_epochs, X_valid_tensor, y_valid):
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        for inputs, labels in train_loader:
            # Move inputs and labels to the appropriate device
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)
        model.eval()
        with torch.no_grad():
            # Move validation data to the appropriate device
            X_valid_tensor = X_valid_tensor.to(device)
            preds = torch.argmax(model(X_valid_tensor), dim=1)
            score = comp_score(y_valid, (preds + 1).cpu())
            epoch_loss = running_loss / len(train_loader.dataset)
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}" + f" => QWK score: {score}")
    return model

# Assume you have defined model, criterion, optimizer, all_train_embeds, train, all_test_embeds, batch_size, num_epochs

# Initialize 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)
scores = []

# Perform 10-fold cross-validation
for fold, (train_indices, valid_indices) in enumerate(kf.split(all_train_embeds)):
    print(f'Fold {fold+1}')
    
    # Split data into training and validation sets
    X_train = all_train_embeds[train_indices]
    y_train = train.loc[train_indices, 'score'].values
    X_valid = all_train_embeds[valid_indices]
    y_valid = train.loc[valid_indices, 'score'].values
    
    # Convert data to PyTorch tensors
    X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train - 1, dtype=torch.long)  # Adjust labels to start from 0
    X_valid_tensor = torch.tensor(X_valid, dtype=torch.float32)
    y_valid_tensor = torch.tensor(y_valid, dtype=torch.long)
    
    # Create TensorDataset and DataLoader for training and validation data
    train_data = TensorDataset(X_train_tensor, y_train_tensor)
    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    val_data = TensorDataset(X_valid_tensor, y_valid_tensor)
    val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False)
    
    # Reinitialize the model
    model = MLP(input_size, hidden_size1, hidden_size2, output_size)  # Replace with your model class
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters())  # Choose optimizer as needed

    # Train the model
    model = train_MLP(model, criterion, optimizer, train_loader, num_epochs, X_valid_tensor, y_valid_tensor)
    
    # Validate the model
    model.eval()
    with torch.no_grad():
        X_valid_tensor = X_valid_tensor.to(device)
        valid_preds = model(X_valid_tensor).argmax(dim=1).cpu().numpy()

    model_path = f'/kaggle/working/model_fold_{fold+1}.pth'
    torch.save(model.state_dict(), model_path)
    print(f'Model saved at {model_path}')
    
    # Compute score
    score = comp_score(y_valid, valid_preds + 1)
    scores.append(score)
    print(f'Score for fold {fold+1}: {score}')

# Compute the mean score for 10 folds
mean_score = np.mean(scores)
print(f'Mean score: {mean_score}')


1. **Imports**:
   - The script imports necessary libraries such as NumPy, PyTorch, and scikit-learn's `cohen_kappa_score` for evaluating the model's performance.

2. **Function Definitions**:
   - `comp_score`: Computes the quadratic weighted kappa score between true labels and predicted labels. This function is used to evaluate the performance of the model during training.
   - `train_MLP`: Function for training the MLP model. It iterates over the specified number of epochs, performing forward and backward passes for each batch of training data. After each epoch, it evaluates the model's performance on the validation set and prints the training progress.

3. **Initialization**:
   - Hyperparameters and necessary variables such as the number of epochs, batch size, and KFold cross-validation splitter are initialized.

4. **K-Fold Cross-Validation**:
   - The script uses K-Fold cross-validation with 10 folds to train and evaluate the model on different subsets of the data.
   - Within each fold loop:
     - The data is split into training and validation sets.
     - PyTorch tensors and data loaders are created for both training and validation sets.
     - The model is reinitialized, optimizer is chosen, and the model is trained using the `train_MLP` function.
     - After training, the model's predictions on the validation set are computed, and the model is saved.
     - The performance score for the fold is computed and printed.
   - After all folds are completed, the mean score over 10 folds is computed and printed.

5. **Explanation**:
   - The script demonstrates a complete pipeline for training and evaluating an MLP model using K-Fold cross-validation. It ensures robust evaluation by training and testing the model on different subsets of the data. The mean score over multiple folds provides a more reliable estimate of the model's performance.

# Loading models and making predictions

In [None]:
# Inference Phase: Loading models and making predictions

# Convert test embeddings to PyTorch tensors and move to the appropriate device
all_test_embeds_tensor = torch.tensor(all_test_embeds, dtype=torch.float32)
all_test_embeds_tensor = all_test_embeds_tensor.to(device)

# List to store predictions from each fold
all_preds = []

# Loop through each fold
for fold in range(10):
    # Reinitialize the model
    model = MLP(input_size, hidden_size1, hidden_size2, output_size)  # Replace with your model class
    model.to(device)
    
    # Load the trained model parameters
    model_path = f'/kaggle/working/model_fold_{fold+1}.pth'
    model.load_state_dict(torch.load(model_path))
    
    # Set model to evaluation mode
    model.eval()
    
    # Make predictions on test data
    with torch.no_grad():
        test_preds = model(all_test_embeds_tensor).argmax(dim=1).cpu().numpy()
    
    # Append predictions from this fold to the list
    all_preds.append(test_preds + 1)  # Incrementing labels by 1 to match original indexing

# Compute the final prediction for each sample
final_preds = np.mean(all_preds, axis=0).round().astype(int)  # Using majority voting for aggregation


1. **Model Inference**:
   - This section loads the trained models saved during the cross-validation process and uses them to make predictions on the test data.
   - The test data embeddings are converted to PyTorch tensors and moved to the appropriate device (CPU or GPU).
   - For each fold:
     - The model is reinitialized.
     - The saved model parameters are loaded from the corresponding file.
     - The model is set to evaluation mode (`model.eval()`) to disable dropout and batch normalization layers.
     - With no gradient calculation, predictions are made on the test data using the loaded model.
     - The predicted labels are converted to NumPy arrays and appended to `all_preds`.

2. **Combining Predictions**:
   - After predictions from all folds are collected in `all_preds`, the final prediction for each sample is computed.
   - In this implementation, the final prediction for each sample is calculated by averaging the predictions from all folds (`np.mean(all_preds, axis=0)`).
   - The averaged predictions are rounded to the nearest integer and cast to integers, representing the final predicted class labels.


# Create Submission CSV

In [None]:
# Assign final predictions to test_preds variable
test_preds = final_preds

# Print the shape of test_preds array
print('Test preds shape:', test_preds.shape)

# Print the first 3 test predictions
print('First 3 test preds:', test_preds[:3])


In [None]:
# Read the sample submission file
sub = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv")

# Assign the final predictions to the "score" column of the submission dataframe
sub["score"] = test_preds

# Convert the "score" column to int32 data type
sub.score = sub.score.astype('int32')

# Save the submission dataframe to a CSV file
sub.to_csv("submission.csv", index=False)

# Print the shape of the submission dataframe
print("Submission shape:", sub.shape)

# Display the first few rows of the submission dataframe
sub.head()