<p align='center'><img src='./header.png'></p>

# <p align='center'> <img src='./LabLogo_Kaggle.png' height='30rem' width='30rem'> Learning Agency Lab - Automated Essay Scoring 2.0</p>

**Learning Agency Lab - Automated Essay Scoring 2.0:-** The goal is to build a model that can accurately predict the score an essay deserves based solely on its text content. The competition aims to improve student learning outcomes by providing timely and reliable feedback to overburdened educators.

## Problem Statement

Essay writing is a crucial method to evaluate student learning and performance, but it is time-consuming for educators to grade manually.<br>
- **Automated Writing Evaluation (AWE)** systems can assist in scoring essays, providing students with regular and timely feedback. However, many advancements in AWE are not widely accessible due to cost barriers. Open-source solutions are needed to make AWE technology available to every community, especially underserved ones.

## Competition Objective

The objective of this competition is to train a model to score student essays accurately. Participants are tasked with reducing the high expense and time required for manual grading, making it feasible to introduce essays into testing, a key indicator of student learning.

## Dataset
The competition dataset comprises about 24000 student-written argumentative essays. Each essay was scored on a scale of 1 to 6 (Link to the Holistic Scoring Rubric). Your goal is to predict the score an essay received from its text.

***File and Field Information:-***
Sure, here's the information organized in a tabular form:

| File Name          | Description                                             | Fields                                  |
|--------------------|---------------------------------------------------------|-----------------------------------------|
| train.csv          | Essays and scores to be used as training data           | essay_id, full_text, score              |
| test.csv           | Essays to be used as test data                          | essay_id, full_text                     |
| sample_submission.csv | A submission file in the correct format                | essay_id, score                        |

Each file contains specific fields:

- `train.csv`: Contains essays along with their unique ID (`essay_id`), the full text of the essay (`full_text`), and the holistic score of the essay on a 1-6 scale (`score`).
- `test.csv`: Contains essays to be used as test data, including their unique ID (`essay_id`) and the full text of the essay (`full_text`). This file does not include the `score` field.
- `sample_submission.csv`: A submission file template with the correct format for submission. It includes the unique ID of each essay (`essay_id`) and a placeholder for the predicted holistic score of the essay on a 1-6 scale (`score`).

This tabular representation summarizes the contents of each file and their respective fields, providing clarity on the dataset structure and file formats.

## Evaluation

Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two outcomes. This metric typically varies from 0 (random agreement) to 1 (complete agreement). In the event that there is less agreement than expected by chance, the metric may go below 0.

The quadratic weighted kappa is calculated as follows. First, an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of essay_ids i (actual) that received a predicted value j. An N-by-N matrix of weights, w, is calculated based on the difference between actual and predicted values:

<div align='center'>
<span class="MathJax" id="MathJax-Element-1-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot; display=&quot;block&quot;><msub><mi>w</mi><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mfrac><msup><mrow><mo>(</mo><mi>i</mi><mo>&amp;#x2212;</mo><mi>j</mi><mo>)</mo></mrow><mn>2</mn></msup><msup><mrow><mo>(</mo><mi>N</mi><mo>&amp;#x2212;</mo><mn>1</mn><mo>)</mo></mrow><mn>2</mn></msup></mfrac></math>" role="presentation" style="text-align: center; position: relative;"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-1" style="width: 8.96em; display: inline-block;"><span style="display: inline-block; position: relative; width: 7.089em; height: 0px; font-size: 126%;"><span style="position: absolute; clip: rect(0.4em, 1007.09em, 3.631em, -999.997em); top: -2.265em; left: 0em;"><span class="mrow" id="MathJax-Span-2"><span class="msubsup" id="MathJax-Span-3"><span style="display: inline-block; position: relative; width: 1.533em; height: 0px;"><span style="position: absolute; clip: rect(3.404em, 1000.68em, 4.198em, -999.997em); top: -4.022em; left: 0em;"><span class="mi" id="MathJax-Span-4" style="font-family: MathJax_Math-italic;">w</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.852em; left: 0.74em;"><span class="texatom" id="MathJax-Span-5"><span class="mrow" id="MathJax-Span-6"><span class="mi" id="MathJax-Span-7" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-8" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-9" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span class="mo" id="MathJax-Span-10" style="font-family: MathJax_Main; padding-left: 0.286em;">=</span><span class="mfrac" id="MathJax-Span-11" style="padding-left: 0.286em;"><span style="display: inline-block; position: relative; width: 3.971em; height: 0px; margin-right: 0.116em; margin-left: 0.116em;"><span style="position: absolute; clip: rect(2.894em, 1003.18em, 4.425em, -999.997em); top: -4.759em; left: 50%; margin-left: -1.584em;"><span class="msubsup" id="MathJax-Span-12"><span style="display: inline-block; position: relative; width: 3.177em; height: 0px;"><span style="position: absolute; clip: rect(3.121em, 1002.67em, 4.425em, -999.997em); top: -4.022em; left: 0em;"><span class="mrow" id="MathJax-Span-13"><span class="mo" id="MathJax-Span-14" style="font-family: MathJax_Main;">(</span><span class="mi" id="MathJax-Span-15" style="font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-16" style="font-family: MathJax_Main; padding-left: 0.23em;">−</span><span class="mi" id="MathJax-Span-17" style="font-family: MathJax_Math-italic; padding-left: 0.23em;">j</span><span class="mo" id="MathJax-Span-18" style="font-family: MathJax_Main;">)</span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -4.476em; left: 2.781em;"><span class="mn" id="MathJax-Span-19" style="font-size: 70.7%; font-family: MathJax_Main;">2</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; clip: rect(2.894em, 1003.86em, 4.425em, -999.997em); top: -3.115em; left: 50%; margin-left: -1.925em;"><span class="msubsup" id="MathJax-Span-20"><span style="display: inline-block; position: relative; width: 3.858em; height: 0px;"><span style="position: absolute; clip: rect(3.121em, 1003.29em, 4.425em, -999.997em); top: -4.022em; left: 0em;"><span class="mrow" id="MathJax-Span-21"><span class="mo" id="MathJax-Span-22" style="font-family: MathJax_Main;">(</span><span class="mi" id="MathJax-Span-23" style="font-family: MathJax_Math-italic;">N<span style="display: inline-block; overflow: hidden; height: 1px; width: 0.06em;"></span></span><span class="mo" id="MathJax-Span-24" style="font-family: MathJax_Main; padding-left: 0.23em;">−</span><span class="mn" id="MathJax-Span-25" style="font-family: MathJax_Main; padding-left: 0.23em;">1</span><span class="mo" id="MathJax-Span-26" style="font-family: MathJax_Main;">)</span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -4.476em; left: 3.404em;"><span class="mn" id="MathJax-Span-27" style="font-size: 70.7%; font-family: MathJax_Main;">2</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; clip: rect(0.853em, 1003.97em, 1.25em, -999.997em); top: -1.301em; left: 0em;"><span style="display: inline-block; overflow: hidden; vertical-align: 0em; border-top: 1.3px solid; width: 3.971em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.08em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 2.27em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -1.568em; border-left: 0px solid; width: 0px; height: 3.718em;"></span></span></nobr><span class="MJX_Assistive_MathML MJX_Assistive_MathML_Block" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block">

</div>
<br>

An N-by-N histogram matrix of expected outcomes, E, is calculated assuming that there is no correlation between values. 
This is calculated as the outer product between the actual histogram vector of outcomes and the predicted histogram vector, normalized such that E and O have the same sum.

From these three matrices, the quadratic weighted kappa is calculated as: 

<div align="center">
<span class="math" id="MathJax-Span-138" style="width: 11.964em; display: inline-block;"><span style="display: inline-block; position: relative; width: 9.47em; height: 0px; font-size: 126%;"><span style="position: absolute; clip: rect(0.456em, 1009.41em, 3.574em, -999.997em); top: -2.265em; left: 0em;"><span class="mrow" id="MathJax-Span-139"><span class="mi" id="MathJax-Span-140" style="font-family: MathJax_Math-italic;">κ</span><span class="mo" id="MathJax-Span-141" style="font-family: MathJax_Main; padding-left: 0.286em;">=</span><span class="mn" id="MathJax-Span-142" style="font-family: MathJax_Main; padding-left: 0.286em;">1</span><span class="mo" id="MathJax-Span-143" style="font-family: MathJax_Main; padding-left: 0.23em;">−</span><span class="mfrac" id="MathJax-Span-144" style="padding-left: 0.23em;"><span style="display: inline-block; position: relative; width: 5.275em; height: 0px; margin-right: 0.116em; margin-left: 0.116em;"><span style="position: absolute; clip: rect(3.121em, 1005.11em, 4.651em, -999.997em); top: -4.929em; left: 50%; margin-left: -2.548em;"><span class="mrow" id="MathJax-Span-145"><span class="munderover" id="MathJax-Span-146"><span style="display: inline-block; position: relative; width: 1.874em; height: 0px;"><span style="position: absolute; clip: rect(3.121em, 1001.02em, 4.425em, -999.997em); top: -4.022em; left: 0em;"><span class="mo" id="MathJax-Span-147" style="font-family: MathJax_Size1; vertical-align: 0em;">∑</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.739em; left: 1.08em;"><span class="texatom" id="MathJax-Span-148"><span class="mrow" id="MathJax-Span-149"><span class="mi" id="MathJax-Span-150" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-151" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-152" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-153" style="padding-left: 0.173em;"><span style="display: inline-block; position: relative; width: 1.533em; height: 0px;"><span style="position: absolute; clip: rect(3.404em, 1000.68em, 4.198em, -999.997em); top: -4.022em; left: 0em;"><span class="mi" id="MathJax-Span-154" style="font-family: MathJax_Math-italic;">w</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.852em; left: 0.74em;"><span class="texatom" id="MathJax-Span-155"><span class="mrow" id="MathJax-Span-156"><span class="mi" id="MathJax-Span-157" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-158" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-159" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-160"><span style="display: inline-block; position: relative; width: 1.59em; height: 0px;"><span style="position: absolute; clip: rect(3.177em, 1000.74em, 4.198em, -999.997em); top: -4.022em; left: 0em;"><span class="mi" id="MathJax-Span-161" style="font-family: MathJax_Math-italic;">O</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.852em; left: 0.74em;"><span class="texatom" id="MathJax-Span-162"><span class="mrow" id="MathJax-Span-163"><span class="mi" id="MathJax-Span-164" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-165" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-166" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; clip: rect(3.121em, 1005.11em, 4.651em, -999.997em); top: -3.285em; left: 50%; margin-left: -2.548em;"><span class="mrow" id="MathJax-Span-167"><span class="munderover" id="MathJax-Span-168"><span style="display: inline-block; position: relative; width: 1.874em; height: 0px;"><span style="position: absolute; clip: rect(3.121em, 1001.02em, 4.425em, -999.997em); top: -4.022em; left: 0em;"><span class="mo" id="MathJax-Span-169" style="font-family: MathJax_Size1; vertical-align: 0em;">∑</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.739em; left: 1.08em;"><span class="texatom" id="MathJax-Span-170"><span class="mrow" id="MathJax-Span-171"><span class="mi" id="MathJax-Span-172" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-173" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-174" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-175" style="padding-left: 0.173em;"><span style="display: inline-block; position: relative; width: 1.533em; height: 0px;"><span style="position: absolute; clip: rect(3.404em, 1000.68em, 4.198em, -999.997em); top: -4.022em; left: 0em;"><span class="mi" id="MathJax-Span-176" style="font-family: MathJax_Math-italic;">w</span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.852em; left: 0.74em;"><span class="texatom" id="MathJax-Span-177"><span class="mrow" id="MathJax-Span-178"><span class="mi" id="MathJax-Span-179" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-180" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-181" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-182"><span style="display: inline-block; position: relative; width: 1.533em; height: 0px;"><span style="position: absolute; clip: rect(3.177em, 1000.74em, 4.198em, -999.997em); top: -4.022em; left: 0em;"><span class="mi" id="MathJax-Span-183" style="font-family: MathJax_Math-italic;">E<span style="display: inline-block; overflow: hidden; height: 1px; width: 0.003em;"></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; top: -3.852em; left: 0.74em;"><span class="texatom" id="MathJax-Span-184"><span class="mrow" id="MathJax-Span-185"><span class="mi" id="MathJax-Span-186" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span class="mo" id="MathJax-Span-187" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-188" style="font-size: 70.7%; font-family: MathJax_Math-italic;">j</span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 4.028em;"></span></span><span style="position: absolute; clip: rect(0.853em, 1005.27em, 1.25em, -999.997em); top: -1.301em; left: 0em;"><span style="display: inline-block; overflow: hidden; vertical-align: 0em; border-top: 1.3px solid; width: 5.275em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.08em;"></span></span></span></span><span class="mo" id="MathJax-Span-189" style="font-family: MathJax_Main;">.</span></span><span style="display: inline-block; width: 0px; height: 2.27em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -1.496em; border-left: 0px solid; width: 0px; height: 3.718em;"></span></span>
</div>

## Submission File

For each essay_id in the test set, participants must predict the corresponding score. The submission file should contain a header and have the following format:

```
essay_id,score
000d118,3
000fe60,3
001ab80,4
```

---
For detailed instructions, guidelines, and access to the dataset, please visit the competition page on Kaggle: [Learning Agency Lab - Automated Essay Scoring 2.0](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2)



# 1. Import modules

In [17]:
# !pip install "/kaggle/input/pyspellchecker/pyspellchecker-0.7.2-py3-none-any.whl"

In [18]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import polars as pl

from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import cohen_kappa_score
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

import string
import re
from spellchecker import SpellChecker
import lightgbm as lgb

import warnings
import logging
warnings.simplefilter("ignore")
logging.disable(logging.ERROR)

# 2. Load dataset and initial configuration

In [19]:
class PATHS:
    train_path = 'data/train.csv'
    test_path = 'data/test.csv'
    sub_path = 'data/sample_submission.csv'

In [20]:
class CFG:
    n_splits = 5
    seed = 42
    num_labels = 6

In [21]:
train = pd.read_csv(PATHS.train_path)
train.head(3)

Unnamed: 0,essay_id,full_text,score
0,000d118,Many people have car where they live. The thin...,3
1,000fe60,I am a scientist at NASA that is discussing th...,3
2,001ab80,People always wish they had the same technolog...,4


In [22]:
test = pd.read_csv(PATHS.test_path)
test.head(3)

Unnamed: 0,essay_id,full_text
0,000d118,Many people have car where they live. The thin...
1,000fe60,I am a scientist at NASA that is discussing th...
2,001ab80,People always wish they had the same technolog...


# 3. Feature Engineering

## 3.1 Data preprocessing functions definations

In [23]:
def removeHTML(x):
    html=re.compile(r'<.*?>')
    return html.sub(r'',x)


cList = {
    "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have",
    "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not",
    "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", 
    "he'd": "he would",  ## --> he had or he would
    "he'd've": "he would have","he'll": "he will", "he'll've": "he will have", "he's": "he is", 
    "how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is",
    "I'd": "I would",   ## --> I had or I would
    "I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have","isn't": "is not",
    "it'd": "it had",   ## --> It had or It would
    "it'd've": "it would have","it'll": "it will","it'll've": "it will have","it's": "it is",
    "let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not","mightn't've": "might not have",
    "must've": "must have","mustn't": "must not","mustn't've": "must not have",
    "needn't": "need not","needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not","oughtn't've": "ought not have",
    "shan't": "shall not","sha'n't": "shall not","shan't've": "shall not have",
    "she'd": "she would",   ## --> It had or It would
    "she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is",
    "should've": "should have","shouldn't": "should not","shouldn't've": "should not have",
    "so've": "so have","so's": "so is",
    "that'd": "that would",
    "that'd've": "that would have","that's": "that is",
    "there'd": "there had",
    "there'd've": "there would have","there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have",
    "to've": "to have","wasn't": "was not","weren't": "were not",
    "we'd": "we had",
    "we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
    "what'll": "what will","what'll've": "what will have","what're": "what are","what's": "what is","what've": "what have",
    "when's": "when is","when've": "when have",
    "where'd": "where did","where's": "where is","where've": "where have",
    "who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is","why've": "why have",
    "will've": "will have","won't": "will not","won't've": "will not have",
    "would've": "would have","wouldn't": "would not","wouldn't've": "would not have",
    "y'all": "you all","y'alls": "you alls","y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are",
    "y'all've": "you all have","you'd": "you had","you'd've": "you would have","you'll": "you you will","you'll've": "you you will have",
    "you're": "you are",  "you've": "you have"
}
c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

def dataPreprocessing(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Remove \xa0
    x = x.replace(u'\xa0',' ')
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    x = expandContractions(x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

def remove_punctuation(text):
    # string.punctuation
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def dataPreprocessing_w_contract_punct_remove(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    x = expandContractions(x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    x = remove_punctuation(x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

## 3.2 Paragraph based feature

<a id='paragraph-feature'></a>

In [24]:
# TODO: can be fixed by keeping "\n" and removed empty paragraph entries
columns = [(pl.col("full_text").str.split(by="\n\n").alias("paragraph"))]
train = pl.from_pandas(train).with_columns(columns)
test = pl.from_pandas(test).with_columns(columns)

In [25]:
# paragraph features
def Paragraph_Preprocess(tmp):
    # Expand the paragraph list into several lines of data
    tmp = tmp.explode('paragraph')
    # Paragraph preprocessing
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(dataPreprocessing))
    # Calculate the length of each paragraph
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x)).alias("paragraph_len"))
    # Calculate the number of sentences and words in each paragraph
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x.split('.'))).alias("paragraph_sentence_cnt"),
                    pl.col('paragraph').map_elements(lambda x: len(x.split(' '))).alias("paragraph_word_cnt"),)
    return tmp

# feature_eng
paragraph_fea = ['paragraph_len','paragraph_sentence_cnt','paragraph_word_cnt']
def Paragraph_Eng(train_tmp):
    aggs = [
        # Count the number of paragraph lengths greater than and less than the i-value
        *[pl.col('paragraph').filter(pl.col('paragraph_len') >= i).count().alias(f"paragraph_{i}_cnt") for i in [50,75,100,125,150,175,200,250,300,350,400,500,600,700] ], 
        *[pl.col('paragraph').filter(pl.col('paragraph_len') <= i).count().alias(f"paragraph_{i}_cnt") for i in [25,49]], 
        # other
        *[pl.col(fea).max().alias(f"{fea}_max") for fea in paragraph_fea],
        *[pl.col(fea).mean().alias(f"{fea}_mean") for fea in paragraph_fea],
        *[pl.col(fea).min().alias(f"{fea}_min") for fea in paragraph_fea],
        *[pl.col(fea).first().alias(f"{fea}_first") for fea in paragraph_fea],
        *[pl.col(fea).last().alias(f"{fea}_last") for fea in paragraph_fea],
        *[pl.col(fea).sum().alias(f"{fea}_sum") for fea in paragraph_fea],
        *[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in paragraph_fea],
        *[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in paragraph_fea],  
        *[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in paragraph_fea],
    ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df

tmp = Paragraph_Preprocess(train)
train_feats = Paragraph_Eng(tmp)

# Obtain feature names
feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(5)

Features Number:  43


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,paragraph_word_cnt_sum,paragraph_len_kurtosis,paragraph_sentence_cnt_kurtosis,paragraph_word_cnt_kurtosis,paragraph_len_q1,paragraph_sentence_cnt_q1,paragraph_word_cnt_q1,paragraph_len_q3,paragraph_sentence_cnt_q3,paragraph_word_cnt_q3
0,000d118,1,1,1,1,1,1,1,1,1,...,494,,,,2645.0,14.0,494.0,2645.0,14.0,494.0
1,000fe60,5,5,5,5,5,5,4,3,3,...,335,-1.301431,-1.044379,-1.310957,237.0,4.0,48.0,398.0,5.0,77.0
2,001ab80,4,4,4,4,4,4,4,4,4,...,550,-1.740076,-1.592593,-1.696723,576.0,5.0,101.0,927.0,8.0,165.0
3,001bdc0,5,5,5,5,4,4,4,4,4,...,448,-1.449534,-1.75,-1.333444,367.0,2.0,64.0,806.0,8.0,128.0
4,002ba53,4,4,4,4,4,4,4,4,4,...,373,-1.475737,-1.616116,-1.457252,19.0,1.0,3.0,559.0,5.0,93.0


## 3.3 Sentence based features

source: https://www.kaggle.com/code/ye11725/tfidf-lgbm-baseline-with-code-comments/notebook#Features-engineering

In [26]:
# sentence feature
def Sentence_Preprocess(tmp):
    # Preprocess full_text and use periods to segment sentences in the text
    tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=".").alias("sentence"))
    tmp = tmp.explode('sentence')
    # Calculate the length of a sentence
    tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x)).alias("sentence_len"))
    # Filter out the portion of data with a sentence length greater than 15
    tmp = tmp.filter(pl.col('sentence_len')>=15)
    # Count the number of words in each sentence
    tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x.split(' '))).alias("sentence_word_cnt"))
    return tmp

# feature_eng
sentence_fea = ['sentence_len','sentence_word_cnt']
def Sentence_Eng(train_tmp):
    aggs = [
        # Count the number of sentences with a length greater than i
        *[pl.col('sentence').filter(pl.col('sentence_len') >= i).count().alias(f"sentence_{i}_cnt") for i in [15,50,100,150,200,250,300] ], 
        # other
        *[pl.col(fea).max().alias(f"{fea}_max") for fea in sentence_fea],
        *[pl.col(fea).mean().alias(f"{fea}_mean") for fea in sentence_fea],
        *[pl.col(fea).min().alias(f"{fea}_min") for fea in sentence_fea],
        *[pl.col(fea).first().alias(f"{fea}_first") for fea in sentence_fea],
        *[pl.col(fea).last().alias(f"{fea}_last") for fea in sentence_fea],
        *[pl.col(fea).sum().alias(f"{fea}_sum") for fea in sentence_fea],
        *[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in sentence_fea],
        *[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in sentence_fea], 
        *[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in sentence_fea], 
        ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df

tmp = Sentence_Preprocess(train)

# Merge the newly generated feature data with the previously generated feature data
train_feats = train_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

Features Number:  68


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,sentence_len_last,sentence_word_cnt_last,sentence_len_sum,sentence_word_cnt_sum,sentence_len_kurtosis,sentence_word_cnt_kurtosis,sentence_len_q1,sentence_word_cnt_q1,sentence_len_q3,sentence_word_cnt_q3
0,000d118,1,1,1,1,1,1,1,1,1,...,47,10,2632,506,1.485673,2.062371,110.0,21.0,225.0,37.0
1,000fe60,5,5,5,5,5,5,4,3,3,...,125,26,1649,351,1.085089,0.464288,53.0,13.0,125.0,26.0
2,001ab80,4,4,4,4,4,4,4,4,4,...,58,10,3041,573,-0.423362,0.129704,90.0,17.0,151.0,29.0


## 3.4 Word based feature

source: https://www.kaggle.com/code/ye11725/tfidf-lgbm-baseline-with-code-comments/notebook#Features-engineering

In [27]:
# word feature
def Word_Preprocess(tmp):
    # Preprocess full_text and use spaces to separate words from the text
    tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=" ").alias("word"))
    tmp = tmp.explode('word')
    # Calculate the length of each word
    tmp = tmp.with_columns(pl.col('word').map_elements(lambda x: len(x)).alias("word_len"))
    # Delete data with a word length of 0
    tmp = tmp.filter(pl.col('word_len')!=0)
    
    return tmp

# feature_eng
def Word_Eng(train_tmp):
    aggs = [
        # Count the number of words with a length greater than i+1
        *[pl.col('word').filter(pl.col('word_len') >= i+1).count().alias(f"word_{i+1}_cnt") for i in range(15) ], 
        # other
        pl.col('word_len').max().alias(f"word_len_max"),
        pl.col('word_len').mean().alias(f"word_len_mean"),
        pl.col('word_len').std().alias(f"word_len_std"),
        pl.col('word_len').quantile(0.25).alias(f"word_len_q1"),
        pl.col('word_len').quantile(0.50).alias(f"word_len_q2"),
        pl.col('word_len').quantile(0.75).alias(f"word_len_q3"),
        ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df

tmp = Word_Preprocess(train)

# Merge the newly generated feature data with the previously generated feature data
train_feats = train_feats.merge(Word_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

Features Number:  89


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,word_12_cnt,word_13_cnt,word_14_cnt,word_15_cnt,word_len_max,word_len_mean,word_len_std,word_len_q1,word_len_q2,word_len_q3
0,000d118,1,1,1,1,1,1,1,1,1,...,6,6,5,2,25,4.356275,2.537066,3.0,4.0,5.0
1,000fe60,5,5,5,5,5,5,4,3,3,...,0,0,0,0,11,3.976119,2.069025,2.0,4.0,5.0
2,001ab80,4,4,4,4,4,4,4,4,4,...,14,10,5,2,15,4.574545,2.604621,3.0,4.0,5.0


## 3.5 Character TFIDF feature:

For TFIDF vector generation we use TfidfVectorizer provided by [sickit-learn liberay](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

#### Terms:
* **TF (Term frequency)**: Number of time a term occur in a document / Total number of term in the document.
* **DF (Document frequency)**: Number of document where the term appear / Total number of document.
* **IDF (Inverse Document Frequency)**: 1 / Document frequency
***
### TfidfVectorizer parameters:
* **tokenizer**: Is set to `lambda x: x` which means the text will be passed as it is. 
* **preprocessor**: Is set to `lambda x: x` which means the text will be passed as it is.
* **token_pattern**: Is not set to `None` means word will be taken as token as it is without any word-level processing.
* **strip_accents**: Ts set to `unicode` which means include unicode characters during preprocessing step.
* **analyzer**: Ts set to `word` which means the feature (terms or token) will be the words
* **ngram_range**: ngram_range equal to `(1, 2)` which means unigrams and bigrams
* **min_df**: Is equal to `0.05` means ignore terms that occur in less the 5% of documents.
* **max_df**: Is equal to `0.95` means ignore terms that occur in more them 95% of documents.
* **sublinear_tf**: Is equal to `True` means replace tf with 1 + log(tf)

##### Note:
* **tokenizer=lambda x: x**: "`words are not tokenized from full-text? Tokenizer should only be overided by identity if text is already tokenized before. Perhaps vectorizer is receiving string (char sequence) instead of word sequence, so it behaves like a char ngram vectorizer`" qouted from notebook [here](https://www.kaggle.com/code/guillaums/error-in-tfidf-vectorizer-in-baseline-nbs?scriptVersionId=175110986&cellId=11)

In [28]:
# TfidfVectorizer parameter
vectorizer = TfidfVectorizer(
            tokenizer=lambda x: x,
            preprocessor=lambda x: x,
            token_pattern=None,
            strip_accents='unicode',
            analyzer = 'word',
            ngram_range=(1,3),
            min_df=0.05,
            max_df=0.95,
            sublinear_tf=True,
)
# Fit all datasets into TfidfVector,this may cause leakage and overly optimistic CV scores
train_tfid = vectorizer.fit_transform([i for i in train['full_text']])

print("#"*80)
vect_feat_names=vectorizer.get_feature_names_out()
print(vect_feat_names[100:110])
print("#"*80, "\n\n")

# Convert to array
dense_matrix = train_tfid.toarray()

# Convert to dataframe
df = pd.DataFrame(dense_matrix)

# rename features
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = train_feats['essay_id']

# Merge the newly generated feature data with the previously generated feature data
train_feats = train_feats.merge(df, on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

################################################################################
['  D e' '  D o' '  D r' '  E' '  E a' '  E l' '  E u' '  E v' '  E x'
 '  F']
################################################################################ 


Features Number:  3380


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,tfid_3281,tfid_3282,tfid_3283,tfid_3284,tfid_3285,tfid_3286,tfid_3287,tfid_3288,tfid_3289,tfid_3290
0,000d118,1,1,1,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.034738,0.071064,0.0,0.0
1,000fe60,5,5,5,5,5,5,4,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,001ab80,4,4,4,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
stopwords_list = stopwords.words('english')
# TfidfVectorizer parameter
word_vectorizer = TfidfVectorizer(
    strip_accents='ascii',
    analyzer = 'word',
    ngram_range=(1,1),
    min_df=0.05,
    max_df=0.95,
    sublinear_tf=True,
    stop_words=stopwords_list,
)
# Fit all datasets into TfidfVector,this may cause leakage and overly optimistic CV scores
processed_text = train.to_pandas()["full_text"].apply(lambda x: dataPreprocessing_w_contract_punct_remove(x))
train_tfid = word_vectorizer.fit_transform([i for i in processed_text])

# Convert to array
dense_matrix = train_tfid.toarray()
# Convert to dataframe
df = pd.DataFrame(dense_matrix)
# rename features
tfid_w_columns = [ f'tfid_w_{i}' for i in range(len(df.columns))]
df.columns = tfid_w_columns
df['essay_id'] = train_feats['essay_id']

df.head()
# Merge the newly generated feature data with the previously generated feature data
train_feats = train_feats.merge(df, on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)


Features Number:  3894


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,tfid_w_504,tfid_w_505,tfid_w_506,tfid_w_507,tfid_w_508,tfid_w_509,tfid_w_510,tfid_w_511,tfid_w_512,tfid_w_513
0,000d118,1,1,1,1,1,1,1,1,1,...,0.0,0.062121,0.0,0.0,0.0,0.0,0.099808,0.08042,0.0,0.0
1,000fe60,5,5,5,5,5,5,4,3,3,...,0.181618,0.0,0.0,0.0,0.095463,0.0,0.0,0.0,0.0,0.0
2,001ab80,4,4,4,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id='extra-feature'></a>
## 3.7 Extra features:
Reference: https://www.kaggle.com/code/tsunotsuno/updated-debertav3-lgbm-with-spell-autocorrect

In [30]:
class Preprocessor:
    def __init__(self) -> None:
        self.twd = TreebankWordDetokenizer()
        self.STOP_WORDS = set(stopwords.words('english'))
        self.spellchecker = SpellChecker()

    def spelling(self, text):
        wordlist=text.split()
        amount_miss = len(list(self.spellchecker.unknown(wordlist)))
        return amount_miss
    
    def count_sym(self, text, sym):
        sym_count = 0
        for l in text:
            if l == sym:
                sym_count += 1
        return sym_count

    def run(self, data: pd.DataFrame, mode:str) -> pd.DataFrame:
        
        # preprocessing the text
        data["processed_text"] = data["full_text"].apply(lambda x: dataPreprocessing_w_contract_punct_remove(x))
        
        # Text tokenization
        data["text_tokens"] = data["processed_text"].apply(lambda x: word_tokenize(x))
        
        # essay length
        data["text_length"] = data["processed_text"].apply(lambda x: len(x))
        
        # essay word count
        data["word_count"] = data["text_tokens"].apply(lambda x: len(x))
        
        # essay unique word count
        data["unique_word_count"] = data["text_tokens"].apply(lambda x: len(set(x)))
        
        # essay sentence count
        data["sentence_count"] = data["full_text"].apply(lambda x: len(x.split('.')))
        
        # essay paragraph count
        data["paragraph_count"] = data["full_text"].apply(lambda x: len(x.split('\n\n')))
        
        # count misspelling
        data["splling_err_num"] = data["processed_text"].apply(self.spelling)
        print("Spelling mistake count done")
        
        return data
    


In [31]:
preprocessor = Preprocessor()
tmp = preprocessor.run(train.to_pandas(), mode="train")
train_feats = train_feats.merge(tmp, on='essay_id', how='left')
feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))


Spelling mistake count done


In [32]:
print('Features Number: ',len(feature_names))
train_feats.head(3)

Features Number:  3904


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,score,paragraph,processed_text,text_tokens,text_length,word_count,unique_word_count,sentence_count,paragraph_count,splling_err_num
0,000d118,1,1,1,1,1,1,1,1,1,...,3,[Many people have car where they live. The thi...,many people have car where they live the thing...,"[many, people, have, car, where, they, live, t...",2607,492,222,14,1,23
1,000fe60,5,5,5,5,5,5,4,3,3,...,3,[I am a scientist at NASA that is discussing t...,i am a scientist at nasa that is discussing th...,"[i, am, a, scientist, at, nasa, that, is, disc...",1630,335,145,20,5,7
2,001ab80,4,4,4,4,4,4,4,4,4,...,4,[People always wish they had the same technolo...,people always wish they had the same technolog...,"[people, always, wish, they, had, the, same, t...",3009,550,226,25,4,7


## 3.8 Test dataset featurization

In [33]:

# Paragraph
tmp = Paragraph_Preprocess(test)
test_feats = Paragraph_Eng(tmp)

# Sentence
tmp = Sentence_Preprocess(test)
test_feats = test_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')

# Word
tmp = Word_Preprocess(test)
test_feats = test_feats.merge(Word_Eng(tmp), on='essay_id', how='left')



In [34]:

# Tfidf
test_tfid = vectorizer.transform([i for i in test['full_text']])
dense_matrix = test_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df, on='essay_id', how='left')



In [35]:

# Word Tfidf
processed_text = test.to_pandas()["full_text"].apply(lambda x: dataPreprocessing_w_contract_punct_remove(x))
# train_w_tfid = word_vectorizer.fit_transform(train['full_text'])
test_w_tfid = word_vectorizer.fit_transform([i for i in processed_text])
dense_matrix = test_w_tfid.toarray()
df_w = pd.DataFrame(dense_matrix)
tfid_w_columns = [ f'tfid_w_{i}' for i in range(len(df_w.columns))]
df_w.columns = tfid_w_columns
df_w['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df_w, on='essay_id', how='left')


In [36]:
# Extra feature

preprocessor2 = Preprocessor()
tmp = preprocessor2.run(test.to_pandas(), mode="train")
test_feats = test_feats.merge(tmp, on='essay_id', how='left')




Spelling mistake count done


In [37]:
test_feats.head(3)

Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,full_text,paragraph,processed_text,text_tokens,text_length,word_count,unique_word_count,sentence_count,paragraph_count,splling_err_num
0,000d118,1,1,1,1,1,1,1,1,1,...,Many people have car where they live. The thin...,[Many people have car where they live. The thi...,many people have car where they live the thing...,"[many, people, have, car, where, they, live, t...",2607,492,222,14,1,23
1,000fe60,5,5,5,5,5,5,4,3,3,...,I am a scientist at NASA that is discussing th...,[I am a scientist at NASA that is discussing t...,i am a scientist at nasa that is discussing th...,"[i, am, a, scientist, at, nasa, that, is, disc...",1630,335,145,20,5,7
2,001ab80,4,4,4,4,4,4,4,4,4,...,People always wish they had the same technolog...,[People always wish they had the same technolo...,people always wish they had the same technolog...,"[people, always, wish, they, had, the, same, t...",3009,550,226,25,4,7


In [38]:

# Features number
feature_names = list(filter(lambda x: x not in ['essay_id','score'], test_feats.columns))
print('Features number: ',len(feature_names))
test_feats.head(3)

Features number:  3758


Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,full_text,paragraph,processed_text,text_tokens,text_length,word_count,unique_word_count,sentence_count,paragraph_count,splling_err_num
0,000d118,1,1,1,1,1,1,1,1,1,...,Many people have car where they live. The thin...,[Many people have car where they live. The thi...,many people have car where they live the thing...,"[many, people, have, car, where, they, live, t...",2607,492,222,14,1,23
1,000fe60,5,5,5,5,5,5,4,3,3,...,I am a scientist at NASA that is discussing th...,[I am a scientist at NASA that is discussing t...,i am a scientist at nasa that is discussing th...,"[i, am, a, scientist, at, nasa, that, is, disc...",1630,335,145,20,5,7
2,001ab80,4,4,4,4,4,4,4,4,4,...,People always wish they had the same technolog...,[People always wish they had the same technolo...,people always wish they had the same technolog...,"[people, always, wish, they, had, the, same, t...",3009,550,226,25,4,7


# 4. Data preparation

## 4.1 Add k-fold details

In [39]:
skf = StratifiedKFold(n_splits=CFG.n_splits, shuffle=True, random_state=CFG.seed)
for i, (_, val_index) in enumerate(skf.split(train_feats, train_feats["score"])):
    train_feats.loc[val_index, "fold"] = i
print(train_feats.shape)
# train_feats.head()

(17307, 3907)


In [40]:
test_feats.shape

(3, 3759)

## 4.2 Feature selection

In [41]:
target = "score"
train_drop_columns = ["essay_id", "fold", "full_text", "paragraph", "text_tokens", "processed_text",'tfid_w_368', 'tfid_w_369', 'tfid_w_370', 'tfid_w_371', 'tfid_w_372', 'tfid_w_373', 'tfid_w_374', 'tfid_w_375', 'tfid_w_376', 'tfid_w_377', 'tfid_w_378', 'tfid_w_379', 'tfid_w_380', 'tfid_w_381', 'tfid_w_382', 'tfid_w_383', 'tfid_w_384', 'tfid_w_385', 'tfid_w_386', 'tfid_w_387', 'tfid_w_388', 'tfid_w_389', 'tfid_w_390', 'tfid_w_391', 'tfid_w_392', 'tfid_w_393', 'tfid_w_394', 'tfid_w_395', 'tfid_w_396', 'tfid_w_397', 'tfid_w_398', 'tfid_w_399', 'tfid_w_400', 'tfid_w_401', 'tfid_w_402', 'tfid_w_403', 'tfid_w_404', 'tfid_w_405', 'tfid_w_406', 'tfid_w_407', 'tfid_w_408', 'tfid_w_409', 'tfid_w_410', 'tfid_w_411', 'tfid_w_412', 'tfid_w_413', 'tfid_w_414', 'tfid_w_415', 'tfid_w_416', 'tfid_w_417', 'tfid_w_418', 'tfid_w_419', 'tfid_w_420', 'tfid_w_421', 'tfid_w_422', 'tfid_w_423', 'tfid_w_424', 'tfid_w_425', 'tfid_w_426', 'tfid_w_427', 'tfid_w_428', 'tfid_w_429', 'tfid_w_430', 'tfid_w_431', 'tfid_w_432', 'tfid_w_433', 'tfid_w_434', 'tfid_w_435', 'tfid_w_436', 'tfid_w_437', 'tfid_w_438', 'tfid_w_439', 'tfid_w_440', 'tfid_w_441', 'tfid_w_442', 'tfid_w_443', 'tfid_w_444', 'tfid_w_445', 'tfid_w_446', 'tfid_w_447', 'tfid_w_448', 'tfid_w_449', 'tfid_w_450', 'tfid_w_451', 'tfid_w_452', 'tfid_w_453', 'tfid_w_454', 'tfid_w_455', 'tfid_w_456', 'tfid_w_457', 'tfid_w_458', 'tfid_w_459', 'tfid_w_460', 'tfid_w_461', 'tfid_w_462', 'tfid_w_463', 'tfid_w_464', 'tfid_w_465', 'tfid_w_466', 'tfid_w_467', 'tfid_w_468', 'tfid_w_469', 'tfid_w_470', 'tfid_w_471', 'tfid_w_472', 'tfid_w_473', 'tfid_w_474', 'tfid_w_475', 'tfid_w_476', 'tfid_w_477', 'tfid_w_478', 'tfid_w_479', 'tfid_w_480', 'tfid_w_481', 'tfid_w_482', 'tfid_w_483', 'tfid_w_484', 'tfid_w_485', 'tfid_w_486', 'tfid_w_487', 'tfid_w_488', 'tfid_w_489', 'tfid_w_490', 'tfid_w_491', 'tfid_w_492', 'tfid_w_493', 'tfid_w_494', 'tfid_w_495', 'tfid_w_496', 'tfid_w_497', 'tfid_w_498', 'tfid_w_499', 'tfid_w_500', 'tfid_w_501', 'tfid_w_502', 'tfid_w_503', 'tfid_w_504', 'tfid_w_505', 'tfid_w_506', 'tfid_w_507', 'tfid_w_508', 'tfid_w_509', 'tfid_w_510', 'tfid_w_511', 'tfid_w_512', 'tfid_w_513'] + [target]

In [42]:
train_feats.drop(columns=train_drop_columns).head()

Unnamed: 0,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,paragraph_350_cnt,...,tfid_w_364,tfid_w_365,tfid_w_366,tfid_w_367,text_length,word_count,unique_word_count,sentence_count,paragraph_count,splling_err_num
0,1,1,1,1,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,2607,492,222,14,1,23
1,5,5,5,5,5,5,4,3,3,2,...,0.0,0.0,0.0,0.0,1630,335,145,20,5,7
2,4,4,4,4,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,3009,550,226,25,4,7
3,5,5,5,5,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,2622,450,213,24,5,8
4,4,4,4,4,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,2137,372,140,16,6,14


In [43]:
test_drop_columns = ["essay_id", "full_text", "paragraph", "text_tokens", "processed_text"]

In [44]:
test_feats.drop(columns=test_drop_columns).head()

Unnamed: 0,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,paragraph_350_cnt,...,tfid_w_364,tfid_w_365,tfid_w_366,tfid_w_367,text_length,word_count,unique_word_count,sentence_count,paragraph_count,splling_err_num
0,1,1,1,1,1,1,1,1,1,1,...,0.065463,0.0,0.065463,0.065463,2607,492,222,14,1,23
1,5,5,5,5,5,5,4,3,3,2,...,0.0,0.181076,0.0,0.0,1630,335,145,20,5,7
2,4,4,4,4,4,4,4,4,4,4,...,0.0,0.0,0.0,0.0,3009,550,226,25,4,7


# 5. Training

## 5.1 Evaluation function and loss function defination 

In [45]:
# idea from https://www.kaggle.com/code/rsakata/optimize-qwk-by-lgb/notebook#QWK-objective
def quadratic_weighted_kappa(y_true, y_pred):
    y_true = y_true + a
    y_pred = (y_pred + a).clip(1, 6).round()
    qwk = cohen_kappa_score(y_true, y_pred, weights="quadratic")
    return 'QWK', qwk, True

def qwk_obj(y_true, y_pred):
    labels = y_true + a
    preds = y_pred + a
    preds = preds.clip(1, 6)
    f = 1/2*np.sum((preds-labels)**2)
    g = 1/2*np.sum((preds-a)**2+b)
    df = preds - labels
    dg = preds - a
    grad = (df/g - f*dg/g**2)*len(labels)
    hess = np.ones(len(labels))
    return grad, hess
a = 2.948
b = 1.092

## 5.2 Training LGBMRegressor model

In [46]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=CFG.seed, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold].drop(columns=train_drop_columns)
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold].drop(columns=train_drop_columns)
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)



[LightGBM] [Info] Using self-defined objective function
Training until validation scores don't improve for 75 rounds
[25]	train's QWK: 0.755278	valid's QWK: 0.744537
[50]	train's QWK: 0.800207	valid's QWK: 0.777743
[75]	train's QWK: 0.817787	valid's QWK: 0.788816
[100]	train's QWK: 0.828686	valid's QWK: 0.793632
[125]	train's QWK: 0.83737	valid's QWK: 0.795988
[150]	train's QWK: 0.844961	valid's QWK: 0.796691
[175]	train's QWK: 0.851675	valid's QWK: 0.801721
[200]	train's QWK: 0.858156	valid's QWK: 0.800466
[225]	train's QWK: 0.863782	valid's QWK: 0.798881
[250]	train's QWK: 0.869761	valid's QWK: 0.799522
Early stopping, best iteration is:
[181]	train's QWK: 0.853508	valid's QWK: 0.802415
Evaluated only: QWK




## 5.3 Validating LGBMRegressor model

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold].drop(columns=train_drop_columns)
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

v_score = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {v_score}")

Validation score : 0.8026743411070119


## 5.4 Testing and collecting prediction

In [None]:
train_feats.shape

(17307, 3907)

In [None]:
test_feats.shape

(3, 3759)

In [None]:
X_eval_cv.shape

(3461, 3754)

In [None]:
# Get the feature columns of the model
model_columns = model.feature_name_

# Get the feature columns of the input data
input_columns = X_eval_cv.columns

# Find extra features in the model
extra_features_model = [col for col in model_columns if col not in input_columns]

# Find extra features in the input data
extra_features_input = [col for col in input_columns if col not in model_columns]

# Print or inspect the extra features
print("Extra features in the model:", extra_features_model)
print("Extra features in the input data:", extra_features_input)


Extra features in the model: []
Extra features in the input data: []


In [None]:
len(extra_features_input)

0

In [None]:
len(extra_features_model)

0

In [None]:
# predecting for 5 models
preds = []
for fold, model in enumerate(models):
    X_eval_cv = test_feats.drop(columns=test_drop_columns)
    # pred = model.predict(X_eval_cv)
    pred = model.predict(X_eval_cv) + a
    preds.append(pred)

# Combining the 5 model results
for i, pred in enumerate(preds):
    test_feats[f"score_pred_{i}"] = pred
test_feats["score"] = np.round(test_feats[[f"score_pred_{fold}" for fold in range(CFG.n_splits)]].mean(axis=1),0).astype('int32')

In [None]:
test_feats.head()

Unnamed: 0,essay_id,paragraph_50_cnt,paragraph_75_cnt,paragraph_100_cnt,paragraph_125_cnt,paragraph_150_cnt,paragraph_175_cnt,paragraph_200_cnt,paragraph_250_cnt,paragraph_300_cnt,...,unique_word_count,sentence_count,paragraph_count,splling_err_num,score_pred_0,score_pred_1,score_pred_2,score_pred_3,score_pred_4,score
0,000d118,1,1,1,1,1,1,1,1,1,...,222,14,1,23,2.167996,1.98511,2.203904,1.575921,2.322956,2
1,000fe60,5,5,5,5,5,5,4,3,3,...,145,20,5,7,2.789518,2.915989,2.881348,2.908497,2.71375,3
2,001ab80,4,4,4,4,4,4,4,4,4,...,226,25,4,7,4.754231,4.854331,4.582892,4.602405,4.787978,5


# 6. Submission

In [None]:
test_feats[["essay_id", "score"]].to_csv("submission.csv", index=False)

# 7. Save Model using Pickle 

In [None]:
import joblib
# Save the trained model to a file
joblib.dump(model, 'lgbm_model.pkl')
