## Feature Analysis data processing affect on score

***

### Features Used
1. **Text length**: Length of the given essay.
2. **Word count**: Number of word in essay (tokenized using nltk TreebankWordDetokenizer).
3. **Unique word count**: Number of unique words.
4. **Spelling mistake count**: Number of spelling mistake (identified using spellchecker liberary).

***

### Model used:
- Model used: LGBM
- Metric: quadratic weighted kappa
- Loss function: quadratic weighted kappa (given in **[link](https://www.kaggle.com/code/ye11725/tfidf-lgbm-baseline-cv-0-799-lb-0-799)**)

***

### Data prosessing setting:
1. **Full text without text processing**: 
    * The essay is used for feature generation without any preprocessing.
2. **Full text with text processing**: 
    * The essay is used for feature generation with preprocessing.
3. **Full text with text processing + contraction expansion**: 
    * The essay is used for feature generation with preprocessing, where preprocessing include contraction expension.
4. **Full text with text processing + punctuation removal**: 
    * The essay is used for feature generation with preprocessing, where preprocessing include punctuation removal.
4. **Full text with text processing + contraction expansion + punctuation removal**: 
    * The essay is used for feature generation with preprocessing, where preprocessing include contraction expansion and punctuation removal.

***
### Observations:
1. The **text length** decrease significantly after the text processing, this is happing because the non-textual content is removed i.e. cleaning of text data is done. But after removing punctuation the increase in text length is observed.  **[Link](#graph)**
2. The **word count** increases after removing punctuation. (this happened because after removing the punctuation the token the are considered as one word are broken into more than one word.) **[Link](#graph)**
3. The **spelling mistakes** decreases significently after removing the puntuation and that because the puntuation may be causing a valid word into a spelling miskake. **[Link](#graph)**
***

### Score:
| Data prosessing setting | Validation score | 
| :--- | :--- |
| Full text without text processing | 0.7100431 |
| Full text with text processing | 0.7134137 |
| Full text with text processing + contraction expansion | 0.7134168 |
| Full text with text processing + punctuation removal | 0.7175229 |
| Full text with text processing + contraction expansion + punctuation removal | 0.7197249 |

***

### Reference:
I would like to give thanks to the authors of these public notebooks. I have learned a lot from you.
* https://www.kaggle.com/code/yongsukprasertsuk/0-818-deberta-v3-large-lgbm-baseline
* https://www.kaggle.com/code/ye11725/tfidf-lgbm-baseline-cv-0-799-lb-0-799
* https://www.kaggle.com/code/tsunotsuno/updated-debertav3-lgbm-with-spell-autocorrect

# Import modules

In [None]:
!pip install "/kaggle/input/pyspellchecker/pyspellchecker-0.7.2-py3-none-any.whl"

In [None]:
from typing import List
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import polars as pl
import warnings
import logging
import os
import shutil
import json
import string
import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from datasets import Dataset,load_dataset, load_from_disk
from transformers import TrainingArguments, Trainer
from datasets import load_metric, disable_progress_bar
from sklearn.metrics import mean_squared_error
import torch
from sklearn.model_selection import KFold, GroupKFold, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import cohen_kappa_score, accuracy_score
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from collections import Counter
import spacy
import re
from spellchecker import SpellChecker
import lightgbm as lgb

# logging setting 

warnings.simplefilter("ignore")
logging.disable(logging.ERROR)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
disable_progress_bar()
tqdm.pandas()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load data initial configuration

In [None]:
# set random seed
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(seed=42)

In [None]:
class PATHS:
    train_path = '/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv'
    test_path = '/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv'
    sub_path = '/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv'

In [None]:
train = pd.read_csv(PATHS.train_path)
train.head()

# Feature Engineering

## 1. Data preprocessing functions definations

source: https://www.kaggle.com/code/ye11725/tfidf-lgbm-baseline-with-code-comments/notebook

In [None]:
def removeHTML(x):
    html=re.compile(r'<.*?>')
    return html.sub(r'',x)


cList = {
    "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have",
    "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not",
    "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", 
    # "he'd": "he would",  ## --> he had or he would
    "he'd've": "he would have","he'll": "he will", "he'll've": "he will have", "he's": "he is", 
    "how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is",
    # "I'd": "I would",   ## --> I had or I would
    "I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have","isn't": "is not",
    # "it'd": "it had",   ## --> It had or It would
    "it'd've": "it would have","it'll": "it will","it'll've": "it will have","it's": "it is",
    "let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not","mightn't've": "might not have",
    "must've": "must have","mustn't": "must not","mustn't've": "must not have",
    "needn't": "need not","needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not","oughtn't've": "ought not have",
    "shan't": "shall not","sha'n't": "shall not","shan't've": "shall not have",
    # "she'd": "she would",   ## --> It had or It would
    "she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is",
    "should've": "should have","shouldn't": "should not","shouldn't've": "should not have",
    "so've": "so have","so's": "so is",
    # "that'd": "that would",
    "that'd've": "that would have","that's": "that is",
    # "there'd": "there had",
    "there'd've": "there would have","there's": "there is",
    # "they'd": "they would",
    "they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have",
    "to've": "to have","wasn't": "was not","weren't": "were not",
    # "we'd": "we had",
    "we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
    "what'll": "what will","what'll've": "what will have","what're": "what are","what's": "what is","what've": "what have",
    "when's": "when is","when've": "when have",
    "where'd": "where did","where's": "where is","where've": "where have",
    "who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is","why've": "why have",
    "will've": "will have","won't": "will not","won't've": "will not have",
    "would've": "would have","wouldn't": "would not","wouldn't've": "would not have",
    "y'all": "you all","y'alls": "you alls","y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are",
    "y'all've": "you all have","you'd": "you had","you'd've": "you would have","you'll": "you you will","you'll've": "you you will have",
    "you're": "you are",  "you've": "you have"
}
c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

def remove_punctuation(text):
    """
    Remove all punctuation from the input text.
    
    Args:
    - text (str): The input text.
    
    Returns:
    - str: The text with punctuation removed.
    """
    # string.punctuation
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def dataPreprocessing(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

def dataPreprocessing_w_contract(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    x = expandContractions(x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

def dataPreprocessing_w_punct_remove(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    x = remove_punctuation(x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

def dataPreprocessing_w_contract_punct_remove(x):
    # Convert words to lowercase
    x = x.lower()
    # Remove HTML
    x = removeHTML(x)
    # Delete strings starting with @
    x = re.sub("@\w+", '',x)
    # Delete Numbers
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    # Delete URL
    x = re.sub("http\w+", '',x)
    # Replace consecutive empty spaces with a single space character
    x = re.sub(r"\s+", " ", x)
    x = expandContractions(x)
    # Replace consecutive commas and periods with one comma and period character
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    x = remove_punctuation(x)
    # Remove empty characters at the beginning and end
    x = x.strip()
    return x

## 2. Feature

### Other feature

Source: https://www.kaggle.com/code/tsunotsuno/updated-debertav3-lgbm-with-spell-autocorrect

In [None]:
class Preprocessor:
    def __init__(self) -> None:
        self.twd = TreebankWordDetokenizer()
        self.STOP_WORDS = set(stopwords.words('english'))
        self.spellchecker = SpellChecker() 

    def spelling(self, text):
        wordlist=text.split()
        amount_miss = len(list(self.spellchecker.unknown(wordlist)))
        return amount_miss
    
    def run(self, data: pd.DataFrame, mode:str) -> pd.DataFrame:
        data["text_tokens"] = data["full_text"].apply(lambda x: word_tokenize(x))
        data["text_length"] = data["full_text"].apply(lambda x: len(x))
        data["word_count"] = data["text_tokens"].apply(lambda x: len(x))
        data["unique_word_count"] = data["text_tokens"].apply(lambda x: len(set(x)))
        data["splling_err_num"] = data["full_text"].progress_apply(self.spelling)
    
        data["processed_text"] = data["full_text"].apply(lambda x: dataPreprocessing(x))
        data["text_tokens"] = data["processed_text"].apply(lambda x: word_tokenize(x))
        data["text_length_p"] = data["processed_text"].apply(lambda x: len(x))
        data["word_count_p"] = data["text_tokens"].apply(lambda x: len(x))
        data["unique_word_count_p"] = data["text_tokens"].apply(lambda x: len(set(x)))
        data["splling_err_num_p"] = data["processed_text"].progress_apply(self.spelling)
    
        data["processed_text"] = data["full_text"].apply(lambda x: dataPreprocessing_w_contract(x))
        data["text_tokens"] = data["processed_text"].apply(lambda x: word_tokenize(x))
        data["text_length_pc"] = data["processed_text"].apply(lambda x: len(x))
        data["word_count_pc"] = data["text_tokens"].apply(lambda x: len(x))
        data["unique_word_count_pc"] = data["text_tokens"].apply(lambda x: len(set(x)))
        data["splling_err_num_pc"] = data["processed_text"].progress_apply(self.spelling)
        
        data["processed_text"] = data["full_text"].apply(lambda x: dataPreprocessing_w_punct_remove(x))
        data["text_tokens"] = data["processed_text"].apply(lambda x: word_tokenize(x))
        data["text_length_ppr"] = data["processed_text"].apply(lambda x: len(x))
        data["word_count_ppr"] = data["text_tokens"].apply(lambda x: len(x))
        data["unique_word_count_ppr"] = data["text_tokens"].apply(lambda x: len(set(x)))
        data["splling_err_num_ppr"] = data["processed_text"].progress_apply(self.spelling)
        
        data["processed_text"] = data["full_text"].apply(lambda x: dataPreprocessing_w_contract_punct_remove(x))
        data["text_tokens"] = data["processed_text"].apply(lambda x: word_tokenize(x))
        data["text_length_pcpr"] = data["processed_text"].apply(lambda x: len(x))
        data["word_count_pcpr"] = data["text_tokens"].apply(lambda x: len(x))
        data["unique_word_count_pcpr"] = data["text_tokens"].apply(lambda x: len(set(x)))
        data["splling_err_num_pcpr"] = data["processed_text"].progress_apply(self.spelling)
        data.drop(columns=["processed_text", "text_tokens"], inplace=True)
        return data
    
preprocessor = Preprocessor()

In [None]:
train_feats = preprocessor.run(train, mode="train")

train_feats.head(3)

# EDA: Engineered features

## Paragraph Length Analysis


In [None]:
feats = [
    "text_length","text_length_p","text_length_pc","text_length_ppr","text_length_pcpr",
    "word_count","word_count_p","word_count_pc","word_count_ppr","word_count_pcpr",
    "unique_word_count","unique_word_count_p","unique_word_count_pc","unique_word_count_ppr","unique_word_count_pcpr",
    "splling_err_num","splling_err_num_p","splling_err_num_pc","splling_err_num_ppr","splling_err_num_pcpr"
]

In [None]:
sns.set(rc={'figure.figsize': (15, 15)})
train_feats[feats + ["score"]].hist(bins=50);

<a id='graph'></a>
### Observation:
**From above charts**
1. The **text length** decrease significantly after the text processing, this is happing because the non-textual content is removed i.e. cleaning of text data is done. But after removing punctuation the text length increases.
2. The **word count** increases after removing punctuation, this happened because after removing the punctuation this may break token into more then one after removal of puntuation.
3. The **spelling mistakes** decreases significently after removing the puntuation and that because the puntuation may be causing a valid word into a spelling miskake.

In [None]:
correlation_matrix = train_feats[["score"]+feats].corr()
plt.figure(figsize=(20, 20))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.2)
plt.title('Correlation Matrix')
plt.show()

In [None]:
for col_idx in range(0, len(feats), 5):
    fig, axes = plt.subplots(1, 5, figsize = (25, 6))
    for i in range(0, 5):
        if col_idx+i < len(feats):
            sns.scatterplot(ax=axes[i], data=train_feats, x=feats[col_idx+i], y='score', color='steelblue');

In [None]:
plt.figure(figsize=(12, 2))
sns.boxplot(data=train_feats[feats[:5]], orient="h")
plt.figure(figsize=(12, 2))
sns.boxplot(data=train_feats[feats[5:10]], orient="h")
plt.figure(figsize=(12, 2))
sns.boxplot(data=train_feats[feats[10:15]], orient="h")
plt.figure(figsize=(12, 2))
sns.boxplot(data=train_feats[feats[15:]], orient="h")

### Observation:
* After removing the punctuation the word-count spikes have been reduced.
* After text processing, contraction expension and punctuation removal the number of spelling mistakes are also reduced.

## Testing the result with featues

In [None]:
class CFG:
    n_splits = 5
    seed = 42
    num_labels = 6

In [None]:
skf = StratifiedKFold(n_splits=CFG.n_splits, shuffle=True, random_state=CFG.seed)
for i, (_, val_index) in enumerate(skf.split(train_feats, train_feats["score"])):
    train_feats.loc[val_index, "fold"] = i
print(train_feats.shape)
train_feats.head(2)

### Metric and loss function

In [None]:
# idea from https://www.kaggle.com/code/rsakata/optimize-qwk-by-lgb/notebook#QWK-objective
def quadratic_weighted_kappa(y_true, y_pred):
    y_true = y_true + a
    y_pred = (y_pred + a).clip(1, 6).round()
    qwk = cohen_kappa_score(y_true, y_pred, weights="quadratic")
    return 'QWK', qwk, True

def qwk_obj(y_true, y_pred):
    labels = y_true + a
    preds = y_pred + a
    preds = preds.clip(1, 6)
    f = 1/2*np.sum((preds-labels)**2)
    g = 1/2*np.sum((preds-a)**2+b)
    df = preds - labels
    dg = preds - a
    grad = (df/g - f*dg/g**2)*len(labels)
    hess = np.ones(len(labels))
    return grad, hess
a = 2.948
b = 1.092

### 1. Testing : Features without any text processing

In [None]:
selected_featues = ["text_length","word_count","unique_word_count","splling_err_num"]
train_feats[selected_featues].head(2)

In [None]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=42, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold][selected_featues]
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

rmse = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {rmse}")

### 2. Testing : Features with text processing

In [None]:
selected_featues = ["text_length_p","word_count_p","unique_word_count_p","splling_err_num_p"]
train_feats[selected_featues].head(2)

In [None]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=42, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold][selected_featues]
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

rmse = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {rmse}")

### 3. Testing : Features with text processing + contraction expension

In [None]:
selected_featues = ["text_length_pc","word_count_pc","unique_word_count_pc","splling_err_num_pc"]
train_feats[selected_featues].head(2)

In [None]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=42, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold][selected_featues]
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

rmse = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {rmse}")

### 4. Testing : Features with text processing + puntuation removal

In [None]:
selected_featues = ["text_length_ppr","word_count_ppr","unique_word_count_ppr","splling_err_num_ppr"]
train_feats[selected_featues].head(2)

In [None]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=42, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold][selected_featues]
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

rmse = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {rmse}")

### 5. Testing : Features with text processing + contraction expension + punctuation removal

In [None]:
selected_featues = ["text_length_pcpr","word_count_pcpr","unique_word_count_pcpr","splling_err_num_pcpr"]
train_feats[selected_featues].head(2)

In [None]:
models = []

callbacks = [
    lgb.log_evaluation(period=25), 
    lgb.early_stopping(stopping_rounds=75,first_metric_only=True)
]
for fold in range(CFG.n_splits):

    model = lgb.LGBMRegressor(
        objective = qwk_obj, metrics = 'None', learning_rate = 0.1, max_depth = 5,
        num_leaves = 10, colsample_bytree=0.5, reg_alpha = 0.1, reg_lambda = 0.8,
        n_estimators=1024, random_state=42, verbosity = - 1
    )
    
    # Take out the training and validation sets for 5 kfold segmentation separately
    X_train = train_feats[train_feats["fold"] != fold][selected_featues]
    y_train = train_feats[train_feats["fold"] != fold]["score"] - a

    X_eval = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval = train_feats[train_feats["fold"] == fold]["score"] - a

    print('\nFold_{} Training ================================\n'.format(fold+1))
    # Training model
    lgb_model = model.fit(
        X_train, y_train,
        eval_names=['train', 'valid'],
        eval_set=[(X_train, y_train), (X_eval, y_eval)],
        eval_metric=quadratic_weighted_kappa,
        callbacks=callbacks
    )
    models.append(model)

In [None]:
preds, trues = [], []
    
for fold, model in enumerate(models):
    X_eval_cv = train_feats[train_feats["fold"] == fold][selected_featues]
    y_eval_cv = train_feats[train_feats["fold"] == fold]["score"]

    pred = model.predict(X_eval_cv) + a
    
    trues.extend(y_eval_cv)
    preds.extend(np.round(pred, 0))

rmse = cohen_kappa_score(trues, preds, weights="quadratic")

print(f"Validation score : {rmse}")