# FE + Null Importance + LGB (CV:0.616->0.615, LB:0.582->0.584)
Six years ago, Kaggle GM Olivier proposed the **[Null Importance](https://www.kaggle.com/code/ogrellier/feature-selection-with-null-importances/notebook)** feature selection method. It could identifies and removes "opportunistic" features. For example, if we include userID as a feature in the model to predict the consumer category to which different userIDs belong, an overfitted model might learn a direct mapping from userID to the consumer group (essentially memorizing which consumer category corresponds to which userID). If we shuffle the labels and retrain the model to make predictions based on the shuffled labels, we would find that the model would once again map the userID directly to the shuffled labels. Ultimately, regardless of whether the labels are real or fake, the userID becomes an "opportunistic" feature that makes itself the most important. How do we identify such features? Olivier's idea is simple: **truly robust, stable, and important features should be highly important under the real labels, but their importance should diminish once the labels are shuffled.** Conversely, if a feature performs moderately under the raw labels but its importance increases when the labels are shuffled, it is clearly unreliable and should be removed as an "opportunistic" feature.

Based on [Silver Bullet | Single Model | 165 Features](https://www.kaggle.com/code/awqatak/silver-bullet-single-model-165-features), I made the following modifications:
- Add early stopping.
- Add Null Importance.

**The offline CV decreases from 0.616 to 0.615, while the public score increases from 0.582 to 0.584. **

I conducted some experiments and found that inconsistencies between offline and online performance are often present. Therefore, I decided to share the null importances for public, even though they may not perform optimally online. However, I believe it provides valuable insights. I also hope that find a teammate. I am less concerned about the results on the public leaderboard. Instead, I may focus on improving the performance of both offline and online metrics by relying on feature engineering and enhancing model generalization.

In [1]:
import polars as pl
import pandas as pd
import numpy as np
import time
import re, os
import random
import optuna
from tqdm import tqdm
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from scipy.stats import skew, kurtosis
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import warnings
warnings.filterwarnings("ignore")

In [2]:
os.mkdir('feature_selection/')
os.mkdir('results')

## Feature Engineering

In [3]:
num_cols = ['down_time', 'up_time', 'action_time', 'cursor_position', 'word_count']
activities = ['Input', 'Remove/Cut', 'Nonproduction', 'Replace', 'Paste']
events = ['q', 'Space', 'Backspace', 'Shift', 'ArrowRight', 'Leftclick', 'ArrowLeft', '.', ',', 'ArrowDown', 'ArrowUp', 'Enter', 'CapsLock', "'", 'Delete', 'Unidentified']
text_changes = ['q', ' ', '.', ',', '\n', "'", '"', '-', '?', ';', '=', '/', '\\', ':']

def count_by_values(df, colname, values):
    fts = df.select(pl.col('id').unique(maintain_order=True))
    for i, value in enumerate(values):
        tmp_df = df.group_by('id').agg(pl.col(colname).is_in([value]).sum().alias(f'{colname}_{i}_cnt'))
        fts  = fts.join(tmp_df, on='id', how='left') 
    return fts

def dev_feats(df):
    
    print("< Count by values features >")
    feats = count_by_values(df, 'activity', activities)
    feats = feats.join(count_by_values(df, 'text_change', text_changes), on='id', how='left') 
    feats = feats.join(count_by_values(df, 'down_event', events), on='id', how='left') 
    feats = feats.join(count_by_values(df, 'up_event', events), on='id', how='left') 

    print("< Input words stats features >")
    temp = df.filter((~pl.col('text_change').str.contains('=>')) & (pl.col('text_change') != 'NoChange'))
    temp = temp.group_by('id').agg(pl.col('text_change').str.concat('').str.extract_all(r'q+'))
    temp = temp.with_columns(input_word_count = pl.col('text_change').list.lengths(), 
                             input_word_length_mean = pl.col('text_change').apply(lambda x: np.mean([len(i) for i in x] if len(x) > 0 else 0)),
                             input_word_length_max = pl.col('text_change').apply(lambda x: np.max([len(i) for i in x] if len(x) > 0 else 0)),
                             input_word_length_std = pl.col('text_change').apply(lambda x: np.std([len(i) for i in x] if len(x) > 0 else 0)),
                             input_word_length_median = pl.col('text_change').apply(lambda x: np.median([len(i) for i in x] if len(x) > 0 else 0)),
                             input_word_length_skew = pl.col('text_change').apply(lambda x: skew([len(i) for i in x] if len(x) > 0 else 0)))
    temp = temp.drop('text_change')
    feats = feats.join(temp, on='id', how='left') 
   
    print("< Numerical columns features >")
    temp = df.group_by("id").agg(pl.sum('action_time').suffix('_sum'), pl.mean(num_cols).suffix('_mean'), pl.std(num_cols).suffix('_std'),
                                 pl.median(num_cols).suffix('_median'), pl.min(num_cols).suffix('_min'), pl.max(num_cols).suffix('_max'),
                                 pl.quantile(num_cols, 0.5).suffix('_quantile'))
    feats = feats.join(temp, on='id', how='left') 

    print("< Categorical columns features >")
    temp  = df.group_by("id").agg(pl.n_unique(['activity', 'down_event', 'up_event', 'text_change']))
    feats = feats.join(temp, on='id', how='left') 

    print("< Idle time features >")
    # time_diff = abs(down_time - up_time_lagged) 
    temp = df.with_columns(pl.col('up_time').shift().over('id').alias('up_time_lagged'))
    temp = temp.with_columns((abs(pl.col('down_time') - pl.col('up_time_lagged')) / 1000).fill_null(0).alias('time_diff'))
    temp = temp.filter(pl.col('activity').is_in(['Input', 'Remove/Cut']))
    temp = temp.group_by("id").agg(inter_key_largest_lantency = pl.max('time_diff'),
                                   inter_key_median_lantency = pl.median('time_diff'),
                                   mean_pause_time = pl.mean('time_diff'),
                                   std_pause_time = pl.std('time_diff'),
                                   total_pause_time = pl.sum('time_diff'),
                                   pauses_half_sec = pl.col('time_diff').filter((pl.col('time_diff') > 0.5) & (pl.col('time_diff') < 1)).count(),
                                   pauses_1_sec = pl.col('time_diff').filter((pl.col('time_diff') > 1) & (pl.col('time_diff') < 1.5)).count(),
                                   pauses_1_half_sec = pl.col('time_diff').filter((pl.col('time_diff') > 1.5) & (pl.col('time_diff') < 2)).count(),
                                   pauses_2_sec = pl.col('time_diff').filter((pl.col('time_diff') > 2) & (pl.col('time_diff') < 3)).count(),
                                   pauses_3_sec = pl.col('time_diff').filter(pl.col('time_diff') > 3).count(),)
    feats = feats.join(temp, on='id', how='left') 
    
    print("< P-bursts features >")
    # P-bursts that refer to the written segments terminated by pauses
    temp = df.with_columns(pl.col('up_time').shift().over('id').alias('up_time_lagged'))
    temp = temp.with_columns((abs(pl.col('down_time') - pl.col('up_time_lagged')) / 1000).fill_null(0).alias('time_diff'))
    temp = temp.filter(pl.col('activity').is_in(['Input', 'Remove/Cut']))
    temp = temp.with_columns(pl.col('time_diff')<2)
    temp = temp.with_columns(pl.when(pl.col("time_diff") & pl.col("time_diff").is_last()).then(pl.count()).over(pl.col("time_diff").rle_id()).alias('P-bursts'))
    temp = temp.drop_nulls()
    temp = temp.group_by("id").agg(pl.mean('P-bursts').suffix('_mean'), pl.std('P-bursts').suffix('_std'), pl.count('P-bursts').suffix('_count'),
                                   pl.median('P-bursts').suffix('_median'), pl.max('P-bursts').suffix('_max'),
                                   pl.first('P-bursts').suffix('_first'), pl.last('P-bursts').suffix('_last'))
    feats = feats.join(temp, on='id', how='left') 

    print("< R-bursts features >")
    # R-bursts that describe the segments terminated by an evaluation, revision or other grammatical discontinuity
    temp = df.filter(pl.col('activity').is_in(['Input', 'Remove/Cut']))
    temp = temp.with_columns(pl.col('activity').is_in(['Remove/Cut']))
    temp = temp.with_columns(pl.when(pl.col("activity") & pl.col("activity").is_last()).then(pl.count()).over(pl.col("activity").rle_id()).alias('R-bursts'))
    temp = temp.drop_nulls()
    temp = temp.group_by("id").agg(pl.mean('R-bursts').suffix('_mean'), pl.std('R-bursts').suffix('_std'), 
                                   pl.median('R-bursts').suffix('_median'), pl.max('R-bursts').suffix('_max'),
                                   pl.first('R-bursts').suffix('_first'), pl.last('R-bursts').suffix('_last'))
    feats = feats.join(temp, on='id', how='left')
    return feats

## Reconstruct Essay

In [4]:
def reconstruct_essay(currTextInput):
    essayText = ""
    for Input in currTextInput.values:
        # Input[0] = activity
        # Input[1] = cursor_position
        # Input[2] = text_change
        # Input[3] = id
        if Input[0] == 'Replace':
            replaceTxt = Input[2].split(' => ')
            essayText = essayText[:Input[1] - len(replaceTxt[1])] + replaceTxt[1] + essayText[Input[1] - len(replaceTxt[1]) + len(replaceTxt[0]):]
            continue
        if Input[0] == 'Paste':
            essayText = essayText[:Input[1] - len(Input[2])] + Input[2] + essayText[Input[1] - len(Input[2]):]
            continue
        if Input[0] == 'Remove/Cut':
            essayText = essayText[:Input[1]] + essayText[Input[1] + len(Input[2]):]
            continue
        # If activity = Move...
        if "M" in Input[0]:
            # Gets rid of the "Move from to" text
            croppedTxt = Input[0][10:]
            # Splits cropped text by ' To '
            splitTxt = croppedTxt.split(' To ')
            # Splits split text again by ', ' for each item
            valueArr = [item.split(', ') for item in splitTxt]
            # Move from [2, 4] To [5, 7] = (2, 4, 5, 7)
            moveData = (int(valueArr[0][0][1:]), int(valueArr[0][1][:-1]), int(valueArr[1][0][1:]), int(valueArr[1][1][:-1]))
            # Skip if someone manages to activiate this by moving to same place
            if moveData[0] != moveData[2]:
                # Check if they move text forward in essay (they are different)
                if moveData[0] < moveData[2]:
                    essayText = essayText[:moveData[0]] + essayText[moveData[1]:moveData[3]] + essayText[moveData[0]:moveData[1]] + essayText[moveData[3]:]
                else:
                    essayText = essayText[:moveData[2]] + essayText[moveData[0]:moveData[1]] + essayText[moveData[2]:moveData[0]] + essayText[moveData[1]:]
            continue
        # If activity = input
        essayText = essayText[:Input[1] - len(Input[2])] + Input[2] + essayText[Input[1] - len(Input[2]):]
    return essayText

def get_essay_df(df):
    df       = df[df.activity != 'Nonproduction']
    temp     = df.groupby('id').apply(lambda x: reconstruct_essay(x[['activity', 'cursor_position', 'text_change']]))
    essay_df = pd.DataFrame({'id': df['id'].unique().tolist()})
    essay_df = essay_df.merge(temp.rename('essay'), on='id')
    return essay_df

In [5]:
def q1(x):
    return x.quantile(0.25)
def q3(x):
    return x.quantile(0.75)

AGGREGATIONS = ['count', 'mean', 'min', 'max', 'first', 'last', q1, 'median', q3, 'sum']

def word_feats(df):
    df['word'] = df['essay'].apply(lambda x: re.split(' |\\n|\\.|\\?|\\!',x))
    # a list of words to each word per row
    df = df.explode('word')
    df['word_len'] = df['word'].apply(lambda x: len(x))
    df = df[df['word_len'] != 0]

    word_agg_df = df[['id','word_len']].groupby(['id']).agg(AGGREGATIONS)
    word_agg_df.columns = ['_'.join(x) for x in word_agg_df.columns]
    word_agg_df['id'] = word_agg_df.index
    word_agg_df = word_agg_df.reset_index(drop=True) 
    return word_agg_df

def sent_feats(df):
    df['sent'] = df['essay'].apply(lambda x: re.split('\\.|\\?|\\!',x))
    df = df.explode('sent')
    df['sent'] = df['sent'].apply(lambda x: x.replace('\n','').strip())
    # Number of characters in sentences
    df['sent_len'] = df['sent'].apply(lambda x: len(x))
    # Number of words in sentences
    df['sent_word_count'] = df['sent'].apply(lambda x: len(x.split(' ')))
    df = df[df.sent_len!=0].reset_index(drop=True)

    sent_agg_df = pd.concat([df[['id','sent_len']].groupby(['id']).agg(AGGREGATIONS), 
                             df[['id','sent_word_count']].groupby(['id']).agg(AGGREGATIONS)], axis=1)
    sent_agg_df.columns = ['_'.join(x) for x in sent_agg_df.columns]
    sent_agg_df['id'] = sent_agg_df.index
    sent_agg_df = sent_agg_df.reset_index(drop=True)
    sent_agg_df.drop(columns=["sent_word_count_count"], inplace=True)
    sent_agg_df = sent_agg_df.rename(columns={"sent_len_count":"sent_count"})
    return sent_agg_df

def parag_feats(df):
    df['paragraph'] = df['essay'].apply(lambda x: x.split('\n'))
    df = df.explode('paragraph')
    # Number of characters in paragraphs
    df['paragraph_len'] = df['paragraph'].apply(lambda x: len(x)) 
    # Number of words in paragraphs
    df['paragraph_word_count'] = df['paragraph'].apply(lambda x: len(x.split(' ')))
    df = df[df.paragraph_len!=0].reset_index(drop=True)
    
    paragraph_agg_df = pd.concat([df[['id','paragraph_len']].groupby(['id']).agg(AGGREGATIONS), 
                                  df[['id','paragraph_word_count']].groupby(['id']).agg(AGGREGATIONS)], axis=1) 
    paragraph_agg_df.columns = ['_'.join(x) for x in paragraph_agg_df.columns]
    paragraph_agg_df['id'] = paragraph_agg_df.index
    paragraph_agg_df = paragraph_agg_df.reset_index(drop=True)
    paragraph_agg_df.drop(columns=["paragraph_word_count_count"], inplace=True)
    paragraph_agg_df = paragraph_agg_df.rename(columns={"paragraph_len_count":"paragraph_count"})
    return paragraph_agg_df
    
def product_to_keys(logs, essays):
    essays['product_len'] = essays.essay.str.len()
    tmp_df = logs[logs.activity.isin(['Input', 'Remove/Cut'])].groupby(['id']).agg({'activity': 'count'}).reset_index().rename(columns={'activity': 'keys_pressed'})
    essays = essays.merge(tmp_df, on='id', how='left')
    essays['product_to_keys'] = essays['product_len'] / essays['keys_pressed']
    return essays[['id', 'product_to_keys']]

def get_keys_pressed_per_second(logs):
    temp_df = logs[logs['activity'].isin(['Input', 'Remove/Cut'])].groupby(['id']).agg(keys_pressed=('event_id', 'count')).reset_index()
    temp_df_2 = logs.groupby(['id']).agg(min_down_time=('down_time', 'min'), max_up_time=('up_time', 'max')).reset_index()
    temp_df = temp_df.merge(temp_df_2, on='id', how='left')
    temp_df['keys_per_second'] = temp_df['keys_pressed'] / ((temp_df['max_up_time'] - temp_df['min_down_time']) / 1000)
    return temp_df[['id', 'keys_per_second']]


## Feature Selection: Null Importances

In [6]:
def get_feature_importances(data_x, data_y, valid_x, valid_y, model_params, shuffle, seed=None):
    if seed == None:
        model_params['random_state'] = None

    gain_model = LGBMRegressor(importance_type='gain', **model_params)
    split_model = LGBMRegressor(importance_type='split', **model_params)

    if shuffle:
        random.shuffle(data_y)
        random.shuffle(valid_y)
        
    gain_model.fit(data_x, data_y, eval_set=[(valid_x, valid_y)], verbose=False)
    split_model.fit(data_x, data_y, eval_set=[(valid_x, valid_y)], verbose=False)

    imp_df = pd.DataFrame()
    imp_df["feature"] = list(data_x)
    imp_df["importance_gain"] = gain_model.feature_importances_
    imp_df["importance_split"] = split_model.feature_importances_    
    return imp_df


def get_batch_imp_df(data_x, data_y, valid_x, valid_y, model_params, nb_runs=10, null_flag=True):
    batch_imp_df = pd.DataFrame()
    print(f'Run {nb_runs} rounds of model training:')
    for i in tqdm(range(nb_runs)):
        if null_flag:
            imp_df = get_feature_importances(data_x, data_y, valid_x, valid_y, model_params, shuffle=True, seed=None)
        else:
            imp_df = get_feature_importances(data_x, data_y, valid_x, valid_y, model_params, shuffle=False, seed=None)
        imp_df['run'] = i + 1 
        batch_imp_df = pd.concat([batch_imp_df, imp_df], axis=0)
    return batch_imp_df


def display_distributions(actual_imp_df_, null_imp_df_, feature_):
    plt.figure(figsize=(13, 6))
    gs = gridspec.GridSpec(1, 2)
    # Plot Split importances
    ax = plt.subplot(gs[0, 0])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())
    # Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())

    
def normalize_feat_imp(imp_df):
    avg_gain_imp_df = imp_df.groupby(by=['feature'])['importance_gain'].mean().reset_index()
    avg_split_imp_df = imp_df.groupby(by=['feature'])['importance_split'].mean().reset_index()
    # rank-normalize score
    avg_gain_imp_df['importance_gain'] = avg_gain_imp_df['importance_gain'].rank(ascending=False, pct=True)
    avg_split_imp_df['importance_split'] = avg_split_imp_df['importance_split'].rank(ascending=False, pct=True)
    avg_imp_df = pd.merge(avg_gain_imp_df, avg_split_imp_df, on='feature')
    # # normalize score
    # avg_imp_df['importance_gain'] = (avg_imp_df['importance_gain'] - avg_imp_df['importance_gain'].min())  / (avg_imp_df['importance_gain'].max() - avg_imp_df['importance_gain'].min())
    # avg_imp_df['importance_split'] = (avg_imp_df['importance_split'] - avg_imp_df['importance_split'].min())  / (avg_imp_df['importance_split'].max() - avg_imp_df['importance_split'].min())
    return avg_imp_df


def nul_imp_feat_select(model_params,
                        train_x, train_y, valid_x, valid_y,
                        actual_avg_imp_df, null_avg_imp_df,
                        search_thres=[ 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1],
                        search_criterion=['gain', 'split', 'both']):
    feat_select_dfs = []
    for search_cri in search_criterion:
        print(f'Feature Selection by Null Importances: {search_cri}:')
        for threshold in tqdm(search_thres):
            if 'split' == search_cri:
                actual_feats = actual_avg_imp_df[actual_avg_imp_df['importance_split']>threshold]['feature'].values.tolist()
                null_feats = null_avg_imp_df[null_avg_imp_df['importance_split']>threshold]['feature'].values.tolist()
                unstable_feats = set(actual_feats + null_feats)
                feats = [_f for _f in actual_avg_imp_df['feature'].unique() if _f not in unstable_feats]
            elif 'gain' == search_cri:
                actual_feats = actual_avg_imp_df[actual_avg_imp_df['importance_gain']>threshold]['feature'].values.tolist()
                null_feats = null_avg_imp_df[null_avg_imp_df['importance_gain']>threshold]['feature'].values.tolist()
                unstable_feats = set(actual_feats + null_feats)
                feats = [_f for _f in actual_avg_imp_df['feature'].unique() if _f not in unstable_feats]
            elif 'both' == search_cri:
                actual_feats = actual_avg_imp_df[actual_avg_imp_df['importance_split']>threshold]['feature'].values.tolist()
                null_feats = null_avg_imp_df[null_avg_imp_df['importance_split']>threshold]['feature'].values.tolist()
                unstable_feats1 = set(actual_feats + null_feats)

                actual_feats = actual_avg_imp_df[actual_avg_imp_df['importance_gain']>threshold]['feature'].values.tolist()
                null_feats = null_avg_imp_df[null_avg_imp_df['importance_gain']>threshold]['feature'].values.tolist()
                unstable_feats2 = set(actual_feats + null_feats)
                
                unstable_feats = set(list(unstable_feats1) + list(unstable_feats2))
                feats = [_f for _f in actual_avg_imp_df['feature'].unique() if _f not in unstable_feats]

            model = LGBMRegressor(**model_params)
            model.fit(train_x[feats], train_y, eval_set=[(valid_x[feats], valid_y)], verbose= False)
            valid_pred_y = model.predict(valid_x[feats], num_iteration=model.best_iteration_)  
            valid_eval_rmse = mean_squared_error(valid_y, valid_pred_y, squared=False) 
            feat_select_dfs.append([search_cri, threshold, feats, len(feats), valid_eval_rmse])
    feat_select_dfs = pd.DataFrame(feat_select_dfs)   
    feat_select_dfs.columns = ['imp_type', 'threshold', 'features', 'feature_number', 'valid_eval_rmse']    
    return feat_select_dfs

## Data Loading

In [8]:
data_path     = 'kaggle/input/linking-writing-processes-to-writing-quality/'
train_logs    = pl.scan_csv(data_path + 'train_logs.csv')
train_feats   = dev_feats(train_logs) # 特征工程
train_feats   = train_feats.collect().to_pandas()

print('< Essay Reconstruction >')
train_logs             = train_logs.collect().to_pandas()
train_essays           = get_essay_df(train_logs)
train_feats            = train_feats.merge(word_feats(train_essays), on='id', how='left')
train_feats            = train_feats.merge(sent_feats(train_essays), on='id', how='left')
train_feats            = train_feats.merge(parag_feats(train_essays), on='id', how='left')
train_feats            = train_feats.merge(get_keys_pressed_per_second(train_logs), on='id', how='left')
train_feats            = train_feats.merge(product_to_keys(train_logs, train_essays), on='id', how='left')


print('< Mapping >')
train_scores   = pd.read_csv(data_path + 'train_scores.csv')
data           = train_feats.merge(train_scores, on='id', how='left')
x              = data.drop(['id', 'score'], axis=1)
y              = data['score'].values
print(f'Number of features: {len(x.columns)}')

< Count by values features >
< Input words stats features >
< Numerical columns features >
< Categorical columns features >
< Idle time features >
< P-bursts features >
< R-bursts features >
< Essay Reconstruction >
< Mapping >
Number of features: 165


## Model Training

In [8]:
def train_valid_split(data_x, data_y, train_idx, valid_idx):
    x_train = data_x.iloc[train_idx]
    y_train = data_y[train_idx]
    x_valid = data_x.iloc[valid_idx]
    y_valid = data_y[valid_idx]
    return x_train, y_train, x_valid, y_valid

def evaluate(data_x, data_y, model, random_state=42, n_splits=5):
    skf    = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    test_y = np.zeros(len(data_x))
    feature_list = list(data_x)
    avg_feature_imp = pd.DataFrame({'feature':feature_list, 'importance':np.zeros(len(feature_list))})

    for i, (train_index, valid_index) in enumerate(skf.split(data_x, data_y.astype(str))):
        print(f'Fold:{i}:')
        train_x, train_y, valid_x, valid_y = train_valid_split(data_x, data_y, train_index, valid_index)

        model.fit(train_x, train_y, 
                  eval_set=[(valid_x, valid_y)],
                 )
        test_y[valid_index] = model.predict(valid_x, num_iteration=model.best_iteration_)
        model.booster_.save_model(f'results/model_fold{i}.txt', num_iteration=model.best_iteration_)

        importances = model.feature_importances_
        
        feature_imp = pd.DataFrame({'feature':feature_list, 'importance':importances})
        feature_imp['importance'] = (feature_imp['importance'] - feature_imp['importance'].min())  / (feature_imp['importance'].max() - feature_imp['importance'].min())
        avg_feature_imp['importance'] += feature_imp['importance']
        feature_imp = feature_imp.sort_values(by='importance', ascending=False)
        feature_imp.to_csv(f'results/model_feat_imp_fold{i}.csv', index=False)
        print('-'*20)

    eval_rmse = mean_squared_error(data_y, test_y, squared=False)
    avg_feature_imp['importance'] /= n_splits
    avg_feature_imp = avg_feature_imp.sort_values(by='importance', ascending=False)
    avg_feature_imp.to_csv('results/model_feat_imp_avg.csv', index=False)
    return test_y, eval_rmse, avg_feature_imp

def inference(n_splits=5, test_x=None):
    test_y = np.zeros((len(test_x), n_splits))
    for i in range(n_splits):
        use_feats = np.load(f'feature_selection/optimal_feat_select_res_fold{i}.npy', allow_pickle=True)
        model = lightgbm.Booster(model_file=f'results/model_fold{i}.txt')
        test_y[:, i] = model.predict(test_x[use_feats[2]])
    return np.mean(test_y, axis=1)

In [9]:
param = {
        'n_estimators': 1024,
         'learning_rate': 0.005,
         'metric': 'rmse',
         'random_state': 42,
         'force_col_wise': True,
         'verbosity': -1,
         'early_stopping_round': 100
        }


data_x = x.copy()
data_y = y.copy()
model_params = param
random_state = 42
n_splits = 5
nb_runs = 50
train_feat_select = True

skf    = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)
test_y = np.zeros(len(data_x))

for i, (train_index, valid_index) in enumerate(skf.split(data_x, data_y.astype(str))):
    print(f'Fold:{i}:')
    train_x, train_y, valid_x, valid_y = train_valid_split(data_x, data_y, train_index, valid_index)

    if train_feat_select:
        print('Feature Selection (get null_imp_df):')
        null_imp_df = get_batch_imp_df(data_x.copy(), data_y.copy(), valid_x.copy(), valid_y.copy(), model_params, nb_runs=nb_runs, null_flag=True)
        print('Feature Selection (actual_imp_df):')
        actual_imp_df = get_batch_imp_df(data_x.copy(), data_y.copy(), valid_x.copy(), valid_y.copy(), model_params, nb_runs=nb_runs, null_flag=False)

        actual_avg_imp_df = normalize_feat_imp(actual_imp_df)
        null_avg_imp_df = normalize_feat_imp(null_imp_df)
        feat_select_dfs = nul_imp_feat_select(model_params,
                            train_x, train_y, valid_x, valid_y,
                            actual_avg_imp_df, null_avg_imp_df,
                            search_thres=[0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 1],
                            search_criterion=['gain', 'split', 'both'])
        
        if not os.path.exists('feature_selection/'):
            os.mkdir('feature_selection/')
        feat_select_dfs.to_csv(f'feature_selection/feature_selection_res_fold{i}.csv', index=False)
        valid_min_eval_rmse = feat_select_dfs['valid_eval_rmse'].min()
        optimal_feat_select_res = feat_select_dfs[feat_select_dfs['valid_eval_rmse'] == valid_min_eval_rmse].values[0]
        np.save(f'feature_selection/optimal_feat_select_res_fold{i}.npy', optimal_feat_select_res)

    use_feats = np.load(f'feature_selection/optimal_feat_select_res_fold{i}.npy', allow_pickle=True)
    print(f'After feature selection: {len(list(train_x))} -> {len(use_feats[2])}.')
    train_x = train_x[use_feats[2]]
    valid_x = valid_x[use_feats[2]]

    model = LGBMRegressor(**model_params)
    model.fit(train_x, train_y, 
            #   eval_set=[(train_x, train_y), (valid_x, valid_y)],
                eval_set=[(valid_x, valid_y)],
                )
    test_y[valid_index] = model.predict(valid_x, num_iteration=model.best_iteration_)

    model.booster_.save_model(f'results/model_fold{i}.txt', num_iteration=model.best_iteration_)


    importances = model.feature_importances_
    feature_list = list(train_x)
    feature_imp = pd.DataFrame({'feature':feature_list, 'importance':importances})
    feature_imp['importance'] = (feature_imp['importance'] - feature_imp['importance'].min())  / (feature_imp['importance'].max() - feature_imp['importance'].min())
    feature_imp = feature_imp.sort_values(by='importance', ascending=False)
    feature_imp.to_csv(f'results/model_feat_imp_fold{i}.csv', index=False)
    print('-'*20)
    del model

eval_rmse = mean_squared_error(data_y, test_y, squared=False)
print('Eval RMSE:', eval_rmse)

Fold:0:
Feature Selection (get null_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [03:21<00:00,  4.04s/it]


Feature Selection (actual_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [28:24<00:00, 34.09s/it]


Feature Selection by Null Importances: gain:


100%|██████████| 9/9 [01:53<00:00, 12.60s/it]


Feature Selection by Null Importances: split:


100%|██████████| 9/9 [01:56<00:00, 12.97s/it]


Feature Selection by Null Importances: both:


100%|██████████| 9/9 [01:56<00:00, 12.94s/it]


After feature selection: 165 -> 105.
[1]	valid_0's rmse: 1.01967
[2]	valid_0's rmse: 1.01669
[3]	valid_0's rmse: 1.0137
[4]	valid_0's rmse: 1.01081
[5]	valid_0's rmse: 1.00786
[6]	valid_0's rmse: 1.00502
[7]	valid_0's rmse: 1.00212
[8]	valid_0's rmse: 0.999315
[9]	valid_0's rmse: 0.996443
[10]	valid_0's rmse: 0.993703
[11]	valid_0's rmse: 0.990867
[12]	valid_0's rmse: 0.988172
[13]	valid_0's rmse: 0.985348
[14]	valid_0's rmse: 0.982523
[15]	valid_0's rmse: 0.979835
[16]	valid_0's rmse: 0.977054
[17]	valid_0's rmse: 0.974217
[18]	valid_0's rmse: 0.97163
[19]	valid_0's rmse: 0.968834
[20]	valid_0's rmse: 0.966157
[21]	valid_0's rmse: 0.963637
[22]	valid_0's rmse: 0.960914
[23]	valid_0's rmse: 0.95826
[24]	valid_0's rmse: 0.955713
[25]	valid_0's rmse: 0.953107
[26]	valid_0's rmse: 0.950546
[27]	valid_0's rmse: 0.948089
[28]	valid_0's rmse: 0.945488
[29]	valid_0's rmse: 0.943047
[30]	valid_0's rmse: 0.940552
[31]	valid_0's rmse: 0.93813
[32]	valid_0's rmse: 0.935671
[33]	valid_0's rmse: 0.

100%|██████████| 50/50 [03:21<00:00,  4.03s/it]


Feature Selection (actual_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [28:35<00:00, 34.30s/it]


Feature Selection by Null Importances: gain:


100%|██████████| 9/9 [02:02<00:00, 13.58s/it]


Feature Selection by Null Importances: split:


100%|██████████| 9/9 [02:05<00:00, 13.96s/it]


Feature Selection by Null Importances: both:


100%|██████████| 9/9 [02:05<00:00, 13.99s/it]


After feature selection: 165 -> 94.
[1]	valid_0's rmse: 1.02318
[2]	valid_0's rmse: 1.02015
[3]	valid_0's rmse: 1.01711
[4]	valid_0's rmse: 1.01409
[5]	valid_0's rmse: 1.01108
[6]	valid_0's rmse: 1.00809
[7]	valid_0's rmse: 1.00507
[8]	valid_0's rmse: 1.00213
[9]	valid_0's rmse: 0.99918
[10]	valid_0's rmse: 0.996328
[11]	valid_0's rmse: 0.993483
[12]	valid_0's rmse: 0.9906
[13]	valid_0's rmse: 0.987728
[14]	valid_0's rmse: 0.984868
[15]	valid_0's rmse: 0.982083
[16]	valid_0's rmse: 0.979263
[17]	valid_0's rmse: 0.976527
[18]	valid_0's rmse: 0.973826
[19]	valid_0's rmse: 0.970983
[20]	valid_0's rmse: 0.968243
[21]	valid_0's rmse: 0.965585
[22]	valid_0's rmse: 0.962979
[23]	valid_0's rmse: 0.960309
[24]	valid_0's rmse: 0.957729
[25]	valid_0's rmse: 0.955156
[26]	valid_0's rmse: 0.952643
[27]	valid_0's rmse: 0.950098
[28]	valid_0's rmse: 0.947547
[29]	valid_0's rmse: 0.944963
[30]	valid_0's rmse: 0.942506
[31]	valid_0's rmse: 0.939921
[32]	valid_0's rmse: 0.937446
[33]	valid_0's rmse: 0.9

100%|██████████| 50/50 [03:37<00:00,  4.36s/it]


Feature Selection (actual_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [28:40<00:00, 34.41s/it]


Feature Selection by Null Importances: gain:


100%|██████████| 9/9 [01:49<00:00, 12.20s/it]


Feature Selection by Null Importances: split:


100%|██████████| 9/9 [01:49<00:00, 12.11s/it]


Feature Selection by Null Importances: both:


100%|██████████| 9/9 [01:51<00:00, 12.44s/it]


After feature selection: 165 -> 138.
[1]	valid_0's rmse: 1.02287
[2]	valid_0's rmse: 1.01952
[3]	valid_0's rmse: 1.0162
[4]	valid_0's rmse: 1.0129
[5]	valid_0's rmse: 1.00964
[6]	valid_0's rmse: 1.00644
[7]	valid_0's rmse: 1.00321
[8]	valid_0's rmse: 1.00003
[9]	valid_0's rmse: 0.996873
[10]	valid_0's rmse: 0.993767
[11]	valid_0's rmse: 0.990623
[12]	valid_0's rmse: 0.987583
[13]	valid_0's rmse: 0.984492
[14]	valid_0's rmse: 0.981489
[15]	valid_0's rmse: 0.978466
[16]	valid_0's rmse: 0.975516
[17]	valid_0's rmse: 0.972552
[18]	valid_0's rmse: 0.969595
[19]	valid_0's rmse: 0.966681
[20]	valid_0's rmse: 0.963806
[21]	valid_0's rmse: 0.960919
[22]	valid_0's rmse: 0.95806
[23]	valid_0's rmse: 0.955244
[24]	valid_0's rmse: 0.952452
[25]	valid_0's rmse: 0.949617
[26]	valid_0's rmse: 0.946906
[27]	valid_0's rmse: 0.944043
[28]	valid_0's rmse: 0.941284
[29]	valid_0's rmse: 0.938536
[30]	valid_0's rmse: 0.93587
[31]	valid_0's rmse: 0.933065
[32]	valid_0's rmse: 0.930358
[33]	valid_0's rmse: 0.9

100%|██████████| 50/50 [03:43<00:00,  4.47s/it]


Feature Selection (actual_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [28:36<00:00, 34.34s/it]


Feature Selection by Null Importances: gain:


100%|██████████| 9/9 [01:43<00:00, 11.53s/it]


Feature Selection by Null Importances: split:


100%|██████████| 9/9 [01:50<00:00, 12.23s/it]


Feature Selection by Null Importances: both:


100%|██████████| 9/9 [01:47<00:00, 11.99s/it]


After feature selection: 165 -> 165.
[1]	valid_0's rmse: 1.01887
[2]	valid_0's rmse: 1.0158
[3]	valid_0's rmse: 1.01275
[4]	valid_0's rmse: 1.00973
[5]	valid_0's rmse: 1.00673
[6]	valid_0's rmse: 1.00374
[7]	valid_0's rmse: 1.00076
[8]	valid_0's rmse: 0.997825
[9]	valid_0's rmse: 0.994868
[10]	valid_0's rmse: 0.991939
[11]	valid_0's rmse: 0.989026
[12]	valid_0's rmse: 0.9861
[13]	valid_0's rmse: 0.983243
[14]	valid_0's rmse: 0.980401
[15]	valid_0's rmse: 0.977619
[16]	valid_0's rmse: 0.974825
[17]	valid_0's rmse: 0.97211
[18]	valid_0's rmse: 0.969406
[19]	valid_0's rmse: 0.966696
[20]	valid_0's rmse: 0.964051
[21]	valid_0's rmse: 0.961404
[22]	valid_0's rmse: 0.958795
[23]	valid_0's rmse: 0.956202
[24]	valid_0's rmse: 0.953501
[25]	valid_0's rmse: 0.950979
[26]	valid_0's rmse: 0.94831
[27]	valid_0's rmse: 0.945798
[28]	valid_0's rmse: 0.943218
[29]	valid_0's rmse: 0.940651
[30]	valid_0's rmse: 0.93812
[31]	valid_0's rmse: 0.935668
[32]	valid_0's rmse: 0.933173
[33]	valid_0's rmse: 0.93

100%|██████████| 50/50 [03:22<00:00,  4.06s/it]


Feature Selection (actual_imp_df):
Run 50 rounds of model training:


100%|██████████| 50/50 [28:46<00:00, 34.54s/it]


Feature Selection by Null Importances: gain:


100%|██████████| 9/9 [01:50<00:00, 12.25s/it]


Feature Selection by Null Importances: split:


100%|██████████| 9/9 [01:45<00:00, 11.69s/it]


Feature Selection by Null Importances: both:


100%|██████████| 9/9 [01:45<00:00, 11.73s/it]


After feature selection: 165 -> 126.
[1]	valid_0's rmse: 1.02346
[2]	valid_0's rmse: 1.02046
[3]	valid_0's rmse: 1.01742
[4]	valid_0's rmse: 1.01445
[5]	valid_0's rmse: 1.01145
[6]	valid_0's rmse: 1.00853
[7]	valid_0's rmse: 1.00558
[8]	valid_0's rmse: 1.0027
[9]	valid_0's rmse: 0.999789
[10]	valid_0's rmse: 0.996939
[11]	valid_0's rmse: 0.994123
[12]	valid_0's rmse: 0.991398
[13]	valid_0's rmse: 0.988638
[14]	valid_0's rmse: 0.985993
[15]	valid_0's rmse: 0.983239
[16]	valid_0's rmse: 0.980639
[17]	valid_0's rmse: 0.977904
[18]	valid_0's rmse: 0.975373
[19]	valid_0's rmse: 0.972743
[20]	valid_0's rmse: 0.970225
[21]	valid_0's rmse: 0.967635
[22]	valid_0's rmse: 0.965148
[23]	valid_0's rmse: 0.962697
[24]	valid_0's rmse: 0.960186
[25]	valid_0's rmse: 0.957768
[26]	valid_0's rmse: 0.955325
[27]	valid_0's rmse: 0.952935
[28]	valid_0's rmse: 0.950503
[29]	valid_0's rmse: 0.948143
[30]	valid_0's rmse: 0.945717
[31]	valid_0's rmse: 0.943442
[32]	valid_0's rmse: 0.941108
[33]	valid_0's rmse: 

## Test Inference

In [10]:
data_path     = '/kaggle/input/linking-writing-processes-to-writing-quality/'

print('< Testing Data >')
test_logs   = pl.scan_csv(data_path + 'test_logs.csv')
test_feats  = dev_feats(test_logs)
test_feats  = test_feats.collect().to_pandas()

test_logs             = test_logs.collect().to_pandas()
test_essays           = get_essay_df(test_logs)
test_feats            = test_feats.merge(word_feats(test_essays), on='id', how='left')
test_feats            = test_feats.merge(sent_feats(test_essays), on='id', how='left')
test_feats            = test_feats.merge(parag_feats(test_essays), on='id', how='left')
test_feats            = test_feats.merge(get_keys_pressed_per_second(test_logs), on='id', how='left')
test_feats            = test_feats.merge(product_to_keys(test_logs, test_essays), on='id', how='left')


test_ids = test_feats['id'].values
testin_x = test_feats.drop(['id'], axis=1)

print('< Learning and Evaluation >')
param = {
        'n_estimators': 1024,
         'learning_rate': 0.005,
         'metric': 'rmse',
         'random_state': 42,
         'force_col_wise': True,
         'verbosity': 1,
         'early_stopping_round': 100
        }


solution = LGBMRegressor(**param)
test_y_pred = inference(test_x=testin_x.copy())
sub = pd.DataFrame({'id': test_ids, 'score': test_y_pred})
sub.to_csv('submission.csv', index=False)

< Testing Data >
< Count by values features >
< Input words stats features >
< Numerical columns features >
< Categorical columns features >
< Idle time features >
< P-bursts features >
< R-bursts features >
< Learning and Evaluation >


In [11]:
!rm -rf /kaggle/working/results
!rm -rf /kaggle/working/feature_selection