<a href="https://www.kaggle.com/code/yaaangzhou/nlp-commonlit-ml-baseline-model?scriptVersionId=142274358" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Created by Yang Zhou**

**[NLP]CommonLit - ML Baseline model**

**7 Sep 2023**

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">[NLP]CommonLit - ML Baseline model</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Automatically assess summaries written by students in grades 3-12</center></p>

***

**The goal of this competition is to generate a model to automatically score student summaries.The goal of the competition is to help teachers and learning platforms provide better feedback to students on their writing.**

This is the first NLP competition I have participated in. I've learned a lot of valuable knowledge from public kernals and discussion forums, thank you all for your help.

# 0. Imports

In [1]:
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
import string
import re

# Models
import optuna
from sklearn.model_selection import KFold, GroupKFold, train_test_split
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Metrics
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')



# 1. Load Datas

In [2]:
data_dir = "/kaggle/input/commonlit-evaluate-student-summaries/"
train_pro = pd.read_csv(data_dir + 'prompts_train.csv')
train_sum = pd.read_csv(data_dir + 'summaries_train.csv')

test_pro = pd.read_csv(data_dir + 'prompts_test.csv')
test_sum = pd.read_csv(data_dir + 'summaries_test.csv')

submission = pd.read_csv(data_dir + 'sample_submission.csv')

**In the data set, prompts represent the description of the problem. The summary is the student's answer. We need to score content and wording separately.**

In [3]:
train_pro

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...
1,3b9047,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
2,814d6b,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."


In [4]:
train_sum.head()

Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757


I have two ideas:
1. Join the prompt and summary tables based on `prompt id`, then merge prompt and text together and separate them with delimiters.
2. I can also ignore the content of the prompt and only use `prompt id` as an input feature, so that it can be processed as a classification task in machine learning.

Let's do the merge.

In [5]:
train = train_sum.merge(train_pro, how="left", on="prompt_id")
test = test_sum.merge(test_pro, how="left", on="prompt_id")

In [6]:
print("Full train dataset shape is {}".format(train.shape))

Full train dataset shape is (7165, 8)


In [7]:
train.head(3)

Unnamed: 0,student_id,prompt_id,text,content,wording,prompt_question,prompt_title,prompt_text
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...


# 2. Prepocess and Functiones

## Preprocess

If we use ML methods, we will add some additional features to the text, which will play a role in modeling.

In [8]:
def count_stopwords(text):
    stopword_list = set(stopwords.words('english'))
    words = text.split()
    stopwords_count = sum(1 for word in words if word.lower() in stopword_list)
    return stopwords_count

In [9]:
def count_punctuation(text):
    punctuation_set = set(string.punctuation)
    punctuation_count = sum(1 for char in text if char in punctuation_set)
    return punctuation_count

In [10]:
def count_numbers(text: str):
    numbers = re.findall(r'\d+', text)
    numbers_count = len(numbers)
    return numbers_count

In [11]:
def data_preprocess(df):
    df[f'text_word_count'] = df['text'].apply(lambda x: len(x.split(' ')))
    df[f'text_length'] = df['text'].apply(lambda x: len(x))
    df[f'text_stopword_count'] = df['text'].apply(lambda x: count_stopwords(x))
    df[f'text_punct_count'] = df['text'].apply(lambda x: count_punctuation(x))
    df[f'text_number_count'] = df['text'].apply(lambda x: count_numbers(x))
    return df

In [12]:
train_preprocessed = data_preprocess(train)
test_preprocessed = data_preprocess(test)

In [13]:
train_preprocessed.head(3)

Unnamed: 0,student_id,prompt_id,text,content,wording,prompt_question,prompt_title,prompt_text,text_word_count,text_length,text_stopword_count,text_punct_count,text_number_count
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,61,346,25,3,0
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",52,244,30,2,0
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,235,1370,98,38,0


In [14]:
# Features extra list
features = ['text_word_count','text_length','text_stopword_count','text_punct_count','text_number_count']

# 3. Train models

Now we can build ML models based on the features we created. Since we have two goals, we need to build two models respectively.

In [15]:
# Features for content score
feature_content = features + ['content']

# Features for wording score
feature_wording = features + ['wording']

In [16]:
# Set different dataset for different target

train_content = train_preprocessed[feature_content]
train_wording = train_preprocessed[feature_wording]

In [17]:
test_content = test_preprocessed[features]
test_wording = test_preprocessed[features]

In [18]:
train_content.head(3)

Unnamed: 0,text_word_count,text_length,text_stopword_count,text_punct_count,text_number_count,content
0,61,346,25,3,0,0.205683
1,52,244,30,2,0,-0.548304
2,235,1370,98,38,0,3.128928


## Model for content score

In [19]:
content_xgb_cv_scores, content_xgb_preds = list(), list()
content_lgbm_cv_scores, content_lgbm_preds = list(), list()
content_rf_cv_scores, content_rf_preds = list(), list()

kf = KFold(n_splits=3, random_state=42, shuffle=True)

X = train_content.drop('content',axis=1)
Y = train_content['content']

for i, (train_ix, test_ix) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]
    
    print('---------------------------------------------------------------')
    
    ## XGBoost
    xgb_content = XGBRegressor().fit(X_train, Y_train)
    xgb_pred = xgb_content.predict(X_test)   
    xgb_score_fold = np.sqrt(mean_squared_error(Y_test, xgb_pred))
    print('Fold', i+1, '==> XGBoost oof RMSE score is ==>', xgb_score_fold)
    content_xgb_cv_scores.append(xgb_score_fold)
    
    ## Pred
    xgb_pred_test = xgb_content.predict(test_content)
    content_xgb_preds.append(xgb_pred_test)
    
    ## LGBM
    lgbm_content = LGBMRegressor().fit(X_train, Y_train)
    lgbm_pred = lgbm_content.predict(X_test) 
    lgbm_score_fold = np.sqrt(mean_squared_error(Y_test, lgbm_pred))
    print('Fold', i+1, '==> LGBM oof RMSE score is ==>', lgbm_score_fold)
    content_lgbm_cv_scores.append(lgbm_score_fold)

    ## Pred
    lgbm_pred_test = lgbm_content.predict(test_content)
    content_lgbm_preds.append(lgbm_pred_test)
    
    ## RF
    rf_content = RandomForestRegressor().fit(X_train, Y_train)
    rf_pred = rf_content.predict(X_test) 
    rf_score_fold = np.sqrt(mean_squared_error(Y_test, rf_pred))
    print('Fold', i+1, '==> RF oof RMSE score is ==>', rf_score_fold)
    content_rf_cv_scores.append(rf_score_fold)

    ## Pred
    rf_pred_test = rf_content.predict(test_content)
    content_rf_preds.append(rf_pred_test)
    
print('---------------------------------------------------------------')
print('Average RMSE of XGBoost model is:', np.mean(content_xgb_cv_scores))
print('Average RMSE of LGBM model is:', np.mean(content_lgbm_cv_scores))
print('Average RMSE of RF model is:', np.mean(content_rf_cv_scores))

---------------------------------------------------------------
Fold 1 ==> XGBoost oof RMSE score is ==> 0.5479318254787
Fold 1 ==> LGBM oof RMSE score is ==> 0.5250568809281149
Fold 1 ==> RF oof RMSE score is ==> 0.5376441190818294
---------------------------------------------------------------
Fold 2 ==> XGBoost oof RMSE score is ==> 0.5555447304427135
Fold 2 ==> LGBM oof RMSE score is ==> 0.5295173583380971
Fold 2 ==> RF oof RMSE score is ==> 0.5447827862407433
---------------------------------------------------------------
Fold 3 ==> XGBoost oof RMSE score is ==> 0.5450634148270878
Fold 3 ==> LGBM oof RMSE score is ==> 0.5251731278216436
Fold 3 ==> RF oof RMSE score is ==> 0.5376053125789495
---------------------------------------------------------------
Average RMSE of XGBoost model is: 0.5495133235828337
Average RMSE of LGBM model is: 0.5265824556959519
Average RMSE of RF model is: 0.5400107393005075


## Model for wording score

In [20]:
wording_xgb_cv_scores, wording_xgb_preds = list(), list()
wording_lgbm_cv_scores, wording_lgbm_preds = list(), list()
wording_rf_cv_scores, wording_rf_preds = list(), list()

kf = KFold(n_splits=3, random_state=42, shuffle=True)

X = train_wording.drop('wording',axis=1)
Y = train_wording['wording']

for i, (train_ix, test_ix) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]
    
    print('---------------------------------------------------------------')
    
    ## XGBoost
    xgb_wording = XGBRegressor().fit(X_train, Y_train)
    xgb_pred = xgb_wording.predict(X_test)   
    xgb_score_fold = np.sqrt(mean_squared_error(Y_test, xgb_pred))
    print('Fold', i+1, '==> XGBoost oof RMSE score is ==>', xgb_score_fold)
    wording_xgb_cv_scores.append(xgb_score_fold)

    ## Pred
    xgb_pred_test = xgb_wording.predict(test_content)
    wording_xgb_preds.append(xgb_pred_test)
    
    ## LGBM
    lgbm_wording = LGBMRegressor().fit(X_train, Y_train)
    lgbm_pred = lgbm_wording.predict(X_test) 
    lgbm_score_fold = np.sqrt(mean_squared_error(Y_test, lgbm_pred))
    print('Fold', i+1, '==> LGBM oof RMSE score is ==>', lgbm_score_fold)
    wording_lgbm_cv_scores.append(lgbm_score_fold)

    ## Pred
    lgbm_pred_test = lgbm_wording.predict(test_content)
    wording_lgbm_preds.append(lgbm_pred_test)
    
    ## RF
    rf_wording = RandomForestRegressor().fit(X_train, Y_train)
    rf_pred = rf_wording.predict(X_test) 
    rf_score_fold = np.sqrt(mean_squared_error(Y_test, rf_pred))
    print('Fold', i+1, '==> RF oof RMSE score is ==>', rf_score_fold)
    wording_rf_cv_scores.append(rf_score_fold)

    ## Pred
    rf_pred_test = rf_wording.predict(test_content)
    wording_rf_preds.append(rf_pred_test)
    
print('---------------------------------------------------------------')
print('Average RMSE of XGBoost model is:', np.mean(wording_xgb_cv_scores))
print('Average RMSE of LGBM model is:', np.mean(wording_lgbm_cv_scores))
print('Average RMSE of RF model is:', np.mean(wording_rf_cv_scores))

---------------------------------------------------------------
Fold 1 ==> XGBoost oof RMSE score is ==> 0.8314521964271242
Fold 1 ==> LGBM oof RMSE score is ==> 0.8055005351166274
Fold 1 ==> RF oof RMSE score is ==> 0.828011143296084
---------------------------------------------------------------
Fold 2 ==> XGBoost oof RMSE score is ==> 0.8321064890306868
Fold 2 ==> LGBM oof RMSE score is ==> 0.8066564767312416
Fold 2 ==> RF oof RMSE score is ==> 0.8327533431844204
---------------------------------------------------------------
Fold 3 ==> XGBoost oof RMSE score is ==> 0.8242572250293704
Fold 3 ==> LGBM oof RMSE score is ==> 0.79785440289046
Fold 3 ==> RF oof RMSE score is ==> 0.8155704551529147
---------------------------------------------------------------
Average RMSE of XGBoost model is: 0.8292719701623938
Average RMSE of LGBM model is: 0.8033371382461096
Average RMSE of RF model is: 0.8254449805444731


In [21]:
content_rf_preds

[array([-1.50471248, -1.50471248, -1.50471248, -1.50471248]),
 array([-1.38091993, -1.38091993, -1.38091993, -1.38091993]),
 array([-1.3221819, -1.3221819, -1.3221819, -1.3221819])]

# Submission

In [22]:
test['content'] = rf_content.predict(test_content)
test['wording'] = rf_wording.predict(test_content)

In [23]:
submission = test[['student_id','content','wording']]
submission.to_csv('submission.csv',index=False)

In [24]:
submission

Unnamed: 0,student_id,content,wording
0,000000ffffff,-1.322182,-1.441624
1,111111eeeeee,-1.322182,-1.441624
2,222222cccccc,-1.322182,-1.441624
3,333333dddddd,-1.322182,-1.441624


**This is a simple attempt, we did not use LLM, which means we cannot extract semantic information for each answer. It is true that we can make the model more accurate by adding more features, but I don't think this can achieve the results obtained using LLM.**