# Evaluating Student Writing

# Business Understanding
Writing is a critical skill for success. However, less than a third of high school seniors are proficient writers, according to the National Assessment of Educational Progress. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback.

In this task, you need to identify elements in student writing. More specifically, you will automatically segment texts and classify argumentative and rhetorical elements in essays written by 6th-12th grade students.

If successful, you'll make it easier for students to receive feedback on their writing and increase opportunities to improve writing outcomes. Virtual writing tutors and automated writing systems can leverage these algorithms while teachers may use them to reduce grading time. The open-sourced algorithms you come up with will allow any educational organization to better help young writers develop.

### Objective
Automatically segment and classify argumentative and rhetorical elements in essays written by 6th-12th grade students into 7 types:

| Type | Definition |
|---|---|
| Lead | an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis |
| Position | an opinion or conclusion on the main question |
| Claim | a claim that supports the position |
| Counterclaim | a claim that refutes another claim or gives an opposing reason to the position |
| Rebuttal | a claim that refutes a counterclaim |
| Evidence | ideas or examples that support claims, counterclaims, or rebuttals |
| Concluding Statement | a concluding statement that restates the claims |

# Data Understanding
The data is provided in two formats:

* A train.csv file with detailed information about annotation for each essay.
* A folder with all text files, each file contain one essay.

In [None]:
pip install iterative-stratification

In [None]:
# Import pakages
import numpy as np
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.layers import Input, Dense, Dropout
import spacy
from spacy import displacy
from pylab import cm, matplotlib
from transformers import *
from wordcloud import WordCloud, STOPWORDS
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold


## Data overview
Let's look at the training data table.

In [None]:
data_df = pd.read_csv('../input/feedback-prize-2021/train.csv')
data_df.head()

There are 7 fields in the table:

| Field              | Description                                                             |
|--------------------|-------------------------------------------------------------------------|
| id                 | ID code for essay response                                              |
| discourse_id       | ID code for discourse element                                           |
| discourse_start    | character position where discourse element begins in the essay response |
| discourse_end      | character position where discourse element ends in the essay response   |
| discourse_text     | text of discourse element                                               |
| discourse_type     | classification of discourse element                                     |
| discourse_type_num | enumerated class label of discourse element                             |
| predictionstring   | the word indices of the training sample, as required for predictions    |

In [None]:
data_df.info()

There are 3 numeric fields, and the rest are object.

In [None]:
data_df.isnull().sum()

There are no missing values in the training set.

Let see how many raw text files and annotaions in the training set.

In [None]:
raw_text_files = os.listdir('../input/feedback-prize-2021/train')
print(f'Training data consists of {len(raw_text_files)} texts')
print(f'Training data consists of {data_df.shape[0]} annotaions')
print(f'Each essay contains average {round(data_df.shape[0]/len(raw_text_files), 1)} annotaions.')

Lets look at one specific essay.

In [None]:
with open('../input/feedback-prize-2021/train/423A1CA112E2.txt', 'r') as file:
    first_txt = file.read()
print(first_txt)

We can see that the essay has many paragraphs. If we look closely, we can see that the punctuation and capitalization is not very good. 

In [None]:
data_df[data_df['id'] == "423A1CA112E2"]

Let's visualize some text.

In [None]:
colors = {'Lead': '#34ebb7',
        'Position': '#2b7ff6',
        'Evidence': '#2adddd',
        'Claim': '#80ffb4',
        'Concluding Statement': 'd4dd80',
        'Counterclaim': '#ff8042',
        'Rebuttal': '#ff0000'}

def visualize_text(example, df = data_df):
    """
    Visualize the class of each span in a text.
    
    Arguments:
    example -- the ID of the essay
    df -- the data frame contain infomation about the essay
    """
    ents = []
    for i, row in df[df['id'] == example].iterrows():
        ents.append({'start': int(row['discourse_start']), 
                    'end': int(row['discourse_end']), 
                    'label': row['discourse_type']})
        
    with open(f'../input/feedback-prize-2021/train/{example}.txt', 'r') as file:
        data = file.read()
    doc2 = {"text": data,
            "ents": ents,
            "title": example}

    options = {"ents": ['Lead', 'Position', 'Evidence', 'Claim', 'Concluding Statement', 'Counterclaim', 'Rebuttal'],
               "colors": colors}
    displacy.render(doc2, style = "ent", options = options, manual = True, jupyter = True)

In [None]:
# Choose 4 examples to show
to_show = data_df['id'].sample(n = 4, random_state = 6)

for essay in to_show:
    visualize_text(essay, data_df)
    print('\n')

From the visualization, we can see that:
* One essay may not have all the discourse types
* Not all the text in an essay is classified
* One sentence can have more than 1 discourse type
* There may be 2 spans of the same discourse type next to each other, even in a same sentence

### Texts (essay) overview

In [None]:
# Create a data frame for id and text info only
text_df = pd.DataFrame(columns = ['id', 'text'])
texts = []
for file in raw_text_files:
    with open(f'/kaggle/input/feedback-prize-2021/train/{file}') as f:
        texts.append({'id': file[:-4], 'text': f.read()})
texts_df = pd.DataFrame(texts)

# Count the number of character and number of word of each essay
texts_df['len'] = texts_df['text'].apply(len)
texts_df['word_num'] = texts_df['text'].apply(lambda x: len(x.split()))

In [None]:
texts_df.describe()

Most of the texts have less than 5000 characters, with some outlier that length up to about 18000 characters.

In [None]:
texts_df['len'].hist(bins = 50, figsize = (12, 8))
plt.title('Number of Characters per Essay', fontsize = 15, pad = 15)
plt.xlabel('Number of characters')
plt.ylabel('Frequency');

Most of all essay have total about 200 to 500 words, with some outliers have more than 1600 words.

In [None]:
texts_df['word_num'].hist(bins = 50, figsize = (12, 8))
plt.title('Number of Words per Essay', fontsize = 16, pad = 15)
plt.xlabel('Number of words')
plt.ylabel('Frequency');

In [None]:
less512 = (texts_df['word_num'] <= 512).sum()/len(texts_df['word_num'])*100
less1024 = (texts_df['word_num'] <= 1024).sum()/len(texts_df['word_num'])*100
print(f'There are {round(less512, 1)}% essays have less than or equal to 512 \
words and {round(less1024, 1)}% essays have less than or equal to 1024 words.')

###  Length and frequency of each discourse type

In [None]:
ax = data_df['discourse_type'].value_counts(ascending = True).plot(kind = 'barh', figsize = (12, 8))
#ax.bar_label(ax.containers[0], label_type="edge", padding = 0.2)
plt.title('Frequency of each Discourse Type', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

The most popular discourse type is Claim, and the least popular is Rebuttal.

In [None]:
# Make a collumn to calculate the len of each discourse
data_df["discourse_num_word"] = data_df["discourse_text"].apply(lambda x: len(x.split()))

ax = data_df.groupby('discourse_type')["discourse_num_word"].mean().sort_values().plot(kind = 'barh', figsize = (12, 8))
#ax.bar_label(ax.containers[0], label_type="edge", fmt='%1.1f', padding = 0.2)
plt.title('Average length of each Discourse Type', fontsize = 16, pad = 15)
plt.xlabel('Number of words')
plt.ylabel('');

In [None]:
plt.figure(figsize=(14,10))
sns.boxplot(x = 'discourse_type', y = 'discourse_num_word', data = data_df)
#order = data_df.groupby('discourse_type')["discourse_len"].mean().sort_values(ascending = False).index
plt.title('Length of each Discourse Type', fontsize = 16, pad = 15)
plt.xlabel('')
plt.ylabel('Number of words');

The Evidence type has the longest average number of words, and the sortest one is Claim. It seems reasonable.

In [None]:
num_discourse = data_df.groupby('id')['discourse_type'].count()
print(f'Minimum number of discourse span per essay: {num_discourse.min()}')
print(f'Maximum number of discourse span per essay: {num_discourse.max()} \n')

num_discourse.hist(figsize = (12, 8), bins = 26)
plt.title('Number of Discourse Span per Essay', fontsize = 16, pad = 15)
plt.xlabel('Number')
plt.ylabel('Frequency');

In [None]:
pd.DataFrame(num_discourse).describe()

Most of essays have about 7 to 11 discourse spans.

## Make word cloud

In [None]:
wordcloud = WordCloud(stopwords = STOPWORDS, max_font_size = 120, max_words = 200,
                      width = 1200, height = 800, background_color = 'white')
wordcloud.generate(' '.join(txt for txt in data_df["discourse_text"]))
plt.figure(figsize = (16, 12))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.title('Word Cloud for all Texts', fontsize = 25, pad = 18);

Since those essay were written by students, we can see that the most popular words are: student, people, school, teacher, help, college

## Create training and validation set

The data set is quite big. In order to experiment more quickly, I use 40% of the data for training, 10% for validation. After find best solution for this small training set, I will use all the data for training and use Kaggle public test data for validation. Finally, the private test data set will be used for testing purpose.

In [None]:
# All id name of texts
all_id = data_df.id.unique()

# Create training and validation set
np.random.seed(6) # For reproduce
train_idx = np.random.choice(np.arange(len(all_id)), int(0.4*len(all_id)), replace = False)
left_set = np.setdiff1d(np.arange(len(all_id)), train_idx)
valid_idx = np.random.choice(left_set, int(0.1*len(all_id)), replace = False)
np.random.seed(None)

Check is the new training and validation set are representative of all data.

In [None]:
data_df[data_df['id'].isin(all_id[train_idx])]['discourse_type'].value_counts(ascending = True).plot(kind = 'barh', figsize = (12, 8))
plt.title('Frequency of each Discourse Type in Training Set', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

In [None]:
data_df[data_df['id'].isin(all_id[valid_idx])]['discourse_type'].value_counts(ascending = True).plot(kind = 'barh', figsize = (12, 8))
plt.title('Number of each Discourse Type in Validation Set', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

In [None]:
data_df['discourse_type'].value_counts(ascending = True, normalize = True)/data_df[data_df['id'].isin(all_id[train_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)

In [None]:
data_df[data_df['id'].isin(all_id[train_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)/data_df[data_df['id'].isin(all_id[valid_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)

The distribution of the small training set and all data are the same. For the validation data set, the distribution of rebuttal and counterclaim are higher. However, since those types are rare, it is OK for validation purpose.

In [None]:
# If KFold cross validation is used:
train_CV = np.append(train_idx, valid_idx)

## Modelling

In [None]:
# Token file
load_tokens_from = '../input/longformerbase4096'

# Pretrained model
downloaded_model_path = '../input/longformerbase4096'

# NER target file
NER_target = '../input/ner-target-for-feedback-prize-competition'

# Max sequence length for model
max_len = 1024

### Tokenize Training set

First we need to converts training dataset into a NER token array that we can use to train a NER transformer.

In [None]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(downloaded_model_path)

In [None]:
def create_target(text_id, dataframe, backbone, name,
                  max_len = max_len, augmented = None, NER_target = None):
    """
    Create token, attention mask and target array.
    
    Arguments:
    text_id -- array or list of essay id
    dataframe -- the dataframe of data set
    backbone -- the name of backbone. For example: 'longformer', 'deberta'
    name -- in the list: 'train_origin', 'val', 'CV', 'train_roberta_augmented',
    'train_backtrans_augmented', 'train_syn_augmented'
    max_len -- max token length
    augmented -- type of augmentation used: 'augmented_back', 'augmented_roberta', 'augmented_syn'
    NER_target -- address of saved folder for token, attention mask and target array
    
    Return:
    targets -- target array
    text_tokens -- token array
    text_attention - -attention mask array
    """
    
    # Run and save the target for the first run time only
    if NER_target:
        targets = np.load(f'{NER_target}/{name}_targets_{backbone}_{max_len}.npy')
        text_tokens = np.load(f'{NER_target}/{name}_tokens_{backbone}_{max_len}.npy')
        text_attention = np.load(f'{NER_target}/{name}_attention_{backbone}_{max_len}.npy')
        print('NER tokens loaded')
        return targets, text_tokens, text_attention
    
    # The tokens and attention arrays   
    text_tokens = np.zeros((len(text_id), max_len), dtype = 'int32')
    text_attention = np.zeros((len(text_id), max_len), dtype = 'int32')

    # The 14 classes for NER
    lead_b = np.zeros((len(text_id), max_len))
    lead_i = np.zeros((len(text_id), max_len))

    position_b = np.zeros((len(text_id), max_len))
    position_i = np.zeros((len(text_id), max_len))

    evidence_b = np.zeros((len(text_id), max_len))
    evidence_i = np.zeros((len(text_id), max_len))

    claim_b = np.zeros((len(text_id), max_len))
    claim_i = np.zeros((len(text_id), max_len))

    conclusion_b = np.zeros((len(text_id), max_len))
    conclusion_i = np.zeros((len(text_id), max_len))

    counterclaim_b = np.zeros((len(text_id), max_len))
    counterclaim_i = np.zeros((len(text_id), max_len))

    rebuttal_b = np.zeros((len(text_id), max_len))
    rebuttal_i = np.zeros((len(text_id), max_len))

    targets_b = [lead_b, position_b, evidence_b, claim_b, conclusion_b, counterclaim_b, rebuttal_b]
    targets_i = [lead_i, position_i, evidence_i, claim_i, conclusion_i, counterclaim_i, rebuttal_i]    
    
    target_map = {'Lead': 0, 'Position': 1, 'Evidence': 2, 'Claim': 3, 'Concluding Statement': 4,
                 'Counterclaim': 5, 'Rebuttal': 6}
    
    # For loop through each text
    for n in text_id:

        # Read text, tokenize, and save in token arrays       
        # Load the text file either in train set or augmented set
        try:
            txt = open(f'../input/feedback-prize-2021/train/{n}.txt', 'r').read()
        except:
            txt = open(f'../input/data-augmented/{augmented}/{n}.txt', 'r').read()

        # Tokenize the text
        tokens = tokenizer.encode_plus(txt, max_length = max_len, padding = 'max_length',
                                        truncation = True, return_offsets_mapping = True)
        
        # Save token of the text to the text_tokens array
        text_tokens[id_num] = tokens['input_ids']
        
        # Save attention mask to the text_attention array
        text_attention[id_num] = tokens['attention_mask']

        # Find targets in text and save in target array
        # Loop through offset_mapping to asign each token to a class
        offsets = tokens['offset_mapping']
        offset_index = 0 # token position
        df = dataframe.loc[dataframe.id == n]
        
        for row in df.itertuples():
            a = row[3] # discourse_start position
            b = row[4] # discourse_end position
            if offset_index > max_len - 1: # the index out of max token length
                break
            c = offsets[offset_index][0] # start position of each token
            d = offsets[offset_index][1] # end position of each token
            beginning = True
            while b > c: 
                # while the word is in the discourse span
                # the position of the text must larger than 
                # the start position and smaller than the end position
                if (c >= a) & (b >= d):
                    k = target_map[row[6]] # discourse_type, in number for indexing
                    if beginning:
                        targets_b[k][id_num][offset_index] = 1
                        beginning = False # 1 B for first only
                    else:
                        targets_i[k][id_num][offset_index] = 1
                offset_index += 1
                if offset_index > max_len - 1:
                    break
                c = offsets[offset_index][0]
                d = offsets[offset_index][1]
     
    # Create target array
    targets = np.zeros((len(text_id), max_len, 15), dtype = 'int32')
    for k in range(7):
        targets[:, :, 2*k] = targets_b[k]
        targets[:, :, 2*k + 1] = targets_i[k]
    # If the token is not assigned to any class:
    targets[:, :, 14] = 1 - np.max(targets, axis = -1)
    
    # Save targets to files for next use time
    np.save(f'{name}_targets_{backbone}_{max_len}', targets)
    np.save(f'{name}_tokens_{backbone}_{max_len}', text_tokens)
    np.save(f'{name}_attention_{backbone}_{max_len}', text_attention)
    print('NER tokens Saved')   
    
    return targets, text_tokens, text_attention

In [None]:
def create_augmented_df(augmented_file, text_id = all_id[train_idx]):
    """
    Create data frame if augmentation is used
    
    Arguments:
    augmented_file -- name of the augmented file:
    'train_backtrans_augmented.csv'
    'train_roberta_augmented.csv'
    'train_synonym_augmented.csv'
    text_id -- array or list of essay id
    
    Return:
    train_df -- data frame contain augmented and original information
    """
    augmented_df = pd.read_csv(f'../input/data-augmented/{augmented_file}')
    origin_df = data_df[data_df['id'].isin(text_id)]
    train_df = pd.concat([origin_df, augmented_df], ignore_index = True)
    
    return train_df

There are many scenarios, so we need to create the files accordingly.

In [None]:
# The type can be 1 in the following: 'normal', 'all_data', 'augmentation', 'KFold'
train_type = 'normal'

In [None]:
if train_type == 'normal':
    # Validation ID list and dataframe
    val_id = all_id[valid_idx]
    val_df = data_df[data_df['id'].isin(val_id)]

    # Token, attention mask and target array for validation set
    val_targets, val_tokens, val_attention = create_target(val_id, val_df, 'longformer', 'val',
                                                          max_len = max_len, augmented = None,
                                                          NER_target = NER_target)
    # Train id list and dataframe
    train_id = all_id[train_idx]
    train_df = data_df[data_df['id'].isin(train_id)]

    # Token, attention mask and target array for training set
    train_targets, train_tokens, train_attention = create_target(train_id, train_df, 'longformer',
                                                                 name = 'train_origin',
                                                                 max_len = max_len,
                                                                 NER_target = NER_target)
elif train_type == 'all_data':
    # If use all data
    train_id = data_df.id.unique()
    targets = np.load('../input/ner-target-for-feedback-prize-competition/targets_1024.npy')
    text_tokens = np.load('../input/ner-target-for-feedback-prize-competition/tokens_1024.npy')
    text_attention = np.load('../input/ner-target-for-feedback-prize-competition/attention_1024.npy')
    
elif train_type == 'augmentation':
    # If use augmentation
    train_df = create_augmented_df('train_synonym_augmented.csv', text_id = all_id[train_idx])
    train_id = train_df.id.unique()
    
elif tran_type == 'KFold':
    # If KFold cross validation is used
    train_id = all_id[train_CV]
    train_df = data_df[data_df['id'].isin(train_id)]

    # Token, attention mask and target array for CV set
    train_targets, train_tokens, train_attention = create_target(train_id, train_df, 'longformer',
                                                                 name = 'CV',
                                                                 max_len = max_len,
                                                                 NER_target = NER_target)

## Build Model
We will use LongFormer backbone and add our own NER head using. We use 15 classes because we have a B class and I class for each of 7 labels. And we have an additional class (called O class) for tokens that do not belong to one of the 14 classes.

In [None]:
# Number of epoch and batch_size
epochs = 8
batch_size = 4

In [None]:
def build_model():
    """
    Function to build and compile model
    """
    tokens = Input(shape = (max_len,), name = 'tokens', dtype = tf.int32)
    attention = Input(shape = (max_len,), name = 'attention', dtype = tf.int32)

    config = AutoConfig.from_pretrained(downloaded_model_path + '/config.json') 
    backbone = TFAutoModel.from_pretrained(downloaded_model_path + '/tf_model.h5', config = config)

    x = backbone(tokens, attention_mask = attention)
    x = Dense(512, activation = 'relu')(x[0])
    x = Dropout(0.2)(x)
    x = Dense(15, activation = 'softmax', dtype = 'float32')(x)

    model = tf.keras.Model(inputs = [tokens, attention], outputs = x)
    model.compile(optimizer = 'adam',
              loss = ['categorical_crossentropy'],
              metrics = ['categorical_accuracy'])
    return model

In [None]:
model = build_model()

In [None]:
# Create saved model checkpoint callback
checkpoint = 'checkpoint-{epoch:02d}'
checkpoint_callback = ModelCheckpoint(filepath = checkpoint,
                                     save_freq = 'epoch',
                                     save_weights_only = True,
                                     verbose = 1)

In [None]:
# Create learning rate decay function
decay_steps = train_tokens.shape[0]//batch_size*epochs
learning_rate_fn = PolynomialDecay(initial_learning_rate = 1e-4,
                                decay_steps = decay_steps,
                                end_learning_rate = 1e-5,
                                power = 1)

In [None]:
# Compile model for training
model.compile(optimizer = Adam(learning_rate = learning_rate_fn),
              loss = ['categorical_crossentropy'],
              metrics = ['categorical_accuracy'])

In [None]:
# Compile model for inference
model.compile(optimizer = 'adam',
              loss = ['categorical_crossentropy'],
              metrics = ['categorical_accuracy'])

In [None]:
model.summary()

In [None]:
# Training model
history = model.fit(x = [train_tokens, train_attention],
                    y = train_targets,
                    validation_data = ([val_tokens, val_attention], val_targets),
                    callbacks = [checkpoint_callback],
                    epochs = epochs,
                    batch_size = batch_size,
                    verbose = 1)

In [None]:
# Dataframe contains training logs
history_df = pd.DataFrame(history.history)

In [None]:
# Plot training accuracy
history_df.plot(y = ['categorical_accuracy', 'val_categorical_accuracy'], figsize = (12, 7))
plt.xlabel("Epochs")
plt.ylabel('Accuracy')
plt.title('Accuracy vs. epochs');

In [None]:
# Plot training loss
history_df.plot(y = ['loss', 'val_loss'], figsize = (12, 7))
plt.xlabel("Epochs")
plt.ylabel('Loss')
plt.title('Loss vs. epochs');

In [None]:
# If load model from pretrained model
model.load_weights('../input/trained-model-for-feedback-prize-competition/long_v12_0.618.h5')

## Build CV model

We use all training data and validation data for cross-validation. For this data, we will use 5 folds cross-validation.

In [None]:
# Create 5 folds
skf = MultilabelStratifiedKFold(n_splits = 5, shuffle = True, random_state = 6)

In [None]:
# Number of epoch and batch size
epochs = 5
batch_size = 4

In [None]:
# Check if the data distribution of all folds are the same
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_tokens, train_targets[:, 0, :])):
    print('Fold:', fold + 1, '\n')
    print(data_df[data_df['id'].isin(all_id[valid_idx])]['discourse_type'].value_counts(ascending = True))

In [None]:
# Train 5 folds
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_tokens, train_targets[:, 0, :])):
    print('Fold:', fold + 1, '\n')
    model = build_model()
    decay_steps = train_tokens.shape[0]//batch_size*epochs
    learning_rate_fn = PolynomialDecay(initial_learning_rate = 1e-4,
                                    decay_steps = decay_steps,
                                    end_learning_rate = 1e-5,
                                    power = 1)
    model.compile(optimizer = Adam(learning_rate = learning_rate_fn),
                  loss = ['categorical_crossentropy'],
                  metrics = ['categorical_accuracy'])
    model.fit(x = [train_tokens[train_idx], train_attention[train_idx]], 
              y = train_targets[train_idx],
              epochs = epochs,
              batch_size = batch_size,
              verbose = 1)
    model.save_weights(f'long_CV_{fold + 1}.h5')
    print('\n', '*'*50, '\n')               

## Validate Model

We will now make predictions on the validation texts. Our model makes label predictions for each token, we need to convert this into a list of word indices for each label. Note that the tokens and words are not the same. A single word may be broken into multiple tokens. Therefore we need to first create a map to change token indices to word indices.

In [None]:
# Turn from class number to class name
target_map_rev = {0: 'Lead', 1: 'Position', 2: 'Evidence', 3: 'Claim', 4: 'Concluding Statement',
                 5: 'Counterclaim', 6: 'Rebuttal', 7: 'blank'}

In [None]:
classes = np.array(['Lead', 'Position','Counterclaim','Rebuttal','Evidence',
       'Claim', 'Concluding Statement'], dtype = object)

In [None]:
# Make the data frame for validation
to_validate = data_df.loc[data_df['id'].isin(all_id[valid_idx])]

In [None]:
# Function to Compute Competition Metric

def calc_overlap(set_pred, set_gt):
    """
    Calculates if the overlap between prediction and
    ground truth is enough for a potential True positive
    """
    # Length of each and intersection
    try:
        len_gt = len(set_gt)
        len_pred = len(set_pred)
        inter = len(set_gt & set_pred)
        overlap_1 = inter/len_gt
        overlap_2 = inter/len_pred
        return overlap_1 >= 0.5 and overlap_2 >= 0.5
    except:  # at least one of the input is NaN
        return False

def score_feedback_comp_micro(pred_df, gt_df, discourse_type):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Use the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df.loc[gt_df['discourse_type'] == discourse_type, 
                      ['id', 'predictionstring']].reset_index(drop = True)
    pred_df = pred_df.loc[pred_df['class'] == discourse_type,
                      ['id', 'predictionstring']].reset_index(drop = True)
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    pred_df['predictionstring'] = [set(pred.split(' ')) for pred in pred_df['predictionstring']]
    gt_df['predictionstring'] = [set(pred.split(' ')) for pred in gt_df['predictionstring']]
    
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on = 'id',
                           right_on = 'id',
                           how = 'outer',
                           suffixes = ('_pred','_gt'))
    overlaps = [calc_overlap(*args) for args in zip(joined.predictionstring_pred, 
                                                     joined.predictionstring_gt)]
    
    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    # we don't need to compute the match to compute the score
    TP = joined.loc[overlaps]['gt_id'].nunique()

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    TPandFP = len(pred_df)
    TPandFN = len(gt_df)
    
    #calc microf1
    my_f1_score = 2*TP/(TPandFP + TPandFN)
    return my_f1_score

def score_feedback_comp(pred_df, gt_df, return_class_scores = False):
    class_scores = {}
    for discourse_type in gt_df.discourse_type.unique():
        class_score = score_feedback_comp_micro(pred_df, gt_df, discourse_type)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1

In [None]:
# Function to validate model
def check_ckp(model):
    """
    Validate model and print out the result based on validation data set
    """
    pred = model.predict([text_tokens[valid_idx], text_attention[valid_idx]],
                         batch_size = 16, verbose = 1)
    oof_preds = np.argmax(pred, axis = -1)
    oof = get_preds(dataset = 'train', text_ids = val_id, augmented = None,
                    preds = oof_preds, thredshold = thredshold)
    f1s = []
    classes= np.array(['Lead', 'Position','Counterclaim','Rebuttal','Evidence',
                       'Claim', 'Concluding Statement'], dtype = object)
    print('Validation F1_score:')
    for c in classes:
        pred_df = oof.loc[oof['class'] == c].copy()
        gt_df = to_validate.loc[to_validate['discourse_type'] == c].copy()
        f1 = score_feedback_comp(pred_df, gt_df)
        print(c + ':', round(f1, 3))
        f1s.append(f1)
    print('Overall',round(np.mean(f1s), 3))
    print('\n')

In [None]:
def get_preds_old(dataset, text_ids, preds, thredshold = 4, augmented = None):
    
    """
    Made prediction and create result data frame
    
    Arguments:
    dataset -- name of folder contain text files: 'train' or 'test'
    text_ids -- array or list of essay id
    preds -- the array contain predicted class
    augmented -- name of augmented text folder: 'augmented_back', 'augmented_roberta', 'augmented_syn'
    thredshold -- the minimum words required for one discourse span of all type
    
    Return:
    df -- result data frame
    """
    all_predictions = []
    target_map_rev = {0: 'Lead', 1: 'Position', 2: 'Evidence', 3: 'Claim', 4: 'Concluding Statement',
                      5: 'Counterclaim', 6: 'Rebuttal', 7: 'blank'}
    
    for id_num in range(len(preds)):

        n = text_ids[id_num] # Text id
    
        # Tokenize the text
        try:
            txt = open(f'../input/feedback-prize-2021/{dataset}/{n}.txt', 'r').read()
        except:
            txt = open(f'../input/data-augmented/{augmented}/{n}.txt', 'r').read()

        tokens = tokenizer.encode_plus(txt, max_length = max_len, padding = 'max_length',
                                   truncation = True, return_offsets_mapping = True)
        off = tokens['offset_mapping']
    
        # Get word position
        word_pos = []
        blank = True
        # The word must start with a symbol, and inly the first symbol will be counted
        for i in range(len(txt)):
            if (not txt[i].isspace()) & (blank == True):
                word_pos.append(i)
                blank = False
            # implied that previous word ended
            elif txt[i].isspace():
                blank = True
        word_pos.append(1e6) # end
            
        # Mapping from tokens to words
        word_map = -1 * np.ones(max_len, dtype = 'int32')
        w_i = 0
        for i in range(len(off)):
            if off[i][1] == 0: # Skip character with token 0
                continue
            while off[i][0] >= word_pos[w_i+1]: #If token position is larger than word start position
                w_i += 1
            word_map[i] = int(w_i)
        
        # Convert token predictions into word labels
        # 0: lead_b, 1: lead_i
        # 2: position_b, 3: position_i
        # 4: evidence_b, 5: evidence_i
        # 6: claim_b, 7: claim_i
        # 8: conclusion_b, 9: conclusion_i
        # 10: counterclaim_b, 11: counterclaim_i
        # 12: rebuttal_b, 13: rebuttal_i
        # 14: nothing (o)

        pred = preds[id_num]/2
    
        i = 0
        while i < max_len:
            prediction = []
            start = pred[i]
            if start in [0,1,2,3,4,5,6,7]: # Only append if the class start with 'B'
                prediction.append(word_map[i])
                i += 1
                if i >= max_len:
                    break
                while pred[i] == start + 0.5: # When the class is 'I'
                    if not word_map[i] in prediction:
                        prediction.append(word_map[i])
                    i += 1
                    if i >= max_len:
                        break
            else:
                i += 1
            prediction = [x for x in prediction if x!=-1]
            #print(prediction)
            
            # Only accept if length of discourse larger than a thredshold
            if len(prediction) > thredshold:
                all_predictions.append((n, target_map_rev[int(start)], 
                                ' '.join([str(x) for x in prediction])))
                
    # MAKE DATAFRAME
    df = pd.DataFrame(all_predictions)
    df.columns = ['id','class','predictionstring']
    
    return df

In [None]:
def get_preds(dataset, text_ids, preds, thredshold, augmented = None):
    
    """
    Made prediction and create result data frame
    
    Arguments:
    dataset -- name of folder contain text files: 'train' or 'test'
    text_ids -- array or list of essay id
    preds -- the array contain predicted class
    augmented -- name of augmented text folder: 'augmented_back', 'augmented_roberta', 'augmented_syn'
    thredshold -- dictionary or array contain the minimum words required for each type
    
    Return:
    df -- result data frame
    """
    all_predictions = []
    target_map_rev = {0: 'Lead', 1: 'Position', 2: 'Evidence', 3: 'Claim', 4: 'Concluding Statement',
                      5: 'Counterclaim', 6: 'Rebuttal', 7: 'blank'}

    for id_num in range(len(preds)):

        n = text_ids[id_num] # Text id
    
        # Tokenize the text
        try:
            txt = open(f'../input/feedback-prize-2021/{dataset}/{n}.txt', 'r').read()
        except:
            txt = open(f'../input/data-augmented/{augmented}/{n}.txt', 'r').read()

        tokens = tokenizer.encode_plus(txt, max_length = max_len, padding = 'max_length',
                                   truncation = True, return_offsets_mapping = True)
        off = tokens['offset_mapping']
    
        # Get word position
        word_pos = []
        blank = True
        # The word must start with a symbol, and only the first symbol will be counted
        for i in range(len(txt)):
            if (not txt[i].isspace()) & (blank == True):
                word_pos.append(i)
                blank = False
            # implied that previous word ended
            elif txt[i].isspace():
                blank = True
        word_pos.append(1e6) # end
            
        # Mapping from tokens to words
        word_map = -1 * np.ones(max_len, dtype = 'int32')
        w_i = 0
        for i in range(len(off)):
            if off[i][1] == 0: # Skip character with token 0
                continue
            while off[i][0] >= word_pos[w_i+1]: #If token position is larger than word start position
                w_i += 1
            word_map[i] = int(w_i)
        
        # Convert token predictions into word labels
        # 0: lead_b, 1: lead_i
        # 2: position_b, 3: position_i
        # 4: evidence_b, 5: evidence_i
        # 6: claim_b, 7: claim_i
        # 8: conclusion_b, 9: conclusion_i
        # 10: counterclaim_b, 11: counterclaim_i
        # 12: rebuttal_b, 13: rebuttal_i
        # 14: nothing (o)

        pred = preds[id_num]/2
        
        # If we see tokens I-X, I-Y, I-X -> change I-Y to I-X
        for j in range(1, len(pred) - 1):
            if pred[j - 1] == pred[j + 1] and pred[j - 1]%2 == 0.5 and pred[j] != pred[j - 1]:
                pred[j] = pred[j - 1]
            
        # B-X, ? (not B), I-X -> change ? to I-X
        for j in range(1, len(pred) - 1):
            if pred[j - 1] in range(0,7,1) and pred[j + 1] == pred[j - 1] + 0.5 and pred[j] != pred[j + 1] and pred[j] not in range(0,7,1):
                pred[j] = pred[j + 1]  
        
        # If we see tokens I-X, O, I-X, change center token to the same for stated discourse types
        for j in range(1, len(pred) - 1):
            if pred[j - 1] in [k + 0.5 for k in range(7)] and pred[j - 1] == pred[j + 1] and pred[j] == 7:
                pred[j] = pred[j - 1]
        
        i = 0
        while i < max_len:
            prediction = []
            start = pred[i]
            # Only append if the class start with 'B'
            if start in range(0, 7): 
                prediction.append(word_map[i])
                i += 1
                if i >= max_len:
                    break
                # When the class is 'I'
                while pred[i] == start + 0.5: 
                    if not word_map[i] in prediction:
                        prediction.append(word_map[i])
                    i += 1
                    if i >= max_len:
                        break
            else:
                i += 1
            
            prediction = [x for x in prediction if x != -1]
            
            
            # Skip blank classified word
            if start == 7:
                continue
            
            # Only accept if length of discourse larger than a thredshold
            discourse_type = target_map_rev[int(start)]
            if len(prediction) > thredshold[discourse_type]:
                all_predictions.append((n, discourse_type, ' '.join([str(x) for x in prediction])))
                
    # MAake data frame
    df = pd.DataFrame(all_predictions)
    df.columns = ['id', 'class', 'predictionstring']
    
    return df

In [None]:
# Check trained checkpoint to find best model
for i in range(1, 9):
    del model
    model = build_model()
    model.load_weights(f'checkpoint-0{i}')
    print(f'Model {i}')
    check_ckp(model)

In [None]:
# Load and save best model
model = build_model()
model.load_weights('checkpoint-07')
model.save_weights('v_15_0.617.h5')

In [None]:
# For validation set
pred = model.predict([val_tokens, val_attention], 
                  batch_size = 16, verbose = 1)
print('Validation predictions shape:', pred.shape)
oof_preds = np.argmax(pred, axis = -1)

In [None]:
# For training set
pred = model.predict([train_tokens, train_attention],
                     batch_size = 16, verbose = 1)
print('Training predictions shape:', pred.shape)
oof_preds = np.argmax(pred, axis = -1)

In [None]:
# Test soft thredshold
for k in np.arange(0.005, 0.055, 0.005 ):
    print('k:', k)
    thredshold = data_df.groupby('discourse_type')['discourse_num_word'].quantile(k)
    oof = get_preds(dataset = 'train', text_ids = val_id, augmented = None, preds = oof_preds, thredshold = thredshold)
    f1s = []
    print('Validation F1_score:')
    for c in classes:
        pred_df = oof.loc[oof['class'] == c].copy()
        gt_df = to_validate.loc[to_validate['discourse_type'] == c].copy()
        f1 = score_feedback_comp(pred_df, gt_df)
        print(c + ':', round(f1, 3))
        f1s.append(f1)
    print()
    print('Overall',round(np.mean(f1s), 3))
    print('\n')

In [None]:
# Test hard thredshold
for ts in range(10):
    oof = get_preds_old(dataset = 'train', text_ids = val_id, augmented = None, preds = oof_preds, thredshold = ts)
    f1s = []
    print('Threadshold:', ts)
    print('Validation F1_score:')
    for c in classes:
        pred_df = oof.loc[oof['class'] == c].copy()
        gt_df = to_validate.loc[to_validate['discourse_type'] == c].copy()
        f1 = score_feedback_comp(pred_df, gt_df)
        print(c + ':', round(f1, 3))
        f1s.append(f1)
    print()
    print('Overall',round(np.mean(f1s), 3))
    print('\n')

In [None]:
data_df.groupby('discourse_type')['discourse_num_word'].quantile(0.005)

In [None]:
# Optimum thredshold
thredshold = {'Lead': 6, 'Position': 4, 'Evidence': 9, 'Claim': 1, 'Concluding Statement': 8, 'Counterclaim': 5, 'Rebuttal': 4}

## Compute Validation Metric

In [None]:
oof_old = get_preds_old(dataset = 'train', text_ids = val_id, augmented = None, preds = oof_preds, thredshold = 4)
oof_old.head()

In [None]:
oof_new = get_preds(dataset = 'train', text_ids = val_id, augmented = None, preds = oof_preds, thredshold = thredshold)
oof_new.head()

In [None]:
# For training set
oof = get_preds(dataset = 'train', text_ids = train_id, augmented = None)
oof.head()

In [None]:
# For training set, when using data augmentation
oof = get_preds(dataset = 'train', text_ids = val_id, augmented = 'augmented_syn')
oof.head()

Evaludate the old inference function.

In [None]:
f1s = []

print('Validation F1_score:')
for c in classes:
    pred_df = oof_old.loc[oof_old['class'] == c].copy()
    gt_df = to_validate.loc[to_validate['discourse_type'] == c].copy()
    f1 = score_feedback_comp(pred_df, gt_df)
    print(c + ':', round(f1, 3))
    f1s.append(f1)
print()
print('Overall',round(np.mean(f1s), 3))
print('\n')

Evaludate after post processing.

In [None]:
f1s = []
classes = np.array(['Lead', 'Position','Counterclaim','Rebuttal','Evidence',
       'Claim', 'Concluding Statement'], dtype = object)

print('Validation F1_score:')
for c in classes:
    pred_df = oof_new.loc[oof_new['class'] == c].copy()
    gt_df = to_validate.loc[to_validate['discourse_type'] == c].copy()
    f1 = score_feedback_comp(pred_df, gt_df)
    print(c + ':', round(f1, 3))
    f1s.append(f1)
print()
print('Overall',round(np.mean(f1s), 3))
print('\n')

## Visualize the prediction

In [None]:
def visualize_prediction(ids, df, path):
    """
    Visualize the prediction
    
    Arguments:
    ids -- ID of the essay
    df -- predicted data frame
    path -- folder of the original text file
    """
    with open(f'{path}/{ids}.txt', 'r') as file:
        data = file.read()
    ents = []
    example = ids
    curr_df = df[df["id"] == example]
    text = " ".join(data.split())
    splitted_text = text.split()
    for i, row in curr_df.iterrows():
        predictionstring = row['predictionstring']
        
        predictionstring = predictionstring.split()
        w_start = int(predictionstring[0])
        
        w_end = int(predictionstring[-1])
        ents.append({'start': len(" ".join(splitted_text[:w_start])), 
                    'end': len(" ".join(splitted_text[:w_end + 1])), 
                    'label': row['class']})
    ents = sorted(ents, key = lambda i: i['start'])
    
    doc2 = {"text": text,
        "ents": ents,
        "title": example}

    options = {"ents": ['Lead', 'Position', 'Evidence', 'Claim', 'Concluding Statement', 'Counterclaim', 'Rebuttal'], "colors": colors}
    displacy.render(doc2, style = "ent", options = options, manual = True, jupyter = True)

Visualize the original text.

In [None]:
visualize_text('7EA804FD13A8', data_df)

Visualize prediction before post processing.

In [None]:
visualize_prediction(ids = '7EA804FD13A8', df = oof_old, path = '../input/feedback-prize-2021/train')

Visualize prediction after post processing

In [None]:
visualize_prediction(ids = '7EA804FD13A8', df = oof_new, path = '../input/feedback-prize-2021/train')