This notebook contains a simple initial data analysis for the [Feedback Prize - Evaluating Student Writing](https://www.kaggle.com/c/feedback-prize-2021) competition.

In [None]:
import pandas as pd
import os

# Data overview

In [None]:
text_files = os.listdir('/kaggle/input/feedback-prize-2021/train')

In [None]:
len(text_files)

In [None]:
train_df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')

In [None]:
train_df.info()

In [None]:
train_df.describe()

Training data consists of 16k texts and 144k lines of annotations (~9 discourses per text). Annotations are presented in the `train.csv` file. Let's look at its structure.

In [None]:
train_df.head(15)

Let's look at the first text and its annotation.

In [None]:
def print_text(text_id):
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    print(''.join(lines))
    
print_text('423A1CA112E2')

`train.csv` contains 10 rows with annotation for this file. Each row corresponds to one discourse element and contains the following information (source: https://www.kaggle.com/c/feedback-prize-2021/data):

* id - ID code for essay response
* discourse_id - ID code for discourse element
* discourse_start - character position where discourse element begins in the essay response
* discourse_end - character position where discourse element ends in the essay response
* discourse_text - text of discourse element
* discourse_type - classification of discourse element
* discourse_type_num - enumerated class label of discourse element
* predictionstring - the word indices of the training sample, as required for predictions

As you can see we have both symbol and words level annotation (we can use any of them in model but need the word level one for submission).

Side note: you can see that one line starts with ',' in the very beginning. It may signify some data quality issues (but I am not sure). We need to be very careful with data cleaning and words <-> tokens transformation to get results aligned with annotation.

We can make annotations more clear if we print texts using different colors.

In [None]:
from termcolor import colored
def color_text(text_id, train_df, color_scheme = None):
    if not color_scheme:
        color_scheme = {
        'Lead': 'green',
        'Position': 'red',
        'Claim': 'blue',
        'Counterclaim': 'magenta',
        'Rebuttal': 'yellow',
        'Evidence': 'cyan',
        'Concluding Statement': 'grey'
    } 
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    text = ''.join(lines)
    
    annot_df = train_df[train_df.id == text_id]
    blocks = [(int(row['discourse_start']),int(row['discourse_end']), color_scheme[row['discourse_type']]) for k, row in annot_df.iterrows()]
    blocks.sort()
    i = 0
    last_symbol = -1
    while i < len(blocks):
        if blocks[i][0] > last_symbol + 1:
            blocks.insert(i, (last_symbol+1, blocks[i][0] - 1, None))
        last_symbol = blocks[i][1]
        i += 1
    if last_symbol < len(text):
        blocks.append((last_symbol+1, len(text) - 1, None))

    colored_text = ''.join([colored(text[x[0]:x[1]+1], x[2]) for x in blocks])
    return colored_text
    
print(color_text('423A1CA112E2', train_df))

In [None]:
print(color_text('A8445CABFECE', train_df))

In [None]:
print(color_text('6B4F7A0165B9', train_df))

# Texts overview

In [None]:
# let's load all textst

texts = []
for file in text_files:
    with open(f'/kaggle/input/feedback-prize-2021/train/{file}') as f:
        lines = f.readlines()
    texts.append({'id': file[:-4], 'text': ''.join(lines)})
texts_df = pd.DataFrame(texts)

In [None]:
texts_df.head()

In [None]:
texts_df['len'] = texts_df['text'].apply(len)

In [None]:
texts_df['len'].hist(bins = 50, figsize = (20,10))
print(texts_df['len'].min(), texts_df['len'].max())

Texts have lenghts from 691 to 18k symbols with most of them around 1-3k symbols.

In [None]:
texts_df['words_num'] = texts_df['text'].apply(lambda x: len(x.split(' ')))

In [None]:
texts_df['words_num'].hist(bins = 100, figsize = (20,10))
print(texts_df['words_num'].min(), texts_df['words_num'].max())

Number of words in each text is around 500-1000 on average, with some outliers.

# Target analysis

In [None]:
train_df['discourse_type'].value_counts()

In [None]:
train_df['discourse_words_num'] = train_df['discourse_text'].apply(lambda x: len(x.split(' ')))

In [None]:
avg_len_dict = {}
for d in train_df['discourse_type'].unique():
    temp_df = train_df[train_df['discourse_type'] == d]
    print(d, temp_df['discourse_words_num'].min(), temp_df['discourse_words_num'].mean(), temp_df['discourse_words_num'].max())
    avg_len_dict[d] = int(temp_df['discourse_words_num'].mean())

Differen target classes are anequally distributed, claims and evidences are most popular ones. 

In [None]:
train_df['first_word'] = train_df['discourse_text'].apply(lambda x: x.split(' ')[0].lower())

In [None]:
top_first_words = {}
for d in train_df['discourse_type'].unique():
    temp_df = train_df[train_df['discourse_type'] == d]
    print(d)
    display(temp_df['first_word'].value_counts().head(10))
    top_first_words[d] = temp_df['first_word'].value_counts().head(10).keys()

If we ignore some stop words we can see that some words are very popular for particular target types.
Let's use this data for our baseline.

# Baseline

I am introducing a straightforward rule-based baseline (it most probably can be beaten by transformers fit/predict).
We compute the average length discourse element for each target class.
We create lists of the most popular first words for each target class.
If we see one of these words in the text, we predict that the following avg_len words are discourse element with corresponding classes.


In [None]:
stop_words = {'the', 'i', 'in', '', 'it', 'this', 'if', 'they', 'to'}

for k, v in top_first_words.items():
    top_first_words[k] = set([x for x in v if x not in stop_words])
top_first_words

Let's just manually remove duplicates and clean up is a bit (giving priority to the more popular classes):

In [None]:
top_first_words = {
    'Claim': {'another', 'students'},
     'Evidence': {},
     'Position': {'there'},
     'Concluding Statement': {'so'},
     'Lead': {'driverless', 'imagine'},
     'Counterclaim': {'although','but','however,'},
     'Rebuttal': {'but,', 'while'}
    }

In [None]:
avg_len_dict

In [None]:
def predict(text_id, path = '/kaggle/input/feedback-prize-2021/train/', top_first_words=top_first_words, avg_len_dict=avg_len_dict):
    with open(f'{path}{text_id}.txt') as f:
        lines = f.readlines()
    text = ''.join(lines)
    words = text.split(' ')
    preds = []
    for i,word in enumerate(words):
        for k,v in top_first_words.items():
            if word in v:
                preds.append({'id': text_id, 'class': k, 'predictionstring': ' '.join([str(x) for x in range(i,i+avg_len_dict[k])])})
    return preds

In [None]:
predict('423A1CA112E2')

We obviously have too many false positives but it is just a dummy baseline, so let's make some predictions using it.

In [None]:
test_files = os.listdir('/kaggle/input/feedback-prize-2021/test')

In [None]:
sub = []
for file in test_files:
    sub += predict(file[:-4], '/kaggle/input/feedback-prize-2021/test/')

In [None]:
sub_df = pd.DataFrame(sub)
sub_df

In [None]:
sub_df.to_csv('submission.csv', index = False)