# Feedback Prize EDA

Hey there NLP fans, I didn't expect another NLP competition right now, how about you? Let's take a look to see what this one is about :) We'll start with looking at the data, and then look at all the info from the organizers - this sequence works best for me :) 

In [None]:
import pandas as pd
import os
from pathlib import Path
import spacy
from spacy import displacy
from pylab import cm, matplotlib

### Training Data

Looks like we have a bunch of text files and a separate `train.csv` with labels. Let's start by looking at the csv. We get reference to the text files, and then multiple lines per text file with spans indicating a specific **discourse type**. Let's also see what are the unique types of discourse below.

In [None]:
train = pd.read_csv('../input/feedback-prize-2021/train.csv')
train.head()

In [None]:
train.discourse_type.unique().tolist()

### Visualize with `displacy`

I think it will work best if we can overlay the spans from the csv file onto the text files. Spacy has a great visualizer for this: `displacy`. Let's use it to see what a single example looks like!

In [None]:
path = Path('../input/feedback-prize-2021/train')

colors = {
            'Lead': '#8000ff',
            'Position': '#2b7ff6',
            'Evidence': '#2adddd',
            'Claim': '#80ffb4',
            'Concluding Statement': 'd4dd80',
            'Counterclaim': '#ff8042',
            'Rebuttal': '#ff0000'
         }

def visualize(example):
    ents = []
    for i, row in train[train['id'] == example].iterrows():
        ents.append({
                        'start': int(row['discourse_start']), 
                         'end': int(row['discourse_end']), 
                         'label': row['discourse_type']
                    })

    with open(path/f'{example}.txt', 'r') as file: data = file.read()

    doc2 = {
        "text": data,
        "ents": ents,
        "title": example
    }

    options = {"ents": train.discourse_type.unique().tolist(), "colors": colors}
    displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)


In [None]:
examples = train['id'].sample(n=5, random_state=42).values.tolist()

for ex in examples:
    visualize(ex)
    print('\n')

### Insights

Ok - this looks interesting!: 
- For some examples, the entire text is densely split into spans of different categories. In some other examples, the annotators omit some words and the splits look very subjective. It's an indicator that annotations may be noisy. 
- Order seems to be important: start with the lead, mix claims and evidence, finish with concluding statement. We may need to incorporate this into our models. 
- There may be 2 spans of the same class next to each other - it will be important to separate them!

### Text length

Another important data point is the length of texts - would they fit into a model like `roberta`? Let's check!

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

In [None]:
texts = []

ids = train.sample(n=500)['id'].unique().tolist()
len(ids)

for example in ids:
    with open(path/f'{example}.txt', 'r') as file: data = file.read()
    texts.append({
        'text': data,
        'n_tokens': len(tokenizer(data)['input_ids'])
    })
     
df = pd.DataFrame(texts)

print(len(df[df.n_tokens > 512])/len(df))

df.n_tokens.hist();

**Important** More than half texts exceed typical max length of 512 tokens, so we'll likely need to apply some sliding window or other chunking approach to the data!

### Number of spans per text

Let's also see how many spans there are per text. Let's also visualize texts with the fewest and the most spans. }

In [None]:
dist = train.groupby('id')['discourse_type'].apply(lambda x: len(list(x)))

print(f'Min: {dist.min()}')
print(f'Max: {dist.max()}')

dist.hist();

In [None]:
dist.sort_values()

In [None]:
visualize('FFFF80B8CC2F')
visualize('FC7A3692794B')
visualize('149E8C278863')
visualize('71259B3EA87F')

### Test

In [None]:
sub = pd.read_csv('../input/feedback-prize-2021/sample_submission.csv')
sub.head()

Ok, so it looks like for test we also get files in a similar format, and we need to predict class and prediction string. I guess it's now time to look at the information provided by the host to understand this better :) 

> The word indices are calculated by using Python's .split() function and taking the indices in the resulting list. The two overlaps are calculated by taking the set() of each list of indices in a ground truth / prediction pair and calculating the intersection between the two sets divided by the length of each set.

I think that means we need to provide the indices of words that are in our predicted span in the `predictionstring` field.



## Baseline

This is an interesting task! It can be approached in several ways:
1. Split texts, for example by sentence, and then classify each sentence. 
2. NER: classify individual tokens (words)
3. Question answering: use discourse types as questions, and predict answer spans

Each approach has some downsides or complexities... I think we have enough data to start building a baseline. I'll go ahead and start to prepare one in a separate notebook, and come back to this EDA later. 

### I will smile for every upvote :)