# Evaluating Student Writing
A competition to give helpful feedback to students!

![img](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.makemyassignments.com%2Fblog%2Fwp-content%2Fuploads%2F2020%2F04%2F1_JrK45nVmcOg2Uvx0nY7gzA.jpeg&f=1&nofb=1)

**This notebook was created during a live coding stream on twitch.**

<img src="https://icdn.digitaltrends.com/image/digitaltrends/twitch-logo-720x720.jpg" alt="twitch logo" width="300"/><img src="https://i.postimg.cc/MGdgj1Hq/Screenshot-from-2021-12-17-09-46-56.png" alt="ms stream" width="300"/>

[Check out past videos here](https://www.twitch.tv/medallionstallion_) and give me a follow to be notified of future streams.

In this competition we are tasked with giving feedback on argumentative essays written by U.S students in grades 6-12. Specifically, our task is to predict the human annotations.

This annotation will be done in 2 steps:
1. Segment each essay into discrete rhetorical and argumentative elements.
2. Classify each element.

Where the classification labels are:
- `Lead` - Introduction
- `Position` - Opinion
- `Claim` - something that supports the position
- `Counterclaim` - A claim that refutes another claim.
- `Rebuttal` - A claim that refutes a counterclaim
- `Evidence` - Examples that support a claims, counterclaims, or rebuttals.
- `Concluding Statement` - Something that restates the claims in conclusion.

In [None]:
# Install nb_black for automatic code formatting
!cp -r ../input/nbblack-code-base/nb_black-1.0.7/nb_black-1.0.7/ ./
!pip install --user nb_black-1.0.7/ > /dev/null
%load_ext lab_black

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from glob import glob
from tqdm.notebook import tqdm

from itertools import cycle

plt.style.use("ggplot")
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

## Data Understanding

The data is provided in two formats:
- A train.csv with annotation for essays
- A train.zip (and train folder) with invidual `.txt` files for each essay.

The dataset is just over 100MB in size.

In [None]:
!ls -GFlash --color ../input/feedback-prize-2021/

In [None]:
# 15k + Text files are found in the train folder
!ls -GFlash --color ../input/feedback-prize-2021/train/ | wc -l

## Reading and exploring the data

In [None]:
train = pd.read_csv("../input/feedback-prize-2021/train.csv")
ss = pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")
train_txt = glob("../input/feedback-prize-2021/train/*.txt")
test_txt = glob("../input/feedback-prize-2021/test/*.txt")

## What Does an essay look like

lets print the first txt file `AFEC37C2D43F.txt` to see what an essay looks like. It's an essay about asking for advice on a certain topic or subject.

In [None]:
!cat ../input/feedback-prize-2021/train/AFEC37C2D43F.txt

We can also look at the raw annotations for this essay from the train.csv file. This essay has 10 annotations. Most of the annotations for discourse type are for `Claim` is second with 6 and `Evidence` with 3.

In [None]:
train.query('id == "AFEC37C2D43F"')["discourse_type"].value_counts()

## Lets print the same example with the text colored by discourse type.

There are prettier was of doing this but we are simply going to highlight the text differently based on the discourse type

In [None]:
def read_essay(id):
    with open(f"../input/feedback-prize-2021/train/{id}.txt") as f:
        essay = f.read()
    return essay


def text_to_color(essay, discourse_type, predictionstring):
    """
    Takes an entire essay, the discourse type and prediction string.
    Returns highlighted text for the prediction string
    """
    discourse_color_map = {
        "Lead": 1,  # 1 red
        "Position": 2,  # 2 green
        "Evidence": 3,  # 3 yellow
        "Claim": 4,  # 4 blue
        "Concluding Statement": 5,  # 5 magenta
        "Counterclaim": 6,  # 6 cyan
        "Rebuttal": 7,  # 7 white
        "None": 9,  # default
    }
    hcolor = discourse_color_map[discourse_type]
    text_index = [int(c) for c in predictionstring.split()]
    text_subset = " ".join(np.array(essay.split())[text_index])
    if discourse_type == "None":
        return f"\033[4{hcolor};30m{text_subset}\033[m"
    return f"\033[4{hcolor};30m{text_subset}\033[m"


def get_non_discourse_df(train, essay, id):
    all_pred_strings = " ".join(train.query("id == @id")["predictionstring"].values)
    all_pred_strings = [int(c) for c in all_pred_strings.split()]
    # [c for c in all_pred_strings

    non_discourse_df = pd.DataFrame(
        [c for c in range(len(essay.split())) if c not in all_pred_strings]
    )
    non_discourse_df.columns = ["predictionstring"]
    non_discourse_df["cluster"] = (
        non_discourse_df["predictionstring"].diff().fillna(1) > 1
    ).cumsum()

    non_discourse_strings = []
    for i, d in non_discourse_df.groupby("cluster"):
        pred_string = [str(x) for x in d["predictionstring"].values]
        non_discourse_strings.append(" ".join(pred_string))
    df = pd.DataFrame(non_discourse_strings).rename(columns={0: "predictionstring"})
    df["discourse_type"] = "None"
    return df


def get_colored_essay(train, id):
    essay = read_essay(id)
    all_text = ""
    train_subset = train.query("id == @id").copy()
    df = get_non_discourse_df(train, essay, id)
    train_subset = pd.concat([train_subset, df])
    train_subset["first_index"] = (
        train_subset["predictionstring"].str.split(" ").str[0].astype("int")
    )
    train_subset = train_subset.sort_values("first_index").reset_index(drop=True).copy()
    for i, d in train_subset.iterrows():
        colored_text = text_to_color(essay, d.discourse_type, d.predictionstring)
        all_text += " " + colored_text
    return all_text[1:]


all_text = get_colored_essay(train, "AFEC37C2D43F")
print(all_text)

## What Type of Annotations are Most Common
- Are there trends to when and where annotations appear in the text?
- What type of annotations are important?

In [None]:
ax = (
    train["discourse_type"]
    .value_counts(ascending=True)
    .plot(kind="barh", figsize=(10, 5), color=color_pal[1])
)
ax.set_title("Discourse Label Frequency (in train)", fontsize=16)
ax.bar_label(ax.containers[0], label_type="edge")
plt.show()

Some things to note about this next plot.
- The lead tends to start at the beginning of the document. Evidence ranges in the area it appears commonly in the document.

In [None]:
ax = (
    train.groupby("discourse_type")[["discourse_start", "discourse_end"]]
    .mean()
    .sort_values("discourse_start")
    .plot(
        kind="barh",
        figsize=(10, 5),
    )
)
ax.set_title("Average Discourse Label Start and End", fontsize=16)
plt.show()

In [None]:
# The length of each label
train["discourse_len"] = (train["discourse_end"] - train["discourse_start"]).astype(
    "int"
)

fig, ax = plt.subplots(figsize=(12, 5))
sns.barplot(x="discourse_type", y="discourse_len", data=train)
ax.set_title("The Average Lenth of each Discourse")
ax.set_xlabel("Discourse Type")
ax.set_ylabel("Average Text Length")
plt.show()

# The txt Files.
- What is the lengnth of each text file? We add info about the essays to training dataset.
    - The length of the essay
    - The number of words in each essay.

In [None]:
len_dict = {}
word_dict = {}
for t in tqdm(train_txt):
    with open(t, "r") as txt_file:
        myid = t.split("/")[-1].replace(".txt", "")
        data = txt_file.read()
        mylen = len(data.strip())
        myword = len(data.split())
        len_dict[myid] = mylen
        word_dict[myid] = myword
train["essay_len"] = train["id"].map(len_dict)
train["essay_words"] = train["id"].map(word_dict)

Interestingly, most essays end around 6,000 words (I'm guessing that's the max length allowed for the essay)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
train.groupby("id").first().plot(
    x="essay_len", y="essay_words", kind="scatter", color=color_pal[3], ax=ax
)
ax.set_title("Word vs Character Length per Essay", fontsize=16)
plt.show()

Now that we know the essay lengths. We can see where the discourse labels tend to appear in relationship to the total essay length.

In [None]:
train["discourse_start_pct"] = train["discourse_start"] / train["essay_len"]
train["discourse_end_pct"] = train["discourse_end"] / train["essay_len"]

ax = (
    train.groupby("discourse_type")[["discourse_start_pct", "discourse_end_pct"]]
    .mean()
    .sort_values("discourse_start_pct")
    .plot(
        kind="barh",
        figsize=(10, 5),
    )
)
ax.set_title("Label Start and End as Percentage of Total Essay", fontsize=16)
plt.show()

# Wordclouds by Discourse Type

In [None]:
plt.style.use('default')
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

fig, axs = plt.subplots(7, 1, figsize=(20, 25))

plt_idx = 0

for discourse_type, d in train.groupby("discourse_type"):
    discourse_text = " ".join(d["discourse_text"].values.tolist())
    wordcloud = WordCloud(
        max_font_size=200,
        max_words=200,
        width=1200,
        height=800,
        background_color="white",
    ).generate(discourse_text)
    axs = axs.flatten()
    axs[plt_idx].imshow(wordcloud, interpolation="bilinear")
    axs[plt_idx].set_title(discourse_type, fontsize=18)
    axs[plt_idx].axis("off")
    plt_idx += 1
plt.tight_layout()
plt.show()

# The Metric

The metric is a version of `micro f1` where matches are made between the predicted text and the ground truth when the overlap is >= 0.5.

The metric page states that this happens in the process:
- all ground truths and predictions for a given class are compared.
- overlap between the ground truth and prediction is >= 0.5, and the overlap between the prediction and the ground truth >= 0.5 is a true positive. If multiple matches exist, the match with the highest pair of overlaps is taken.
- Any unmatched ground truths are false negatives and any unmatched predictions are false positives.

## Example scoring a single label
- We will purposefully make the label not overlap and show how it will be a false positive.

In [None]:
example_pred = (
    train[["id", "discourse_type", "predictionstring"]]
    .query('id == "423A1CA112E2"')
    .copy()
)
example_pred = example_pred.rename(columns={"discourse_type": "class"})
example_gt = (
    train[["id", "discourse_type", "predictionstring"]]
    .query('id == "423A1CA112E2"')
    .copy()
)

# Make one prediction wrong on purpose
example_pred.loc[0, "predictionstring"] = " ".join(
    example_gt["predictionstring"].values[0].split(" ")[:10]
)
example_pred.loc[5, "class"] = "Lead"

# Step 1. all ground truths and predictions for a given class are compared.
joined_example = example_pred.merge(
    example_gt,
    left_on=["id", "class"],
    right_on=["id", "discourse_type"],
    how="outer",
    suffixes=("_pred", "_gt"),
)

set_pred = set(joined_example["predictionstring_pred"][0].split(" "))
set_gt = set(joined_example["predictionstring_gt"][0].split(" "))

# Find the lengths of the sets for this example
len_gt = len(set_gt)
len_pred = len(set_pred)
inter = len(set_gt.intersection(set_pred))
overlap_1 = inter / len_gt
overlap_2 = inter / len_pred
print(len_gt, len_pred, inter)
# In this example both overlap percentages are not >= 0.5 so it is not a true positive
print(overlap_1, overlap_2)

# Competition Metric Code
**Note** @cdeotte noted in the comments. The scoring metric for this competition is `macro_f1` and the`score_feedback_comp` function now computes the macro f1 score to follow the evaluation page.

In [None]:
def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(" "))
    set_gt = set(row.predictionstring_gt.split(" "))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter / len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp_micro(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition

    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = (
        gt_df[["id", "discourse_type", "predictionstring"]]
        .reset_index(drop=True)
        .copy()
    )
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    pred_df["pred_id"] = pred_df.index
    gt_df["gt_id"] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(
        gt_df,
        left_on=["id", "class"],
        right_on=["id", "discourse_type"],
        how="outer",
        suffixes=("_pred", "_gt"),
    )
    joined["predictionstring_gt"] = joined["predictionstring_gt"].fillna(" ")
    joined["predictionstring_pred"] = joined["predictionstring_pred"].fillna(" ")

    joined["overlaps"] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5,
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined["overlap1"] = joined["overlaps"].apply(lambda x: eval(str(x))[0])
    joined["overlap2"] = joined["overlaps"].apply(lambda x: eval(str(x))[1])

    joined["potential_TP"] = (joined["overlap1"] >= 0.5) & (joined["overlap2"] >= 0.5)
    joined["max_overlap"] = joined[["overlap1", "overlap2"]].max(axis=1)
    tp_pred_ids = (
        joined.query("potential_TP")
        .sort_values("max_overlap", ascending=False)
        .groupby(["id", "predictionstring_gt"])
        .first()["pred_id"]
        .values
    )

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined["pred_id"].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query("potential_TP")["gt_id"].unique()
    unmatched_gt_ids = [c for c in joined["gt_id"].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    # calc microf1
    my_f1_score = TP / (TP + 0.5 * (FP + FN))
    return my_f1_score


def score_feedback_comp(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    for discourse_type, gt_subset in gt_df.groupby("discourse_type"):
        pred_subset = (
            pred_df.loc[pred_df["class"] == discourse_type]
            .reset_index(drop=True)
            .copy()
        )
        class_score = score_feedback_comp_micro(pred_subset, gt_subset)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1

## ** Note - Duplicate Labels **
Twice training set that have multiple labels for the same string. This results in a imperfect score when using the ground truth labels as a submission.
I've printed these two examples below, they occur when words, seperated by commas are given different labels, while the `split()` keeps them as a single string.

In [None]:
# Examples where the same `prediction string has two label.
train.loc[train[["id", "discourse_type", "predictionstring"]].duplicated(keep=False)][
    [
        "id",
        "discourse_id",
        "discourse_start",
        "discourse_end",
        "discourse_text",
        "discourse_type",
        "discourse_type_num",
        "predictionstring",
    ]
]

## Scoring the Ground Truth Labels

In [None]:
gt_df = train.copy()
pred_df = gt_df.copy()
pred_df = pred_df.rename(columns={"discourse_type": "class"})
microf1_score = score_feedback_comp_micro(pred_df, gt_df)
macrof1_score, class_scores = score_feedback_comp(
    pred_df, gt_df, return_class_scores=True
)

print(
    f"Using the ground truth to predict the micro f1 score is {microf1_score:0.4f} and macro f1 score {macrof1_score:0.4f}"
)
print("The individual class scores are:")
print(class_scores)

# Baseline Model

The model below is very basic and only used as a baseline.

Using a simple heuristic based on the average length and position of the `Lead` and `Concluding Statement` label lengths in the training set we:
- Label the first 51 words of every essay as `Lead`
- Label the last 60 words as the `Concluding` statement.

Note the average lengths of these sections below:

In [None]:
train["len_predstring"] = train["predictionstring"].apply(lambda x: len(x.split(" ")))
train_meta = (
    train.groupby("id")["discourse_type"]
    .value_counts()
    .unstack()
    .fillna(0)
    .astype("int")
)
(train_meta["Lead"] > 0).mean(), (train_meta["Concluding Statement"] > 0).mean()
# Predict the average lenth of `Lead` and `Conclusion`
train.groupby("discourse_type")["len_predstring"].mean()

In [None]:
def make_baseline_submission(txt_files, lead_padding=51, conclusion_padding=61):
    lead_pred = {}
    conc_pred = {}

    for t in tqdm(txt_files):
        with open(t, "r") as txt_file:
            myid = t.split("/")[-1].replace(".txt", "")
            data = txt_file.read()
        split_data = data.split()
        n_strings = len(split_data)
        lead_str = [" ".join([str(x + 1) for x in range(lead_padding)])]
        lead_pred[myid] = lead_str
        conc_str = [
            " ".join([str(x) for x in range(n_strings - conclusion_padding, n_strings)])
        ]
        conc_pred[myid] = conc_str

    # Create Dataframes from the predicted lead and conclusions
    lead_df = pd.DataFrame().from_dict(lead_pred, orient="index")
    conc_df = pd.DataFrame().from_dict(conc_pred, orient="index")
    lead_df = lead_df.reset_index().rename(
        columns={"index": "id", 0: "predictionstring"}
    )
    conc_df = conc_df.reset_index().rename(
        columns={"index": "id", 0: "predictionstring"}
    )
    lead_df["class"] = "Lead"
    conc_df["class"] = "Concluding Statement"

    # Concating our lead and conclusion datasets
    sub = pd.concat([lead_df, conc_df]).reset_index(drop=True)
    # Save as csv
    sub = sub[ss.columns].copy()
    return sub

In [None]:
sub_train = make_baseline_submission(train_txt)
gt_df = train[["id", "discourse_type", "predictionstring"]].copy()
pred_df = sub_train.copy()
pred_df = pred_df.rename(columns={"discourse_type": "class"})
microf1_score = score_feedback_comp_micro(pred_df, gt_df)
macrof1_score, class_scores = score_feedback_comp(
    pred_df, gt_df, return_class_scores=True
)

print(
    f"The baseline model on the training set has a micro f1 score is {microf1_score:0.4f} and macro f1 score {macrof1_score:0.4f}"
)
print("The individual class scores are:")
print(class_scores)

# Predict on Test Set

In [None]:
sub = make_baseline_submission(test_txt)
sub[ss.columns].to_csv("submission.csv", index=False)

# END