## **Feedback Prize - Evaluating Student Writing**
* An automated tool to evaluate student writing and provide personlized feedback. 
* Automatically segment text and classify the text among the following rhetorical and argumentative elements:
    * ***Lead***
    * ***Position***
    * ***Claim***
    * ***Counterclaim***
    * ***Rebuttal***
    * ***Evidence***
    * ***Concluding Statement***
    
### **Evaluation**
* Evaluation based on the overlap between the ground truth and predicted word indices.
* The final score is calculated using **TP/FP/FN** for each class, followed by **macro F1 score** across all classes.
* Unmatched ground truths are **false negatives (FN)**.
* Unmatched predictions are **false positives (FP)**.
* True Positive (TP) - Overlap between the ground truth and prediction and the prediction and the ground truth **>=0.5** 

### **Motivation**
A feedback tool for students to improve their writing and the overall outcomes.

### **Problem Description**
This is a part of text segmentation problem, where each required text files are required to be divided into contiguous segments taking into account the semantic structure.

In [None]:
import os
import glob
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

### **Dataset**

In [None]:
input_dir = "../input/feedback-prize-2021/"
train_dir ='../input/feedback-prize-2021/train/' # Training Set
test_dir ='../input/feedback-prize-2021/test' # Test Set
sample_file = 'sample_submission.csv' # a sample file for submission part
train_file = "train.csv" # Annotated version of the train set

The .txt file in the training directory consists of individual essays.  
<span style="color: red">**WARNING**</span> Some part of the essay are not annotated as they do not fit into any of the above categories.

### **Exploratory Data analysis**

In [None]:
print(f"Total Train files: {len(os.listdir(train_dir))}")
print(f"Total Test files: {len(os.listdir(test_dir))}")

In [None]:
train_data = pd.read_csv(os.path.join(input_dir,train_file))
train_data.head()

#### **Description of Each Column**
* **id** - ID code for essay response
* **discourse_id** - ID code for discourse element
* **discourse_start** - Character position where discourse element begins in the essay response
* **discourse_end** - Character position where discourse element ends in the essay response
* **discourse_text** - Text of discourse element
* **discourse_type** - Classification of discourse element
* **discourse_type_num** - Enumerated class label of discourse element
* **predictionstring** - The word indices of the training sample, as required for predictions


In [None]:
print(f"Total Number of data to train: {len(train_data)}")

In [None]:
sns.barplot(x=train_data["discourse_type"].value_counts().index, y=train_data["discourse_type"].value_counts())
plt.xticks(rotation=45)
plt.title("Label Count for each Discourse type")
plt.show()

### In Progress...