#### What are you trying to do in this notebook?
In this competition, I’ll identify elements in student writing. More specifically, I will automatically segment texts and classify argumentative and rhetorical elements in essays written by 6th-12th grade students. I'll access to the largest dataset of student writing ever released in order to test your skills in natural language processing, a fast-growing area of data science.

#### Why are you trying it?
I'll make it easier for students to receive feedback on their writing and increase opportunities to improve writing outcomes. Virtual writing tutors and automated writing systems can leverage these algorithms while teachers may use them to reduce grading time. The open-sourced algorithms you come up with will allow any educational organization to better help young writers develop.

My task is to predict the human annotations. I will first need to segment each essay into discrete rhetorical and argumentative elements (i.e., discourse elements) and then classify each element as one of the following:

Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis

Position - an opinion or conclusion on the main question

Claim - a claim that supports the position

Counterclaim - a claim that refutes another claim or gives an opposing reason to the position

Rebuttal - a claim that refutes a counterclaim

Evidence - ideas or examples that support claims, counterclaims, or rebuttals.

Concluding Statement - a concluding statement that restates the claims

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import os

In [None]:
text_files = os.listdir('/kaggle/input/feedback-prize-2021/train')

In [None]:
len(text_files)

In [None]:
train_df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')

In [None]:
train_df.info()

In [None]:
train_df.describe()

In [None]:
train_df.head(15)

In [None]:
def print_text(text_id):
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    print(''.join(lines))
    
print_text('423A1CA112E2')

In [None]:
from termcolor import colored
def color_text(text_id, train_df, color_scheme = None):
    if not color_scheme:
        color_scheme = {
        'Lead': 'green',
        'Position': 'red',
        'Claim': 'blue',
        'Counterclaim': 'magenta',
        'Rebuttal': 'yellow',
        'Evidence': 'cyan',
        'Concluding Statement': 'grey'
    } 
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    text = ''.join(lines)
    
    annot_df = train_df[train_df.id == text_id]
    blocks = [(int(row['discourse_start']),int(row['discourse_end']), color_scheme[row['discourse_type']]) for k, row in annot_df.iterrows()]
    blocks.sort()
    i = 0
    last_symbol = -1
    while i < len(blocks):
        if blocks[i][0] > last_symbol + 1:
            blocks.insert(i, (last_symbol+1, blocks[i][0] - 1, None))
        last_symbol = blocks[i][1]
        i += 1
    if last_symbol < len(text):
        blocks.append((last_symbol+1, len(text) - 1, None))

    colored_text = ''.join([colored(text[x[0]:x[1]+1], x[2]) for x in blocks])
    return colored_text
    
print(color_text('423A1CA112E2', train_df))

In [None]:
print(color_text('A8445CABFECE', train_df))

In [None]:
print(color_text('6B4F7A0165B9', train_df))

In [None]:
# let's load all textst

texts = []
for file in text_files:
    with open(f'/kaggle/input/feedback-prize-2021/train/{file}') as f:
        lines = f.readlines()
    texts.append({'id': file[:-4], 'text': ''.join(lines)})
texts_df = pd.DataFrame(texts)

In [None]:
texts_df.head()

In [None]:
texts_df['len'] = texts_df['text'].apply(len)

In [None]:
texts_df['len'].hist(bins = 50, figsize = (20,10))
print(texts_df['len'].min(), texts_df['len'].max())

In [None]:
texts_df['words_num'] = texts_df['text'].apply(lambda x: len(x.split(' ')))

In [None]:
texts_df['words_num'].hist(bins = 100, figsize = (20,10))
print(texts_df['words_num'].min(), texts_df['words_num'].max())

In [None]:
train_df['discourse_type'].value_counts()

In [None]:
train_df['discourse_words_num'] = train_df['discourse_text'].apply(lambda x: len(x.split(' ')))

In [None]:
avg_len_dict = {}
for d in train_df['discourse_type'].unique():
    temp_df = train_df[train_df['discourse_type'] == d]
    print(d, temp_df['discourse_words_num'].min(), temp_df['discourse_words_num'].mean(), temp_df['discourse_words_num'].max())
    avg_len_dict[d] = int(temp_df['discourse_words_num'].mean())

In [None]:
train_df['first_word'] = train_df['discourse_text'].apply(lambda x: x.split(' ')[0].lower())

In [None]:
top_first_words = {}
for d in train_df['discourse_type'].unique():
    temp_df = train_df[train_df['discourse_type'] == d]
    print(d)
    display(temp_df['first_word'].value_counts().head(10))
    top_first_words[d] = temp_df['first_word'].value_counts().head(10).keys()

In [None]:
stop_words = {'the', 'i', 'in', '', 'it', 'this', 'if', 'they', 'to'}

for k, v in top_first_words.items():
    top_first_words[k] = set([x for x in v if x not in stop_words])
top_first_words

In [None]:
top_first_words = {
    'Claim': {'another', 'students'},
     'Evidence': {},
     'Position': {'there'},
     'Concluding Statement': {'so'},
     'Lead': {'driverless', 'imagine'},
     'Counterclaim': {'although','but','however,'},
     'Rebuttal': {'but,', 'while'}
    }

In [None]:
avg_len_dict

In [None]:
def predict(text_id, path = '/kaggle/input/feedback-prize-2021/train/', top_first_words=top_first_words, avg_len_dict=avg_len_dict):
    with open(f'{path}{text_id}.txt') as f:
        lines = f.readlines()
    text = ''.join(lines)
    words = text.split(' ')
    preds = []
    for i,word in enumerate(words):
        for k,v in top_first_words.items():
            if word in v:
                preds.append({'id': text_id, 'class': k, 'predictionstring': ' '.join([str(x) for x in range(i,i+avg_len_dict[k])])})
    return preds

In [None]:
predict('423A1CA112E2')

In [None]:
test_files = os.listdir('/kaggle/input/feedback-prize-2021/test')

In [None]:
sub = []
for file in test_files:
    sub += predict(file[:-4], '/kaggle/input/feedback-prize-2021/test/')

In [None]:
sub_df = pd.DataFrame(sub)
sub_df

In [None]:
sub_df.to_csv('submission.csv', index = False)

#### Did it work?
I'll make it easier for students to receive feedback on their writing and increase opportunities to improve writing outcomes. Virtual writing tutors and automated writing systems can leverage these algorithms while teachers may use them to reduce grading time. The open-sourced algorithms you come up with will allow any educational organization to better help young writers develop.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Writing is a critical skill for success. However, less than a third of high school seniors are proficient writers, according to the National Assessment of Educational Progress. Unfortunately, low-income, Black, and Hispanic students fare even worse, with less than 15 percent demonstrating writing proficiency. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback.