
# Introduction: 

The dataset contains argumentative essays written by U.S students in grades 6-12. The essays were annotated by expert raters for elements commonly found in argumentative writing.

Task: To predict the human annotations. You will first need to segment each essay into discrete rhetorical and argumentative elements (i.e., discourse elements) and then classify as one of 7 "discourse types". These are:

1. Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
2. Position - an opinion or conclusion on the main question
3. Claim - a claim that supports the position
4. Counterclaim - a claim that refutes another claim or gives an opposing reason to the position
5. Rebuttal - a claim that refutes a counterclaim
6. Evidence - ideas or examples that support claims, counterclaims, or rebuttals.
7. Concluding Statement - a concluding statement that restates the claims

Data:

train.zip - folder of individual .txt files, with each file containing the full text of an essay response in the training set

train.csv - a .csv file containing the annotated version of all essays in the training set

test.zip - folder of individual .txt files, with each file containing the full text of an essay response in the test set

sample_submission.csv - file in the required format for making predictions - note that if you are making multiple predictions for a document, submit multiple rows

In [None]:
# import libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from glob import glob

#import matplotlib.style as style
#style.use('fivethirtyeight')
from matplotlib.ticker import FuncFormatter

from wordcloud import WordCloud, STOPWORDS


import nltk
from nltk.corpus import stopwords


import warnings
warnings.filterwarnings('ignore')
#import spacy
from sklearn.feature_extraction.text import CountVectorizer
import os

# Part 1:
# Train.csv file EDA

In [None]:
base_path = '/kaggle/input/feedback-prize-2021/'

In [None]:
train_df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')

In [None]:
display(train_df.head(5))
display(train_df.shape)

In [None]:
# finding unique values in each columns
for col in train_df.columns:
    print(col + ":" + str(len(train_df[col].unique())))

In [None]:
# lets take a look at the number of text files :
train_text_files = os.listdir(base_path+'/train')
len(train_text_files)

In [None]:
train_df.describe()

In [None]:
# train data set is free from missing values:
train_df.isnull().sum()

1. There are 15594 essays.
2. These essays have been labelled into 144293 discourses.
3. Each discourses is labelled in one of the 7 discourse types.

Thus, Training data consists of 16k(id = 15594) essays and 144k(discourse_id = 144293) lines of annotations (~9 discourses per essay). 

The train.csv contains discourse_text with annotations. Each row corresponds to one discourse element and contains the following:

1. id - ID code for essay response

2. discourse_id - ID code for discourse element

3. discourse_start - character position where discourse element begins in the essay response

4. discourse_end - character position where discourse element ends in the essay response

5. discourse_text - text of discourse element

6. discourse_type - classification of discourse element

7. discourse_type_num - enumerated class label of discourse element

8. predictionstring - the word indices of the training sample, as required for predictions





#Distribution of Discourse type labels in the annonated essays.

In [None]:
#distribution of labels in the annotated discourse:
label_dist = train_df['discourse_type'].value_counts()
label_dist *= 100 / label_dist.sum()
label_dist

In [None]:
label_dist.plot.bar(rot=90, title='Distribution of lables')
plt.ylabel('count')

In [None]:
# Percentage distribution of discourse_type_number in the annotated discourses:
av_per_disc = train_df['discourse_type_num'].value_counts(ascending = True)
#av_per_disc
av_per_disc *= 100 / av_per_disc.sum()
av_per_disc.rename_axis('discourse_type_num').reset_index(name='%count')

In [None]:
av_per_disc.plot(kind = "barh", figsize = (12, 8))

#Distribution of average number of words in discourse_types and discourse_type_number:


In [None]:
#add columns to 'train_df' which calculates the length of string in dicourse (as dis_len) and prediction string (as pred_len)
train_df['disc_len'] = train_df['discourse_text'].astype(str).apply(len)
train_df['pred_len'] = train_df['predictionstring'].astype(str).apply(len)

In [None]:
train_df.head(3)

In [None]:
#add columns to 'train_df' which calculates number of words of string in dicourse (as disc_word_count) and prediction string (as pred_word_count)
train_df["disc_word_count"] = train_df["discourse_text"].apply(lambda x: len(x.split()))
train_df["pred_word_count"] = train_df["predictionstring"].apply(lambda x: len(x.split()))


In [None]:
train_df.head(3)

In [None]:
#Distribution of text length in discourse text:
discourse_len = train_df['disc_len'] 
fig, ax = plt.subplots(figsize=(12, 8))

sns.distplot(discourse_len, bins = 50 , ax = ax)
plt.show()

In [None]:
# distribution of number of words in discourse text:
word_dist = train_df['disc_word_count']
fig, ax = plt.subplots(figsize=(12, 8))
sns.distplot(word_dist,bins = 50 , ax = ax )
plt.show()

In [None]:
# distribution of number of words in prediction string:
pred_word = train_df["pred_word_count"]
fig, ax = plt.subplots(figsize=(12, 8))
sns.distplot(pred_word,bins = 50 , ax = ax )
plt.show()

In [None]:
# now lets find out the average number of words per discourse type:
dis_type_len = train_df.groupby('discourse_type')['disc_word_count'].mean().sort_values()
dis_type_len

In [None]:
#plot the graph for average number of word per discourse type:
dis_type_len.plot(kind = 'barh', figsize = (10,5))
plt.xlabel('average number of words')
plt.title('Average number of word per discourse type')

In [None]:
# Also find out the average len of prediction string:
pred_str_len = train_df.groupby('discourse_type')['pred_len'].mean().sort_values()
pred_str_len

In [None]:
#plot the graph for length of prediction string per type:
pred_str_len.plot(kind = 'barh', figsize = (10,5))
plt.xlabel('Prediction string length')
plt.title('Length of pred_string per discourse type')

In [None]:
#Below you can see a plot with the average positions of the discourse start and end.
data = train_df.groupby("discourse_type")[['discourse_end', 'discourse_start']].mean().reset_index().sort_values(by = 'discourse_start', ascending = False)
data.plot(x='discourse_type',
        kind='barh',
        stacked=False,
        title='Average start and end position absolute',
        figsize=(12,4))
plt.show()

1. Discourse texts have lengths from 691 to 18k symbols with most of them around 1-3k symbols.

2. Number of words in each discourse text is around 500-1000 on average, with some outliers.

3. Different type classes are unequally distributed, CLAIM and EVIDENCE are most popular ones.

4. 'Evidence' has most average number of words, followed by 'Concluding Statement' class.

5. Is there a correlation between the length of a discourse and the class (discourse_type)? 
Yes, there is. Evidence is the longest discount type on average. When looking at the frequencies of occurence, we see that Counterclaim and Rebuttal are relatively rare.

6. We do have the field discourse_type_num. We see that Evidence1, Position1 and Claim1 are almost always there in an essay. Most students also had at least one Concluding Statement. What's surprising to me is that a Lead is missing in about 40% of the essays (Lead 1 is found in almost 60% of the essays).

7. We also try to find out the number of words in discourse_text and prediction string. They both are of same length as expected.



#A look at the discourse text annotation:

In [None]:
# Let's look at the first text and its annotation.
def print_text(text_id):
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    print(''.join(lines))
    
print_text('423A1CA112E2')

In [None]:
# We can make annotations more clear if we print texts using different colors.
from termcolor import colored
def color_text(text_id, train_df, color_scheme = None):
    if not color_scheme:
        color_scheme = {
        'Lead': 'green',
        'Position': 'red',
        'Claim': 'blue',
        'Counterclaim': 'magenta',
        'Rebuttal': 'yellow',
        'Evidence': 'cyan',
        'Concluding Statement': 'grey'
    } 
    with open(f'/kaggle/input/feedback-prize-2021/train/{text_id}.txt') as f:
        lines = f.readlines()
    text = ''.join(lines)
    
    annot_df = train_df[train_df.id == text_id]
    blocks = [(int(row['discourse_start']),int(row['discourse_end']), color_scheme[row['discourse_type']]) for k, row in annot_df.iterrows()]
    blocks.sort()
    i = 0
    last_symbol = -1
    while i < len(blocks):
        if blocks[i][0] > last_symbol + 1:
            blocks.insert(i, (last_symbol+1, blocks[i][0] - 1, None))
        last_symbol = blocks[i][1]
        i += 1
    if last_symbol < len(text):
        blocks.append((last_symbol+1, len(text) - 1, None))

    colored_text = ''.join([colored(text[x[0]:x[1]+1], x[2]) for x in blocks])
    return colored_text
    
print(color_text('423A1CA112E2', train_df))

In [None]:
# lets try another disourse:
print(color_text('6B4F7A0165B9', train_df))

#Wordcloud

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.

We will write a simple and intuitive function plot_wordcloud that will help us plot wordclouds with ease.

In [None]:
# function to plot world cloud:
def plot_wordcloud(column, title):
    
    """
    Function to Plot Wordcloud of given dataframe column.
    
    params: column(string): The Column of the DataFrame for plotting.
            title(string) : The Title of the Wordcloud.
    """
    # Define stopwords
    stopwords = set(STOPWORDS) 
    
    # Define the Wordcloud    
    wordcloud = WordCloud(width = 800, 
                          height = 800,
                          background_color ='black',
                          min_font_size = 10,
                          stopwords = stopwords).generate(' '.join(train_df[column])) 

    # Plot the WordCloud image                        
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.title('Wordcloud: ' + title, fontsize = 20)

    plt.show() 

In [None]:
#Most frequent words in 'discourse'
plot_wordcloud(column = 'discourse_text', title = 'Most frequent words in Discourse Texts')

#Most used words in different Discourse Types

In [None]:
train_df['discourse_text'] = train_df['discourse_text'].str.lower()

#get stopwords from nltk library
stop_english = stopwords.words("english")
other_words_to_take_out = ['school', 'students', 'people', 'would', 'could', 'many']
stop_english.extend(other_words_to_take_out)

#put dataframe of Top-10 words in dict for all discourse types
counts_dict = {}
for dt in train_df['discourse_type'].unique():
    df = train_df.query('discourse_type == @dt')
    text = df.discourse_text.apply(lambda x: x.split()).tolist()
    text = [item for elem in text for item in elem]
    df1 = pd.Series(text).value_counts().to_frame().reset_index()
    df1.columns = ['Word', 'Frequency']
    df1 = df1[~df1.Word.isin(stop_english)].head(10)
    df1 = df1.set_index("Word").sort_values(by = "Frequency", ascending = True)
    counts_dict[dt] = df1

plt.figure(figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)

keys = list(counts_dict.keys())

for n, key in enumerate(keys):
    ax = plt.subplot(4, 2, n + 1)
    ax.set_title(f"Most used words in {key}")
    counts_dict[keys[n]].plot(ax=ax, kind = 'barh')
    plt.ylabel("")

plt.show()


#Unigram

Now we need to extract N-Gram features. N-grams are used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means 2-worded phrase, and trigram means 3-worded phrase. 

In order to do this, we use scikit-learn’s CountVectorizer function.

First, it would be interesting to compare unigrams before and after removing stop words.




In [None]:
#The distribution of top unigrams before removing stop words in discourse_text:
def get_top_n_words_uni(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words_uni(train_df['discourse_text'], 20)
for word, freq in common_words:
    print(word, freq)


In [None]:
df1 = pd.DataFrame(common_words, columns = ['disctext' , 'count'])
df1 = df1.groupby('disctext').sum()['count'].sort_values(ascending=False)
df1.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 words in discourse texts before removing stop words')

In [None]:
#The distribution of top unigrams after removing stop words in discourse texts:
def get_top_n_words_uni_real(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words_real = get_top_n_words_uni_real(train_df['discourse_text'], 20)
for word, freq in common_words_real:
    print(word, freq)

In [None]:
df2 = pd.DataFrame(common_words_real, columns = ['disctext' , 'count'])
df2 = df2.groupby('disctext').sum()['count'].sort_values(ascending=False)
df2.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 words in discourse texts after removing stop words')

#Bigrams - Second, we want to compare bigrams before and after removing stop words.


In [None]:
#The distribution of top bigrams before removing stop words
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words_bi = get_top_n_bigram(train_df['discourse_text'], 20)
for word, freq in common_words_bi:
    print(word, freq)



In [None]:
df3 = pd.DataFrame(common_words_bi, columns = ['text' , 'count'])
df3 = df3.groupby('text').sum()['count'].sort_values(ascending=False)
df3.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 bi grams words in discourse texts before removing stop words')


In [None]:
#The distribution of top bigrams after removing stop words

def get_top_n_bigram_real(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words_bi_real = get_top_n_bigram_real(train_df['discourse_text'], 20)
for word, freq in common_words_bi_real:
    print(word, freq)


In [None]:
df4 = pd.DataFrame(common_words_bi_real, columns = ['text' , 'count'])
df4 = df4.groupby('text').sum()['count'].sort_values(ascending=False)
df4.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 bigrams words in discourse texts after removing stop words')

#TRIGRAMS : Last, we compare trigrams before and after removing stop words.

In [None]:
#The distribution of Top trigrams before removing stop words
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words_tri = get_top_n_trigram(train_df['discourse_text'], 20)
for word, freq in common_words_tri:
    print(word, freq)



In [None]:
df5 = pd.DataFrame(common_words_tri, columns = ['text' , 'count'])
df5 = df5.groupby('text').sum()['count'].sort_values(ascending=False)
df5.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 trigrams words in discourse texts before removing stop words')

In [None]:
#The distribution of Top trigrams after removing stop words
def get_top_n_trigram_real(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words_tri_real = get_top_n_trigram_real(train_df['discourse_text'], 20)
for word, freq in common_words_tri_real:
    print(word, freq)


In [None]:
df6 = pd.DataFrame(common_words_tri_real, columns = ['text' , 'count'])
df6 = df6.groupby('text').sum()['count'].sort_values(ascending=False)
df6.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 trigrams words in discourse texts after removing stop words')

# Most popular uni, bi tri grams in discourse_text:

1. 'Student', 'people', 'school' , 'help' are the top words in discourse texts (based on world cloud and unigrams)
2. 'electoral college', 'driverless cars', 'cell phones' , 'community service'  are the most popuar BIGRAMS.
3. the most popular TRIGRAMS are : 'facial action coding', 'attend classes home', 'limiting car usage' 


# Essay EDA:
Let's take a look at the 15594 essays in the train_text files.

In [None]:
# let's load all essays:

texts = []
for file in train_text_files :
    with open(f'/kaggle/input/feedback-prize-2021/train/{file}') as f:
        lines = f.readlines()
    texts.append({'id': file[:-4], 'text': ''.join(lines)})
texts_df = pd.DataFrame(texts)

In [None]:
texts_df.head()


In [None]:
texts_df.info()

In [None]:
#length of eassays :
texts_df['len'] = texts_df['text'].apply(len)
texts_df['len'].hist(bins = 50, figsize = (10,6))
plt.title('Length of Essays')
print(texts_df['len'].min(), texts_df['len'].max())

In [None]:
# number of words in the essays:
texts_df['words_num'] = texts_df['text'].apply(lambda x: len(x.split(' ')))

In [None]:
texts_df['words_num'].hist(bins = 100, figsize = (10,6))
plt.title('Word Count distribution in the essays')

        

In [None]:
print('Minimum no. of words {} and max no. of words {} in the Essays'.format(texts_df['words_num'].min(), texts_df['words_num'].max()))
print('Minimum length of essays is {} and maximum length of essays is {}'.format(texts_df['len'].min(), texts_df['len'].max()))      

In [None]:
# function to plot world cloud for Essays:
def plot_wordcloud_essay(column, title):
    
    """
    Function to Plot Wordcloud of given dataframe column.
    
    params: column(string): The Column of the DataFrame for plotting.
            title(string) : The Title of the Wordcloud.
    """
    # Define stopwords
    stopwords = set(STOPWORDS) 
    
    # Define the Wordcloud    
    wordcloud = WordCloud(width = 800, 
                          height = 800,
                          background_color ='black',
                          min_font_size = 10,
                          stopwords = stopwords).generate(' '.join(texts_df[column])) 

    # Plot the WordCloud image                        
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.title('Wordcloud: ' + title, fontsize = 20)

    plt.show() 

In [None]:
#World Cloud of the Essays:
plot_wordcloud_essay(column = 'text', title = 'Most frequent words in the Essays')

In [None]:
#UNIGRAM of Essays without removing stop words:
common_words_ess = get_top_n_words_uni(texts_df['text'], 20)
for word, freq in common_words_ess:
    print(word, freq)


In [None]:
df_ess_1 = pd.DataFrame(common_words_ess, columns = ['text' , 'count'])
df_ess_1 = df_ess_1.groupby('text').sum()['count'].sort_values(ascending=False)
df_ess_1.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 words in Essays before removing stop words')

In [None]:
#The distribution of top unigrams after removing stop words in Essays:

common_words_real_ess = get_top_n_words_uni_real(texts_df['text'], 20)
for word, freq in common_words_real_ess:
    print(word, freq)

In [None]:
df_ess_2 = pd.DataFrame(common_words_real_ess, columns = ['text' , 'count'])
df_ess_2 = df_ess_2.groupby('text').sum()['count'].sort_values(ascending=False)
df_ess_2.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 words in Essays after removing stop words')

In [None]:
#The distribution of top bigrams in Essays after removing stop words

common_words_bi_real_ess = get_top_n_bigram_real(texts_df['text'], 20)
for word, freq in common_words_bi_real_ess:
    print(word, freq)

In [None]:
df_ess_3 = pd.DataFrame(common_words_bi_real_ess, columns = ['text' , 'count'])
df_ess_3 = df_ess_3.groupby('text').sum()['count'].sort_values(ascending=False)
df_ess_3.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 Bigrams words in Essays after removing stop words')

In [None]:
#The distribution of Top trigrams in Essays after removing stop words
common_words_tri_real_ess = get_top_n_trigram_real(texts_df['text'], 20)
for word, freq in common_words_tri_real_ess:
    print(word, freq)

In [None]:
df_ess_4 = pd.DataFrame(common_words_tri_real_ess, columns = ['text' , 'count'])
df_ess_4 = df_ess_4.groupby('text').sum()['count'].sort_values(ascending=False)
df_ess_4.plot(kind = 'bar', figsize = (10,5))
plt.xlabel('count')
plt.title('Top 20 Trigrams words in Essays after removing stop words')

# Most popular uni, bi tri grams in discourse_text:

1. 'Student', 'people', 'school' , 'help' are the top words in discourse texts (based on world cloud and unigrams)
2. 'electoral college', 'driverless cars', 'cell phones' , 'community service'  are the most popuar BIGRAMS.
3. the most popular TRIGRAMS are : 'facial action coding', 'attend classes home', 'limiting car usage' 

WHILE

# Most popular uni, bi tri grams in Essays:

1. 'Student', 'people', 'school' , 'help' are the top words in Essays (based on world cloud and unigrams)
2. 'electoral college', 'driverless cars', 'cell phones' , 'community service'  are the most popuar BIGRAMS.

3. the most popular TRIGRAMS are : 'facial action coding', 'attend classes home', 'limiting car usage' 

Both are similar in every aspect. As expected as the discourses are annotated essays!