> **Hello everyone, my R skills are better than python.If there are any mistakes, I hope you can give me some advice, thank you! ! !**

In [None]:
import numpy as np
import pandas as pd
from glob import glob
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
from matplotlib.ticker import FuncFormatter
from nltk.corpus import stopwords
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')
import spacy
from sklearn.feature_extraction.text import CountVectorizer
import os
import random
from collections import Counter
import re

In [None]:
train=pd.read_csv("../input/feedback-prize-2021/train.csv")
train[["discourse_id",'discourse_start','discourse_end']]=train[["discourse_id",'discourse_start','discourse_end']].astype(int)
sample_submission=pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")
train_txt = glob('../input/feedback-prize-2021/train/*.txt') 
test_txt = glob('../input/feedback-prize-2021/test/*.txt')

## 1. Basic information

**Field information**

- **id** - ID code for essay response
- **discourse_id** - ID code for discourse element
- **discourse_start** - character position where discourse element begins in the essay response
- **discourse_end** - character position where discourse element ends in the essay response
- **discourse_text** - text of discourse element
- **discourse_type** - classification of discourse element
    - *Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis*
    - *Position - an opinion or conclusion on the main question*
    - *Claim - a claim that supports the position*
    - *Counterclaim - a claim that refutes another claim or gives an opposing reason to the position*
    - *Rebuttal - a claim that refutes a counterclaim*
    - *Evidence - ideas or examples that support claims, counterclaims, or rebuttals.*
    - *Concluding Statement - a concluding statement that restates the claims*
- **discourse_type_num** - enumerated class label of discourse element
- **predictionstring** - the word indices of the training sample, as required for predictions

In [None]:
train["discourse_text"][2].split(" ")

In [None]:
len(train["discourse_text"][1].split(" "))

In [None]:
train.head()

In [None]:
train.info()

## 2.Data Exploration

### 2.1 Essay id

**QUESTION 1**: 

Is the id of the article in the training set exactly the same as the id of the article in the train folder?

In [None]:
res = [item.replace('../input/feedback-prize-2021/train/', '') for item in train_txt]
res = [item.replace('.txt', '') for item in res]

In [None]:
len(list(set(train["id"]) & set(res)))==len(res)

In [None]:
len(res)==train["id"].nunique()

In [None]:
train["id"].nunique()

**ANSWER 1:**

YES.

We have 15594 articles

### 2.2 discourse_type & discourse_num

**QUESTION 2.1**: 

Does every article contain these 7 discourse types?

*7 discourse types：*

In [None]:
train["discourse_type"].unique()

In [None]:
df=train[["id","discourse_type"]].value_counts().rename_axis(["id","discourse_type"]).reset_index(name='counts')
df=df["id"].value_counts().rename_axis(["id"]).reset_index(name='counts')
df=df["counts"].value_counts().rename_axis(["discourse_num"]).reset_index(name='essay_num')

In [None]:
# function to add value labels
def addlabels(x,y):
    for i in range(len(x)):
        plt.text(i,y[i],y[i], ha = 'center')

In [None]:
plt.figure(figsize= (15, 10))

df["discourse_num"]=df["discourse_num"].apply(str)
df= df.sort_values('essay_num',ascending=False)

plt.bar('discourse_num', 'essay_num',data=df)
addlabels(df["discourse_num"], df["essay_num"])
plt.xlabel("discourse_type_num", size=12)
plt.ylabel("essay_num", size=15)

plt.show()

In [None]:
del df

**ANSWER 2.1:**

NO.

We have 15594 articles.As can be seen from the above figure, not all articles have these 7 elements. Among them, 102 articles contain only one type, 6017 articles contain 5 elements, and less than 15% of the articles contain all 7 elements.

**QUESTION 2.2**: 

What is the distribution of each discourse type in the text?

In [None]:
temp=train[["id","discourse_type"]].value_counts().rename_axis(["id","discourse_type"]).reset_index(name='counts')
discourse_types=temp["discourse_type"].unique()

plt.figure(figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)

for i in range(len(discourse_types)):
    df=temp[temp["discourse_type"]==discourse_types[i]][["counts"]].value_counts().rename_axis(discourse_types[i]+"_num").reset_index(name='essay_num')
    #plt.figure(figsize= (15, 10))
    df[discourse_types[i]+"_num"]=df[discourse_types[i]+"_num"].apply(str)
    df= df.sort_values('essay_num',ascending=False)
    ax = plt.subplot(4, 2, i + 1)
    plt.bar(discourse_types[i]+"_num", 'essay_num',data=df)
    addlabels(df[discourse_types[i]+"_num"], df["essay_num"])
    plt.xlabel(discourse_types[i]+"_num",size=12)
    plt.ylabel('essay_num',size=15)

plt.show()

**ANSWER 2.2:**

It can be seen from these figures that the number of "Claim" and "Evidence" in the article is relatively large, but the number of "Concluding Statement", "Position" and "Lead" in the article is relatively small.

### 2.3 predictionstring

**QUESTION 3**: 

Are all parts of the article marked, that is, are the predictionstrings corresponding to each article continuous?

This question can be answered if we can cite a counter-example.

We randomly select an article

In [None]:
random.seed (2022)
essayID=random.choice (train["id"].unique())

In [None]:
one_essay=train[train["id"]==essayID]
one_essay.head()

In [None]:
predictionstring=np.asarray(one_essay.predictionstring.str.cat(sep=' ').split(" ")).astype(int)
Counter(np.diff(predictionstring))

**ANSWER 3:**
From the above results, it can be seen that the answer to this question is: NO

*Next, let's see which part of the article is not marked?*


In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
startPos=predictionstring[np.where(np.diff(predictionstring)!=1)]
endPos=predictionstring[tuple(x+1 for x in np.where(np.diff(predictionstring)!=1))]

In [None]:
file=open("../input/feedback-prize-2021/train/"+essayID+".txt","r")
essaytxt=file.read()
splitEssay=re.split(" ""|\n",essaytxt)

res = [x for x in splitEssay if x.strip()]
len(res)==max(predictionstring)+1

*The following parts of the article are not marked*

In [None]:
for i in range(len(startPos)):
    print(' '.join(res[(startPos[i]+1):(endPos[i])]))

### 2.4 discourse_text&discourse_type

What is the distribution of text length under each discourse element? That is, does the length of the text affect the type of discourse?

In [None]:
train["discourse_text_len"]= train["discourse_text"].apply(lambda x: len(x.split()))

In [None]:
import seaborn as sns
import pandas as pd
#import matplotlib.pyplt as plt

plt.figure(figsize=(15, 12))
for col in train["discourse_type"].unique():
    sns.distplot(train[["discourse_text_len"]][train["discourse_type"]==col], label=col,
                 bins=range(0, 401, 20),
                 kde=False, hist_kws=dict(edgecolor='black'))

plt.xlabel('discourse_text_len',size=12)
plt.ylabel('frequency',size=15)
plt.legend()
plt.xticks(range(0, 401, 20))
plt.show()

In general, the distribution of text length under different discourse types shows obvious differences. As can be seen from the above figure, the text length distribution under 'Evidence' is relatively balanced, while the text length under 'Claim' has obvious right skew.

**To be continued. Please stay tuned!**