**Content:**
* [Load all dependencies we need](#section-two)
* [EDA](#section-three)

<a id="section-two"></a>
# Load all dependencies we need

In [None]:
import string
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from plotly import graph_objs as go
from collections import Counter
import plotly.express as px
import seaborn as sns
import scipy as sp
import re
import csv
from bs4 import BeautifulSoup
from nltk.corpus import stopwords


<a id="section-three"></a>
# EDA

In [None]:
train             = pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
test              = pd.read_csv('../input/feedback-prize-effectiveness/test.csv')
sample_submission = pd.read_csv('../input/feedback-prize-effectiveness/sample_submission.csv')

print(train.shape)
print(test.shape)

So We have 36765 samples in the train set and 10 samples in the test set

In [None]:
train.info()

There are no null Values in the test set and train set.

In [None]:
train.head()

**Lets look at the distribution of discourse_effectiveness in the train set**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='discourse_effectiveness',data=train)

Let's draw a Funnel-Chart for better visualization

In [None]:
temp = train.groupby('discourse_effectiveness').count()['discourse_id'].reset_index().sort_values(by='discourse_id',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.discourse_effectiveness,
    values = temp.discourse_id,
    title = {"position": "top center", "text": "Funnel-Chart of discourse_effectiveness Distribution"}
    ))
fig.show()

Now let's look at the distribution of discourse_type in the train set

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='discourse_type',data=train)

In [None]:
temp = train.groupby('discourse_type').count()['discourse_id'].reset_index().sort_values(by='discourse_id',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.discourse_type,
    values = temp.discourse_id,
    title = {"position": "top center", "text": "Funnel-Chart of discourse_type Distribution"}
    ))
fig.show()

So the dataset is quite imbalance.

Now let's concatenate txt files with csv file

In [None]:
%%time
filenames = os.listdir('../input/feedback-prize-effectiveness/train/')
train['essay'] = ''
for file in filenames:
    raw_html = open('../input/feedback-prize-effectiveness/train/' + file)
    cleantext = BeautifulSoup(raw_html, "lxml").text 
    output = re.sub('\s+',' ', cleantext)      # saved the result using a variable
    train['essay'][train['essay_id'] == file[:-4]] = output

In [None]:
train.head()

In [None]:
print(f"Number of uniques values in discourse_text column: {train.discourse_text.nunique()} in train set")
print(f"Number of uniques values in essay column: {train.essay.nunique()} in train set")

So in discourse_text They are just 36765-36691 = 74 duplicate values.

**Now let's look at how much words we have in discourse_text and essay**

In [None]:
%%time
len_essay          = []
len_discourse_text = []
for k in range(train.shape[0]):
    len_essay.append(len(train['essay'][k]))
    len_discourse_text.append(len(train['discourse_text'][k]))    

plt.hist(len_essay)

In [None]:
plt.hist(len_discourse_text)

In [None]:
print('The shortest essay has ' +str(min(len_essay))+' words' )
print('The longest essay has ' +str(max(len_essay))+' words' )
print('The mean words in the essay is ' +str(np.mean(len_essay)))
print()
print('The shortest discourse_text has ' +str(min(len_discourse_text))+' words' )
print('The longest discourse_text has ' +str(max(len_discourse_text))+' words' )
print('The mean words in the discourse_text is ' +str(np.mean(len_discourse_text)))

So I guess there are some outliers in our data that we should take car of.