# **EDA Simplified: Feedback Prize, Effectiveness (BETA)**

## Introduction
Since we've did EDA on the first "Feedback Prize" competition, the competition based on that returned again! Instead of analyzing argumentative elements on students grade 6-12 essays on the first competition, we will rate the effectiveness of the argumentative writing elements from students' grade 6-12 essays! And as always, let's use EDA to analyze the data of the 2nd "Feedback Prize" competition!

Before we head on, be sure to read my first "EDA Simplified" Notebook on the first Feedback Prize compeitition:
https://www.kaggle.com/code/dinowun/eda-simplified-feedback-prize/notebook

And, let's get going!

## Imports and File Setup
First things first, let's import the necessary modules, along setting up our files to analyze with! We import the files, directory, and computer modules, which is the os module, tqdm from the tqdm module with the notebook submodule, the warnings module, and glob from the glob module. Then, we import the text analysis modules, such as the spacy module, the wordcloud module, and the stylecloud module, though we need to install it first by using the pip command. We proceed to import the modules that are reliable for linear algebra and machine learning, which is the pandas module as pd and the numpy module as np. Finally, we import the plotting modules like the plotly module with the express and graph_objects submodules as px and go, and the matplotlib module with the pyplot submodule as plt.

In [None]:
# Files, Directories, and Computer
import os
from glob import glob
from tqdm.notebook import tqdm

# Text Analysis
import spacy
import wordcloud

!pip3 install stylecloud
import stylecloud

# ML and Linear Algebra
import pandas as pd
import numpy as np

# Plotting Modules
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

After we import our necessary modules, let's setup over the csvs and directories! First, we create two dataframes, train and sub, by using the pd module with the read_csv function, containing the path directory leading to the two csv files. Next, we define two variables, TRAIN_DIR to TEST_DIR, to the path leading to two directories, thus defining additional two variables, train_files and test_files, to the os module with the listdir function for listing out the files given in the specific directory, which is the TRAIN_DIR and TEST_DIR variables. Finally, we create each for loop, looping the file variable in the number of entities of the train_files and test_files variables with the len function followed by ranging the entities with the range function. Inside each for loop, the train_files and test_files variable with the slice index of the file variable is assigned to the concatenation of the TRAIN_DIR and TEST_DIR variables, the slash in strings, and the conversion to string with the str function that contains the train_files and test_files variables with the slice index of the file variable just for representing the assembly of the file path.

In [None]:
train = pd.read_csv("../input/feedback-prize-effectiveness/train.csv")
sub = pd.read_csv("../input/feedback-prize-effectiveness/sample_submission.csv")

TRAIN_DIR = "../input/feedback-prize-effectiveness/train"
TEST_DIR = "../input/feedback-prize-effectiveness/test"
train_files = os.listdir(TRAIN_DIR)
test_files = os.listdir(TEST_DIR)

for file in range(len(train_files)):
    train_files[file] = TRAIN_DIR + "/" + str(train_files[file])
    
for file in range(len(test_files)):
    test_files[file] = TEST_DIR + "/" + str(test_files[file])

Or, in other ways, we can use the glob module to find all the pathnames that matched a specified pattern according to the rules that were used by the Unix shell. How? Well, we define two variables, train_txt and test_txt, to the glob function call, including the formatted file path to the txt files specified.

In [None]:
train_txt = glob("..input/feedback-prize-effectiveness/train/*.txt")
test_txt = glob("..input/feedback-prize-effectiveness/test/*.txt")

With all the basic setup of importing modules and files, let's head on to preparing our famous "Exploratory-Data-Analysis"!

## EDA
To start our EDA, let's find out the number of train and test files by printing out the train_files and test_files variables that are covered by the len function, just for finding out the number of entities on each of them.

In [None]:
print("Number of train files: ", len(train_files))
print("Number of test files: ", len(test_files))

Unlike the previous competition of the first "Feedback Prize" one, there are 4,191 train files and a single test file. After that, let's analyze one of the anonymous students' train essays they wrote!

### The Sample Analysis of One Train Essay
To analyze the sample of one anonymous student's train essay, we define a variable, f to open the train_files variable with the slice index of any index number and read it as "r" in the open function. Finally we'll read the specific essay by using the f variable that has the read function.

In [None]:
f = open(train_files[3], "r")
print(f.read())

As you can see, we analyzed that an anonymous student wrote an argument saying that eighth graders should bring and use cell phones all times except when they are taking classes. But, what about the single test essay sample? Well, we'll move on.

### The Sample Analysis of One Test Essay
Since we have only one test essay given in this competition, let's analyze it! It's like the same code as the part where we analyzed the train essay sample but the f variable is assigned to open and read the test_files variable that has the slice index of 0, since it's just only one test essay.

In [None]:
f = open(test_files[0], "r")
print(f.read())

After we run this code above, we see that an anonymous student wrote how making choices in someone's life can be very difficult. Now, let's proceed to the tabular train dataframe basics!

### Tabular Train Dataframe
Let's start taking a peek to the tabular train dataframe! 
- **discourse_id**: This is an ID code for discourse element.
- **essay_id**: This is an ID code for an essay.
- **discourse_text**: This represents the text of the discourse element.
- **discourse_type**: That's the classification of discourse element.
- **discourse_effectiveness**: That represents the effectiveness of each discourse element.

### Analysis of Train Dataframe
Let's now take a look at the train dataframe! All we need to do is to plug in the head function to the train dataframe to display the first 5 rows of it.

In [None]:
train.head()

After running this code cell above, we clearly see that there are six entities of data stored in a training dataframe! But, what about some data in a specific essay in an essay_id? The answer is that we use the query function to the train dataframe, setting the essay_id to find the specific id of an essay.

In [None]:
train.query('essay_id == "00944C693682"')

When we see the CSV file displayed after the code cell ran, we saw that the discourse_effectiveness was mostly effective while some were adequate.

### Calculation and Distribution of the Discourse Length
To find the discourse length of each discourse text, we define the train dataframe with the data index of discourse_len to the train dataframe with the discourse_text data index and apply it with the apply function, containing the setup of lambda x to the finding the number of entities of the x variable itself with the len function.

In [None]:
train["discourse_len"] = train["discourse_text"].apply(lambda x: len(x))

Now let's display the rows of the train dataframe again with the head function!

In [None]:
train.head()

After we ran this code cell above, we can now see that the discourse_len data index is registered to the train dataframe. Thus, let's find out the number of discourses! To find the number of discourses, we print out the number of entities with the len function, containing the train dataframe.

In [None]:
print("Total No. of Discourses: ", len(train))

As always, there are 36,765 number of discourses in this train dataframe. But, how can we find the length of each discourse. Well, let's find out by plotting with Plotly!

To plot the violin graphs with plotly over the discourse_len, we define a variable called fig to the px module with the histogram function to make a histogram plot, containing four parameters, data_frame (dataframe input) set to the train dataframe, x (x-axis input) set to discourse_len, marginal (displot) set to violin, and nbins (number of bins) set to 400. We then use to update_layout function to update our layout of our graph to the fig graph variable, setting the template parameter (used for formatting templates) to presentation or any built-in templates by Plotly. Finally, we show out the figure with the show function to the fig variable.

In [None]:
fig = px.histogram(data_frame=train, x="discourse_len", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

Per the graph, the highest number of discourses is between 80 and 79. Thus, we proceed to plot down the discourse types and the effectiveness types with Plotly still.

### Data Distribution (Discourse and Effectiveness)

To plot the average number of words versus the discourse types, we define the fig module to the px module with the bar function, setting the x parameter to finding the unique values of the train dataframe with the discourse_type data index with the unique function, the y parameter to the array containing the list of the train dataframe with the discourse_type data index being counted with the count function by the i variable that looped in the unique values of the discourse_type data index of the train dataframe with the unique function, the color parameter to the same as what we did to the x parameter, and the color_continuous_scale parameter to any built in color continous scale plotly gave out.

Next, we update our figure's x and y axes by applying the update_xaxes and update_yaxes functions to the fig variable, setting the title to Discourses (x) and No. of Rows (y). We then update our layout with the update layout function to the fig variable, setting the showlegend parameter to True, and the title parameter to a dictionary of configuring the title shown below, and the template parameter to any built-in templates made by Plotly. Finally, we show our figure with the show function to the fig variable.

In [None]:
fig = px.bar(x=np.unique(train["discourse_type"]), y=[list(train["discourse_type"]).count(i) for i in np.unique(train["discourse_type"])], color=np.unique(train["discourse_type"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Discourses")
fig.update_yaxes(title="No. of Rows")
fig.update_layout(showlegend=True,
                  title={
                      'text': 'Discourse Types',
                      'y': 0.95,
                      'x': 0.5,
                      'xanchor': 'center',
                      'yanchor': 'top'}, template="seaborn")
fig.show()

As you can see, the most number of discourse types of Evidence since the number of the Claim discourse type is nearly the same to the Evidence Discourse type, whilst the least is the Rebuttal discourse types. 

Now let's graph the effectiveness types, which it is effective, adequate, and ineffective! It's like the same as what we did to finding the number of each discourse type, but its for the effectiveness data with the train dataframe with the discourse_effectiveness data index.

In [None]:
fig = px.bar(x=np.unique(train["discourse_effectiveness"]), y=[list(train["discourse_effectiveness"]).count(i) for i in np.unique(train["discourse_effectiveness"])], color=np.unique(train["discourse_effectiveness"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Effectiveness")
fig.update_yaxes(title="No. of Rows")
fig.update_layout(showlegend=True,
                  title={
                      'text': 'Discourse Effectiveness',
                      'y': 0.95,
                      'x': 0.5,
                      'xanchor': 'center',
                      'yanchor': 'top'}, template="seaborn")
fig.show()

Once again, we can see that the most counts of discourse effectiveness is Adequate, while the least is Ineffective. For sure, let's go on to create a cluster of words with wordcloud (and stylecloud)!

### Wordclouding
For creating a wordlcloud, we define a variable, wordcloud to the wordcloud module with the WordCloud function, setting the stopwords parameter to wordcloud module with the STOPWORDS attribute, the max_size parameter to 90, the max_words parameter to 4500, the width parameter to 600, the height parameter to 400, the background_color parameter to black and on the outside of that, generate the words from the sample essays with the generate function, containing an empty string that was connected by the join function with the txt variable that looped in the the train_df dataframe with the discourse_text data index.

Next, we define two variables to display our wordcloud, fig and ax to the plt module with the subplots function, setting the figsize to 14, and 10 in the slice. We then show the image of a wordcloud figure with the imshow function to the ax variable, applying the wordcloud figure variable as our input to this, and setting the interpolation parameter to "bilinear". Thus, we disable our axes by using the set_axis_off function to the ax variable. Finally, let's display the wordcloud with the imshow function from the plt module, setting the wordcloud as our input!

In [None]:
wordcloud = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=90, max_words=4500, width=600, height=400, background_color="black").generate(' '.join(txt for txt in train["discourse_text"]))

fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud)

As always, like the previous EDA, we can see that the words from the anonymous students' essays were scrambled together, the most of which is the word, "student".

And now, it's time to plot the words in a stylecloud again! Before we use stylecloud again, we concat the text data of the train_df dataframe by defining a variable called concated_discourse_text to a blank string that was connected by the join function, containing an array in which the i variable looping in the train_df dataframe with the discourse_text data attribute and being converted to str with the astype function nested in.

In [None]:
concated_discourse_text = ' '.join([i for i in train.discourse_text.astype(str)])

After that, let's move on to styling the cloudpacked words! We use the stylecloud module with the gen_stylecloud to generate our stylecloud figure, setting the text parameter to the concated_discourse_text variable, the icon_name parameter to any font awesome icons, the palette parameter to any format (e.g. colorbrewer.diverging.Spectral_11), the background_color parameter to black, and the size parameter to 1024.

In [None]:
stylecloud.gen_stylecloud(text=concated_discourse_text, icon_name="fas fa-bell", palette="colorbrewer.diverging.Spectral_11", background_color="black", size=1024)

Now let's show the stylecloud! Before we begin, we need to import the Image module from the IPython module with the display submodule. And then, we call out the Image function, setting the filename parameter to the output path directory of our stylecloud figure, the width parameter to 1024, and the height parameter to 1024.

In [None]:
from IPython.display import Image
Image(filename='./stylecloud.png', width=1024, height=1024)

And here it is, there's the masterpiece again! We just displayed the bell icon containing the words from the students' essays! It's like the same as what we did in a stylecloud!

## Conclusion
Amid skipping the Text Visualization, we've done all of the EDA in the second Feedback Prize competition! However, this "EDA Simplified" Notebook on this competition remained in beta development, as there's work in progress on another notebook over text visualization. So, with all that's being said, we may update our conclusion with a link to a new notebook, which it is text visualization. In other words, stay tuned!