# About the dataset
Writing is a critical skill for success. However, less than a third of high school seniors are proficient writers, according to the National Assessment of Educational Progress. Unfortunately, low-income, Black, and Hispanic students fare even worse, with less than 15 percent demonstrating writing proficiency. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback.

There are currently numerous automated writing feedback tools, but they all have limitations. Many often fail to identify writing structures, such as thesis statements and support for claims, in essays or do not do so thoroughly. Additionally, the majority of the available tools are proprietary, with algorithms and feature claims that cannot be independently backed up. More importantly, many of these writing tools are inaccessible to educators because of their cost. This problem is compounded for under-serviced schools which serve a disproportionate number of students of color and from low-income backgrounds. In short, the field of automated writing feedback is ripe for innovation that could help democratize education.

Georgia State University (GSU) is an undergraduate and graduate urban public research institution in Atlanta. U.S. News & World Report ranked GSU as one of the most innovative universities in the nation. GSU awards more bachelor’s degrees to African-Americans than any other non-profit college or university in the country. GSU and The Learning Agency Lab, an independent nonprofit based in Arizona, are focused on developing science of learning-based tools and programs for social good.

In this competition, you’ll identify elements in student writing. More specifically, you will automatically segment texts and classify argumentative and rhetorical elements in essays written by 6th-12th grade students. You'll have access to the largest dataset of student writing ever released in order to test your skills in natural language processing, a fast-growing area of data science.

![image.png](attachment:58af3e8d-4ead-4717-a7a5-07669e7886f9.png)

If successful, you'll make it easier for students to receive feedback on their writing and increase opportunities to improve writing outcomes. Virtual writing tutors and automated writing systems can leverage these algorithms while teachers may use them to reduce grading time. The open-sourced algorithms you come up with will allow any educational organization to better help young writers develop.

# **Data Description**

The dataset contains argumentative essays written by U.S students in grades 6-12. The essays were annotated by expert raters for elements commonly found in argumentative writing.

Note that this is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set is approximately 10k documents. A small public test sample has been provided for testing your notebooks.

Your task is to predict the human annotations. You will first need to segment each essay into discrete rhetorical and argumentative elements (i.e., discourse elements) and then classify each element as one of the following:

Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
Position - an opinion or conclusion on the main question
Claim - a claim that supports the position
Counterclaim - a claim that refutes another claim or gives an opposing reason to the position
Rebuttal - a claim that refutes a counterclaim
Evidence - ideas or examples that support claims, counterclaims, or rebuttals.
Concluding Statement - a concluding statement that restates the claims
The training set will consist of individual essays in a folder of .txt files, as well as a .csv file containing the annotated version of these essays. It is important to note that some parts of the essays will be unannotated (i.e., they do not fit into one of the classifications above).

Files

train.zip - folder of individual .txt files, with each file containing the full text of an essay response in the training set
train.csv - a .csv file containing the annotated version of all essays in the training set
id - ID code for essay response
discourse_id - ID code for discourse element
discourse_start - character position where discourse element begins in the essay response
discourse_end - character position where discourse element ends in the essay response
discourse_text - text of discourse element
discourse_type - classification of discourse element
discourse_type_num - enumerated class label of discourse element
predictionstring - the word indices of the training sample, as required for predictions
test.zip - folder of individual .txt files, with each file containing the full text of an essay response in the test set
sample_submission.csv - file in the required format for making predictions - note that if you are making multiple predictions for a document, submit multiple rows


# Data visualization using two easy ways
1. Using ProfileReport from pandas_profiling library
2. Using AutoViz

**ProfileReport**
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

**Autoviz**

Automatically visualize any dataset, and it mainly works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization in just one line of code. Autoviz is incredibly fast and highly useful.

In Autoviz, a single line of code can identify features and create meaningful plots for you.



In [None]:
train_csv = '../input/feedback-prize-2021/train.csv'
sample_submission_csv = '../input/feedback-prize-2021/sample_submission.csv'

In [None]:
import numpy as np
import pandas as pd

In [None]:
train = pd.read_csv(train_csv)

In [None]:
train.head(5)

In [None]:
train.isnull()

In [None]:
train.isnull().sum()

In [None]:
train.dtypes

In [None]:
train.shape

In [None]:
train.info()

# ProfileReport

In [None]:
from pandas_profiling import ProfileReport

In [None]:
profile_train = ProfileReport(train,title="Train Profiling Report")
profile_train.to_file("Train Profiling Report.html")
profile_train

# Autoviz

In [None]:
# !pip install autoviz -q

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
#Automatically produce dataset
AV.AutoViz("",dfte = train )

**One more way**: Using Sweetviz
There is one more additional way using which you can visualize the data however I don't think it is relevant for this data anyways you can use it for another datasets

# Sweetviz

In [None]:
! pip install sweetviz

In [None]:
import sweetviz as sv
sweet_report = sv.analyze(train)
sweet_report.show_html('sweet_report.html')

In [None]:
# from IPython.display import HTML
# HTML(filename='./sweet_report.html')

Below are the screenshots of the html file generated, you can see it by downloading the sweet_report.html file in the Output file


Sweetviz is a python library that focuses on exploring the data with the help of beautiful and high-density visualizations. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.

![image.png](attachment:9351b66e-63ac-4030-a5d4-64e6fb8c6ab5.png)
![image.png](attachment:6e8535c6-879c-4031-af2c-4d5be8798527.png)
![image.png](attachment:512afcb5-9220-458c-ba28-c52bf92b3f92.png)

Check out the References to know more about each of them:

https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html


https://www.journaldev.com/52615/autoviz-module-in-python

https://www.analyticsvidhya.com/blog/2021/01/making-exploratory-data-analysis-sweeter-with-sweetviz-2-0/

