# Ironic Corpus - Understanding the Data

> Harry Potter says, “And [the Death Eaters] would love to have me. We’d be best pals if they didn’t keep trying to do me in.” The Death Eaters is actually an evil group bent on killing Harry Potter. 
Harry Potter, 

J.K. Rowling, Harry Potter and the Half-Blood Prince


The data presented in this corpus concerns [_irony_](https://en.wikipedia.org/wiki/Irony#Verbal_irony), which is a statement produced with one meaning, but the intended meaning is the opposite. That is, a literal interpretation of the statement will produce the opposite meaning of what the statement is trying to convey. For example, a person who has burned a meal might serve it and say ironically "it's a little underdone."

Althought there are [different types of irony](http://typesofirony.com/), this dataset concerns _verbal irony_. A feature of verbal irony is that it is used intentionally to mean something other than what is said.

Online comments are a ripe place for irony. 

This makes irony detection a difficult problem because a surface interpretation of a sentence can produce an opposite meaning that what was intended. In natural language processing, automatic verbal irony detection has been treated as a text classification problem, but with some approaches specific to irony.

The creators of the dataset used here, claim that context is important in detecting irony. They claim that humans often need context to determine irony, and therefore computers probably also need context. 

The paper was published in 201

### What is the goal?

The _Ironic Corpus_, [first presented in 2014](http://www.byronwallace.com/static/articles/wallace-irony-acl-2014.pdf), is set up as a binary classification problem. The dataset contains 1950 comments (taken from [reddit](www.reddit.com), which were rated as (1) containing irony, or (-1) not containing irony.

This data is relevant, because it was taken from a source known to inspire ironic comments. However, the data set was purposely created to be ambigious.

In the paper, the authors use an SVM model with five-fold cross-validation to detect irony. They present their results with the F1 score, precision and recall. For preceision and recall, scores close to 1 are best. F1 is the harmoic mean between the two.
- average [F1 score](https://en.wikipedia.org/wiki/F1_score): 0.383 (range 0.330 - 0.412)
- average [recall](https://en.wikipedia.org/wiki/Precision_and_recall): 0.496 (range 0.446 - 0.548)
- average [precision](https://en.wikipedia.org/wiki/Precision_and_recall): 0.315 (range 0.261 - 0.380)

Before trying to improve or reporoduce these results, this notebook performs some basic exploratory data analysis to get a feel for the corpus. 

# Understanding the Data

Let's see what this data looks like, and answer a few questions:
- What are examples of ironic and non-ironic comments?
- What is the split of ironic / non ironic comments?
- How big of a dataset are we looking at?

In [None]:
import pandas as pd
import re

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The provided csv file contains two columns, containing the text and label. Accorrding to the Kaggle page, -1 indicates a "not ironic" label, and a 1 indicates an "ironic" label.

In [None]:
irony_data = pd.read_csv('/kaggle/input/ironic-corpus/irony-labeled.csv')
irony_data.head()

The `shape` of the dataframe tells us how many comments are in the Ironic Corpus. Here we see that there are 1,949 comments in the file instead of 1,950.  

In [None]:
irony_data.shape

The dataset is imbalanced. There are approximately three times as many "not ironic" (-1) comments as "ironic" (1) comments. 

In [None]:
irony_data.label.value_counts()

Now let's look at the ironic and unironic comments. The comments are spearated based on their label, and by using `values` , only the comments remain.

In [None]:
unironic = irony_data['comment_text'][irony_data.label == -1].values
ironic = irony_data['comment_text'][irony_data.label == 1].values

The first five unironic comments:

In [None]:
unironic[:5]

The first five ironic comments:

In [None]:
ironic[:5]

Even with a cursory first glance, these groups of comments seem different in terms of word length. Below is a function to count the words across all of the ironic and unironic comments. 

In [None]:
def count_words(lines,linetype):
    total_words = 0
    for line in lines:
        total_words += len(re.findall(r'\w+', line))
    print(f'Number of {linetype} comments: {len(lines)}, Total words: {total_words}, Words per comment: {total_words / len(lines)}')

Based on rough counts, the unironic comments appear to be generally longer than the unironic comments. In addition, the entire dataset is rather small, though, with only ~84k words in total.

In [None]:
count_words(unironic, "Unironic")
count_words(ironic, "Ironic")

# Questions Answered

Here's the questions we asked before, and the answers our quick descriptive analysis provided. 
- What are examples of ironic and non-ironic comments?
    - Ironic snippet: "Insane like a fox.  Ted Cruz is actually very very intelligent."
    - Unironic snippet: "Also there are bound to be some glitches when rolling out a program of this scope."
    - A superficial difference of the groups of comments is that the Unironic comments contain more words than the Ironic comments. After fining the mean length of comment, the Unironic comments contain almost 20 more words per comment. 
    
    
- What is the split of ironic / non ironic comments?
    - The comments are unbalanced, with about 3:1 unironic to ironic comments.
- How big of a dataset are we looking at?
    - In total, the dataset contains ~84k words. 

## What's next?

Now that we've looked at the data, check out the other notebooks in this section to see different ways of analyzing it.