# Ruddit  - A Supporting Dataset for Jigsaw Comments Severity Rating

This dataset contains offensive comments from Reddit and offensiveness score corresponding to each comment. 
Unfortunately, the original dataset does not contain the 'text' that we need for modelling. Rather it provides the post ids and comment ids so that one can extract those comments. This notebook uses **[PRAW](https://praw.readthedocs.io/en/stable/)**, a python library to extract comments from Reddit via **[Reddit API](https://www.reddit.com/wiki/api)**.

Please find the Original Paper **[Ruddit: Norms of Offensiveness for English Reddit Comments](https://aclanthology.org/2021.acl-long.210/)**

Please find the official repo [here](https://github.com/hadarishav/Ruddit)

Acknowledgement:

` @inproceedings{hada-etal-2021-ruddit,
    title = "Ruddit: {N}orms of Offensiveness for {E}nglish {R}eddit Comments",
    author = "Hada, Rishav  and
      Sudhir, Sohi  and
      Mishra, Pushkar  and
      Yannakoudakis, Helen  and
      Mohammad, Saif M.  and
      Shutova, Ekaterina",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.210",
    doi = "10.18653/v1/2021.acl-long.210",
    pages = "2700--2717",
    abstract = "On social media platforms, hateful and offensive language negatively impact the mental well-being of users and the participation of people from diverse backgrounds. Automatic methods to detect offensive language have largely relied on datasets with categorical labels. However, comments can vary in their degree of offensiveness. We create the first dataset of English language Reddit comments that has fine-grained, real-valued scores between -1 (maximally supportive) and 1 (maximally offensive). The dataset was annotated using Best{--}Worst Scaling, a form of comparative annotation that has been shown to alleviate known biases of using rating scales. We show that the method produces highly reliable offensiveness scores. Finally, we evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.",
}
`


![Python Reddit Scraping](https://d1y2qj23ol72q6.cloudfront.net/2021/01/Python-Reddit-Banner-2.jpg)
> *[Image Source](https://d1y2qj23ol72q6.cloudfront.net/2021/01/Python-Reddit-Banner-2.jpg)*

# Clone the Repo

In [None]:
# clone from the official repo
!git clone 'https://github.com/hadarishav/Ruddit.git'

In [None]:
# what files are available?
!ls ./Ruddit/Dataset/

# Install PRAW library and create the environment

In [None]:
# install Python Reddit API Wrapper
!pip install praw

In [None]:
# import necessary modules and APIs
import numpy as np 
import pandas as pd 
from time import time
import praw
from tqdm.notebook import tqdm_notebook as tqdm
from kaggle_secrets import UserSecretsClient

# Read data

In [None]:
# read the ruddit data file
ruddit = pd.read_csv('./Ruddit/Dataset/Ruddit.csv')
print(ruddit.shape)
ruddit.head()

`comment_id` is the Ruddit Comment ID, `post_id` is the Ruddit parent Post ID, `offensiveness_score` is the score value calculated by authors of the original paper cited above (target value, in case of model fine-tuning) 

While reading the `Thread_structure.txt` file (available in the Dataset directory), it can be understood that for a given `post_id`, there are one or more `comment_id`. Hence we need to know the unique post ids and their corresponding comment ids.

In [None]:
# we need a list of post_id to extract comments
posts = ruddit.post_id.unique()
# number of unique post_id
len(posts)

In [None]:
# create a dictionary with POST_ID as key and an array of COMMENT_IDs as values
# an optimized method, utilizes pandas groupby groups attribute
r = ruddit.groupby('post_id')[['post_id', 'comment_id']]
pairs = dict()
for j in r.groups:
    pairs[j] = ruddit['comment_id'].iloc[r.groups[j]].to_numpy()

In [None]:
# do a random check
# what comment_ids are there for a given post_id?
pairs['3vdy9k']

In [None]:
# generate separate columns for extracted texts and their URLs
ruddit['txt'] = np.nan
ruddit['url'] = np.nan
ruddit.head()

# Extract texts

In [None]:
# credentials, keep them secret
# to develop your own, please follow PRAW docs and Reddit API
user_secrets = UserSecretsClient()
# save and retrieve secrets using kaggle_secrets
secret_value_0 = user_secrets.get_secret("CLIENT_AGENT")
secret_value_1 = user_secrets.get_secret("CLIENT_ID")
secret_value_2 = user_secrets.get_secret("CLIENT_SECRET")

In [None]:
# create a reddit crawler
reddit = praw.Reddit(
    user_agent= secret_value_0,
    client_id=secret_value_1,
    client_secret=secret_value_2
)

In [None]:
# collect post ids which lead to errors, like forbidden 403
issue_posts = []
# iterate over all post ids
for p in tqdm(posts[:10], desc='overall progress'): 
    # process 10 posts for demo
    now = time()
    try:
        # create a submission to Reddit API
        submission = reddit.submission(id=p)
        # read the URL
        URL = submission.url
        # flatten the comment tree
        submission.comments.replace_more(limit=None)
    except Exception as e:
        # if there is an error making submission
        issue_posts.append((p, e))
        continue
    delta = int(time()-now)
    # let's know the time taken for each submission
    desc = str(p)+' '+str(delta)+' sec'
    # iterate over actual comment ids 
    for c in tqdm(submission.comments.list(), desc=desc):
        # iff our data contains that id
        if c in pairs[p]:
            # locate in our data
            index = ruddit[ruddit['comment_id']== str(c)].index
            # replace our data
            ruddit.loc[index,['txt','url']] = [c.body,URL+'/'+str(c)+'/']

# Publish CSV file

In [None]:
# drop rows where we don't have texts
# reorder columns for elegance
ruddit_1 = ruddit.dropna(axis=0,inplace=False)\
[['post_id','comment_id','txt','url','offensiveness_score']]
ruddit_1.head()

In [None]:
# if there are problematic post_ids, 
# publish them into a CSV file
if len(issue_posts):
    print(len(issue_posts))
    issue_1 = pd.DataFrame(data=issue_posts, columns=['post_id', 'error_msg'])
    issue_1.to_csv('post_with_issue_1.csv', index=False)
    print(issue_1.head())

In [None]:
# publish our extracted text data
ruddit_1.to_csv('ruddit_with_text_1.csv',index=False) 

### Find the complete Dataset **[here](https://www.kaggle.com/rajkumarl/ruddit-jigsaw-dataset)**

#### Thank you for your time!