# This notebook:
- explores the dataset to gain a basic understanding
- filters out unwanted instances e.g. empty or duplicate texts
- samples 2000 instances to label manually

## Explore the dataset
- Obtain basic statistics and characteristics of the corpus

In [1]:
# Load in the data from CSV

import pandas as pd
import re

allComments = pd.read_csv("data/comments.csv", usecols=['document_id', 'comment', 'has_attachments'])
allComments['comment'] = allComments['comment'].map(lambda x: re.sub("(\r|\n)", " ", x))
print("# comments scraped:", allComments.shape[0])

# allComments = allComments.drop_duplicates(subset = ['comment'])
# print("# comments with duplicates removed:", allComments.shape[0])

# comments scraped: 251623


In [2]:
# Sample 10 comments to veiw 

allComments.sample(10).comment

91105     I am appalled that our treasured National monu...
198592    Please keep these monuments free from developm...
94494     I am appalled that our treasured National monu...
126505    Dear Secretary Ryan Zinke,  I am writing in su...
17804                       Please Write Your Comment Here:
210437    Dear Secretary Ryan Zinke,  As a supporter of ...
95014     Dear Secretary Ryan Zinke,  I moved to Souther...
246553    I strongly urge you to continue to protect all...
50024     I am opposed to any alterations to Mojave Trai...
198714    Dear Secretary Ryan Zinke,  As an avid outdoor...
Name: comment, dtype: object

In [3]:
# Split texts into sentences
# Ref: https://stackoverflow.com/a/31505798

import re

caps = "([A-Z])"
digits = "([0-9])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    '''
    Function to break text (astring) into sentences (a list of strings).
    '''
    
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences if len(s.strip())>1]
    return sentences

allComments['sentences'] = allComments['comment'].map(lambda x: split_into_sentences(x))

In [4]:
# Output some statistics about the corpus

from collections import defaultdict
import operator

sentences = [s for sublist in allComments['sentences'] for s in sublist]
print("# sentences:", len(sentences))

frequency = defaultdict(int)
for sent_list in allComments['sentences']:
    for sent in sent_list:
        frequency[sent] += 1

uniqueSentences = list(frequency.keys())
print("# unique sentences:", len(uniqueSentences))
print()

sorted_frequency_list = sorted(frequency.items(), key=operator.itemgetter(1), reverse=True)
print("10 most common sentences:")
for index, row in enumerate(sorted_frequency_list[:10]):
    print(index+1, row[0][:50]+"...", row[1])

# sentences: 1909473
# unique sentences: 558473

10 most common sentences:
1 Please make sure you side with the people who supp... 58821
2 Every single one of our parks, monuments and cultu... 58764
3 It is your job as the Secretary of the Dept.... 54012
4 Hear me, and the overwhelming number of people who... 54011
5 of Interior to protect and safeguard our national ... 54008
6 I am adamantly opposed to any effort to eliminate ... 54008
7 Five Tribal nations, Hopi, Navajo, Uintah and Oura... 42050
8 Now the Bears Ears Inter-Tribal Coalition is worki... 42046
9 The short review you are undertaking makes a mocke... 42044
10 I am appalled that our treasured National monument... 42043


Looking at the sample comments and sentence frequencies, it is clear that a large portion of comments were submitted in templates. We need to identify those template comments so that when we sample comments to label (for model training) we do not have duplicate or overly similar text. We want to prevent words used in templates from biasing our feature weights. One simple way to filter out template comments is compare the first two sentences in the text and then remove comments whose starting two sentences have been seen in previous comments. The reasoning is that two posters are unlikely to have written the same words if they didn't come from a template.

## Filter out comments
- Remove empty comments
- Remove comments with attachments
- Remove duplicate and template comments

In [5]:
# Remove empty comments whose sentence count is zero 
# or whose text contains only the official template lines
# i.e. "Dear Secretary Ryan Zinke," and/or "Leave your personal comment here..."

def hasContent(sentences):
    if len(sentences) == 0:
        return False
    if sentences[0].find('Leave your personal comment here') != -1 \
    or (len(sentences) == 2 and len(sentences[0]) < 30 and \
        sentences[0].find('Dear Secretary Ryan Zinke,') != -1):
        return False
    return True

allComments = allComments[allComments['sentences'].map(hasContent)]
print("# comments with empty comments removed:", allComments.shape[0])

# comments with empty comments removed: 250888


In [6]:
# Remove comments with attachments

allComments = allComments[allComments['has_attachments'] == False]
print("# comments without attachments:", allComments.shape[0])

# comments without attachments: 246733


In [10]:
# Mark duplicates: considering that there are minor variations within the same template 
# (e.g. punctuations and spaces), use gensim's built-in function simple_preprocess() 
# to lowercase and tokenize sentences.

import gensim

def tokenize(text, minLength=3):
    return gensim.utils.simple_preprocess(text, deacc=True, min_len=minLength)

allComments['first_two_sents'] = allComments['sentences'].map(lambda x: " ".join([" ".join(tokenize(sent,2)) for sent in x[:2]]))
allComments['duplicate'] = allComments.duplicated(subset=['first_two_sents'], keep=False)
allComments.sample(5)

Unnamed: 0,document_id,has_attachments,comment,sentences,first_two_sents,duplicate
22978,DOI-2017-0002-119884,False,"Dear Secretary Ryan Zinke, You hunt. So let us...","[Dear Secretary Ryan Zinke, You hunt., So let ...",dear secretary ryan zinke you hunt so let us c...,False
110621,DOI-2017-0002-2343,False,Sec. Zinke contradicted himself in the press r...,"[Sec., Zinke contradicted himself in the press...",sec zinke contradicted himself in the press re...,False
35298,DOI-2017-0002-131398,False,Please leave the national monuments alone as t...,[Please leave the national monuments alone as ...,please leave the national monuments alone as t...,False
8469,DOI-2017-0002-106734,False,"Dear Secretary Ryan Zinke, I've grown up and ...","[Dear Secretary Ryan Zinke, I've grown up and...",dear secretary ryan zinke ve grown up and spen...,False
74367,DOI-2017-0002-176761,False,I am appalled that our treasured national park...,[I am appalled that our treasured national par...,am appalled that our treasured national parks ...,True


In [11]:
# Write to file

allComments[['document_id','comment','duplicate']].to_csv('data/comments-cleaned.csv', index=False)

## Sample data 
- randomly sample 2000 instances
- output to file for manual labelling

In [12]:
# To avoid repeating training examples, drop duplicate comments whose first two sentences are identical

uniqueComments = allComments.drop_duplicates(subset=['first_two_sents'])
print("# rather unique comments:", uniqueComments.shape[0])

# rather unique comments: 85285


In [None]:
# Output 2000 comments to CSV for manual labelling

comments_to_label = uniqueComments[['document_id', 'comment']].sample(2000)
comments_to_label.to_csv('data/comments-to-label.csv', index=False)

In [13]:
# Output unique comments for later use

uniqueComments[['document_id', 'comment']].to_csv('data/uniqueComments.csv', index=False)