# This notebook:
- explores the dataset to gain a basic understanding
- filters out unwanted instances e.g. empty or duplicate texts
- samples 2000 instances to label manually

## Explore the dataset
- Obtain basic statistics and characteristics of the corpus

In [1]:
# Load in the data from CSV

import pandas as pd

allComments = pd.read_csv("data/comments.csv", usecols=['document_id', 'comment', 'has_attachments'])
print("# comments scraped:", allComments.shape[0])

allComments = allComments.drop_duplicates(subset = ['comment'])
print("# comments with duplicates removed:", allComments.shape[0])

# comments scraped: 151631
# comments with duplicates removed: 116287


In [2]:
# Sample 10 comments to veiw 

allComments.sample(10).comment

20499     Please Write Your Comment Here: STOP EVERYTHIN...
5965      Dear Secretary Ryan Zinke,\nProtected public l...
141836    Dear Secretary Ryan Zinke,\nPlease keep Bears ...
43194     Dear Secretary Ryan Zinke,\nOur public lands a...
25033     I'm disgusted with the executive order 13792. ...
64154     Please leave these National Monuments alone, u...
132572    I am writing in response to the Department of ...
122380    Dear Secretary Ryan Zinke,\n\nAs a supporter o...
57715     Dear Secretary Ryan Zinke,\n\nOur national mon...
85423     Dear Secretary Ryan Zinke,\n\nI am a frequent ...
Name: comment, dtype: object

In [3]:
# Split texts into sentences
# Ref: https://stackoverflow.com/a/31505798

import re

caps = "([A-Z])"
digits = "([0-9])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    '''
    Function to break text (astring) into sentences (a list of strings).
    '''
    
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences if len(s.strip())>1]
    return sentences

allComments['sentences'] = allComments['comment'].map(lambda x: split_into_sentences(x))

In [4]:
# Output some statistics about the corpus

from collections import defaultdict
import operator

sentences = [s for sublist in allComments['sentences'] for s in sublist]
print("# sentences:", len(sentences))

frequency = defaultdict(int)
for sent_list in allComments['sentences']:
    for sent in sent_list:
        frequency[sent] += 1

uniqueSentences = list(frequency.keys())
print("# unique sentences:", len(uniqueSentences))
print()

sorted_frequency_list = sorted(frequency.items(), key=operator.itemgetter(1), reverse=True)
print("10 most common sentences:")
for index, row in enumerate(sorted_frequency_list[:10]):
    print(index+1, row[0][:50]+"...", row[1])

# sentences: 827303
# unique sentences: 520456

10 most common sentences:
1 He and all fifteen subsequent presidents--of both ... 19434
2 These monuments are a legacy of Teddy Roosevelt.... 19433
3 Reversing any of these designations would be a tra... 19415
4 I urge you to uphold Roosevelt's legacy and mainta... 19397
5 The national monuments created in the past twenty ... 19338
6 From the buttes of Bears Ears that support birds l... 19302
7 Dear Secretary Ryan Zinke,  As a supporter of bird... 18607
8 Thank you.... 3757
9 I strongly urge you to oppose any efforts to elimi... 3730
10 I am firmly opposed to any effort to revoke or dim... 3599


Looking at the sample comments and sentence frequencies, it is clear that a large portion of comments were submitted in templates. We need to identify those template comments so that when we sample comments to label (for model training) we do not have duplicate or overly similar text. We want to prevent words used in templates from biasing our feature weights. One simple way to filter out template comments is compare the first two sentences in the text and then remove comments whose starting two sentences have been seen in previous comments. The reasoning is that two posters are unlikely to have written the same words if they didn't come from a template.

## Filter out comments
- Remove empty comments
- Remove comments with attachments
- Remove duplicate and template comments

In [5]:
# Remove empty comments whose sentence count is zero

allComments = allComments[allComments['sentences'].map(len) != 0]
print("# comments with empty comments removed:", allComments.shape[0])

# comments with empty comments removed: 116281


In [6]:
# Remove empty comments whose text contains only the official template lines
# i.e. "Dear Secretary Ryan Zinke," and/or "Leave your personal comment here..."

def hasContent(sentences):
    if sentences[0].find('Leave your personal comment here') != -1 \
    or (len(sentences) == 2 and len(sentences[0]) < 30 and \
        sentences[0].find('Dear Secretary Ryan Zinke,') != -1):
        return False
    return True

allComments = allComments[allComments['sentences'].map(hasContent)]
print("# comments with no content comments removed:", allComments.shape[0])

# comments with no content comments removed: 115664


In [7]:
# Remove comments with attachments

allComments = allComments[allComments['has_attachments'] == False]
print("# comments without attachments:", allComments.shape[0])

# comments without attachments: 114373


In [8]:
# Considering that there are minor variations within the same template (e.g. punctuations and spaces), 
# use gensim's built-in function simple_preprocess() to lowercase and tokenize sentences.
# Then drop duplicate comments whose first two sentences are the same.

import gensim

def tokenize(text, minLength=3):
    return gensim.utils.simple_preprocess(text, deacc=True, min_len=minLength)

allComments['first_two_sents'] = allComments['sentences'].map(lambda x: " ".join([" ".join(tokenize(sent)) for sent in x[:2]]))
uniqueComments = allComments.drop_duplicates(subset=['first_two_sents'])
print("# rather unique comments:", uniqueComments.shape[0])

# rather unique comments: 79314


In [9]:
# Sample a few comments for viewing

uniqueComments[['document_id', 'comment']].sample(10)

Unnamed: 0,document_id,comment
38159,DOI-2017-0002-134730,To Whom It May Concern:\n\nThe Grand Staircase...
127110,DOI-2017-0002-77680,"Secretary Zinke,\nStanding before a Giant Sequ..."
75589,DOI-2017-0002-30809,"Dear Secretary Ryan Zinke,\n\nAs a photographe..."
43237,DOI-2017-0002-13995,I would like to voice my opinion that the monu...
136271,DOI-2017-0002-86016,"Dear Secretary Zinke,\n\nConsider this:\nThe m..."
59758,DOI-2017-0002-16403,I am opposed to opening our national monuments...
46,DOI-2017-0002-0048,National monuments are great for the economy. ...
92862,DOI-2017-0002-4652,Leave our national monuments and open spaces u...
42769,DOI-2017-0002-139333,"Dear Secretary Ryan Zinke,\n\nOur national mon..."
19571,DOI-2017-0002-116877,Please continue protecting the national monume...


## Sample data 
- randomly sample 2000 instances
- output to file for manual labelling

In [10]:
# Output 2000 comments to CSV for manual labelling

comments_to_label = uniqueComments[['document_id', 'comment']].sample(2000)
comments_to_label.to_csv('data/comments-to-label.csv', index=False)