# Create MongoDB collection of quotes from a corpus of Media Cloud stories

Option to include *only* scare quotes and titles or *exclude* all scare quotes and titles. "Scare quotes" are defined as any quotes comprising three or fewer words.

This pipeline uses the python implementation of MongoDB, [pymongo](https://pymongo.readthedocs.io/en/stable/). See end of notebook for handy pymongo commands.

## Phase 1: Create the database

In [None]:
import csv
import pymongo
import pprint

This intializes MongoDB client on local server. For remote server, use format `'mongodb://SERVER_URL'`.  

You need to create a database in that client and then a collection in that database. Your collection is where your stories and quotes will live. You can create and use multiple collections.

In [None]:
client = pymongo.MongoClient('mongodb://localhost:27017/')

database = client["media-cloud-client"]

collection = database.quotes_collection

# Phase 2: Get Stories Text from Media Cloud

Use env variable for `MC_API_KEY`

In [None]:
MEDIA_CLOUD_API_KEY = MC_API_KEY

import mediacloud.api, json, datetime
mc = mediacloud.api.AdminMediaCloud(MEDIA_CLOUD_API_KEY)

We recommend fetching stories from Media Cloud in batches using `mc.storyList`. Reference API [documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2stories_publiclist) for list to build your query.  

Since you can only fetch 1000 stories at a time through the Media Cloud API, those code splits the `SAMPLE_SIZE` of your corpus into batches and makes multiple calls. `CALL_SIZE` is the number of stories you'll fetch through the Media Cloud API. While the hard maximum for `CALL_SIZE` is 1000, a smaller `CALL_SIZE` tends to run faster.

Use `sort=mc.SORT_RANDOM` to randomize stories fetched in query.  

This code only adds stories with `stories_id` that do not already exist in your collection.

In [None]:
SAMPLE_SIZE = SIZE_OF_CORPUS
CALL_SIZE = BATCH_SIZE_FOR_FETCH_NO_GREATER_THAN_1000
last_processed_stories_id = 0
query = 'QUERY'
i = collection.count_documents({})

while i < SAMPLE_SIZE:
    stories = mc.storyList(query,
                        mc.publish_date_query(datetime.date(2012, 1, 1), datetime.date(2020, 4, 15)),
                        #sort=mc.SORT_RANDOM,
                        rows=CALL_SIZE,
                        last_processed_stories_id=last_processed_stories_id,
                        text=True)
    last_processed_stories_id = stories[-1]['processed_stories_id']
    for s in stories:
        if collection.find_one({'stories_id': s['stories_id']}) is None and i < sample_size:
            collection.insert_one(s)
    i = collection.count_documents({})
    
print('{} stories added to collection'.format(i))

# Phase 3: Run Stanford CoreNLP

This phase can take a while to run, so we built a small job-based app to do it. Check out the [Quote-Annotator](https://github.com/mitmedialab/Quote-Annotator) to see how to add quotes to all the stories fetched in the previous phases.

Check number of stories without quotes in your collection to make sure your annotation process ran correctly.

In [None]:
stories_unprocessed = collection.count_documents({'quotes' : { '$exists': False }})
print("{} stories unprocessed".format(stories_unprocessed))

# Phase 4: Optional Adjustments for Quotes

Determine if a given quote is a scare quote or title in quotation marks. This heuristic is based on our assessment that a quote comprising three words or fewer is typically a scare quote or a title in quotes.

In [None]:
def is_scare_quote_or_title(quote):
    quote_split = quote['text'].split()
    if len(quote_split) < 4 or quote['text'].istitle():
        return True

CoreNLP often uses a pronoun or 'Unknown' for attribution in the `speaker` and/or `canonical_speaker` categories. This heuristic forces replaces that pronoun or 'Unknown' with a proper noun if a proper noun exists in either the `speaker` or `canonical_speaker` attribute in the extracted quote's collection entry.

In [None]:
not_speakers = ['he', 'his', 'she', 'her', 'Unknown']

def assumed_speaker(speaker, canonical_speaker):
    if speaker in not_speakers and canonical_speaker in not_speakers:
        return speaker
    elif speaker not in not_speakers and canonical_speaker not in not_speakers:
        return speaker
    elif speaker in not_speakers:
        return canonical_speaker
    elif canonical_speaker in not_speakers:
        return speaker

Add snippet from `story_text` to quote entry that shows quote context. This is useful if you are trying to determine if CoreNLP's quote attribution is correct. This example code creates snippets that are a maximum of 800 characters longer than the extracted quote.

In [None]:
def text_snippet(story, quote):
    snippet_begin = max(0, quote['begin_char']-400)
    snippet_end = max(0, quote['end_char']+400)
    new_snippet = story['story_text'][snippet_begin:snippet_end]

# Phase 5: Output CSV

In [None]:
import csv

with open('annotated_quotes.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames = ['_id','stories_id', 'text'])
    writer.writeheader()
    for story in collection.find({'quotes' : { '$exists': True }}):
        for quote in story:
            writer.writerow(quote)

# Useful pymongo commands

Count documents in collection

In [None]:
collection.count_documents({})

Delete a key from all documents. You can also use `update_one` to run this operation on a single document.

In [None]:
db.collection.update_many({}, { '$unset' : { 'snippet'  : 1} })

Empty a collection (deletes all documents)

In [None]:
collection.drop()