# Using MongoDB in Jupyter with WE1S Data

This notebook documents metods of using MongoDB in Jupyter.
It reviews some basics, then focuses on working with WE1S data -- for example:

1. notebook import of articles based on a database search
2. database import of zip collections
3. updating articles in the database
4. ...

### GENERAL REFERENCE

-  https://api.mongodb.com/python/3.0.3/tutorial.html
-  https://realpython.com/introduction-to-mongodb-and-python/
-  https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
-  https://docs.mongodb.com/manual/core/databases-and-collections/
-  https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
-  https://docs.mongodb.com/manual/reference/program/mongoimport/

For performance checks, add `%%time` to the first line of any cell.

## Connecting, Databases and Collections

In [185]:
import json
import os
import pprint
pp = pprint.PrettyPrinter(indent=2, compact=False)

In [186]:
# import mongodb package

from pymongo import MongoClient 
from pymongo.errors import DuplicateKeyError, InvalidDocument

In [187]:
import sys
sys.path.insert(0, '/home/jovyan/utils/preprocessing/')

from libs.fuzzyhasher.fuzzyhasher import FuzzyHasher
from libs.zipeditor.zipeditor import ZipEditor, zip_scanner, zip_scanner_excludedirs, ZipProcessor
from we1s_utils.ziputils import BatchJSONUploader


## Setup connection

In [211]:
# connect with MongoDB URI:

client = MongoClient('mongodb://mongo/')
db = client['we1s']
corpus = db['Corpus'] ## REVISE AND IN BELOW CODE
coll = db['humanities_keywords']

In [212]:
print('Passing and info: db and client are recoverable from collection object.\n')
print(coll)
print(coll.name)
print(coll.database)
print(coll.database.name)
print(coll.database.client)
print(db.name)
print(db.client)

Passing and info: db and client are recoverable from collection object.

Collection(Database(MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True), 'we1s'), 'humanities_keywords')
humanities_keywords
Database(MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True), 'we1s')
we1s
MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True)
we1s
MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True)


In [189]:
def search_articles_source(collection, source):
    hits = 0
    query = { 'sources': { '$regex' : source } }
    client = MongoClient('mongodb://mongo/')
    result = client['we1s'][collection].find_one(query)
    pp.pprint(result)
    print('count:', hits)

In [190]:
search_articles_source('humanities_keywords', 'bangor')

{ '_id': ObjectId('5d3a9ee9f123b8357f3e0c7e'),
  'attachment_id': '',
  'author': '',
  'content': 'COLLEGE NAC Championships UM-Farmington women 41, Castleton '
             'State 45, Mount Ida 67, Johnson State 69, Elms, Bay Path, Lasell '
             'no team scores 1. Andrew Spearrin (UMF) 22:00. 4, 2. Katie '
             'Sprowl (Castleton) 22:39. 7, 3. Stacey Sarber (MI) 23:07. 8, 4. '
             'Jen Dickie (Johnson) 23:24. 8, 5. Michelle Bird (Castleton) '
             '23:41. 1, 6. Kathryn Dickinson (UMF) 24:51. 5, 7. Paque Render '
             "(Johnson) 24:26. 9, 8. Misty O'Brien (UMF) 24:29. 6, 9. Erika "
             'Hoddinott (UMF) 24:30. 4, 10. Yvonne Olney (Castleton) 24:41. 5, '
             '11. Danielle Ring (MI) 25:22. 3, 12. Liliana German (Johnson) '
             '25:41. 9, 13. Beth Pantzer (Castleton) 25:49. 7, 14. Christine '
             'Palleschi (MI) 25:53. 1, 15. Ariel Delany (Castleton) 26:54. 1 '
             'UM-Farmington men 17, Johnson State 53

                [ '24:29',
                  '24:29',
                  '24:29',
                  'NUM',
                  'CD',
                  'False',
                  ['B', 'CARDINAL']],
                ['.', '.', '.', 'NOUN', 'NN', 'False', ['O', '']],
                ['6', '6', '6', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                ['9', '9', '9', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                ['.', '.', '.', 'PUNCT', '.', 'False', ['O', '']],
                ['Erika', 'erika', 'Erika', 'PROPN', 'NNP', 'False', ['O', '']],
                [ 'Hoddinott',
                  'hoddinott',
                  'Hoddinott',
                  'PROPN',
                  'NNP',
                  'False',
                  ['O', '']],
                ['(', '(', '(', 'PUNCT', '-LRB-', 'False', ['O', '']],
                ['UMF', 'umf', 'UMF', 'PROPN', 'NNP', 'False', ['B', 'ORG']],
                [')', ')

                  'PROPN',
                  'NNP',
                  'False',
                  ['B', 'ORG']],
                ['128', '128', '128', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                [ 'Wesleyan',
                  'wesleyan',
                  'Wesleyan',
                  'PROPN',
                  'NNP',
                  'False',
                  ['O', '']],
                ['131', '131', '131', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                [ 'Amherst',
                  'amherst',
                  'Amherst',
                  'PROPN',
                  'NNP',
                  'False',
                  ['O', '']],
                ['137', '137', '137', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                [ 'Middlebury',
                  'm

                  'NUM',
                  'CD',
                  'False',
                  ['B', 'CARDINAL']],
                ['.', '.', '.', 'PUNCT', '.', 'False', ['O', '']],
                ['00', '00', '00', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                ['2', '2', '2', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                ['.', '.', '.', 'PUNCT', '.', 'False', ['O', '']],
                [ 'Michael Lansing',
                  'michael lansing',
                  'Michael',
                  'PROPN',
                  'NNP',
                  'False',
                  ['B', 'PERSON']],
                ['(', '(', '(', 'PUNCT', '-LRB-', 'False', ['O', '']],
                [ 'Maine',
                  'maine',
                  'Maine',
                  'PROPN',
                  'NNP',
                  'False',
                  ['B', 'GPE']],
                [')', ')', ')', 'PUNCT', '-RRB-', 'F

                  'False',
                  ['B', 'CARDINAL']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                ['2', '2', '2', 'NUM', 'CD', 'False', ['B', 'CARDINAL']],
                ['.', '.', '.', 'PUNCT', '.', 'False', ['O', '']],
                [ 'Michael Bunker',
                  'michael bunker',
                  'Michael',
                  'PROPN',
                  'NNP',
                  'False',
                  ['B', 'PERSON']],
                ['(', '(', '(', 'PUNCT', '-LRB-', 'False', ['O', '']],
                ['USM', 'usm', 'USM', 'PROPN', 'NNP', 'False', ['B', 'ORG']],
                [')', ')', ')', 'PUNCT', '-RRB-', 'False', ['O', '']],
                [ '25:17',
                  '25:17',
                  '25:17',
                  'NUM',
                  'CD',
                  'False',
                  ['B', 'DATE']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                ['3', '3', '3', '

                  'NNP',
                  'False',
                  ['B', 'PERSON']],
                ['25:28', '25:28', '25:28', 'NUM', 'CD', 'False', ['O', '']],
                [',', ',', ',', 'PUNCT', ',', 'False', ['O', '']],
                ['76', '76', '76', 'NUM', 'CD', 'False', ['B', 'DATE']],
                ['.', '.', '.', 'PUNCT', '.', 'False', ['O', '']],
                [ 'Hattie Landry',
                  'hattie landry',
                  'Hattie',
                  'PROPN',
                  'NNP',
                  'False',
                  ['B', 'PERSON']],
                ['25:36', '25:36', '25:36', 'NUM', 'CD', 'False', ['O', '']]],
  'language_model': { 'accuracy': { 'ents_f': 85.8587845242,
                                    'ents_p': 86.3317889027,
                                    'ents_r': 85.3909350025,
                                    'las': 89.6616629074,
                                    'tags_acc': 96.7783856079,
                               

In [191]:
def search_sources_tags(tag_term):
    hits = 0
    query = { 'tags': { '$regex' : tag_term } }
    client = MongoClient('mongodb://mongo/')
    results = client['Sources']['Sources'].find(query)
    for result in results:
        hits += 1
        pp.pprint(result)
        print('count:', hits)

In [192]:
search_sources_tags("education")

{ '_id': '21st-century-principal',
  'aliases': ['21st-century-principal'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': '21st-century-principal',
  'namespace': 'we1s2.0',
  'tags': ['region/US/South', 'media/website', 'education/perspective/K-12'],
  'title': '21st Century Principal'}
count: 1
{ '_id': 'accent-austin-community-college',
  'aliases': ['Accent: Austin Community College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'accent-austin-community-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/demographic/Hispanic-serving Institution',
            'education/institution/Community College'],
  'title': 'Accent: Austin Community College'}
count: 2
{ '_id': 'advance-titan-university-of-wisconsin-oshkosh',
  'aliases': ['Advance-Titan: University of Wisconsin - Oshkosh'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sourc

            'education/institution/Doctoral'],
  'title': 'Colorado Daily: University of Colorado at Boulder'}
count: 46
{ '_id': 'commonwealth-times-virginia-commonwealth-university',
  'aliases': ['Commonwealth Times: Virginia Commonwealth University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'commonwealth-times-virginia-commonwealth-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'Commonwealth Times: Virginia Commonwealth University'}
count: 47
{ '_id': 'concordiensis-union-college',
  'aliases': ['Concordiensis: Union College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'concordiensis-union-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/North East',
            'education/funding/US private college',
            'education/institution/Liberal Arts'],
  'title': 'Concordiensis:

  'language': 'en',
  'metapath': 'Sources',
  'name': 'et-cetera-eastfield-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/demographic/Hispanic-serving Institution',
            'education/institution/Community College'],
  'title': 'Et Cetera: Eastfield College'}
count: 83
{ '_id': 'exponent-montana-state-university',
  'aliases': ['Exponent: Montana State University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'exponent-montana-state-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/Rockies and Southwest',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'Exponent: Montana State University'}
count: 84
{ '_id': 'factory-times-suny-institute-of-technology-utica-rome',
  'aliases': ['Factory Times: SUNY Institute of Technology at Utica-Rome'],
  'country': 'US',
  'language': 'en',
  'metapat

            'education/demographic/religion/Christian'],
  'title': "Mars' Hill: Trinity Western University"}
count: 122
{ '_id': 'massachusetts-daily-collegian-university-of-massachusetts-amherst',
  'aliases': [ 'Massachusetts Daily Collegian: University of '
               'Massachusetts-Amherst'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'massachusetts-daily-collegian-university-of-massachusetts-amherst',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/North East',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'Massachusetts Daily Collegian: University of Massachusetts-Amherst'}
count: 123
{ '_id': 'medill-news-service-northwestern-university-washington-dc',
  'aliases': ['Medill News Service/Y Vote 2000'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'medill-news-service-northwestern-university-washington-dc',
  'namespace': 'we1s2.0',
  'tags': [ 'region/U

  'name': 'sonoma-state-star-sonoma-state-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/West Coast',
            'education/funding/US public college',
            'education/demographic/Hispanic-serving Institution',
            'education/affiliation/Cal State system',
            'education/institution/Liberal Arts'],
  'title': 'Sonoma State Star: Sonoma State University'}
count: 162
{ '_id': 'sounds-newspaper-south-puget-sound-community-college',
  'aliases': ['Sounds Newspaper: South Puget Sound Community College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'sounds-newspaper-south-puget-sound-community-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/West Coast',
            'education/funding/US public college',
            'education/institution/Community College'],
  'title': 'Sounds Newspaper: South Puget Sound Community College'}
count: 163
{ '_id': 'southern-accent-southern-adventist-university',
  'aliases': ['The S

  'name': 'the-bear-facts-pikeville-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US private college',
            'education/demographic/religion/Christian',
            'education/institution/Liberal Arts'],
  'title': 'The Bear Facts: Pikeville College'}
count: 202
{ '_id': 'the-bell-ringer-augusta-state-university',
  'aliases': ['The Bell Ringer: Augusta State University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-bell-ringer-augusta-state-university',
  'namespace': 'we1s2.0',
  'tags': ['region/US/South', 'education/funding/US public college'],
  'title': 'The Bell Ringer: Augusta State University'}
count: 203
{ '_id': 'the-bell-tower-freed-hardeman-university',
  'aliases': ['The Bell Tower: Freed-Hardeman University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-bell-tower-freed-hardeman-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South

            'education/institution/Liberal Arts'],
  'title': 'The Carroll News: John Carroll University'}
count: 239
{ '_id': 'the-catalyst-colorado-college',
  'aliases': ['The Catalyst: Colorado College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-catalyst-colorado-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/Rockies and Southwest',
            'education/funding/US private college',
            'education/institution/Liberal Arts'],
  'title': 'The Catalyst: Colorado College'}
count: 240
{ '_id': 'the-centre-college-cento-centre-college',
  'aliases': ['The Centre College Cento: Centre College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-centre-college-cento-centre-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US private college',
            'education/institution/Liberal Arts'],
  'title': 'The Centre College Cento: Centre College'}
count: 24

  'tags': [ 'region/US/North East',
            'education/funding/US private college',
            'education/demographic/religion/Jewish',
            'education/institution/Doctoral'],
  'title': 'The Commentator: Yeshiva University'}
count: 271
{ '_id': 'the-communicator-chattanooga-state-technical-community-college',
  'aliases': [ 'The Communicator: Chattanooga State Technical Community '
               'College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-communicator-chattanooga-state-technical-community-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/emphasis/Science Tech and Ag school',
            'education/institution/Community College'],
  'title': 'The Communicator: Chattanooga State Technical Community College'}
count: 272
{ '_id': 'the-communicator-indiana-purdue-ft-wayne',
  'aliases': ['The Communicator'],
  'country': 'US',
  'language': 

  'title': 'The Daily Emerald: University of Oregon'}
count: 301
{ '_id': 'the-daily-free-press-boston-university',
  'aliases': ['The Daily Free Press: Boston University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-daily-free-press-boston-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/North East',
            'education/funding/US private college',
            'education/institution/Doctoral'],
  'title': 'The Daily Free Press: Boston University'}
count: 302
{ '_id': 'the-daily-gamecock-university-of-south-carolina',
  'aliases': ['The Gamecock'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-daily-gamecock-university-of-south-carolina',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'The Daily Gamecock: University of South Carolina - Columbia'}
count: 303
{ '_id': 'the-daily-iowan-un

  'tags': [ 'region/US/Midwest',
            'education/funding/US private college',
            'education/institution/Doctoral'],
  'title': 'The Flyer News: University of Dayton'}
count: 341
{ '_id': 'the-foghorn-university-of-san-francisco',
  'aliases': ['The Foghorn: University of San Francisco'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-foghorn-university-of-san-francisco',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/West Coast',
            'education/funding/US private college',
            'education/demographic/religion/Catholic'],
  'title': 'The Foghorn: University of San Francisco'}
count: 342
{ '_id': 'the-forum-piedmont-virginia-community-college',
  'aliases': ['The Forum: Piedmont Virginia Community College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-forum-piedmont-virginia-community-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US publi

            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'The ISU Bengal: Idaho State University'}
count: 381
{ '_id': 'the-iupui-sagamore',
  'aliases': ['The IUPUI Sagamore'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-iupui-sagamore',
  'namespace': 'we1s2.0',
  'tags': ['region/US/Midwest', 'education/funding/US public college'],
  'title': 'The IUPUI Sagamore'}
count: 382
{ '_id': 'the-j-tac-tarleton-state-university',
  'aliases': ['The J-TAC: Tarleton State University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-j-tac-tarleton-state-university',
  'namespace': 'we1s2.0',
  'tags': ['region/US/South', 'education/funding/US public college'],
  'title': 'The J-TAC: Tarleton State University'}
count: 383
{ '_id': 'the-jambar-youngstown-state-university',
  'aliases': ['The Jambar: Youngstown State University'],
  'country': 'US',
  'language': 'en',
  'metapat

            'education/institution/Doctoral'],
  'title': 'The Marquette Tribune: Marquette University'}
count: 420
{ '_id': 'the-mass-media-university-of-massachusetts-boston',
  'aliases': ['The Mass Media: University of Massachusetts - Boston'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-mass-media-university-of-massachusetts-boston',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/North East',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'The Mass Media: University of Massachusetts - Boston'}
count: 421
{ '_id': 'the-mcgill-tribune-mcgill-university',
  'aliases': [ 'The McGill Tribune: McGill University',
               'The McGill Tribune: McGill University'],
  'country': 'CA',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-mcgill-tribune-mcgill-university',
  'namespace': 'we1s2.0',
  'tags': ['region/Canada', 'education/non-US college'],
  'title': 'The McGill Trib

  'aliases': ['The Oracle: SUNY College at New Paltz'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-oracle-suny-new-paltz',
  'namespace': 'we1s2.0',
  'tags': ['region/US/North East', 'education/funding/US public college'],
  'title': 'The Oracle: SUNY College at New Paltz'}
count: 462
{ '_id': 'the-orion-california-state-university-chico',
  'aliases': ['The Orion: California State University - Chico'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-orion-california-state-university-chico',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/West Coast',
            'education/funding/US public college',
            'education/demographic/Hispanic-serving Institution',
            'education/affiliation/Cal State system'],
  'title': 'The Orion: California State University - Chico'}
count: 463
{ '_id': 'the-ottawa-campus-ottawa-university',
  'aliases': ['The Ottawa Campus: Ottawa University'],
  'country': 'US',
  'lang

  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'The Reveille: Louisiana State University'}
count: 502
{ '_id': 'the-review-university-of-delaware',
  'aliases': ['The Review'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-review-university-of-delaware',
  'namespace': 'we1s2.0',
  'tags': [ 'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'The Review'}
count: 503
{ '_id': 'the-reviewmckn-mckendree-college',
  'aliases': ['The Review/MCKN: McKendree College'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-reviewmckn-mckendree-college',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/Midwest',
            'education/funding/US private college',
            'education/demographic/religion/Christian',
            'education/institution/Doctoral'],
  'title': 

count: 542
{ '_id': 'the-spectrum-sacred-heart-university',
  'aliases': ['The Spectrum: Sacred Heart University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-spectrum-sacred-heart-university',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/North East',
            'education/funding/US private college',
            'education/demographic/religion/Catholic'],
  'title': 'The Spectrum: Sacred Heart University'}
count: 543
{ '_id': 'the-spinnaker-university-of-north-florida',
  'aliases': ['The Spinnaker: University of North Florida'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-spinnaker-university-of-north-florida',
  'namespace': 'we1s2.0',
  'tags': ['region/US/South', 'education/funding/US public college'],
  'title': 'The Spinnaker: University of North Florida'}
count: 544
{ '_id': 'the-spur-southwest-state-university',
  'aliases': ['The Spur: Southwest State University'],
  'country': 'US',
  'language': 'en

  'namespace': 'we1s2.0',
  'tags': [ 'region/US/South',
            'education/funding/US public college',
            'education/demographic/Hispanic-serving Institution',
            'education/institution/Doctoral'],
  'title': 'The University Star: Texas State University - San Marcos'}
count: 583
{ '_id': 'the-university-times-trinity-college',
  'aliases': ['The University Times: Trinity College'],
  'country': 'IE',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-university-times-trinity-college',
  'namespace': 'we1s2.0',
  'tags': ['region/Europe', 'education/non-US college'],
  'title': 'The University Times: Trinity College'}
count: 584
{ '_id': 'the-urban-wire-ngee-ann-polytechnic',
  'aliases': ['The Urban Wire: Ngee Ann Polytechnic'],
  'country': 'IE',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'the-urban-wire-ngee-ann-polytechnic',
  'namespace': 'we1s2.0',
  'tags': ['region/Asia', 'education/non-US college'],
  'title': 'The Urban Wire: Ngee An

  'language': 'en',
  'metapath': 'Sources',
  'name': 'university-wire',
  'namespace': 'we1s2.0',
  'tags': [ 'media/news wires and aggregators',
            'media/US newspaper',
            'education/perspective/students'],
  'title': 'University Wire'}
count: 624
{ '_id': 'utvs-6-television-st-cloud-state-university',
  'aliases': ['UTVS 6 Television: St Cloud State University'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'utvs-6-television-st-cloud-state-university',
  'namespace': 'we1s2.0',
  'tags': ['region/US/Midwest', 'education/funding/US public college'],
  'title': 'UTVS 6 Television: St Cloud State University'}
count: 625
{ '_id': 'uwm-post',
  'aliases': ['UWM Post'],
  'country': 'US',
  'language': 'en',
  'metapath': 'Sources',
  'name': 'uwm-post',
  'namespace': 'we1s2.0',
  'tags': [ 'region/US/Midwest',
            'education/funding/US public college',
            'education/institution/Doctoral'],
  'title': 'UWM Post'}
count: 6

## Information

In [193]:
# retrieve the build information
client['we1s'].command("buildinfo")

{'version': '4.0.11',
 'gitVersion': '417d1a712e9f040d54beca8e4943edce218e9a8c',
 'modules': [],
 'allocator': 'tcmalloc',
 'javascriptEngine': 'mozjs',
 'sysInfo': 'deprecated',
 'versionArray': [4, 0, 11, 0],
 'openssl': {'running': 'OpenSSL 1.0.2g  1 Mar 2016',
  'compiled': 'OpenSSL 1.0.2g  1 Mar 2016'},
 'buildEnvironment': {'distmod': 'ubuntu1604',
  'distarch': 'x86_64',
  'cc': '/opt/mongodbtoolchain/v2/bin/gcc: gcc (GCC) 5.4.0',
  'ccflags': '-fno-omit-frame-pointer -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -Werror -O2 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -Wno-missing-braces -fstack-protector-strong -fno-builtin-memcmp',
  'cxx': '/opt/mongodbtoolchain/v2/bin/g++: g++ (GCC) 5.4.0',
  'cxxflags': '-Woverloaded-virtual -Wno-maybe-uninitialized -std=c++14',
  'target_arch': 'x86_64',
  'target_os': 'linux'},
 'bits': 64,
 'debug': False,
 'maxBsonObjectSize': 167

In [55]:
# list records in all collections

d = dict((db, [collection for collection in client[db].list_collection_names()])
             for db in client.list_database_names())
for db in d:
    for coll in d[db]:
        # print(db, coll)
        print((client[db].command("collstats", coll))['count'], db, coll)

1 admin system.version
2 config system.sessions
79 local startup_log
752243 we1s reddit
508490 we1s humanities_keywords_no_exact
12 we1s comparison-not-humantiies-filter
418302 we1s humanities_keywords
6607 we1s comparison-sciences-filter
628317 we1s comparison-sciences
111396 we1s deletes_humanities
635495 we1s comparison-not-humanities
47320 we1s deletes_reddit
44114 we1s deletes_comparison-not-humanities
66612 we1s deletes_comparison-sciences
0 we1s2018 Sources
548329 we1s2018 Corpus
1 we1s2018 testcollection


In [58]:
%%time
# get estimated documents in a collection
# ...fast

count = client['we1s']['humanities_keywords'].estimated_document_count()
print(count)

418302
CPU times: user 727 µs, sys: 46 µs, total: 773 µs
Wall time: 1.15 ms


In [None]:
%%time
# get estimated documents in a collection
# ...slow, very slow

count = client['we1s']['humanities_keywords'].count_documents({})
print(count)

In [59]:
%%time
# get a list of unique values for a given key
# ...very slow

key_list = client['we1s']['humanities_keywords'].distinct('pub')
print(len(key_list), key_list)

# does not give counts, just unique values
# to e.g. aggregate counts for each distinct value, see aggregation queries

1557 ['Bolivian Express Magazine', 'KOCE', 'HBO', 'KCBS', 'KNBC', 'FOX_News', 'MSNBC', 'KABC', 'CNN', 'HLN', 'KCAL', 'KCET', 'KTLA', 'KTTV_FOX', 'AlJazeera', 'WEWS', 'ComedyCentral', 'WUAB', 'CSPAN', 'DigitalEphemera', 'WWW', 'CampaignAds', 'Shooters', 'Current', 'CSPAN2', None, 'School Library Journal Reviews', 'School Library Journal', 'American School & University', 'Teacher Magazine', 'Variety', '21st Century Principal', 'Newsweek', 'DailyKos', 'Amandala', 'The Houston Chronicle', 'HOUSTON CHRONICLE', 'ANDINA Peru News Agency', 'Banderas News', 'Counter-Currents', 'American Renaissance', 'College Board', 'American Thinker', 'Common Dreams', 'Arizona Daily Sun', 'Daily Dot', 'Baltimore City Paper', 'Commentary Magazine', 'Boston Review', 'Conservative Treehouse', 'Current Affairs', 'Breitbart News Network', 'Detroit Free Press', 'BuzzFeed', 'Daily Signal', 'Chronicle of Higher Education', 'Dollars and Sense', 'CityLab', 'Education Next', 'CNSNews', 'Daily Wire', 'Democracy Now', 'Co

3 ['humanities', ' humanities', ' liberal_arts']
CPU times: user 157 ms, sys: 11.1 ms, total: 168 ms
Wall time: 4min 53s


In [61]:
# collection statistics
# ...very verbose

client['we1s'].command("collstats", 'humanities_keywords')

{'ns': 'we1s.humanities_keywords',
 'size': 57477624093.0,
 'count': 418302,
 'avgObjSize': 137407,
 'storageSize': 15226941440.0,
 'capped': False,
 'wiredTiger': {'metadata': {'formatVersion': 1},
  'creationString': 'access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_custom=(prefix=,start_ge

## Basic insert, find, delete

In [None]:
# you can create multiple database or collection objects.
# the way to create a new database or collection is to define it and write to it.

# both the database and the collection are not yet created
db = client['temp_db']
temp_corpus = db['temp_corpus']

# they are created when the first document is written.
temp_corpus.insert_one({'field': 'DELETEME'})

# confirm the document was added and assigned an _id
print(temp_corpus.find_one())

# db and collection are NOT deleted when the all documents are deleted
temp_corpus.delete_many({'field': {'$eq' : 'DELETEME'}})

# list all collections in a database:
print('collections: ', db.list_collection_names())

# list all databases on mongodb instance:
print('databases: ', client.list_database_names())

# cleaning up:
#   a collection can be deleted with drop
temp_corpus.drop()
#   a database can be deleted from the client by object or name
#   -- name is safer if using generic code with a db object
#   such as in a Jupyter testing notebook
client.drop_database('temp_db')
#   the temp_db database is now gone:
print('databases: ', client.list_database_names())

# re-define the default collection for all tests below
db = client['we1s']
corpus = db['Corpus']

In [None]:
# COPY DATABASE TO NEW NAME AND DELETE/DROP OLD
# This is the closest thing mongodb has to a rename database command.
# It duplicates all data during the copy operation

client.admin.command('copydb',
                     fromdb='we1s-old',
                     todb='we1s-new')
client.drop_database('we1s-old')

## Queries

In [33]:
# a query is a python dict.
# define a query for: all jons with "Angeles" in the pub field 
query = {'pub': {'$regex':'.*Angeles.*'}}
# query = {'name': {'$regex':'.*thewallstreetjournal.*'}}

# retrieve a document
one_doc = client['we1s']['humanities_keywords'].find_one(query, {'features':0, 'language_model':0})

# display a document
one_doc['content'] = one_doc['content'][0:500] + '[...]' # snip content for preview readability
# print("one_doc:\n", one_doc, '\n')
pp.pprint(one_doc)



docs = client['we1s']['humanities_keywords'].aggregate([{ '$sample': { 'size': 3 } }])

for doc in docs:
    if 'features' in doc:
        doc.pop('features')
    if 'language_model' in doc:
        doc.pop('language_model')
    if 'content-unscrubbed' in doc:
        doc.pop('content-unscrubbed')
    doc['content'] = doc['content'][0:500] + '[...]'
    pp.pprint(doc)

{ '_id': ObjectId('5d3ab969f123b8357f43969d'),
  'attachment-id': '',
  'author': 'By Lester C. Thurow',
  'content': ' WHEN IT COMES to inventing new technologies, America has no '
             'peers. More than a proportionate share of new ideas are still '
             'American ideas. It was not always so. Prior to World War II, '
             'basic scientific breakthroughs tended to come from Europe. Then, '
             'for two or three decades after World War II, America had '
             'effortless technological superiority --it had no peers either as '
             'an inventor or user of new technologies. Now that era is gone, '
             'too. America faces foreign competitors -- Japan, West Germany, '
             'Sou[...]',
  'content-hash-ssdeep': '192:gEo7m2G3333s3MzzzzVQyYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTre6CtttO:gEx2lDyYYYYYYYYYYYYYYYYYYYYYYYYx',
  'copyright': 'Wall Street Journal , Eastern edition; New York, N.Y. [New '
               'York, N.Y]12 June 1987: 

In [75]:
result = client['we1s']['humanities_keywords'].find_one({'pub_date': ''})

print(result)


{'_id': ObjectId('5d3a98b1f123b8357f3c89b7'), 'doc_id': '02A6A252C52394AB97B14672E56C2F2F4F90A529D0561AD49F00CEDD2F664699C866F0EE71FC8C267F54EC7271BA873CEF5C6814683C799579950948989D66E2685A0B7A65C70808142051ACB1A0A249FEC8409EB8A86BB611B574F2927FD1D66CAB33485CF58E00CC3D383B19CF903E', 'attachment_id': '', 'pub': 'School Library Journal Reviews', 'pub_date': '', 'length': '332', 'section': 'MULTIMEDIA REVIEW; Video/DVD; Pg. 69', 'author': 'MaryAnn Karre', 'title': 'Playing to Win;  The College Admissions Experience; Getting an Edge: Early Admission; Admit, Defer, or Reject: The Admissions Perspective; Accept or Decline: The Applicant Perspective', 'copyright': '<br/>', 'content-unscrubbed': '(Series). 3 videocassettes or 3 DVDs. color. range 22-23 min. Prod. by ABC News. Dist. by Films for the Humanities & Sciences. 2003. video: $239.95 ser., $89.95 ea.; DVD: $269.95 ser., $99.95 ea. Includes: (ISBN 0-7365-8193-7); (ISBN 0-7365-8193-6); (ISBN 0-7365-8197-9). Gr 10-12 - These productions a

In [None]:
results = client['we1s']['humanities_keywords'].find({'pub_date':{'$exists': False}, 'raw_date':{'$exists': False}},{'Age': 1, '_id':0})

hits = 0
for rec2 in results:
    hits += 1
    print(rec2['name'])
print('\nHits: ', hits)

    

## Counting query results

Rather than receiving a list of results and counting the list,
ask the database for a count directly.

In [None]:
# get a count of how many documents match the query
# NOTE: does not take a projection argument

count = corpus.count_documents(query)
print("count:", count)
print(count)

In [None]:
# compare two query counts

query = { 'name': {'$regex':'.*humanities.*' }}
count = corpus.count_documents(query)
query = { 'name': {'$regex':'.*liberal.*' }}
count2 = corpus.count_documents(query)
print("counts:", count, count2, '\n')

## Results: using cursors

Unlike a generator, a cursor is random access -- however it isn't populated until you try to access it.

In [None]:
# Access results by index

# create a cursor on a search
query = { 'name': {'$regex':'.*humanities.*' }}
projection = { 'content':0 }
result_set = corpus.find(query, projection)
# this is a cursor object, not a list of results:
print(result_set, '\n')
# random access to mongodb -- don't need to iterate in order
print(result_set[5], '\n')
print(result_set[0], '\n')

In [None]:
# Use limit to limit a set of results by index

# create a cursor on a search
query = { 'name': {'$regex':'.*humanities.*' }}
projection = { 'content':0 }
result_set = corpus.find(query, projection).limit(3)
# this is a cursor object, not a list of results:
print(result_set, '\n')
# weirdly the cursor can still be used to access records outside the limit until it is iterated over
# and limit does not influence count
# ...for more, see https://stackoverflow.com/questions/29604573/how-to-limit-mongo-query-in-python
print(result_set[5], '\n')
# iterate over list -- this empties it
for result in result_set:
    print(result, '\n')
# print(result_set[0], '\n') # this now causes an error


In [None]:
# Randomly selected results

# create a cursor on a search
query = { 'name': {'$regex':'.*humanities.*' }}
projection = { '_id':1 }
result_set = corpus.find(query, projection)
# create list of full cursor contents 
result_list = list(result_set)
# print(result_list[200])
import random
print('Download full list and get three random docs:\n', random.sample(result_list, 3), '\n')

## Insert records

In [None]:
corpus.index_information()

# you can create a document and insert it.

insert = {'doc_id': 'INSERT_TEST',
          'title': 'Fake Article Title',
         }

# on insert_one, an InsertOne object is returned.
# it can be queried for the _id of what was just inserted.

insert_id = corpus.insert_one(insert)
print(insert_id.inserted_id)

# clean up: delete the document
query = { 'doc_id': {'$eq': 'INSERT_TEST' }}
dels = corpus.delete_many(query)
print(dels.deleted_count, " documents deleted.")

In [None]:
# identical documents will be inserted with
# a different _id field.

corpus.insert_one({'doc_id': 'INSERT_TEST',
                   'title': 'Fake Article Title',
                  })
corpus.insert_one({'doc_id': 'INSERT_TEST',
                   'title': 'Fake Article Title',
                  })

# results: two identical documents with different _ids

query = { 'doc_id': {'$eq': 'INSERT_TEST' }}
result_set = corpus.find(query)
for result in result_set:
    print(result)

# clean up: delete test documents

dels = corpus.delete_many(query)
print(dels.deleted_count, " documents deleted.")

In [None]:
%%time

# 

insert = {'doc_id': 'INSERT_TEST',
           'title': 'Fake Article Title',
          }
insert_id = corpus.insert_one(insert)
print(insert_id.inserted_id)

# each document used for inserting
# also has _id silently to the python dict.

print("insert['_id']:", insert['_id'])

# this generates a key error if you try to insert the
# same object twice, as you are trying to insert an _id
# that already exists--which may be what you want,
# but could be surprising if you are modifying a dict
# on the fly and using it for repeated writes.
# discussion: https://stackoverflow.com/questions/17529216/mongodb-insert-raises-duplicate-key-error

insert3 = {'doc_id': 'INSERT_TEST',
           'title': 'Fake Article Title',
          }
try:
    insert_id = corpus.insert_one(insert3)
    print(insert_id.inserted_id)
    insert_id = corpus.insert_one(insert3)
    print(insert_id.inserted_id)
except (DuplicateKeyError) as err:
    print(err)


In [None]:
test_insert = {'_id': 'abc',
               'doc_id': 'DELETEME',
               'pub': 'Fake Publication',
               'title': 'Fake Article Title',
              }

insert_id = corpus.insert_one(test_insert)
print(insert_id.inserted_id)
query = { 'doc_id': {'$eq': 'DELETEME' }}
result_set = corpus.find(query)
for result in result_set:
    print(result)

In [None]:
# upsert -- either add or update documents

# https://stackoverflow.com/questions/44462399/how-to-handle-duplicatekeyerror-in-mongodb-pymongo    
# for doc in documents:
#     client.update_one({'_id': doc['_id']}, doc, upsert=True)


## Projection: limiting which fields are returned

In [None]:
# The projection field controls what record fields are returned.
# It is either inclusive or exclusive.

query = {'name': {'$regex': '.*newyorktimes.*'}}

# create a projection that shows only selected fields
print('selected doc:\n', corpus.find_one(query, {'name':1, 'pub':1, 'title':1}), '\n')

# create a projection that shows everything but content
print('no content doc:\n', corpus.find_one(query, {'content':0}), '\n')

# _id is always included, but you can exclude it explicitly
print('no _id, name only doc:\n', corpus.find_one(query, {'_id':0, 'name':1}), '\n')

# once the object is returned, fields can be further manipulated --
# -- for example, showing a preview of long content

# content only projection, transformed to preview (shortened for readability)
one_doc = corpus.find_one(query, {'content':1})
one_doc['content_preview'] = one_doc.pop('content')[0:400] + '[...]'
print('content only (shortened):\n', one_doc, '\n')


# you cannot combine inclusion and exclusion statements in projection documents
# -- with the exception of the _id field.


## Search with multiple terms

In [None]:
# define a query for: name contains "newyorktimes" with a pub_date beginning with 2016
query = {'name': {'$regex':'.*newyorktimes.*'}, 'pub_date': {'$regex':'2016.*'}}
count = corpus.count_documents(query)
print("count:", count)

# example
print("example:\n", corpus.find_one(query, {'content':0}),'\n')

In [None]:
%%time

# define and run a complex regular expression:
#   humanities NOT liberal

query = { 'name': {'$regex':'^(?!.*liberal).*humanities.*'}}
count = corpus.count_documents(query)
print("count:", count, '\n')

# ...when possible this may be ~2x faster with mongodb operators, see below

In [None]:
%%time

# define and run a complex search:
#   humanities NOT liberal

query = { '$and': [ { 'name': { '$regex':'.*humanities.*' } }, { 'name': { '$not': { '$regex':'.*liberal.*' } } } ] }
count = corpus.count_documents(query)
print("count:", count, '\n')

In [None]:
# we could potentially make this much faster with text indexing,
# although it has very specific limitations

# https://medium.com/statuscode/how-to-speed-up-mongodb-regex-queries-by-a-factor-of-up-to-10-73995435c606

## Query printing

Methods of displaying pymongo mongodb queries in readable formats.

In [205]:
# pymongo queries are python dicts of dicts (or lists of dicts)
# however, these nested lines can be very hard to read

# print a query dict
query = {'$or': [{'name': {'$regex': '.*liberal.*'}}, {'name': {'$regex': '.*humanities.*'}}]}
print("\nprint dict:\n", query)


print dict:
 {'$or': [{'name': {'$regex': '.*liberal.*'}}, {'name': {'$regex': '.*humanities.*'}}]}


In [206]:
# pprint adds linebreaks and indents
import pprint
pp = pprint.PrettyPrinter(indent=4, compact=False)
print("\nprint with PrettyPrinter:")
pp.pprint(query)


print with PrettyPrinter:
{   '$or': [   {'name': {'$regex': '.*liberal.*'}},
               {'name': {'$regex': '.*humanities.*'}}]}


In [207]:
# json can also be used to print nested dict/lists
# in a more articulated indented outline form

query = {'$or': [{'name': {'$regex': '.*liberal.*'}}, {'name': {'$regex': '.*humanities.*'}}]}
import json
def jsonprint(obj):
    print(json.dumps(obj, sort_keys=True, indent=4))
print("\nprint with json:")
jsonprint(query)    


print with json:
{
    "$or": [
        {
            "name": {
                "$regex": ".*liberal.*"
            }
        },
        {
            "name": {
                "$regex": ".*humanities.*"
            }
        }
    ]
}


## Databases: rename-by-copy

In [215]:
%%time
## extremely slow--and potentially not safe if name already exists

# COPY DATABASE TO NEW NAME AND DELETE/DROP OLD

# client.admin.command('copydb',
#                      fromdb='we1s',
#                      todb='we1s-test')

CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 22.4 µs


## Databases: drop

In [216]:
# client.drop_database('we1s')