# Data Acquisition Notebook
- **Data Sources: [ProPublica Congress API](https://www.propublica.org/datastore/api/propublica-congress-api), [Google Knowledge Graph](https://developers.google.com/knowledge-graph/libraries), [Wikipedia](https://www.wikipedia.org/), [VoteSmart](https://justfacts.votesmart.org/), [Center for Responsive Politics (OpenSecrets API)](https://www.opensecrets.org/open-data/api).**
- Functions and transformations for acquisition of US House of Representatives biographical data.

## Authentication and Configuration
- Utilizing a ```config.ini``` file and ```Auth``` class to configure and authenticate.
- Functions from ```data_acq_functions.py```.
- Dependencies: ```configparser```, ```pymongo```, ```requests```, ```functools```, ```bs4```, ```re```, ```mediawiki```, and ```googleapiclient```

In [1]:
import data_acq_functions as daf # Only aliased for the purposes of the notebook
from data_acq_functions import Auth

In [2]:
# Config ProPublica API
config = Auth('../database-dev/auth/config.ini')
PP_ROOT, PP_HEADER = config.config_propublica()

## Building Initial Database
- ```build_db_script.py``` utilized to build database.

### Acquiring ProPublica Representative Data
- Utilizing [ProPublica Congress API](https://www.propublica.org/datastore/api/propublica-congress-api).
- Primary IDs based on [US Congress Bioguide IDs](https://bioguide.congress.gov/).
- Return representative dictionaries with minor cleaning.

In [3]:
# Retrieve all 117th House IDs
house_ids = daf.get_house_ids(117, PP_ROOT, PP_HEADER)

In [4]:
# Sample Bioguide ID
print(house_ids[0])

A000370


In [5]:
# Sample - Retrieve ProPublica data for representative from US House
sample_rep = daf.get_member(house_ids[0], PP_ROOT, PP_HEADER)
for k, v in list(sample_rep.items())[:10]: # First 10 key-value pairs
    print(f'{k}: {v}')

_id: A000370
first_name: Alma
middle_name: 
last_name: Adams
dob: 1946-05-27
gender: F
current_party: D
state: NC
google_id: /m/02b45d
votesmart_id: 5935


### Retrieving Wikipedia URLs
- Utilizing [Google Knowledge Graph](https://developers.google.com/knowledge-graph/libraries) and [MediaWiki](https://github.com/barrust/mediawiki) to acquire Wikipedia page URLs of representatives.
- Error logging wrapper (```@error_logging```) was built to wrap functions to group representatives by errors during data pulls.
- ```googleapiclient.errors.HttpError```: Missing or wrong Google Entity IDs from the ProPublica data pull.  Initiate function pulling the ID through a Google Knowledge Graph search.
- ```KeyError```: Missing Wikipedia URL in Google Knowledge Graph.  Initiate function pulling the ID through MediaWiki search.

In [6]:
# Config Google API Services and MediaWiki
entities = config.config_gkg()
wiki = config.config_wiki()

In [7]:
# Sample - Retrive Wikipedia URL for representative, no errors
sample_rep = daf.get_rep_data(house_ids[0], PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Alma Adams
https://en.wikipedia.org/wiki/Alma_Adams


In [8]:
# Sample - Initially missing Google Entity ID in ProPublica data
sample_rep = daf.get_rep_data('C001119', PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Angie Craig
https://en.wikipedia.org/wiki/Angie_Craig


In [9]:
# Sample - Initially missing wikipedia URL in Google Knowledge Graph Entity
sample_rep = daf.get_rep_data('D000624', PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Debbie Dingell
https://en.wikipedia.org/wiki/Debbie_Dingell


### Initial Database Population and Bulk Writes
- A local MongoDB database was utilized to avoid write/read limits from cloud databases (Google Firestore, AWS RDS, etc.) and to allow data to be nested and stored in different formats.

In [10]:
from pymongo import InsertOne

In [11]:
# Config local MongoDB
db = config.config_mongodb()
collection = db['reps']

In [12]:
# Bulk write insert statements
# inserts = []
# for member in house_ids:
#     data = get_rep_data(member, PP_ROOT, PP_HEADER, entities, wiki)
#     inserts.append(InsertOne(data))

In [13]:
# Bulk write to collection
# result = collection.bulk_write(inserts)
# print(result.bulk_api_result)

## Acquire Educational Data
- Utilizing BeautifulSoup to scrape US House Representative Wikipedia pages.
- Educational background located in ```<th>Education</th>``` or ```<th><a>Alma mater</a></th>``` row of ```<table>``` with attribute ```class="infobox vcard"```.
- Error logging to utilize [VoteSmart](https://justfacts.votesmart.org/) as an alternative when Wikipedia pages are missing educational data from infobox.

In [14]:
# Retrieve all representative IDs, Wikipedia URLs, VoteSmart IDs, first name, last name
projection = {'_id': 1, 'first_name': 1, 'last_name': 1, 'wiki_url': 1, 'votesmart_id': 1}
results = collection.find({}, projection)
reps = [ rep for rep in results ]

In [15]:
# Check if there are missing wikipedia URLs
for rep in reps:
    assert rep['wiki_url'] != None
    assert rep['wiki_url'] != ''

In [16]:
# Sample
print(reps[0])

{'_id': 'A000370', 'first_name': 'Alma', 'last_name': 'Adams', 'votesmart_id': '5935', 'wiki_url': 'https://en.wikipedia.org/wiki/Alma_Adams'}


In [18]:
wiki_url = reps[0]['wiki_url']
edus = daf.wiki_edu_scrape(wiki_url)
print(edus) # ([<educational data>], <error>)

(['North Carolina A&T State University', 'BS', 'MS', 'Ohio State University', 'PhD'], None)


### Script for educational background acquisition
- Initial for-loop script ran in an out-of-date notebook.
- Note: several representatives had to be manually checked (all error lists), all errors were corrected by the script.

In [27]:
# Error lists
no_wiki_edu = []
no_vs_id = []
no_vs_edu = []
other_errors = []

# Sample script (for-loop used in original script)
rep = reps[0]
edus, error = daf.wiki_edu_scrape(rep['wiki_url'])
rep['education'] = edus
if error: # No educational background on wikipedia (Note: None had post-secondary degrees)
    no_wiki_edu.append(rep)

elif len(edus) < 2: # No degree shown in wikipedia educational background
    vs_id, error = daf.get_vs_id(rep)
    if error:
        rep['education'] = None # No VoteSmart ID
        no_vs_id.append(rep)
    else:
        rep['votesmart_id'] = vs_id
        edus, error = daf.vs_edu_scrape(rep)
        if error:
            rep['education'] = None # No degree from VoteSmart
            no_vs_edu.append(rep)
        else: 
            rep['education'] = edus # Degree pulled from VoteSmart

else: # No errors clean educational data
    edus, error = daf.clean_edu(rep)
    if error:
        other_errors.append(rep)
    else:
        rep['education'] = edus

In [28]:
# Sample result
rep['education']

[['BS', 'North Carolina A&T State University'],
 ['MS', 'North Carolina A&T State University'],
 ['PHD', 'Ohio State University']]

### Bulk Update Educational Backgrounds

In [29]:
from pymongo import UpdateOne

In [26]:
# Create bulk updates
# updates = []
# for rep in reps:
#     update = UpdateOne(
#         {'_id': rep['_id']},
#         {'$set': {'votesmart_id': rep['votesmart_id'], 'education': rep['education']}}
#     )
#     updates.append(update)

In [None]:
# result = collection.bulk_write(updates)
# result.bulk_api_result