# Data Acquisition Notebook
- Notebook for the acquisition of US House of Representatives biographical data and the transformations at each step.

## Authentication and Configuration
- Utilizing a ```config.ini``` file and ```Auth``` class to configure and authenticate.
- Functions from ```data_acq_functions.py```.
- Dependencies: ```configparser```, ```pymongo```, ```requests```, ```functools```, ```re```, ```mediawiki```, and ```googleapiclient```

In [1]:
import data_acq_functions as daf
from data_acq_functions import Auth

In [2]:
# Config ProPublica API
config = Auth('../database-dev/auth/config.ini')
PP_ROOT, PP_HEADER = config.config_propublica()

## Acquiring ProPublica Representative Data
- Primary IDs based on [US Congress Bioguide IDs](https://bioguide.congress.gov/).
- Return representative dictionaries with minor cleaning.

In [3]:
# Retrieve all 117th House IDs
house_ids = daf.get_house_ids(117, PP_ROOT, PP_HEADER)

In [4]:
# Sample Bioguide ID
print(house_ids[0])

A000370


In [5]:
# Sample - Retrieve ProPublica data for representative from US House
sample_rep = daf.get_member(house_ids[0], PP_ROOT, PP_HEADER)
for k, v in list(sample_rep.items())[:10]: # First 10 key-value pairs
    print(f'{k}: {v}')

_id: A000370
first_name: Alma
middle_name: 
last_name: Adams
dob: 1946-05-27
gender: F
current_party: D
state: NC
google_id: /m/02b45d
votesmart_id: 5935


## Retrieving Wikipedia URLs
- Error logging wrapper (```@error_logging```) was built to wrap functions to group representatives by errors during data pulls.
- ```googleapiclient.errors.HttpError```: Missing or wrong Google Entity IDs from the ProPublica data pull.  Initiate function pulling the ID through a Google Knowledge Graph search.
- ```KeyError```: Missing Wikipedia URL in Google Knowledge Graph.  Initiate function pulling the ID through MediaWiki search.

In [6]:
# Config Google API Services and MediaWiki
entities = config.config_gkg()
wiki = config.config_wiki()

In [11]:
# Sample - Retrive Wikipedia URL for representative, no errors
sample_rep = daf.get_rep_data(house_ids[0], PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Alma Adams
https://en.wikipedia.org/wiki/Alma_Adams


In [12]:
# Sample - Initially missing Google Entity ID in ProPublica data
sample_rep = daf.get_rep_data('C001119', PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Angie Craig
https://en.wikipedia.org/wiki/Angie_Craig


In [13]:
# Sample - Initially missing wikipedia URL in Google Knowledge Graph
sample_rep = daf.get_rep_data('D000624', PP_ROOT, PP_HEADER, entities, wiki)
print(sample_rep['first_name'], sample_rep['last_name'])
print(sample_rep['wiki_url'])

Debbie Dingell
https://en.wikipedia.org/wiki/Debbie_Dingell


## Initial Database Population and Bulk Writes
- A local MongoDB database was utilized to avoid write/read limits from cloud databases (Google Firestore, AWS RDS, etc.) and to allow data to be nested and stored in different formats.

In [15]:
# Config local MongoDB
db = config.config_mongodb()
collection = db['reps']

In [18]:
# Bulk write insert statements
# inserts = []
# for member in members:
#     data = get_rep_data(member, PP_ROOT, PP_HEADER, entities, wiki)
#     inserts.append(InsertOne(data))

In [20]:
# Bulk write to collection
# result = collection.bulk_write(inserts)
# print(result.bulk_api_result)