# Crawling and Extracting Files from Google Drive

## Phase 1: Crawling
...is how we create our initial index of files. Downstream, this lets us strategize as to how many and what types of files are processed at different locations. Here we use the **Google OAuth2 InstalledAppFlow** to get a user's Auth credentials, and then use the **Google Drive API** to generate a list of all files and their attributes from the Google Drive API. 

You may need to install the Google Drive requirements before running this notebook. In the terminal this can be done with: 

```pip install -r google_drive_nb_requirements.txt```

In [422]:
from __future__ import print_function

# Standard Py Imports
import os
import json
import pickle
import os.path
import requests

# Google Imports
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

In [423]:
# First, set the following elements as environment variables on your machine
# ** DISCLAIMER **: if you push these to GitHub, I will be sad.
# ** FOR MACOSX: After you source, .bash_profile, might need full system restart to discover env vars. 
project_id = os.environ["goog_project_id"]
client_id = os.environ["goog_client_id"]
client_secret = os.environ["goog_client_secret"]

In [424]:
# crawl_url = "http://xtract-crawler-4.eba-ghixpmdf.us-east-1.elasticbeanstalk.com/crawl"
xtract_url = "http://127.0.0.1:5000"

In [425]:
# Add the secret stuff to our credentials document...
with open("config.json", "r") as f:
    creds = json.load(f)
print(creds)
creds['web']['client_id'] = client_id
creds['web']['client_secret'] = client_secret
creds['web']['project_id'] = project_id

# And write it.
with open("credentials.json", "w") as f: 
    json.dump(creds, f)

{'web': {'auth_uri': 'https://accounts.google.com/o/oauth2/auth', 'token_uri': 'https://oauth2.googleapis.com/token', 'auth_provider_x509_cert_url': 'https://www.googleapis.com/oauth2/v1/certs', 'redirect_uris': ['http://localhost', 'http://localhost/potato', 'http://localhost:8080/Callback'], 'javascript_origins': ['http://localhost:8080']}}


Now we are going to define and run the login flow. Here we give our credentials to the Google OAuth2 API server, which will give back an immutable Credentials() object.

In [426]:
SCOPES = ['https://www.googleapis.com/auth/drive.metadata.readonly', 'https://www.googleapis.com/auth/drive.readonly']

# Stolen from Google Quickstart docs
# https://developers.google.com/drive/api/v3/quickstart/python
def do_login_flow():
    creds = None
    # The file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)

    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)

    return creds, None  # Returning None because Tyler can't figure out how he wants to structure this yet. 

# THIS should force-open a Google Auth window in your local browser. If not, you can manually copy-paste it. 
auth_creds = do_login_flow()

# Now delete the file so you don't accidentally `git add` it. 
os.remove("credentials.json")

In [429]:
r = requests.post(url=f"{xtract_url}/crawl",
                  data=pickle.dumps({'auth_creds': auth_creds, 'repo_type': 'GDRIVE'}) )

crawl_mdata = json.loads(r.content)
print(f"Crawl Started! Here's your trackable crawl_id:\n {crawl_mdata}")
crawl_id = crawl_mdata['crawl_id']

Crawl Started! Here's your trackable crawl_id:
 {'crawl_id': 'f0fb6361-45db-406f-860c-7d2554722e5d'}


## Phase 2: Extraction
...is how we create our initial index of files. Downstream, this lets us strategize as to how many and what types of files are processed at different locations. Here we use the **Google OAuth2 InstalledAppFlow** to get a user's Auth credentials, and then use the **Google Drive API** to generate a list of all files and their attributes from the Google Drive API. 

You may need to install the Google Drive requirements before running this notebook. In the terminal this can be done with: 


In [439]:
r = requests.get(url=f'{xtract_url}/get_crawl_status', json={'crawl_id': crawl_id})
print(json.loads(r.content))

{'crawl_end_t': 1599084536.7306452, 'crawl_start_t': 1599084506.6546237, 'crawl_status': 'COMPLETED', 'gdrive_mdata': {'doc_types': {'is_gdoc': 1206, 'is_user_upload': 2107}, 'first_ext_tallies': {'compressed': 6, 'hierarch': 1, 'images': 380, 'other': 360, 'presentation': 182, 'tabular': 207, 'text': 2177}}, 'groups_crawled': 3313, 'n_commit_threads': 5, 'repo_type': 'GDrive', 'total_crawl_time': 30.076021432876587}


In [440]:
# Do a Globus Login (Will won't need this, because of REFRESH token)
from fair_research_login import NativeClient

client = NativeClient(client_id='7414f0b4-7d05-4bb6-bb00-076fa3f17cf5')
tokens = client.login(
    requested_scopes=['https://auth.globus.org/scopes/56ceac29-e98a-440a-a594-b41e7a084b62/all', 
                      'urn:globus:auth:scope:transfer.api.globus.org:all',
                     "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all", 
                     'email', 'openid'],
    no_local_server=True,
    no_browser=True)

auth_token = tokens["petrel_https_server"]['access_token']
transfer_token = tokens['transfer.api.globus.org']['access_token']
funcx_token = tokens['funcx_service']['access_token']

headers = {'Authorization': f"Bearer {auth_token}", 'Transfer': transfer_token, 'FuncX': funcx_token, 'Petrel': auth_token}
print(f"Headers: {headers}")

Headers: {'Authorization': 'Bearer AgxNvGpaNvzeKnV594dQK96gVpxOK04244mnelVxv77k4dYM6xH8CwJW5B1B6WaPnQ2NMXYGx4192xCqpDjO5UQJ45', 'Transfer': 'AgO4myokWe6zqmGd1Qa0GMm5BgjVNyjlgpNaeqggJKrazG9PEDcnCpwGvPOaDqGrlG4le9kaxjXq5vue6da0rFo93N', 'FuncX': 'AgevxaX2Jj7oND5WJkk8Krm7EPBYYmO0yKbqKao02rEaqyj95viVCVrpn09Obym6ybBB6NOdnDnvvySK46oYluJPeV', 'Petrel': 'AgxNvGpaNvzeKnV594dQK96gVpxOK04244mnelVxv77k4dYM6xH8CwJW5B1B6WaPnQ2NMXYGx4192xCqpDjO5UQJ45'}


In [441]:
r = requests.get(url=f'{xtract_url}/get_crawl_status', json={'crawl_id': crawl_id})
print(json.loads(r.content))

{'crawl_end_t': 1599084536.7306452, 'crawl_start_t': 1599084506.6546237, 'crawl_status': 'COMPLETED', 'gdrive_mdata': {'doc_types': {'is_gdoc': 1206, 'is_user_upload': 2107}, 'first_ext_tallies': {'compressed': 6, 'hierarch': 1, 'images': 380, 'other': 360, 'presentation': 182, 'tabular': 207, 'text': 2177}}, 'groups_crawled': 3313, 'n_commit_threads': 5, 'repo_type': 'GDrive', 'total_crawl_time': 30.076021432876587}


In [442]:
# source_ep_path = "/mdf_open"
# source_ep_path = "/mdf_open/kearns_biofilm_rupture_location_v1.1/Biofilm Images/Paper Images/Isofluence Images (79.4)/THY+75mM AA"

# source_ep_id = "e38ee745-6d04-11e5-ba46-22000b92c6ec"
source_ep_id = "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec"
dest_ep_id = "1adf6602-3e50-11ea-b965-0e16720bb42f"  # where they will be extracted

# Globus endpoint and file path at which we want to store metadata documents
mdata_ep_id = "5113667a-10b4-11ea-8a67-0e35e66293c2"
mdata_path = "/projects/DLHub/mdf_metadata"  # TODO: Add exception if you put slash at the end of the mdata_path. 

# FuncX endpoint at which we want the metadata extraction to occur. Does NOT have to be same endpoint as the data.
#. funcx_ep_id = "6045fcfb-c3ef-48db-9b32-5b50fda15144"  # Path to funcX running on JetStream. 
funcx_ep_id = "68bade94-bf58-4a7a-bfeb-9c6a61fa5443"  # River k8s cluster. 

# print(type(gdrive_pkl))

# TODO: Need Google Credentials
print(f"Extracting for crawl_id: {crawl_id}")
xtract = requests.post(f'http://127.0.0.1:5000/extract', data=pickle.dumps({
                                                  'crawl_id': crawl_id,
                                                  'gdrive_pkl': auth_creds,
                                                  'headers': json.dumps(headers),
                                                  'funcx_eid': funcx_ep_id, 
                                                  'source_eid': source_ep_id,
                                                  'dest_eid': dest_ep_id,
                                                  'mdata_store_path': mdata_path}))

Extracting for crawl_id: f0fb6361-45db-406f-860c-7d2554722e5d


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))