# Google Drive Preliminary Tests

This script is meant to crawl a Google Drive repository to find some base metrics. The goal of this notebook is to identify users whose Google Drive accounts are 'large' and 'interesting' for use in a full metadata extraction study. All Google Drive data crawled by this notebook is confidential and is not downloaded or stored to any computing machinery outside of Google. 

Note: all Drive accounts must come from an **xxx@uchicago.edu*** email address.

To begin, you should run the following in the directory containing this notebook AND ensure your environment is using Python3.6+: 

`pip install -r nb-requirements.txt`

After installing to your environment of choice, you may need to restart your notebook.

In [None]:
from __future__ import print_function

# Standard Py Imports
import os
import json
import pickle
import requests

# Google Imports
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

### Step 1: Auth
We will Authenticate using Google OAuth2's InstalledAppFlow. You will likely need to replace the `project_id`, `client_id`, and `client_secret` with the fields provided by Tyler (you can just Copy+Paste them into this notebook so long as you don't push to GitHub). 

In [3]:
# Add the secret stuff to our credentials document...
with open("../examples/config.json", "r") as f:
    creds = json.load(f)

project_id = os.environ["goog_project_id"]
client_id = os.environ["goog_client_id"]
client_secret = os.environ["goog_client_secret"]

# project_id = "TODO 1 of 3"
# client_id = "TODO 2 of 3"
# client_secret = "3 of 3" 

creds['web']['client_id'] = client_id
creds['web']['client_secret'] = client_secret
creds['web']['project_id'] = project_id

# And write out to file. 
with open("credentials.json", "w") as f: 
    json.dump(creds, f)
    
SCOPES = ['https://www.googleapis.com/auth/drive.metadata.readonly', 'https://www.googleapis.com/auth/drive.readonly']

# Stolen from Google Quickstart docs
# https://developers.google.com/drive/api/v3/quickstart/python
def do_login_flow():
    creds = None
    # The file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)

    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)

    return creds, None  # Returning None because Tyler can't figure out how he wants to structure this yet. 

# This should force-open a Google Auth window in your local browser. 
#    If not, you can manually copy-paste it. 
auth_creds = do_login_flow()
os.remove("credentials.json")  # The worst way to provide the minimum level of file-levelsecurity.  

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=364500245041-r1eebsermd1qp1qo68a3qp09hhpc5dfi.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A63798%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.metadata.readonly+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.readonly&state=299kI67AAzAOpkO0gMkbYCFzNawXlZ&access_type=offline


### Step 2: Crawl 
The crawler is sending Drive API calls and is collecting information about every file and directory in your repository. In this, it is also trying to identify a 'best' first extractor to apply to each of your files (if any). 

In [4]:
crawl_url = "http://xtract-crawler-4.eba-ghixpmdf.us-east-1.elasticbeanstalk.com/crawl"
status_url = "http://xtract-crawler-4.eba-ghixpmdf.us-east-1.elasticbeanstalk.com/get_crawl_status"
# crawl_url = "http://127.0.0.1:5000/crawl"
# status_url = "http://127.0.0.1:5000/get_crawl_status"

r = requests.post(url=crawl_url,
                  data=pickle.dumps({'auth_creds': auth_creds, 'repo_type': 'GDRIVE'}) )

crawl_mdata = json.loads(r.content)
print(f"Crawl ID info:\n {crawl_mdata}")
crawl_id = crawl_mdata['crawl_id']

Crawl ID info:
 {'crawl_id': '0707bc4f-841c-4a26-ba5c-1a7988668466'}


### Step 3: Get status
**Run the following cell periodically**. When 'complete', please send the resulting JSON object to Tyler on Slack. The entire crawl shouldn't take more than 5 minutes. (It takes my personal Google Drive containing 3,100 files ~20-40 seconds per run). Google's Drive API returns a maximum of 100 files per API call, hence why you may notice 'groups_crawled' incrementing by a multiple of 100 whenever you update. 

The crawl_status codes are: STARTING->PROCESSING->COMMITTING->COMPLETED/FAILED

In [13]:
crawl_status = requests.get(status_url, json={'crawl_id': crawl_id})
print(crawl_status)
crawl_content = json.loads(crawl_status.content)
print(f"Crawl Status: {crawl_content}")

<Response [200]>
Crawl Status: {'crawl_end_t': 1611509419.7886066, 'crawl_start_t': 1611509385.0409765, 'crawl_status': 'COMPLETED', 'gdrive_mdata': {'doc_types': {'is_gdoc': 1207, 'is_user_upload': 1945}, 'first_ext_tallies': {'compressed': 6, 'hierarch': 1, 'images': 376, 'other': 379, 'presentation': 184, 'tabular': 222, 'text': 1984}}, 'groups_crawled': 3152, 'n_commit_threads': 5, 'repo_type': 'GDrive', 'total_crawl_time': 34.74763011932373}


## When the final status is "COMPLETED", please send me the resulting JSON document. Thanks!