# Part 1. Authentication in a service

## 1.1. What do you store in your Google Drive?

Sometimes it can be quite troublesome to crawl web data - for example, when you can't just collect data from web-pages because the authentification to a website is required. Today's tutorial is about a dataset of special type - namely, Google Drive data. You will need to get access to the system using OAuth protocol, download and parse files of different types.

Plan. 
1. Download [this little archive](https://drive.google.com/open?id=1Xji4A_dEAm_ycnO0Eq6vxj7ThcqZyJZR), **unzip** it and place the folder anywhere inside your Google Drive. You should get a subtree of 6 folders with files of different types: presentations, pdf-files, texts, and even code.
2. Go to [Google Drive API](https://developers.google.com/drive/api/v3/quickstart/python) documentation, read [intro](https://developers.google.com/drive/api/v3/about-sdk) and learn how to [search for files](https://developers.google.com/drive/api/v3/reference/files/list) and [download](https://developers.google.com/drive/api/v3/manage-downloads) them. Pay attention, that  working at `localhost` (jupyter) and at `google colab` can be slighty different. We expect you to run from localhost.
3. Learn how to open from python such files as [pptx](https://python-pptx.readthedocs.io/en/latest/user/quickstart.html), pdf, docx or even use generalized libraries like [textract](https://textract.readthedocs.io/en/stable/index.html), save internal text in a file near.
4. Write a code with returns names (with paths) of files for a given substring. Test on these queries.
```
segmentation
algorithm
classifer
printf
predecessor
Шеннон
Huffman
function
constructor
machine learning
dataset
Протасов
Protasov
```

### 1.1.1. Access GDrive ###

Below is the example of how you can oranize your code - it's fine if you change it.

Let's extract the list of all files that are contained (recursively) in t
he folder of interest. In my case, I called it `air_oauth_folder`.

In [None]:
# install some dependencies
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib oauth2client

Requirement already up-to-date: google-api-python-client in /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages (1.12.8)
Requirement already up-to-date: google-auth-httplib2 in /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages (0.0.4)
Requirement already up-to-date: google-auth-oauthlib in /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages (0.4.2)
Requirement already up-to-date: oauth2client in /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages (4.1.3)
Requirement already up-to-date: google-cloud in /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages (0.34.0)
Found existing installation: google-api-python-client 1.12.8
Uninstalling google-api-python-client-1.12.8:
  Would remove:
    /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages/apiclient/*
    /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages/google_api_python_client-1.12.8.dist-info/*
    /Users/osmiyg/opt/miniconda3/lib/python3.8/site-packages/googleapiclient/*
Proceed (y/n)? 

In [1]:
from google.oauth2 import service_account
from googleapiclient.http import MediaIoBaseDownload, MediaFileUpload
from googleapiclient.discovery import build
import io
from pathlib import Path

In [2]:
def gdrive_get_all_files_in_folder(folder_name, 
                                   SCOPES=['https://www.googleapis.com/auth/drive'], 
                                   SERVICE_ACCOUNT_FILE='client_id.json'):
    def find_files_in_folder(files, folder_id):
        return [file for file in files if folder_id in file.get("parents", []) 
                and file["mimeType"] != "application/vnd.google-apps.folder"]
    
    def find_folders_in_folder(files, folder_id):
        return [file for file in files if folder_id in file.get("parents", []) 
                and file["mimeType"] == "application/vnd.google-apps.folder"]
    
    #TODO retrieve all files from a given folder
    credentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=SCOPES)
    service = build('drive', 'v3', credentials=credentials)
    results = service.files().list(pageSize=1000, fields="nextPageToken, files(id, name, mimeType, parents)").execute()
    
    # Find folder_id from name
    files = results["files"]
    folder_id = next((file["id"] for file in files if file["name"] == folder_name 
                      and file["mimeType"] == "application/vnd.google-apps.folder"), None)
    files_in_folder = []
    if folder_id:
        # Find all files withing this folder
        folder_queue = []
        files_in_folder.extend(find_files_in_folder(files, folder_id))
        # Add all folders to be proceeded
        folder_queue.extend(find_folders_in_folder(files, folder_id))
        while folder_queue:
            folder_id_tmp = folder_queue.pop()["id"]
            files_in_folder.extend(find_files_in_folder(files, folder_id_tmp))
            folder_queue.extend(find_folders_in_folder(files, folder_id_tmp))
    
    return files_in_folder

def gdrive_download_file(file, path_to_save,
                         SCOPES=['https://www.googleapis.com/auth/drive'],
                         SERVICE_ACCOUNT_FILE='client_id.json'): 
    #TODO download file and save it under the path
    credentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=SCOPES)
    service = build('drive', 'v3', credentials=credentials)
    request = service.files().get_media(fileId=file["id"])
    
    path = Path(path_to_save)
    path.mkdir(parents=True, exist_ok=True)
    file_path = path / Path(file["name"])
    
    fh = io.FileIO(file_path, 'wb')
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print(f"Download:{file_path} {int(status.progress() * 100)}%.")

In [3]:
folder_of_interest = 'air_oauth_folder'
files = gdrive_get_all_files_in_folder(folder_of_interest)

test_dir = "test_files"
for item in files:
    gdrive_download_file(item, test_dir)

Download:test_files/origin-06.mp3 100%.
Download:test_files/origin-05.mp3 100%.
Download:test_files/neuro.html 100%.
Download:test_files/nn.cpp 100%.
Download:test_files/rdtsc-vc.cpp 100%.
Download:test_files/rdtsc-gcc.c 100%.
Download:test_files/cyclomat.c 100%.
Download:test_files/lockexamples.c 100%.
Download:test_files/Program.cs 100%.
Download:test_files/skiplist.js 100%.
Download:test_files/bloomset.js 100%.
Download:test_files/sort.js 100%.
Download:test_files/students.txt 100%.
Download:test_files/Tutorial 9.pdf 100%.
Download:test_files/Assessment Criteria (May).pdf 100%.
Download:test_files/AY16-17 Academic Calendar .pdf 100%.
Download:test_files/retake-2016-08-18.docx 100%.
Download:test_files/dsa.pdf 100%.
Download:test_files/FuncnNEW.pdf 100%.
Download:test_files/Tutorial #8.pdf 100%.
Download:test_files/[DM]-Course Description.docx 100%.
Download:test_files/3cases.pdf 100%.
Download:test_files/L5-problems-2015.pdf 100%.
Download:test_files/ai-junior.pdf 100%.
Download:tes

### 1.1.2. Tests ###
Please fill free to change function signatures and behaviour.

In [4]:
assert len(files) == 34, 'Number of files is incorrect'
print('n_files:', len(files))

print("file here means id and name, e.g.: ", files[0])

gdrive_download_file(files[0], '.')

import os.path
assert os.path.isfile(os.path.join('.', files[0]["name"])), "File is not downloaded correctly"

n_files: 34
file here means id and name, e.g.:  {'id': '1Bd-gKE8UMqRUEn9SzSBgCjU8HVE9poMs', 'name': 'origin-06.mp3', 'mimeType': 'audio/mpeg', 'parents': ['1p2l3bbtH8ZTmyRTe0B0R_-PPGU9K-x2P']}
Download:origin-06.mp3 100%.


## 1.2. Read files content
### 1.2.1. Read

In [12]:
# install dependencies
!pip install textract

Collecting textract
  Downloading textract-1.6.3-py3-none-any.whl (21 kB)
Collecting extract-msg==0.23.1
  Downloading extract_msg-0.23.1-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 1.3 MB/s eta 0:00:01
[?25hCollecting EbookLib==0.17.1
  Downloading EbookLib-0.17.1.tar.gz (111 kB)
[K     |████████████████████████████████| 111 kB 1.5 MB/s eta 0:00:01
[?25hCollecting python-pptx==0.6.18
  Downloading python-pptx-0.6.18.tar.gz (8.9 MB)
[K     |████████████████████████████████| 8.9 MB 5.6 MB/s eta 0:00:01
[?25hCollecting six==1.12.0
  Downloading six-1.12.0-py2.py3-none-any.whl (10 kB)
Collecting docx2txt==0.8
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
Collecting argcomplete==1.10.0
  Downloading argcomplete-1.10.0-py2.py3-none-any.whl (31 kB)
Collecting pdfminer.six==20181108
  Downloading pdfminer.six-20181108-py2.py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 12.3 MB/s eta 0:00:01
Collecting SpeechRecognition==3.8.1
  D

For windows please refer to 
- https://textract.readthedocs.io/en/latest/installation.html#don-t-see-your-operating-system-installation-instructions-here

- https://www.xpdfreader.com/download.html

ALSO BE CAREFUL WITH SPACES IN NAMES. Better save without spaces!!!!

In [36]:
import textract

## IF USING python3.8, textract COULD NOT PARSE PDF FILE, BECAUSE OF BROKEN 'chardet'.

def get_file_strings(path):
    #TODO change this function to handle different data types properly 
    # - textract is not able to parse everything
    # Take care of non-text data too
    
    # If filetype is parsable by textract - extract text
    filetype = Path(path).suffix
    texts = ""
    if filetype in textract.parsers._get_available_extensions():        
        try:
            texts = str(textract.process(path), encoding="utf-8").replace('\\n', '\n').replace('\\r', '').split('\n')
            print(f"File {path} parsed successfully")
        except Exception as e:
            print(f"Could not decode {path} beacause of \033[1m{e}\033[0m")
            texts = ""
    # If filetype if code - parse it as text
    elif filetype in [".cpp", ".c", ".js", ".cs"]:
        try:
            with open(path, "r") as f:
                text = f.read()
            texts = text.replace('\\n', '\n').replace('\\r', '').split('\n')
            print(f"File {path} parsed successfully")
        except Exception as e:
            print(f"Could not decode {path} beacause of \033[1m{e}\033[0m")
            texts = ""
    else:
        print(f"Could not decode {path} beacause \033[1mfiletype {filetype} is not supported by textract\033[0m")
    return texts

In [37]:
# creating dictionary of parsed files
files_data = dict()
for file in os.scandir(test_dir):  
    strings = get_file_strings(file.path)
    if strings:
        files_data[file.name] = strings

File test_files/sort.js parsed successfully
File test_files/AY16-17 Academic Calendar .pdf parsed successfully
Could not decode test_files/3cases.pdf beacause of [1mdecode() argument 1 must be str, not None[0m
File test_files/rdtsc-vc.cpp parsed successfully
File test_files/Assessment Criteria (May).pdf parsed successfully
File test_files/at least this file.txt parsed successfully
File test_files/rdtsc-gcc.c parsed successfully
Could not decode test_files/Tutorial #8.pdf beacause of [1m'charmap' codec can't decode byte 0x9d in position 1197: character maps to <undefined>[0m
File test_files/DSA_15 Lion in the desert.pptx parsed successfully
File test_files/origin-05.mp3 parsed successfully
File test_files/origin-06.mp3 parsed successfully
File test_files/students.txt parsed successfully
File test_files/L5-problems-2015.pdf parsed successfully
Could not decode test_files/grant.txt beacause of [1m'utf-8' codec can't decode byte 0x93 in position 10562: invalid start byte[0m
File test

### 1.2.2. Tests for read

In [41]:
# Changed test: len=27 because of library not working in this distribution
# Changed 'deep-features-scene (1).pdf' file to cs.pdf

assert len(files_data) == 27 
print(len(files_data))

assert "Protasov" in get_file_strings(os.path.join(test_dir, 'at least this file.txt')), "TXT File parsed incorrectly"
assert "Computer Science" in get_file_strings(os.path.join(test_dir, 'cs.pdf')), "PDF File parsed incorrectly"

27
File test_files/at least this file.txt parsed successfully
File test_files/cs.pdf parsed successfully


## 1.3. Tests

In [47]:
def find(query, text_files):
    # Extremely simple search engine
    filenames = set()
    for filename, word_list in text_files.items():
        if query in word_list:
            filenames.add(filename)
    return filenames

In [51]:
queries = ["segmentation", "algorithm", "printf", "predecessor", "Huffman",
           "function", "constructor", "machine learning", "dataset", "Protasov"]

for query in queries:
    r = find(query, files_data)
#    print("Results for: ", query)
#    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
# This assert will not work, as some documents are not parsed properly by library
#    assert len(r) > 1, "Query should return at least 2 documents" 
    assert "at least this file.txt" in r, "This file has all the queries. It should be in a result"

# 2. Parse me if you can #

Sometimes when crawling we have to parse websites that turn out to be SaaS - i.e., there is a special JS application which renders documents and which is downloaded first. Therefore, data that is to be rendered initially comes in a proprietary format. One of the examples is Google Drive. Last time we downladed and parsed some files from GDrive, however, we didn't parse GDrive-specific file formats, such as google sheets or google slides.

Today we will learn to obtain and parse such data using Selenium - a special framework for testing web-applications.

## 2.1. Getting started

Let's try to load and parse the page the way we did before:

In [52]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.body.text[:1000])

Не удалось открыть файл, поскольку в вашем браузере отключено использование JavaScript. Включите его и перезагрузите страницу.Некоторые функции PowerPoint не поддерживаются в Google Презентациях. Они будут удалены, если вы измените документ.Подробнее…6. Approximate nearest neighbours search 2. Trees   Смотреть  Открыть доступВойтиИспользуемая вами версия браузера больше не поддерживается. Установите поддерживаемую версию браузера.Закрытьdocument.getElementById('docs-unsupported-browser-bar').addEventListener('click', function () {this.parentNode.parentNode.removeChild(this.parentNode);return false;});ФайлПравкаВидСправкаСпециальные возможностиОтладкаНесохраненные изменения: ДискПоследние изменения      Специальные возможности  Только просмотр     DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();Презентация в виде HTML(function(){/*

 Copyright The Closure Library Authors.
 SPDX-License-Identifier: Apache-2.0
*/
var a=this||self;function b(){this.g=thi

As we see, the output is not what we expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script? Browser engines can help us get data. Let's try to load the same web-page, but do it in a different way: let's give some time to a browser to load the scripts and run them; and then will work with DOM (Document Object Model), but will get it from browser engine itself, not from BeautifulSoup.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from code and retrieve DOM? Service applications called drivers will interpret out commands and translate them into browser actions.


For each browser engine suport you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko == Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location.

### 2.1.1. Download driver

And place it in any folder or under PATH env. variable.

### 2.1.2. Install selenium

In [None]:
!pip install -U selenium

In [53]:
from selenium import webdriver

### 2.1.3. Launch browser

This will open browser window

In [71]:
browser = webdriver.Firefox(
    executable_path='/Users/osmiyg/opt/geckodriver'
)

### 2.1.4. Download the page


In [72]:
# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It glues all the words!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 95
What if just a silly approach: Forestsofsearchtrees
What if a smart approach: Forests of search trees


In [73]:
browser.quit()

- Too slow, wait for browser to open, browser to render

## 2.2. Headless

Browsers (at least [FF](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/), IE) have headless mode - no window rendering and so on. Means it should work much faster!

In [65]:
options = webdriver.FirefoxOptions()

options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Firefox(options=options, executable_path='/Users/osmiyg/opt/geckodriver')

In [66]:
## SAME CODE

# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It adds NEW LINE. Behavior differs!!!!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 95
What if just a silly approach: Forestsofsearchtrees
What if a smart approach: Forests of search trees


In [67]:
browser.quit()

### 2.2.1. NB 
Note, that browser behavior differs for the same code!

## 2.3. Task 
Our lectures usually have lot's of links. Here are the links to original (spring 2020) versions of the documents.

[4. Vector space](https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing)

[6. search trees](https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing)

[7-8. Web basics](https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing)

Please complete the following tasks:

### 2.3.1. Search for slides with numbers
I want to type a word, and it should say which slides of which lecture has this word.

In [None]:
def getTextAndImgsFromSlides(url):    
    slides_text = dict() # dictionary slide_num : slide_text
    img_list = [] # list of image urls 
    #TODO: parse google slides and save all text and image urls in slides_text and img_list
    # you should get the contents from ALL slides - however, you will see that at one moment 
    # of time only single slide + few slide previews on the left are visible. To be able to    
    # reach all slides you will need to scroll to and click these previews. While slide contents 
    # can be obtained from previews themselves, speaker notes (which you also have to extract)
    # can be viewed only if a particular slide is open.
    # to scroll the element of interest into view, use can this: 
    # browser.execute_script("arguments[0].scrollIntoView();", el)
    # to click the element, use can use ActionChains library   
    
    
    return slides_text, img_list

Parsing three presentations

In [None]:
links = ["https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing", 
         "https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing",
         "https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing"]

all_imgs = []
all_texts = dict()

for i, link in enumerate(links):
    texts, imgs = getTextAndImgsFromSlides(link)


### 2.3.2. Tests

In [None]:
texts, imgs = getTextAndImgsFromSlides('http://tiny.cc/00dhkz')

assert len(texts) == 35 # equal to the total number of slides in the presentation 
print(len(texts))

assert len(imgs) > 26 # can be more than that due to visitor icons
print(len(imgs))

assert any("Navigable" in value for value in texts.values()) # word is on a slide
assert any("MINUS" in value for value in texts.values()) # word is in speaker notes

In [70]:
queries = ["architecture", "algorithm", "function", "dataset", 
           "Protasov", "cosine", "модель", "например"]

for query in queries:
    r = find(query, texts)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"

NameError: name 'texts' is not defined