# 1 - Scraping the web page

Alcoholics Anonymous maintains an online directory of virtual meetings that is a good candidate for easy scraping because:

* It displays all results on a single page, which also acccepts URL variables to filter results
* Results are pre-categorized by a tag system indicated by DOM selectors
* Results also contain user-generated text data, creating the possibility for natural language analysis as well 

The URL for this directory is: [https://aa-intergroup.org/meetings?types=English](https://aa-intergroup.org/meetings?types=English)
(The URL variable types=English is added to limit results to English language meetings.)

In [2]:
url = "https://aa-intergroup.org/meetings?types=English"

### Dynamic loading means requests won't be enough

First, a quick look at why requests and BeautifulSoup alone won't be enough to scrape this page. Though not immediately obvious when viewing the page, the content is actually loaded dynamically, with more results appearing as you scroll down. This means that the source code returned by requests does not contain any results whatsoever:

In [3]:
import requests

In [5]:
page = requests.get(url)
if page.status_code == 200:
    print(page.text)
else:
    print("Bad status code: ", page.status_code)

<!DOCTYPE html><html
lang=en-US><head><meta
charset="UTF-8"><link
rel=profile href=https://gmpg.org/xfn/11><title>Browse the Directory of Online Meetings | Online Intergroup of Alcoholics Anonymous</title><meta
name="viewport" content="width=device-width, initial-scale=1"><meta
name="description" content="The OIAA Directory features 1,000+ online AA meetings worldwide, ranging from video or telephone conferences to email or chat groups in many languages, available 24/7. Browse the next available or search for the right one for you."><meta
name="robots" content="index, follow"><meta
name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"><meta
name="bingbot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"><link
rel=canonical href=https://aa-intergroup.org/meetings><meta
property="og:url" content="https://aa-intergroup.org/meetings/"><meta
property="og:site_name" content="Online Intergroup of Alcoholic

### Using Selenium to simulate scrolling down

Selenium can simulate scrolling down, thus causing the page to render all results. In fact, because it uses a webdriver that starts an actual browser instance, it can obtain the first batch of results even without the scroll-down behavior, just by adding a few seconds:

In [6]:
import time

In [7]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [8]:
# Chrome's webdriver's headless option keeps an actual browser window from appearing
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)

In [9]:
driver.get(url)
time.sleep(3)

In [10]:
print(driver.page_source)

<html lang="en-US"><head><meta charset="UTF-8"><link rel="profile" href="https://gmpg.org/xfn/11"><title>Browse the Directory of Online Meetings | Online Intergroup of Alcoholics Anonymous</title><meta name="viewport" content="width=device-width, initial-scale=1"><meta name="description" content="The OIAA Directory features 1,000+ online AA meetings worldwide, ranging from video or telephone conferences to email or chat groups in many languages, available 24/7. Browse the next available or search for the right one for you."><meta name="robots" content="index, follow"><meta name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"><meta name="bingbot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"><link rel="canonical" href="https://aa-intergroup.org/meetings"><meta property="og:url" content="https://aa-intergroup.org/meetings/"><meta property="og:site_name" content="Online Intergroup of Alcoholics Ano

However, for the purpose of using Selenium to scroll all the way to the end of dynamically loaded content, it would be best to not use the headless option so as to visually verify it's working properly. The file named `selen.py` is a script that does precisely that, with a command line prompt before closing the browser so as to give the user time to verify that it did indeed reach the bottom. The script also saves the page source into a file to be parsed in the next notebook. The usage at the command line looks like this:

    $ python www.some-page-url.com somefilename.txt
    
The file `aa_complete.txt` contains the output of this script, while `aa_short.txt` represents a smaller batch to be used in notebook 2.