# Step 2 - Parsing and storing data

This is a demo of using BeautifulSoup to parse the page source code that was saved in step 1 with Selenium, and then save that data into a csv file for later use. For speed and efficiency, this notebook uses a truncated version  of the source code - that first batch that is rendered before scrolling down, as demonstrated in the previous notebook.

The script named '' contains everything demonstrated here, and can be run on the complete page source code that was obtained by running `selen.py`

In [67]:
from bs4 import BeautifulSoup
import pandas as pd
import re

In [68]:
with open("aa_short.txt") as file:
    soup = BeautifulSoup(file, 'html.parser')

In [69]:
# Shows first 500 characters in string-ified soup
str(soup)[:500]

'<html lang="en-US"><head><meta charset="utf-8"/><link href="https://gmpg.org/xfn/11" rel="profile"/><title>Browse the Directory of Online Meetings | Online Intergroup of Alcoholics Anonymous</title><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="The OIAA Directory features 1,000+ online AA meetings worldwide, ranging from video or telephone conferences to email or chat groups in many languages, available 24/7. Browse the next available or search for the right '

### Examining the data

Here, BeautifulSoup's `find` method is used in order to limit the search to only the first entry. This is good for exploring the various data and planning what fields that data might map to, before iterating over all entries.

This page uses the newer "indicative" divs to properly organize content - in this case using `<article>` tags for each entry. However, since many sites still don't use these, BeautifulSoup thankfully can also select tags based on attributes. Both strategies are shown below:

In [70]:
meeting = soup.find('article')

In [71]:
print(meeting.prettify())

<article class="css-ggcp4y">
 <div class="css-j7qwjs">
  <div class="css-9zzdmh">
   <h2 class="css-1m3h46c">
    <span>
     <span class="">
      AA Lucan
     </span>
    </span>
   </h2>
   <h3 class="css-1kn9d3w">
    Tuesday 3:00 pm
   </h3>
  </div>
  <div class="css-82qlwu">
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <path d="M16 16c0 1.104-.896 2-2 2h-12c-1.104 0-2-.896-2-2v-8c0-1.104.896-2 2-2h12c1.104 0 2 .896 2 2v8zm8-10l-6 4.223v3.554l6 4.223v-12z" fill="currentColor">
      </path>
     </svg>
     Zoom
    </button>
   </div>
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Email lucanonlinegroup@gmail.com" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <g fill="currentColor">
    

In [72]:
also_meeting = soup.find(class_='css-ggcp4y')
print(also_meeting.prettify())

<article class="css-ggcp4y">
 <div class="css-j7qwjs">
  <div class="css-9zzdmh">
   <h2 class="css-1m3h46c">
    <span>
     <span class="">
      AA Lucan
     </span>
    </span>
   </h2>
   <h3 class="css-1kn9d3w">
    Tuesday 3:00 pm
   </h3>
  </div>
  <div class="css-82qlwu">
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <path d="M16 16c0 1.104-.896 2-2 2h-12c-1.104 0-2-.896-2-2v-8c0-1.104.896-2 2-2h12c1.104 0 2 .896 2 2v8zm8-10l-6 4.223v3.554l6 4.223v-12z" fill="currentColor">
      </path>
     </svg>
     Zoom
    </button>
   </div>
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Email lucanonlinegroup@gmail.com" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <g fill="currentColor">
    

Upon examination, two important things become clear:

1. Selection by class name is the way to go, because the class names are unique to each data field, with different names for the title, time, description, and category divs. Specificity is king. Such usage would cover the case where someone may be running an older webdriver that doesn't render HTML5 tags. 

2. Some string parsing will be needed. Anchor tags (`<a>`) are not shown here for the links - rather, URLs are found within the tag `title` attribute. Date/time info will also need to be extracted from a string.

**List of useful class names:**

    Article:        css-ggcp4y
    Name h2:        css-1m3h46c
    Datetime h3:    css-1kn9d3w
    Link button:    css-1akp03c
    Description p:  css-fzcsno
    Category div:   css-108n2y7

##### Name 

In [73]:
name = soup.find(class_='css-1m3h46c')
print(name.get_text())

AA Lucan


##### Datetime

In [74]:
datetime = soup.find(class_='css-1kn9d3w').get_text()
print(datetime)

Tuesday 3:00 pm


In [75]:
print(re.match('.*day', datetime).group())

Tuesday


In [76]:
# Last 8 characters will cover 2-digit hours, just strip whitespace for single digit hours
print(datetime[-8:].strip())

3:00 pm


##### Link buttons

From inspecting the page in the browser, it appears there are more types of links than just zoom and email. Let's check:

In [77]:
link_bttns = soup.find_all(class_='css-1akp03c')

In [78]:
link_titles = [link_bttn['title'] for link_bttn in link_bttns]
print(link_titles)

['Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09', 'Email lucanonlinegroup@gmail.com', 'Visit https://us02web.zoom.us/j/3728994472', 'Email aaportugalnorth@outlook.com', 'Visit https://zoom.us/j/96250625525', 'Call 16465588656', 'Email hopenonlysi@gmail.com', 'Visit https://meet.jit.si/247recovery', 'Email nofeesnodues@gmail.com', 'Visit https://us02web.zoom.us/j/802496652', 'Call 16699006833,,802496652', 'Email superdave1212@mac.com', 'Visit https://zoom.us/j/582456897', 'Email rocklinfwpaa@gmail.com', 'Visit https://zoom.us/j/188177606', 'Visit https://us02web.zoom.us/j/79540296512?pwd=MUFxNklOdFNJeFluWVJHbE5xWDlHdz09%20"', 'Call 16699006833,,,,79540296512#,,,,*736666#', 'Email thebrokenelevatorgroup@gmail.com', 'Visit https://us02web.zoom.us/j/7981521081', 'Call 6465588656,,7981521081#', 'Email WITS3333@gmail.com', 'Visit https://us04web.zoom.us/j/89136475364', 'Email 4thdimensionmtg@gmail.com']


Indeed, there is also a phone number option. Therefore, for this field let's go ahead and work with all results, to make sure all three versions are covered. This won't be the same loop used in the final script though, because the final script will work row-by-row for performance.

In [79]:
# Matching video meeting URL links
re.match('Visit (.+)', link_titles[0]).group(1)

'https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09'

In [80]:
# Matching email links
re.match('Email (.+)', link_titles[1]).group(1)

'lucanonlinegroup@gmail.com'

In [81]:
# Matching phone meetings
re.match('Call (.+)', link_titles[5]).group(1)

'16465588656'

In [82]:
stripped_links = []

for title in link_titles:
    if re.match('Visit (.+)', title):
        stripped_links.append(re.match('Visit (.+)', title).group(1))
    elif re.match('Email (.+)', title):
        stripped_links.append(re.match('Email (.+)', title).group(1))
    elif re.match('Call (.+)', title):
        stripped_links.append(re.match('Call (.+)', title).group(1))
    else:
        stripped_links.append("No match")

print(stripped_links)

['https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09', 'lucanonlinegroup@gmail.com', 'https://us02web.zoom.us/j/3728994472', 'aaportugalnorth@outlook.com', 'https://zoom.us/j/96250625525', '16465588656', 'hopenonlysi@gmail.com', 'https://meet.jit.si/247recovery', 'nofeesnodues@gmail.com', 'https://us02web.zoom.us/j/802496652', '16699006833,,802496652', 'superdave1212@mac.com', 'https://zoom.us/j/582456897', 'rocklinfwpaa@gmail.com', 'https://zoom.us/j/188177606', 'https://us02web.zoom.us/j/79540296512?pwd=MUFxNklOdFNJeFluWVJHbE5xWDlHdz09%20"', '16699006833,,,,79540296512#,,,,*736666#', 'thebrokenelevatorgroup@gmail.com', 'https://us02web.zoom.us/j/7981521081', '6465588656,,7981521081#', 'WITS3333@gmail.com', 'https://us04web.zoom.us/j/89136475364', '4thdimensionmtg@gmail.com']


##### Description

In [83]:
desc = soup.find(class_='css-fzcsno').get_text()
print(desc)

An open AA meeting based out of Dublin, Ireland, open to all. It is on every day at 8pm GMT (Google "what time is it in Dublin")  The meeting requires the ZOOM app. Just click the link below. The password is embedded in the link. No further password is required.


##### Categories

In [84]:
category = soup.find(class_='css-108n2y7').get_text()
print(category)

Audio


### Extracting data from one entry

The final algorithm will work row-by-row, one entry at a time, rather than column-wise. Aside from the performance gain, this is necessary in this case because each entry can have both multiple links and multiple categories.

Here is the logic to work on only the first entry:

In [85]:
first_entry = soup.find(class_='css-ggcp4y')

In [86]:
sample_dict = {}

In [87]:
sample_dict["Name"] = first_entry.find(class_='css-1m3h46c').get_text()
print(sample_dict["Name"])

AA Lucan


In [88]:
daytime = first_entry.find(class_='css-1kn9d3w').get_text()
sample_dict["Day"] = re.match('.*day', daytime).group()
sample_dict["Time"] = daytime[-8:].strip()
print(sample_dict["Day"])
print(sample_dict["Time"])

Tuesday
3:00 pm


In [89]:
entry_link_buttons = first_entry.find_all(class_='css-1akp03c')
entry_link_titles = [link_bttn['title'] for link_bttn in entry_link_buttons]
print(entry_link_titles)

['Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09', 'Email lucanonlinegroup@gmail.com']


In [90]:
# No need to fill empty values for unused link types as pandas will handle that
for title in entry_link_titles:
    if re.match('Visit (.+)', title):
        sample_dict["Video"] = re.match('Visit (.+)', title).group(1)
        print(sample_dict["Video"])
    elif re.match('Email (.+)', title):
        sample_dict["Email"] = re.match('Email (.+)', title).group(1)
        print(sample_dict["Email"])
    elif re.match('Call (.+)', title):
        sample_dict["Phone"] = re.match('Call (.+)', title).group(1)
        print(sample_dict["Phone"])

https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09
lucanonlinegroup@gmail.com


In [91]:
sample_dict["Desc"] = first_entry.find(class_='css-fzcsno').get_text()
print(sample_dict["Desc"])

An open AA meeting based out of Dublin, Ireland, open to all. It is on every day at 8pm GMT (Google "what time is it in Dublin")  The meeting requires the ZOOM app. Just click the link below. The password is embedded in the link. No further password is required.


In [92]:
entry_categ_buttons = first_entry.find_all(class_='css-108n2y7')
entry_categs = [button.get_text() for button in entry_categ_buttons]
print(entry_categs)

['Audio', 'Open', 'Tuesday', 'Video']


Let's remove the weekday category label, since it is redundant.

In [93]:
categ_words = [word for word in entry_categs if not re.match('.*day', word)]
print(categ_words)

['Audio', 'Open', 'Video']


In [94]:
sample_dict["Categories"] = ",".join(categ_words)
print(sample_dict["Categories"])

Audio,Open,Video


In [95]:
print(sample_dict)

{'Name': 'AA Lucan', 'Day': 'Tuesday', 'Time': '3:00 pm', 'Video': 'https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09', 'Email': 'lucanonlinegroup@gmail.com', 'Desc': 'An open AA meeting based out of Dublin, Ireland, open to all. It is on every day at 8pm GMT (Google "what time is it in Dublin")  The meeting requires the ZOOM app. Just click the link below. The password is embedded in the link. No further password is required.', 'Categories': 'Audio,Open,Video'}


### Extracting data from all entries

In [96]:
all_entries = []

In [97]:
for meeting in soup.find_all(class_='css-ggcp4y'):
    this_entry = {}
    this_entry["Name"] = meeting.find(class_='css-1m3h46c').get_text()
    
    daytime = meeting.find(class_='css-1kn9d3w').get_text()
    # A sub-group of meetings have 'Ongoing' here rather than 'Tuesday 8:00 pm'etc
    if re.match('.*day', daytime):
        this_entry["Day"] = re.match('.*day', daytime).group()
        this_entry["Time"] = daytime[-8:].strip()
    else: this_entry["Day"] = daytime
    
    entry_link_buttons = meeting.find_all(class_='css-1akp03c')
    entry_link_titles = [link_bttn['title'] for link_bttn in entry_link_buttons]
    for title in entry_link_titles:
        if re.match('Visit (.+)', title):
            this_entry["Video"] = re.match('Visit (.+)', title).group(1)
        elif re.match('Email (.+)', title):
            this_entry["Email"] = re.match('Email (.+)', title).group(1)
        elif re.match('Call (.+)', title):
            this_entry["Phone"] = re.match('Call (.+)', title).group(1)
            
    # Not every meeting has a description
    if meeting.find(class_='css-fzcsno'):
        this_entry["Desc"] = meeting.find(class_='css-fzcsno').get_text()
    
    entry_categ_buttons = meeting.find_all(class_='css-108n2y7')
    entry_categs = [button.get_text() for button in entry_categ_buttons]
    categ_words = [word for word in entry_categs if not re.match('.*day', word)]
    this_entry["Categories"] = ",".join(categ_words)
    
#     entry_categs = meeting.find_all(class_='css-108n2y7')
#     this_entry["Categories"] = ",".join([button.get_text() for button in entry_categs])
    
    all_entries.append(this_entry)

In [98]:
print(all_entries)

[{'Name': 'AA Lucan', 'Day': 'Tuesday', 'Time': '3:00 pm', 'Video': 'https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09', 'Email': 'lucanonlinegroup@gmail.com', 'Desc': 'An open AA meeting based out of Dublin, Ireland, open to all. It is on every day at 8pm GMT (Google "what time is it in Dublin")  The meeting requires the ZOOM app. Just click the link below. The password is embedded in the link. No further password is required.', 'Categories': 'Audio,Open,Video'}, {'Name': 'AA North Portugal', 'Day': 'Tuesday', 'Time': '3:00 pm', 'Video': 'https://us02web.zoom.us/j/3728994472', 'Email': 'aaportugalnorth@outlook.com', 'Desc': 'Public Email Contact, if any: aaportugalnorth@outlook.com', 'Categories': 'Open,Video'}, {'Name': 'HOPE Group', 'Day': 'Tuesday', 'Time': '3:00 pm', 'Video': 'https://zoom.us/j/96250625525', 'Phone': '16465588656', 'Email': 'hopenonlysi@gmail.com', 'Desc': 'Closed Discussion. We read the 3 last paragraphs of page 276 (The last 15 years...)

### Save to csv with pandas dataframe

Usually a list of dicts is not the best way to initialize performance-wise, but this is a small dataset and pandas takes care of empty values nicely.

In [99]:
df = pd.DataFrame(all_entries)

In [100]:
df.head()

Unnamed: 0,Name,Day,Time,Video,Email,Desc,Categories,Phone
0,AA Lucan,Tuesday,3:00 pm,https://us02web.zoom.us/j/86771088384?pwd=UmJR...,lucanonlinegroup@gmail.com,"An open AA meeting based out of Dublin, Irelan...","Audio,Open,Video",
1,AA North Portugal,Tuesday,3:00 pm,https://us02web.zoom.us/j/3728994472,aaportugalnorth@outlook.com,"Public Email Contact, if any: aaportugalnorth@...","Open,Video",
2,HOPE Group,Tuesday,3:00 pm,https://zoom.us/j/96250625525,hopenonlysi@gmail.com,Closed Discussion. We read the 3 last paragrap...,"Audio,Discussion,Telephone,Video",16465588656
3,"No Fees, No Dues",Tuesday,3:00 pm,https://meet.jit.si/247recovery,nofeesnodues@gmail.com,It is encouraged that a structured meeting beg...,"Audio,Closed,Telephone,Video",
4,PG & Chill,Tuesday,3:00 pm,https://us02web.zoom.us/j/802496652,superdave1212@mac.com,Password 960328,"Audio,Open,Telephone,Video","16699006833,,802496652"


In [101]:
df.to_csv("meetings_short.csv", index=False)

### Working on complete page source code

The script named `extract.py` runs the code demonstrated in this notebook. It takes two arguments, the source code file to read from, and the csv file to write to, like so:

    $ python extract.py aa_complete.txt meetings_complete.csv

This script has been run on the complete page code and the result saved to `meetings_complete.csv`.