# Step 2 - Parsing and storing data

This is a demo of using BeautifulSoup to parse the page source code that was saved in step 1 with Selenium, and then save that data into a csv file for later use. For speed and efficiency, this notebook uses a truncated version  of the source code - that first batch that is rendered before scrolling down, as demonstrated in the previous notebook.

The script named '' contains everything demonstrated here, and can be run on the complete page source code that was obtained by running `selen.py`

In [4]:
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
with open("aa_short.txt") as file:
    soup = BeautifulSoup(file, 'html.parser')

In [3]:
# Shows first 500 characters in string-ified soup
str(soup)[:500]

'<html lang="en-US"><head><meta charset="utf-8"/><link href="https://gmpg.org/xfn/11" rel="profile"/><title>Browse the Directory of Online Meetings | Online Intergroup of Alcoholics Anonymous</title><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="The OIAA Directory features 1,000+ online AA meetings worldwide, ranging from video or telephone conferences to email or chat groups in many languages, available 24/7. Browse the next available or search for the right '

## Examining one meeting entry

Here, BeautifulSoup's `find` method is used in order to limit the search to only the first entry. This is good for exploring the various data and planning what fields that data might map to, before iterating over all entries.

This page uses the newer "indicative" divs to properly organize content - in this case using `<article>` tags for each entry. However, since many sites still don't use these, BeautifulSoup thankfully can also select tags based on attributes. Both strategies are shown below:

In [5]:
meeting = soup.find('article')

In [6]:
print(meeting.prettify())

<article class="css-ggcp4y">
 <div class="css-j7qwjs">
  <div class="css-9zzdmh">
   <h2 class="css-1m3h46c">
    <span>
     <span class="">
      AA Lucan
     </span>
    </span>
   </h2>
   <h3 class="css-1kn9d3w">
    Tuesday 3:00 pm
   </h3>
  </div>
  <div class="css-82qlwu">
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <path d="M16 16c0 1.104-.896 2-2 2h-12c-1.104 0-2-.896-2-2v-8c0-1.104.896-2 2-2h12c1.104 0 2 .896 2 2v8zm8-10l-6 4.223v3.554l6 4.223v-12z" fill="currentColor">
      </path>
     </svg>
     Zoom
    </button>
   </div>
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Email lucanonlinegroup@gmail.com" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <g fill="currentColor">
    

In [8]:
also_meeting = soup.find(class_='css-ggcp4y')
print(also_meeting.prettify())

<article class="css-ggcp4y">
 <div class="css-j7qwjs">
  <div class="css-9zzdmh">
   <h2 class="css-1m3h46c">
    <span>
     <span class="">
      AA Lucan
     </span>
    </span>
   </h2>
   <h3 class="css-1kn9d3w">
    Tuesday 3:00 pm
   </h3>
  </div>
  <div class="css-82qlwu">
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Visit https://us02web.zoom.us/j/86771088384?pwd=UmJRNzFLTHd1R3M1ZmJMaXhiMTlWZz09" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <path d="M16 16c0 1.104-.896 2-2 2h-12c-1.104 0-2-.896-2-2v-8c0-1.104.896-2 2-2h12c1.104 0 2 .896 2 2v8zm8-10l-6 4.223v3.554l6 4.223v-12z" fill="currentColor">
      </path>
     </svg>
     Zoom
    </button>
   </div>
   <div class="css-3zg5es">
    <button class="css-1akp03c" title="Email lucanonlinegroup@gmail.com" type="button">
     <svg class="css-h7g82p" focusable="false" role="presentation" viewbox="0 0 24 24">
      <g fill="currentColor">
    

Upon examination, two important things become clear:

1. Selection by class name is the way to go, because the class names are unique to each data field, with different names for the title, time, description, and category divs. Specificity is king.

2. Some string parsing will be needed. Anchor tags (`<a>`) are not shown here for the links - rather, URLs are found within the tag `title` attribute. Date/time info will also need to be extracted from a string.