# Challenges Scraping Non-Tabular Data

On <a href="https://sandeepmj.github.io/scrape-example-page">this demo page</a> I've reproduced several variations of issues we are likely to encounter when scraping.

- Review scrape of an well-organized page.
- Dynamically getting column names.
- Scraping a challenging page.
- Excluding multi-classes.


Let's start by scraping <a href="https://sandeepmj.github.io/scrape-example-page/#organized">the organized CEO data</a>.

In [5]:
## import libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests



In [2]:
## target url
url = "https://sandeepmj.github.io/scrape-example-page/"

In [3]:
## response

response = requests.get(url)

In [6]:
## turn into soup

soup = BeautifulSoup(response.text, "html.parser")

In [10]:
organized = soup.find(id="organized")
organized

<section id="organized">
<h2>Organized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>
<div class="ceo">
<p class="rank">Rank: 2</p>
<p class="name">Name: Frank Bisignano</p>
<p class="annual_compensation">Annual Compensation: $102.2 million</p>
<p class="company">Company: First Data (FDC)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 3</p>
<p class="name">Name: Michael Rapino</p>
<p class="annual_compensation">Annual Compensation: $70.6 million</p>
<p class="company">Company: Live Nation Entertainment (LYV)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 4</p>
<p class="name">Name: Leslie Moonves</p>
<p class="annual_compensation">Annual Compensation: 68.4 million</p>
<p class="company">Company: CBS</p>
</div>
<div class="ceo">
<p class="rank">Rank: 5</p>
<p class="name">Name: Gregory Ma

In [8]:
type(organized)

bs4.element.Tag

In [12]:
## isolate ceos

ceos = organized.find_all('div', class_="ceo")
ceos

[<div class="ceo">
 <p class="rank">Rank: 1</p>
 <p class="name">Name: Hock E. Tan</p>
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>
 <p class="company">Company: Broadcom</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 2</p>
 <p class="name">Name: Frank Bisignano</p>
 <p class="annual_compensation">Annual Compensation: $102.2 million</p>
 <p class="company">Company: First Data (FDC)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 3</p>
 <p class="name">Name: Michael Rapino</p>
 <p class="annual_compensation">Annual Compensation: $70.6 million</p>
 <p class="company">Company: Live Nation Entertainment (LYV)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 4</p>
 <p class="name">Name: Leslie Moonves</p>
 <p class="annual_compensation">Annual Compensation: 68.4 million</p>
 <p class="company">Company: CBS</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 5</p>
 <p class="name">Name: Gregory Maffei</p>
 <p class="annual_compensation">Annua

### The same steps each time:

* Is the content on the page (use ```Reveal Source```)?
* Where and how is the content held on the page?
* Which classes and IDs do we target?
* Is there a pattern?
* Is there anything that breaks the pattern?

# Excluding classes

Most modern sites have tags that include multiple classes.

What if you want to target a tag with a single class but that class also appears in tags with others that holds other types of content.

For example, capture ```Excluding Some Classes``` section of our page in ```BeautifulSoup``` object.



In [42]:
## RUN this cell that holds some html
some_html = '''<li> Silly List </li>
<li class="a"> A alone  - UNWANTED </li>
<li class="a z"> A and Z  - UNWANTED </li>
<li class="z"> Z first - my target</li>
<li class="b z"> B and Z  - UNWANTED</li>
<li class="x z"> X and Z - UNWANTED </li>
<li class="z"> Z second - my target</li>'''



### Back to our CEOs