# Python Beautiful Soup Web Scraping (find/find_all, css select, scrape table)

**Resources**\
Simple webpage: https://keithgalli.github.io/web-scraping/example.html \
Example webpage: https://keithgalli.github.io/web-scraping/webpage.html \
Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ \
CSS Selector Reference: https://www.w3schools.com/cssref/css_selectors.asp

## Load in the necessary libraries

In [20]:
import requests  # pip install requests
from bs4 import BeautifulSoup as bs  # pip install beautifulsoup4

## Load our first page

In [21]:
# Load the webpage content
url = 'https://keithgalli.github.io/web-scraping/example.html'
r = requests.get(url)

In [22]:
# Convert to a beautifulsoup object
soup = bs(r.content, 'html.parser')

In [23]:
# Print out our html
# print(soup)
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



## Start using beautifulsoup to scrape

In [24]:
# find and find_all
first_header = soup.find('h2')
print(first_header)

<h2>A Header</h2>


In [25]:
headers = soup.find_all('h2')
print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]


In [26]:
# pass in a list of elements to look for
headers = soup.find_all(['h1', 'h2'])
print(headers)

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [27]:
# you can pass in attributes to the find/find_all function
paragraph = soup.find_all('p')
print(paragraph)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [28]:
paragraph = soup.find_all('p', attrs={'id': "paragraph-id"})
print(paragraph)

[<p id="paragraph-id"><b>Some bold text</b></p>]


In [29]:
# you can nest find/find_all calls
body = soup.find('body')
div = body.find('div')
h = div.find('h1')
print(h)

<h1>HTML Webpage</h1>


In [30]:
# we can search specific strings in our find/find_all calls
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



In [31]:
import re
paragraphs = soup.find_all('p', string=re.compile('Some'))
print(paragraphs)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [32]:
header = soup.find_all('h2', string=re.compile('(H|h)eader'))
print(header)

[<h2>A Header</h2>, <h2>Another header</h2>]


## Select method (CSS path selections)

In [33]:
content = soup.select('p')
print(content)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [34]:
content = soup.select('div p')
print(content)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]


In [35]:
paragraphs = soup.select('h2 ~ p')
print(paragraphs)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [36]:
bolt_text = soup.select('p#paragraph-id b')
bolt_text

[<b>Some bold text</b>]

In [37]:
paragraphs = soup.select('body > p')
print(paragraphs)

for p in paragraphs:
  print(p.select('i'))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [38]:
# grab by element with specific property
soup.select('[align=middle]')

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

## Grabbing the string/text from an HTML element

In [49]:
# use .string
header = soup.find('h2')
print(header.string)
print(header.text)
print(header.get_text())

# if multiple child elements use get_text()
div = soup.find('div')
print(div.prettify())
print(div.get_text())

A Header
A Header
A Header
<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [54]:
# Get a specific get a property from an element

link = soup.find('a')
link['href']

paragraph = soup.select('p#paragraph-id')
paragraph[0]['id']

'paragraph-id'

## Code navigation (parents, children, siblings)

In [57]:
# path sintax
soup.body.div.h1.string

'HTML Webpage'

In [63]:
# know the terms: parent, siblin, child
print(soup.body.prettify())
soup.body.find('div').find_next_siblings()

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

## Exercise 1: Grab all social links on webpage in 3 different ways

In [65]:
# Load the webpage content
url = 'https://keithgalli.github.io/web-scraping/webpage.html'
r = requests.get(url)

# Convert to a beautifulsoup object
webpage = bs(r.content, 'html.parser')

# Print out our html
print(webpage.prettify())

<head>
 <title>
  Keith Galli's Page
 </title>
 <style>
  table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
 </style>
</head>
<body>
 <h1>
  Welcome to my page!
 </h1>
 <img src="./images/selfie1.jpg" width="300px"/>
 <h2>
  About me
 </h2>
 <p>
  Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
 </p>
 <p>
  Here is a link to my channel:
  <a href="https://www.youtube.com/kgmit">
   youtube.com/kgmit
  </

### Grab all of the social links from webpage

Do this in at least 3 different ways

In [95]:
links = webpage.select('ul.socials a')
print(links)

actual_links = [link['href'] for link in links]
print(actual_links)
print()

# or
links = webpage.find('ul', attrs={'class':'socials'})
links = links.find_all('a')
print(links)

actual_links = [link['href'] for link in links]
print(actual_links)
print()

# or
links = webpage.select('li.social a')
print(links)
actual_links = [link['href'] for link in links]
print(actual_links)

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>, <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>, <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>, <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>, <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>, <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>, <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@

## Exercise 2: Scrape an HTML table into a Pandas Dataframe

### Scrape the Table

In [109]:
import pandas as pd

table = webpage.select('table.hockey-stats')[0]
columns = table.find('thead').find_all('th')
columns_name = [c.string for c in columns]
columns_name

table_rows = table.find('tbody').find_all('tr')

l = []
for tr in table_rows:
  td = tr.find_all('td')
  row = [str(tr.get_text()).strip() for tr in td]
  l.append(row)

In [114]:
df = pd.DataFrame(l, columns=columns_name)
df.loc[df.Team != 'Did not play']

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17,3,9,12,20,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9,1,1,2,2,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12,5,5,10,8,0.0,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8,5,10,15,8,,|,,,,,,,


## Exercise 3: Grab all fun facts that contain the word “is”

In [127]:
import re

facts = webpage.select('ul.fun-facts li')
facts_with_is = [fact.find(string=re.compile('is')) for fact in facts]
# fwi = [fact.strip() for fact in facts_with_is if fact]
fwi = [fact.find_parent().get_text() for fact in facts_with_is if fact]
fwi

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

## Exercise 4: Use beautiful soup to help download an image from a webpage

In [130]:
import requests  # pip install requests
from bs4 import BeautifulSoup as bs  # pip install beautifulsoup4

# Load the webpage content
im = 'https://keithgalli.github.io/web-scraping/'
url = im + 'webpage.html'
r = requests.get(url)

# Convert to a beautifulsoup object
webpage = bs(r.content, 'html.parser')

print(webpage.select('div.row div.column img'))

images = webpage.select('div.row div.column img')
image_url = images[0]['src']
full_image_url = url + image_url

import requests

img_data = requests.get(full_image_url).content

with open('image_name.jpg', 'wb') as handler:
    handler.write(img_data)

[<img alt="Lake Como" src="images/italy/lake_como.jpg" style="height:100%"/>, <img alt="Pontevecchio, Florence" src="images/italy/pontevecchio.jpg" style="height:100%"/>, <img alt="Riomaggiore, Cinque de Terre" src="images/italy/riomaggiore.jpg" style="height:100%"/>]


## Exercise 5: Solve the mystery challenge!!!

In [143]:
files = webpage.select('div.block a')
# print(files)

relative_files = [f['href'] for f in files]

im = 'https://keithgalli.github.io/web-scraping/'
for f in relative_files:
  full_url = im + f
  page = requests.get(full_url)
  bs_page = bs(page.content)
  # print(bs_page.body.prettify())
  swe = bs_page.find('p', attrs={'id':'secret-word'})
  sw = swe.string
  print(sw)
  break

Make


In [None]:
pass