<a href="https://colab.research.google.com/github/steven1174/Web_Scraping_with_Python/blob/main/C2_Advance_HTML_Parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries

In [1]:
%%capture
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Code 1

In [2]:
url = 'https://www.pythonscraping.com/pages/warandpeace.html'

In [3]:
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

In [4]:
nameList = bsObj.findAll("span",{"class":"green"})

for name in nameList:
  print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


When to *get_text()* and When to Preserve Tags
.get_text() strips all tags from the document you are working
with and returns a string containing the text only. For example, if
you are working with a large block of text that contains many
hyperlinks, paragraphs, and other tags, all those will be stripped
away and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for
in a BeautifulSoup object than in a block of text. Calling
.get_text() should always be the last thing you do, immediately
before you print, store, or manipulate your final data. In
general, you should try to preserve the tag structure of a document
as long as possible.

# Code 2




 

*   findAll(tag, attributes, recursive, text, limit, keywords)
*   find(tag, attributes, recursive, text, keywords)

Some examples of how to use the code: 


1.   findAll({"h1","h2","h3","h4","h5","h6"})
2.   findAll("span", {"class":"green", "class":"red"})

In [5]:
nameList = bsObj.findAll(text="the prince")
print(nameList)

['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']


In [6]:
allText = bsObj.findAll(id="text")
print(allText[0].get_text())


"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and
delivered by a scarlet-liveri

**A Caveat to the keyword Argument**
<body>
<p>The keyword argument can be very helpful in some situations.
However, it is technically redundant as a BeautifulSoup feature.
Keep in mind that anything that can be done with keyword can also
be accomplished using techniques we will discuss later in this chapter
(see Regular Expressions and Lambda Expressions).</p>

<p>For instance, the following two lines are identical:</p>
<ul>
  <li>*bsObj.findAll(id="text")*</li>
  <li>*bsObj.findAll("", {"id":"text"})*</li>
</ul>
<p>In addition, you might occasionally run into problems using key
word, most notably when searching for elements by their class
attribute, because class is a protected keyword in Python. That is,
class is a reserved word in Python that cannot be used as a variable
or argument name (no relation to the BeautifulSoup.findAll()
keyword argument, previously discussed). For example, if you try
the following call, you’ll get a syntax error due to the nonstandard
use of class:</p>
<ul>
  <li>*bsObj.findAll(class="green")*</li>
</ul>
<p>Instead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:</p>
<ul>
  <li>*bsObj.findAll(class_="green")*</li>
</ul>
<p>Alternatively, you can enclose class in quotes:</p>
<ul>
  <li>*bsObj.findAll("", {"class":"green"})*</li>
</ul>
</body>

# Code 2

## Dealing with children and other descendants

*   descendants()
*   children()

In [7]:
url = "http://www.pythonscraping.com/pages/page3.html"

In [8]:
html = urlopen(url)
bsObj = BeautifulSoup(html)

for child in bsObj.find("table",{"id":"giftList"}).children:
  print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


## Dealing with siblings

*   next_sibling
*   next_siblings
*   previous_sibling
*   previous_siblings

In [9]:
html = urlopen(url)
bsObj = BeautifulSoup(html)

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
  print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

## Dealing with your parents

* .parent
* .parents

In [21]:
html = urlopen(url)
bsObj = BeautifulSoup(html)
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())


$15.00



## Regular Expressions

Example of ***regex:*** aa*bbbbb(cc)*(d | )

* <b><i>aa* :</i></b>
<p>The letter a is written, followed by a* (read as a star) which means “any number of a’s, including 0 of them.” In this way, we can guarantee that the letter a is written at least once.</p>

* <b><i>bbbbb :</i></b>
<p>No special effects here —just five b’s in a row.</p>

* <b><i>(cc)* :</i></b>
<p>Any even number of things can be grouped into pairs, so in order to enforce this rule about even things, you can write two c’s, surround them in parentheses, and write an asterisk after it, meaning that you can have any number of pairs of c’s (note that this can mean 0 pairs, as well).</p>

* <b><i>(d | ) :</i></b>
<p>Adding a bar in the middle of two expressions means that it can be “this thing or that thing.” In this case, we are saying “add a d followed by a space or just add a space without a d.” In this way we can guarantee that there is, at most, one d, followed by a space, completing the string.</p>


Example of ***regex:*** Email Addresses [A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)

1.   ***Rule 1:***

* The first part of an email address contains at least one of the following: uppercase letters, lowercase letters, the numbers 0-9, periods (.), plus signs (+), or underscores (_).

* **[A-Za-z0-9\._+]+**
* The regular expression shorthand is pretty smart. For
example, it knows that “A-Z” means “any uppercase
letter, A through Z.” By putting all these possible
sequences and symbols in brackets (as opposed to
parentheses) we are saying “this symbol can be any one
of these things we’ve listed in the brackets.” Note also
that the + sign means “these characters can occur as
many times as they want to, but must occur at least
once.”

2.   ***Rule 2:***

* After this, the email address contains the @ symbol.

* **@**
* This is fairly straightforward: the @ symbol must occur in
the middle, and it must occur exactly once.

3.   ***Rule 3:***

* The email address then must contain
at least one uppercase or lowercase
letter.

* **[A-Za-z]+**
* We may use only letters in the first part of the domain
name, after the @ symbol. Also, there must be at least
one character.

4.   ***Rule 4:***

* This is followed by a period (.).

* **\.**
* You must include a period (.) before the domain name.
5.   ***Rule 5:***

* Finally, the email address ends with
com, org, edu, or net (in reality, there
are many possible top-level
domains, but, these four should
suffice for the sake of example).

* **(com|org|edu|net)**
* This lists the possible sequences of letters that can occur
after the period in the second part of an email address.

Commonly used regular expression symbols:

1. <b><i>* :</i></b>
<p>Matches the preceding character, subexpression, or bracketed character,
0 or more times</p>
<p>a*b*</p>
<p>aaaaaaaa, aaabbbbb, bbbbbb</p>

2. <b><i>+ :</i></b>
<p>Matches the preceding character, subexpression, or bracketed character,
1 or more times</p>
<p>a+b+</p>
<p>aaaaaaaab,
aaabbbbb, abbbbbb</p>

3. <b><i>[] :</i></b>
<p>Matches any character within the brackets (i.e., “Pick any one of these
things”)</p>
<p>[A-Z]*</p>
<p>APPLE, CAPITALS, QWERTY</p>

4. <b><i>() :</i></b>
<p>A grouped subexpression (these are evaluated first, in the “order of
operations” of regular expressions)</p>
<p>(a*b)*</p>
<p>aaabaab, abaaab, ababaaaaab</p>

5. <b><i>{m, n} :</i></b>
<p>Matches the preceding character, subexpression, or bracketed character
between m and n times (inclusive)</p>
<p>a{2,3}b{2,3}</p>
<p>aabbb, aaabbb, aabb</p>

6. <b><i>[^] :</i></b>
<p>Matches any single character that is not in the brackets</p>
<p>[^A-Z]*</p>
<p>apple, lowercase, qwerty</p>

7. <b><i>| :</i></b>
<p>Matches any character, string of characters, or subexpression, separated
by the “I” (note that this is a vertical bar, or “pipe,” not a capital “i”)</p>
<p>b(a|i|e)d</p>
<p>bad, bid, bed</p>

8. <b><i>. :</i></b>
<p>Matches any single character (including symbols, numbers, a space, etc.)</p>
<p>b.d</p>
<p>bad, bzd, b$d, b d</p>

9. <b><i>^ :</i></b>
<p>Indicates that a character or subexpression occurs at the beginning of a
string</p>
<p>^a</p>
<p>apple, asdf, a</p>

10. <b><i>\ :</i></b>
<p>An escape character (this allows you to use “special” characters as their
literal meaning)</p>
<p>\. \| \\</p>
<p>. | \</p>

11. <b><i>\$ :</i></b>
<p>Often used at the end of a regular expression, it means “match this up
to the end of the string.” Without it, every regular expression has a
defacto “.*” at the end of it, accepting strings where only the first part
of the string matches. This can be thougt of as analogous to the ^
symbol.</p>
<p>[A-Z]*[a-z]*$</p>
<p>ABCabc, zzzyx, Bob</p>

12. <b><i>?! :</i></b>
<p>“Does not contain.” This odd pairing of symbols, immediately preceding
a character (or regular expression), indicates that that character should
not be found in that specific place in the larger string. This can be tricky
to use; after all, the character might be found in a different part of the
string. If trying to eliminate a character entirely, use in conjunction with
a ^ and \$ at either end.</p>
<p>^((?![A-Z]).)*\$</p>
<p>no-caps-here, $ymb0ls a4e f!ne</p>
 


In [23]:
from bs4 import re

In [25]:
html = urlopen(url)

bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})

for image in images:
  print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
