## Introduction

- #### `bs4` library has different classes for different page type:
  - _BeautifulSoup_ for _HTML_ pages
  - _BeautifulStoneSoup_ for _XML_ pages

- #### With the help of the request module, it is used as a scraping library.

## Imports

In [12]:
from bs4 import BeautifulSoup 
import re
import requests as rq

## Parsing

#### Parsing
  - #### _BeautifulSoup_ acts as a parser for the HTML and XML pages. __It can even use heristics to figure out the correct parsing for incorrectly parsed document__. With HTML soup does not know sometimes about the self-closing tags.

  - #### Only unicode string are stored in the data structure created by soup.

  - #### While creating the soup we can also relay that from which encoding are we taking the code.
    > soup = BeautifulSoup( site, from_encoding="euc-jp" )

  - #### Other options such as turning off smart quotes can be turned off.

  - #### A parser object (soup) is a nested, connected data structure corresponding to the structure of an XML/HTML document. The parser object contains two other types of objects: Tag objects, which correspond to tags like the `<TITLE>` tag and the `<B>` tags; and NavigableString objects which correspond to strings like `Page title` and `This is paragraph`. 

In [13]:
doc=[ '<html><head><title>TITLE OF THE PAGE</title></head>', 
      '<body> <div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="head2"> <p class="para" id="para1" align="center">Here is the typical which might be provided in the page about the site <link rel="icon" href="#" sizes="414X414" /> </p>',
      '<p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>',
      '</html>'
]

# Parsing the html document
self_soup = BeautifulSoup(''.join(doc) )
print( self_soup.prettify())

<html>
 <head>
  <title>
   TITLE OF THE PAGE
  </title>
 </head>
 <body>
  <div class="container" id="main">
   <div class="head" id="head1">
    <h1>
     #RANDOM_TITLE
    </h1>
   </div>
   <div class="info" id="head2">
    <p align="center" class="para" id="para1">
     Here is the typical which might be provided in the page about the site
     <link href="#" rel="icon" sizes="414X414"/>
    </p>
    <p class="para" id="para2">
     Here is some more info about the site.
    </p>
   </div>
  </div>
 </body>
</html>



## Navigation

#### Generally, the __Navigating down__ can be done through 2 ways:
  - `soup.#elm_name`  i.e. directly referencing the element's name.
    -  We can directly use the tag name and add them like this `div.div.a`. This returns the first such tag present and `None` if no such tag is there.

  - `soup.contents`/`soup.children`/`soup.descendants` i.e. referencing the elements children and descendants directly.
    - `.contents` gives the whole content of that tag including any other tag which is in there.
    - `.children` can be seen as a list of all the contents in the children.
    - `.descendants` returns the list of all the descendants and their children in the format of a list.

  - Apart from this, `.strings`/`.stripped_strings` can be used to get the inner string of the tag. 
    - `.strings` returns a list iterator which can be used with the for loop. 

In [14]:
# Navigating with the help of tags 
first_div = self_soup.div
print( f"1st div tag:\n\n{first_div}\n" )

first_a = self_soup.a
print(f"Since there is no <a>, there is {first_a}.", end="\n\n")

# Navigating with the help of .contents and .children
con1 = first_div.contents
print(f"The contents of the first <div> tag are:\n{con1}", end="\n\n")

# Giving the list iterator of the children of the tag
child1 = first_div.children
print(f"The list iterator is {child1}.", end="\n\n")

# Giving the descendants of the div tag
desc = first_div.descendants
print(f"The descendants of the nodes are {desc}", end="\n\n")

1st div tag:

<div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="head2"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>

Since there is no <a>, there is None.

The contents of the first <div> tag are:
[' ', <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div>, ' ', <div class="info" id="head2"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div>, ' ']

The list iterator is <list_iterator object at 0x7f69284879d0>.

The descendants of the nodes are <generator object Tag.descendants at 0x7f69283af2a0>



#### Navigating up can be also done with the help of methods - `parent` and `parents`.
  - `elm.parent` gives the immediate parent of the element.
  - `elm.parents` gives a iterator to iterate over the parents of the element.

In [15]:
# Direct parent of an element
parent_elm = self_soup.body.div
print(f"{parent_elm} is the immediate parent of the elment div.", end="\n\n")

# Iterator to the parents of the element
parent_iterator = self_soup.body.div
print(f"Tihs is the list of the parents of {parent_iterator}", end="\n\n")

<div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="head2"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div> </div> is the immediate parent of the elment div.

Tihs is the list of the parents of <div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="head2"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>



#### Navigating sideways can be done with attributes `next_sibling` and `previous_sibling` and their more functional parts `next_siblings` and `previous_siblings`.
  - `next_sibling` gives the next elements which is in the same level as the current one. Point to note is that it can also return the string and exclamation marks which separate the tags. `previous_sibling` also works in a similar way but only it gives previous sibling.
  
  - `next_siblings` and `previous_siblings` simply return the generator objects of all the next and previous siblings respectively.

  - Two other  are attributes are `next_element` and  `previous_element` along with other functionalities such as `next_elements` and `previous_elements`.

In [16]:
# Giving the next sibling
nextsibling_elm= self_soup.body.div.div.next_sibling
print(f"{nextsibling_elm} is the next sibling of the element.", end="\n\n")

# Giving the previous siblings
prevsiblings_elm = self_soup.html.body.div.div.previous_siblings
print(f"{prevsiblings_elm} are the previous siblings of the element.")

  is the next sibling of the element.

<generator object PageElement.previous_siblings at 0x7f69283119c0> are the previous siblings of the element.


## Searching the tree

#### Soup offers filters, methods and css selectors which allow us to search the tree.

  - ##### Soup offers some filters which can be used to search the elements. The filters are explained as follows:
    - _String_: Simply passing a string and soup will search for the tag name same as the given string.
    - _Regular Expressions_: Regular Expressions can also be used to search for the elements. This allows to find the tags which start or end with a given letter.
    - _True_: This simply gives all the elements.
    - _List_: list might contain the strings which are to be searched across the document.
    - _function_: Functions can also be passed which will allow us to filter tags by their arguments or content.

    > `Note`: All of the above filters can be used with mojority of the methods.

  - ##### The most important method to find the elements are `find` and `find_all`. There are 10 other methods which can be used. They all have their arguments similar to find and find_all.
    
    - `find()`: It is used to find the first element with the given properties. It has the following arguments:
      - __name__: soup gets the first tag with the given element.
        >soup.find("head")

      - __**kwargs__: soup searches for the given keyword in the tag's keywords. `class_` is used in place of `class`. 
        >soup.find(id="first")
      
      - _attrs_: Some custom HTML attributes can not be detected by soup. Thus, they can be passed in the form of the dictionary.
        >soup.find(attrs={'data-foo':'VALUE'})
      
      - _string_: This searches for strings instead of tags.
        >soup.find( string="Jameieson" )

      - _recursive_: This limits the search to the specific domain of immediate child.
        >soup.find( reursive=false)
        
      - `NOTE`: The above attributes also support the filters which were discussed previously.

    - `find_all()`: This gives all the matching results. This has an additoinal arguement __limit__ which is used to specify the search limit to a specific value.
      - We can even use it directly from the soup like this: `elm('')`/`soup('a')`

    - Other 10 methods include _find\_parents_, _find\_next\_siblings_, _find\_previous_siblings_, _find\_all\_next_, _find\_all\_previous_ along with their variation for single matching result. Most of them support the arguments of the _find_ and _find\_all_ methods.

In [33]:
# Finding the head tag by directly inputting the head as a string in the method
tag_head = self_soup.find("head")
print(f"The head tag is:\n{tag_head}",end="\n\n")

# Finding the first div tag with the class head
tag_div1 = self_soup.find("div", class_="head")
print(f"The first div tag with the class 'head' is:\n{tag_div1}", end="\n\n")

# Finding the first div tag with the class head and the id head2
tag_div2 = self_soup.find("div", class_="head", id="head1")
print(f"The first div tag with the class 'head' and id 'head2' are:\n{tag_div2}", end="\n\n")

# Finding the string "provided"
str_find= self_soup.find(string="#RANDOM_TITLE")
print(f"The string has been found as:\n{str_find}", end="\n\n")

# Using filters
str_find1= self_soup.find(string=re.compile('[Ss]ite'))
print(f"The use of regular expressions can be done to find strings such as this: {str_find1}", end="\n\n")

str_find2= self_soup.find("div", class_=True)
print(f"By using True as a value for class_, we can get all the tags which have a class attribute:\n{str_find2}", end="\n\n")

# Apart from this, we can also define functions which can be used to find some element.

The head tag is:
<head><title>TITLE OF THE PAGE</title></head>

The first div tag with the class 'head' is:
<div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div>

The first div tag with the class 'head' and id 'head2' are:
<div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div>

The string has been found as:
#RANDOM_TITLE

The use of regular expressions can be done to find strings such as this: Here is the typical which might be provided in the page about the site 

By using True as a value for class_, we can get all the tags which have a class attribute:
<div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="head2"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>



- ##### Using CSS selectors
  - The actual selector implementation is done through _soupsieve_ package.
  
  - We can use the `css.select` method and can give the input of tag name, tag order, id, class, inside tags, sibling tag, tag-by-attribute-value and many more.
  
  - Advance features of the Soupseive offer other API such as _match_, _escape_, _filter_, _closest_. They can be found [here](https://facelessuser.github.io/soupsieve/).
  
  - The description of major methods is given here:

    - By tag name, tag ordering, inside tag, sibling tags:
      - `soup.css.select('a')`
      - `soup.css.select('html body div a')`
      - `soup.css.select('head > title ')`
      - `soup.css.select('#head1 ~ .para')`[This gives all the tags with the matching class name para, use + insted of ~ for one.]
    
    - By id, class
      - `soup.css.select('#head1')`
      - `soup.css.select('.para')`
    
    - By attribute value of the tag and checking whether the tag with a certain attribute is present or not.
      - `soup.css.select( link[href="#"] )`
      - `soup.css.select( link[href] )`