# Theory

- ### Parsing
  - #### `bs4` library has different classes for different page type:
    - ##### _BeautifulSoup_ for _HTML_ pages
    - ##### _BeautifulStoneSoup_ for _XML_ pages

  - #### _BeautifulSoup_ acts as a parser for the HTML and XML pages. __It can even use heristics to figure out the correct parsing for incorrectly parsed document__. With HTML soup does not know sometimes about the self-closing tags.

  - #### Only unicode string are stored in the data structure created by soup.

  - #### While creating the soup we can also relay that from which encoding are we taking the code.
    > soup = BeautifulSoup( site, from_encoding="euc-jp" )

  - #### Other options such as turning off smart quotes can be turned off.

  - #### A parser object (soup) is a nested, connected data structure corresponding to the structure of an XML/HTML document. The parser object contains two other types of objects: Tag objects, which correspond to tags like the `<TITLE>` tag and the `<B>` tags; and NavigableString objects which correspond to strings like `Page title` and `This is paragraph`. 

# Code

In [1]:
from bs4 import BeautifulSoup 
import re
import requests as rq

## Self-made parsing  

In [2]:
doc=[ '<html><head><title>TITLE OF THE PAGE</title></head>', 
      '<body> <div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="info1"> <p class="para" id="para1" align="center">Here is the typical which might be provided in the page about the site <link rel="icon" href="#" sizes="414X414" /> </p>',
      '<p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>',
      '</html>'
]

# Parsing the html document
self_soup = BeautifulSoup(''.join(doc) )
print( self_soup.prettify())

<html>
 <head>
  <title>
   TITLE OF THE PAGE
  </title>
 </head>
 <body>
  <div class="container" id="main">
   <div class="head" id="head1">
    <h1>
     #RANDOM_TITLE
    </h1>
   </div>
   <div class="info" id="info1">
    <p align="center" class="para" id="para1">
     Here is the typical which might be provided in the page about the site
     <link href="#" rel="icon" sizes="414X414"/>
    </p>
    <p class="para" id="para2">
     Here is some more info about the site.
    </p>
   </div>
  </div>
 </body>
</html>



## Navigation

- #### Generally, the navigating down can be done through 2 ways:
  - `soup.#elm_name`  i.e. directly referencing the element's name
  - `soup.contents`/`soup.children` i.e. referencing the elements children

- #### We can directly use the tag name and add them like this `div.div.a`. This returns the first such tag present and `None` if no such tag is there.


In [15]:
# Navigating with the help of tags 
first_div = self_soup.div
print( f"1st div tag:\n\n{first_div}\n" )

first_a = self_soup.a
print(f"Since there is no <a>, there is {first_a}.", end="\n\n")

# Navigating with the help of .contents and .children
con1 = first_div.contents
print(f"The contents of the first <div> tag are:\n{con1}")


1st div tag:

<div class="container" id="main"> <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div> <div class="info" id="info1"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div> </div>

Since there is no <a>, there is None.

The contents of the first <div> tag are:
[' ', <div class="head" id="head1"> <h1>#RANDOM_TITLE</h1> </div>, ' ', <div class="info" id="info1"> <p align="center" class="para" id="para1">Here is the typical which might be provided in the page about the site <link href="#" rel="icon" sizes="414X414"/> </p><p class="para" id="para2"> Here is some more info about the site. </p> </div>, ' ']
