**HTML parser**

One of the motivations behind introducing the HTML parser is to give you an appreciation for the power of derived classes. Coders provided a class that is customizable to meet many needs.

The HTML parser package is used to parse an HTML file.

The package provides a class called HTMLParser that has lots of functionality.

The parser moves through an HTML file and identifies

- opening tags
- closing tags
- data

It has functions which we can refer to as *handlers* 

- handle_starttag
- handle_endtag
- handle_data

for handling each of these possibilities which we override to accomplish some task.

In other words, we write our own versions of these handlers in our derived class.

In so doing, we build our own customized parser.

In the following example, we 

- print a message when we encounter a start tag
- print a message when we encounter an end tag

We'll use the parser to parse the short sample.html file.

There is a method for the parser called **getpos** that gives the location of a tag (line number,column number) in the file being parsed. 

In [2]:
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    
    # what to do if start tag encountered
    def handle_starttag(self, tag, attrs):
        pos=self.getpos()
        line=pos[0]
        col=pos[1]
        st="start tag = {:8s} at line {:5d} column {:5d}".format(tag,line,col)
        print(st)
    # what to do if end tag encountered
    def handle_endtag(self, tag):
        pos=self.getpos()
        line=pos[0]
        col=pos[1]
        st="end tag = {:8s} at line {:5d} column {:5d}".format(tag,line,col)
        print(st)
    # how to handle data
    def handle_data(self, data):
        pass

with open("sample.html") as fin:
    text=fin.read()
    print(text)
    print("\n\n")
# instantiate a parser
parser = MyHTMLParser()

# use feed method
parser.feed(text)


<html>
<head>
<title> THIS IS A TITLE </title>
</head>
<body>
<p>
This is a sentence in a paragraph.
Here is another.
</p>
<p>
This is what happens when I put a br tag <br />
in some text.
Here is a table.
</p>
<table>
<th> <td> heading1 </td> <td> heading2 </td> <td> heading3 </td> </th>
<tr> <td> row1column1 </td> <td> row1column2 </td> <td> row1column3 </td> </tr>
<tr> <td> row2column1 </td> <td> row2column2 </td> <td> row2column3 </td> </tr>
<tr> <td> row3column1 </td> <td> row3column2 </td> <td> row3column3 </td> </tr>
</table>
</body>
</html> 



start tag = html     at line     1 column     0
start tag = head     at line     2 column     0
start tag = title    at line     3 column     0
end tag = title    at line     3 column    24
end tag = head     at line     4 column     0
start tag = body     at line     5 column     0
start tag = p        at line     6 column     0
end tag = p        at line     9 column     0
start tag = p        at line    10 column     0
start tag = br 

**Extracting only certain data**

Now suppose we only want to extract data that appears somewhere in a specific context.

For example, maybe we only want data in the title i.e. data enclosed in \<title> ... \<title/>.

We need to do something with the handle_data function. 

But how does this function know whether data is enclosed in those tags?

We need to introduce a *flag* that tell the parser whether it is currently inside a start and end title tag.

The flag will be a class attribute.

In [3]:
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    inside_title_tag=False
    inside_para_tag=False
    # what to do if start tag encountered
    def handle_starttag(self, tag, attrs):
        if tag=="title":
            MyHTMLParser.inside_title_tag=True
        if tag=="p":
            MyHTMLParser.inside_para_tag=True
    # what to do if end tag encountered
    def handle_endtag(self, tag):
        if tag=="title":
            MyHTMLParser.inside_title_tag=False
        if tag=="p":
            MyHTMLParser.inside_para_tag=False
    # how to handle data
    def handle_data(self, data):
        if MyHTMLParser.inside_title_tag:
            print(data)
        if MyHTMLParser.inside_para_tag:
            print("para data = "+data)

with open("sample.html") as fin:
    text=fin.read()

# instantiate a parser
parser = MyHTMLParser()

# use feed method
parser.feed(text)


 THIS IS A TITLE 
para data = 
This is a sentence in a paragraph.
Here is another.

para data = 
This is what happens when I put a br tag 
para data = 
in some text.
Here is a table.



**Extracting table data**

Now we consider extracting data in the table.

In [7]:
import requests as req
url="https://www.minneapolisfed.org/about-us/monetary-policy/inflation-calculator/consumer-price-index-1913-"
res=req.get(url)
with open("fed.txt","w") as fout:
    fout.write(res.text)
 

In [8]:
         
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        if tag=="table":
            print("opening table tag found")
    def handle_endtag(self, tag):
        if tag=="table":
            print("ending table tag found")
    def handle_data(self, data):
        pass

with open("fed.txt") as fin:
    text=fin.read()
parser = MyHTMLParser()

# use feed method
parser.feed(text)

opening table tag found
ending table tag found


Evidently, there is only one table in the document.

We create a variable that indicates whether we are inside a table or not.

In [12]:
         
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    intable=False
    inrow=False
    def handle_starttag(self, tag, attrs):
        if tag=="table":
            print("opening table tag found")
            intable=True
        if tag=="tr":
            print("table row started")
            inrow=True
    def handle_endtag(self, tag):
        if tag=="table":
            print("ending table tag found")
            intable=False
        if tag=="row":
            inrow=False
    def handle_data(self, data):
        pass

with open("fed.txt") as fin:
    text=fin.read()
parser = MyHTMLParser()

# use feed method
parser.feed(text)

opening table tag found
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
table row started
tabl

In [58]:
         
from html.parser import HTMLParser
intable=False
inrow=False
incell=False
inheader=False
Lrow=[]
L=[]
class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        global intable, inrow, incell, inheader
        if tag=="table":
            intable=True
        if tag=="tr":
            inrow=True
        if tag=="td":
            incell=True
        if tag=="th":
            inheader=True
    def handle_endtag(self, tag):
        global intable, inrow, incell, inheader
        if tag=="table":
            intable=False
        if tag=="tr":
            inrow=False
        if tag=="td":
            incell=False
        if tag=="th":
            inheader=False
    def handle_data(self, data):
        global intable, inrow, incell, inheader,L, Lrow
        if intable and inrow and incell:
            Lrow.append(data)
        if intable and inrow and inheader:
            Lrow.append(data)
        if intable and not inrow:
            L.append(Lrow)
            Lrow=[]

with open("fed.txt") as fin:
    text=fin.read()
parser = MyHTMLParser()

# use feed method
parser.feed(text)

In [59]:
print(L[0:10])

[[], [], ['Year', 'Annual Average CPI(-U)', 'Annual Percent Change', '\n            (rate of inflation)'], ['\n            ', '1913', '\n            ', '\n            ', '9.9', '\n            ', '\n            ', '\xa0', '\n            '], ['\n            ', '1914', '\n            ', '\n            ', '10.0', '\n            ', '\n            ', '1.3%', '\n            '], ['\n            ', '1915', '\n            ', '\n            ', '10.1', '\n            ', '\n            ', '0.9%', '\n            '], ['\n            ', '1916', '\n            ', '\n            ', '10.9', '\n            ', '\n            ', '7.7%', '\n            '], ['\n            ', '1917', '\n            ', '\n            ', '12.8', '\n            ', '\n            ', '17.8%', '\n            '], ['\n            ', '1918', '\n            ', '\n            ', '15.0', '\n            ', '\n            ', '17.3%', '\n            '], ['\n            ', '1919', '\n            ', '\n            ', '17.3', '\n            ',

In [62]:
H=L[2]
L=L[3:]
L=[[x[1],x[4],x[7]] for x in L if len(x)==9]

In [63]:
L

[['1916', '10.9', '7.7%'],
 ['1917', '12.8', '17.8%'],
 ['1918', '15.0', '17.3%'],
 ['1919', '17.3', '15.2%'],
 ['1920', '20.0', '15.6%'],
 ['1921', '17.9', '-10.9%'],
 ['1922', '16.8', '-6.2%'],
 ['1923', '17.1', '1.8%'],
 ['1924', '17.1', '0.4%'],
 ['1925', '17.5', '2.4%'],
 ['1926', '17.7', '0.9%'],
 ['1927', '17.4', '-1.9%'],
 ['1928', '17.2', '-1.2%'],
 ['1929', '17.2', '0.0%'],
 ['1930', '16.7', '-2.7%'],
 ['1931', '15.2', '-8.9%'],
 ['1932', '13.6', '-10.3%'],
 ['1933', '12.9', '-5.2%'],
 ['1934', '13.4', '3.5%'],
 ['1935', '13.7', '2.6%'],
 ['1936', '13.9', '1.0%'],
 ['1937', '14.4', '3.7%'],
 ['1938', '14.1', '-2.0%'],
 ['1939', '13.9', '-1.3%'],
 ['1940', '14.0', '0.7%'],
 ['1941', '14.7', '5.1%'],
 ['1942', '16.3', '10.9%'],
 ['1943', '17.3', '6.0%'],
 ['1944', '17.6', '1.6%'],
 ['1945', '18.0', '2.3%'],
 ['1946', '19.5', '8.5%'],
 ['1947', '22.3', '14.4%'],
 ['1948', '24.0', '7.7%'],
 ['1949', '23.8', '-1.0%'],
 ['1950', '24.1', '1.1%'],
 ['1951', '26.0', '7.9%'],
 ['1952',