# BeautifulSoup
Named after a Lewis Carroll poem in Alice's Adventures in Wonderland, Beautiful Soup is a Python library for pulling data out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree.
#### To install:
    pip install beautifulsoup4
or  

    pip3 install beautifulsoup4

In [1]:
from bs4 import BeautifulSoup as bs

In [2]:
import urllib.request as r

## NEA Dengue cases
http://www.nea.gov.sg/public-health/dengue/dengue-cases  
Using the browser inspect tool, find the number of reported cases in the html.

In [4]:
url = "http://www.nea.gov.sg/public-health/dengue/dengue-cases"
opener = r.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
html = opener.open(url)

### Beautiful Soup object
Generate a Beautiful Soup object:

In [5]:
bsObj = bs(html.read(), "html5lib")

### Retrieve tags
Get document header:

In [6]:
bsObj.h1

<h1 class="title" id="top">
    Dengue Cases
</h1>

## find or findAll
Html, and similarly xml, is like navigating a tree.  
To find the information of weekly dengue cases, we need to find the children of the corresponding table.  

In [7]:
for table in bsObj.findAll("table"):
    if table.has_attr('id'):
        print(table['id'])

ContentPlaceHolderTitle_C008_TblReportedCases
ContentPlaceHolderTitle_C008_TblEWeek


In [10]:
bs_table = bsObj.find("table", {"id":"ContentPlaceHolderTitle_C008_TblEWeek"})
bs_table

<table id="ContentPlaceHolderTitle_C008_TblEWeek">
	<tbody><tr>
		<td style="font-weight:bold;">E-week 39<br/>(24-30Sep17)</td><td style="font-weight:bold;">E-week 40<br/>(01-07Oct17)</td><td style="font-weight:bold;">E-week 41<br/>(08-14Oct17)</td><td style="font-weight:bold;">E-week 42<br/>(15-21Oct17)</td><td style="font-weight:bold;">E-week 43<br/>(22-28Oct17)</td><td style="font-weight:bold;">E-week 44<br/>(29Oct-04Nov17)</td><td style="font-weight:bold;">E-week 45<br/>(05-07Nov17 at 3pm)</td>
	</tr><tr>
		<td>39</td><td>52</td><td>55</td><td>63</td><td>77</td><td>65</td><td>24</td>
	</tr>
</tbody></table>

Collect data into a table:

In [3]:
import pandas as pd

In [9]:
tbl_list = []
for row in bs_table.findAll("tr"):
    tmp_list = []
    for item in row:
        try:
            string = item.get_text()
            tmp_list.append(string)
        except:
            pass
    tbl_list.append(tmp_list)
    #df = df.append(tmp_list,ignore_index=True)
df = pd.DataFrame(tbl_list)
df

Unnamed: 0,0,1,2,3,4,5,6
0,E-week 39(24-30Sep17),E-week 40(01-07Oct17),E-week 41(08-14Oct17),E-week 42(15-21Oct17),E-week 43(22-28Oct17),E-week 44(29Oct-04Nov17),E-week 45(05-07Nov17 at 3pm)
1,39,52,55,63,77,65,24
