### Python's BeautifulSoup package

BeautifulSoup is a widely used Python package to process and extract element of HTML documents.

In [None]:
import requests 
from bs4 import BeautifulSoup

We will use this package to extract the table on the wikipedia page of the List of Largest financial services companies by revenue. 

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url) 
print(r.url)
print(r.text)

As you can see the HTTP request has returned the html document that makes up the wikipedia webpage content. This is messy and the structure of the HTML is not entirely clear at first glance. 

#### 1. Creating a BeautifulSoup object
This is where BeautifulSoup package comes handy! Let's convert the output of request's "text" method into a BeautifulSoup object.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url)
html_content = r.text

if html_content is not None:
    # create a beautiful soup object
    html_soup = BeautifulSoup(html_content, "html.parser")
    print(type(html_soup))
else:
    raise Exception('Error getting data from {}'.format(url))

The BeautifulSoup library itself depends on an HTML parser. Python has multiple HTML parsers:
- 'html.parser' - Python's built-in parser
- 'lxml' - external package, runs very fast
- 'html5lib' - aims to parse web page exactly the same way as browser does, is a bit slow

#### 2. Methods to extract HTML elements
BeautifulSoup takes HTML content and transforms it into a tree-based representation. There are two methods to fetch data from a BeautifulSoup object, which are more commonly used:
- find : returns the retrieved element
- find_all : return list of the retrieved elements

Both methods are used to find elemets inside the HTML tree. You can input the tag name that you wish to find on the page as a string or a list of tags. Next, you can also input attrs argument which takes a Python dictionary of attributes and matches HTML elements that match those attributes. "find_all" has an extra argument calles limit which can be used to limit the number of elements that are retreived.

In [None]:
html_soup.find('tr')

In [None]:
html_soup.find_all('tr')

#### 3. Extracting data using attributes of HTML elements

Additional attributes can be provided to filter upon.

In [None]:
html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
# html_soup.find('table', class_= 'wikitable sortable plainrowheads')

In [None]:
# **keywords search
countries = html_soup.find_all(class_= 'datasortkey')
countries

#### 4. Filtering the results of find and find_all methods
You can select the specific elements from the result of the "find" method using the tags. 

In [None]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
print(type(my_table))
my_table

In [None]:
my_table('th')

In [None]:
for eachrow in my_table('tr'):
    print('-----------------')
    print(eachrow)

In [None]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow('th'))

In [None]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow(['th','td']))

In [None]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    print('----------')
    print(my_data)

In [None]:
countries = html_soup.find_all(class_= 'datasortkey')
for i in countries:
    img = i.find('img', class_='thumbborder')
    country = i.text
    print(f"{country}\n{img}")

In [None]:
for i in countries:
    img = i.find('img', class_='thumbborder')['src']
    country = i.text
    print(f"{country}\n{img}")

#### 5. Extracting data based on string match
You can also pass a string to do a look-up under a specific HTML tag and/or attribute. 

In [None]:
insurance = html_soup.find_all('td', string='Insurance')
insurance

#### 6. Navigating HTML tree using CSS 

There is also a "select" method that allows us to navigate the html tree based on CSS selectors. Each CSS selectors have HTML attributes that can be accessed like a dictionary.

In [None]:
html_soup.select('table')

In [None]:
for i in html_soup.select('td'):
    print(i)
    print(i.text)
    print('----')

In [None]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
my_table.select_one('td')

#### 7. Storing the data

Now that we know exactly where the information of rows and columns are stored, we are ready to extract them and store it into dictionary. 

Let's begin by creating:
1. a list of items that will be the columns headers and 
2. a dictionary who keys are the same column headers and whose values are an empty list, which we will fill with the data we scrape.

In [None]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for idx, item in enumerate(mytable_dict.keys()):
    print(idx, item)

In [None]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text )
        print(idx, item)
        print(mytable_dict[item])

In [None]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )
print(mytable_dict)

In [None]:
import pandas as pd

dataframe = pd.DataFrame(mytable_dict)
dataframe.head()

## Summary

In [None]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# make an HTTP request and convert the text of response object into beautiful soup object
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url)
html_content = r.text

if html_content is not None:
    html_soup = BeautifulSoup(html_content, "html.parser")
else:
    raise Exception('Error getting data from {}'.format(url))

# isolate the table we want and save it into a dataframe
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )
        
dataframe = pd.DataFrame(mytable_dict)
dataframe