### Python's BeautifulSoup package

BeautifulSoup is a widely used Python package to process and extract element of HTML documents.

In [53]:
import requests 
from bs4 import BeautifulSoup

We will use this package to extract the table on the wikipedia page of the List of Largest financial services companies by revenue. 

In [54]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url) 
print(r.url)
print(r.text)

https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of largest financial services companies by revenue - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XmEi9gpAMFQAAGSjAlMAAABL","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_largest_financial_services_companies_by_revenue","wgTitle":"List of largest financial services companies by revenue","wgCurRevisionId":942358054,"wgRevisionId":942358054,"wg

As you can see the HTTP request has returned the html document that makes up the wikipedia webpage content. This is messy and the structure of the HTML is not entirely clear at first glance. 

#### 1. Creating a BeautifulSoup object
This is where BeautifulSoup package comes handy! Let's convert the output of request's "text" method into a BeautifulSoup object.

In [55]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url)
html_content = r.text

if html_content is not None:
    # create a beautiful soup object
    html_soup = BeautifulSoup(html_content, "html.parser")
    print(type(html_soup))
else:
    raise Exception('Error getting data from {}'.format(url))

<class 'bs4.BeautifulSoup'>


The BeautifulSoup library itself depends on an HTML parser. Python has multiple HTML parsers:
- 'html.parser' - Python's built-in parser
- 'lxml' - external package, runs very fast
- 'html5lib' - aims to parse web page exactly the same way as browser does, is a bit slow

#### 2. Methods to extract HTML elements
BeautifulSoup takes HTML content and transforms it into a tree-based representation. There are two methods to fetch data from a BeautifulSoup object, which are more commonly used:
- find : returns the retrieved element
- find_all : return list of the retrieved elements

Both methods are used to find elemets inside the HTML tree. You can input the tag name that you wish to find on the page as a string or a list of tags. Next, you can also input attrs argument which takes a Python dictionary of attributes and matches HTML elements that match those attributes. "find_all" has an extra argument calles limit which can be used to limit the number of elements that are retreived.

In [57]:
html_soup.find('tr')

<tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billions)
</p>
</th>
<th scope="col">Headquarters
</th></tr>

In [58]:
html_soup.find_all('tr')

[<tr>
 <th data-sort-type="number">Rank
 </th>
 <th scope="col">Company
 </th>
 <th scope="col">Industry
 </th>
 <th scope="col">Revenue
 <p>(USD millions)
 </p>
 </th>
 <th>Net Income
 <p>(USD millions)
 </p>
 </th>
 <th>Total Assets
 <p>(USD billions)
 </p>
 </th>
 <th scope="col">Headquarters
 </th></tr>, <tr>
 <th>1
 </th>
 <td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
 <td>Conglomerate</td>
 <td>247,500
 </td>
 <td>4,020
 </td>
 <td>708</td>
 <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_t

#### 3. Extracting data using attributes of HTML elements

Additional attributes can be provided to filter upon.

In [59]:
html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
# html_soup.find('table', class_= 'wikitable sortable plainrowheads')

<table class="wikitable sortable plainrowheads">
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billions)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
<tr>
<th>1
</th>
<td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
<td>Conglomerate</td>
<td>247,500
</td>
<td>4,020
</td>
<td>708</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia

In [60]:
# **keywords search
countries = html_soup.find_all(class_= 'datasortkey')
countries

[<span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/> </span><a href="/wiki/United_States" title="United States">United States</a></span>,
 <span class="datasortkey" data-sort-value="China"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcs

#### 4. Filtering the results of find and find_all methods
You can select the specific elements from the result of the "find" method using the tags. 

In [48]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
print(type(my_table))
my_table

<class 'bs4.element.Tag'>


<table class="wikitable sortable plainrowheads">
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billions)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
<tr>
<th>1
</th>
<td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
<td>Conglomerate</td>
<td>247,500
</td>
<td>4,020
</td>
<td>708</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia

In [50]:
my_table('th')

[<th data-sort-type="number">Rank
 </th>, <th scope="col">Company
 </th>, <th scope="col">Industry
 </th>, <th scope="col">Revenue
 <p>(USD millions)
 </p>
 </th>, <th>Net Income
 <p>(USD millions)
 </p>
 </th>, <th>Total Assets
 <p>(USD billions)
 </p>
 </th>, <th scope="col">Headquarters
 </th>, <th>1
 </th>, <th>2
 </th>, <th>3
 </th>, <th>4
 </th>, <th>5
 </th>, <th>6
 </th>, <th>7
 </th>, <th>8
 </th>, <th>9
 </th>, <th>10
 </th>, <th>11
 </th>, <th>12
 </th>, <th>13
 </th>, <th>14
 </th>, <th>15
 </th>, <th>16
 </th>, <th>17
 </th>, <th>18
 </th>, <th>19
 </th>, <th>20
 </th>, <th>21
 </th>, <th>22
 </th>, <th>23
 </th>, <th>24
 </th>, <th>25
 </th>, <th>26
 </th>, <th>27
 </th>, <th>28
 </th>, <th>29
 </th>, <th>30
 </th>, <th>31
 </th>, <th>32
 </th>, <th>33
 </th>, <th>34
 </th>, <th>35
 </th>, <th>36
 </th>, <th>37
 </th>, <th>38
 </th>, <th>39
 </th>, <th>40
 </th>, <th>41
 </th>, <th>42
 </th>, <th>43
 </th>, <th>44
 </th>, <th>45
 </th>, <th>46
 </th>, <th>47
 </th>, <th>4

In [61]:
for eachrow in my_table('tr'):
    print('-----------------')
    print(eachrow)

-----------------
<tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billions)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
-----------------
<tr>
<th>1
</th>
<td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
<td>Conglomerate</td>
<td>247,500
</td>
<td>4,020
</td>
<td>708</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_

In [62]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow('th'))

----------
[<th data-sort-type="number">Rank
</th>, <th scope="col">Company
</th>, <th scope="col">Industry
</th>, <th scope="col">Revenue
<p>(USD millions)
</p>
</th>, <th>Net Income
<p>(USD millions)
</p>
</th>, <th>Total Assets
<p>(USD billions)
</p>
</th>, <th scope="col">Headquarters
</th>]
----------
[<th>1
</th>]
----------
[<th>2
</th>]
----------
[<th>3
</th>]
----------
[<th>4
</th>]
----------
[<th>5
</th>]
----------
[<th>6
</th>]
----------
[<th>7
</th>]
----------
[<th>8
</th>]
----------
[<th>9
</th>]
----------
[<th>10
</th>]
----------
[<th>11
</th>]
----------
[<th>12
</th>]
----------
[<th>13
</th>]
----------
[<th>14
</th>]
----------
[<th>15
</th>]
----------
[<th>16
</th>]
----------
[<th>17
</th>]
----------
[<th>18
</th>]
----------
[<th>19
</th>]
----------
[<th>20
</th>]
----------
[<th>21
</th>]
----------
[<th>22
</th>]
----------
[<th>23
</th>]
----------
[<th>24
</th>]
----------
[<th>25
</th>]
----------
[<th>26
</th>]
----------
[<th>27
</th>]
----------

In [63]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow(['th','td']))

----------
[<th data-sort-type="number">Rank
</th>, <th scope="col">Company
</th>, <th scope="col">Industry
</th>, <th scope="col">Revenue
<p>(USD millions)
</p>
</th>, <th>Net Income
<p>(USD millions)
</p>
</th>, <th>Total Assets
<p>(USD billions)
</p>
</th>, <th scope="col">Headquarters
</th>]
----------
[<th>1
</th>, <td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>, <td>Conglomerate</td>, <td>247,500
</td>, <td>4,020
</td>, <td>708</td>, <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_

In [64]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    print('----------')
    print(my_data)

----------
[<th>1
</th>, <td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>, <td>Conglomerate</td>, <td>247,500
</td>, <td>4,020
</td>, <td>708</td>, <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/> </span><a href="/wiki/United_States" title="United States">United States</a></span>
</td>]
----------
[<th>2
</th>, <td><a class="mw-redirect" href="/wiki/Ping_An_Insurance_Group" title="Ping An Insurance Group">Ping An Insurance 

In [70]:
countries = html_soup.find_all(class_= 'datasortkey')
for i in countries:
    img = i.find('img', class_='thumbborder')
    country = i.text
    print(f"{country}\n{img}")

 United States
<img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/>
 China
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_o

In [72]:
for i in countries:
    img = i.find('img', class_='thumbborder')['src']
    country = i.text
    print(f"{country}\n{img}")

United States
//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png
China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
Germany
//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png
France
//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/23px-Flag_of_France.svg.png
United States
//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png
China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_o

#### 5. Extracting data based on string match
You can also pass a string to do a look-up under a specific HTML tag and/or attribute. 

In [75]:
insurance = html_soup.find_all('td', string='Insurance')
insurance

[<td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>]

#### 6. Navigating HTML tree using CSS 

There is also a "select" method that allows us to navigate the html tree based on CSS selectors. Each CSS selectors have HTML attributes that can be accessed like a dictionary.

In [86]:
html_soup.select('table')

[<table class="wikitable sortable plainrowheads">
 <tbody><tr>
 <th data-sort-type="number">Rank
 </th>
 <th scope="col">Company
 </th>
 <th scope="col">Industry
 </th>
 <th scope="col">Revenue
 <p>(USD millions)
 </p>
 </th>
 <th>Net Income
 <p>(USD millions)
 </p>
 </th>
 <th>Total Assets
 <p>(USD billions)
 </p>
 </th>
 <th scope="col">Headquarters
 </th></tr>
 <tr>
 <th>1
 </th>
 <td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
 <td>Conglomerate</td>
 <td>247,500
 </td>
 <td>4,020
 </td>
 <td>708</td>
 <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x,

In [87]:
for i in html_soup.select('td'):
    print(i)
    print(i.text)
    print('----')

<td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>
Berkshire Hathaway
----
<td>Conglomerate</td>
Conglomerate
----
<td>247,500
</td>
247,500

----
<td>4,020
</td>
4,020

----
<td>708</td>
708
----
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/> </span><a href="/wiki/United_States" title="United States">United States</a></span>
</td>
 United States

----
<td><a class="mw-redirect" href="/wiki/Ping_An_Insurance_Group" title="P

In [88]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
my_table.select_one('td')

<td><a href="/wiki/Berkshire_Hathaway" title="Berkshire Hathaway">Berkshire Hathaway</a></td>

#### 7. Storing the data

Now that we know exactly where the information of rows and columns are stored, we are ready to extract them and store it into dictionary. 

Let's begin by creating:
1. a list of items that will be the columns headers and 
2. a dictionary who keys are the same column headers and whose values are an empty list, which we will fill with the data we scrape.

In [83]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for idx, item in enumerate(mytable_dict.keys()):
    print(idx, item)

0 Rank
1 Name
2 Industry
3 Revenue
4 NetIncome
5 TotalAssets
6 Headquarters


In [84]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text )
        print(idx, item)
        print(mytable_dict[item])

0 Rank
['1\n']
1 Name
['Berkshire Hathaway']
2 Industry
['Conglomerate']
3 Revenue
['247,500\n']
4 NetIncome
['4,020\n']
5 TotalAssets
['708']
6 Headquarters
['\xa0United States\n']
0 Rank
['1\n', '2\n']
1 Name
['Berkshire Hathaway', 'Ping An Insurance Group\n']
2 Industry
['Conglomerate', 'Insurance\n']
3 Revenue
['247,500\n', '163,597\n']
4 NetIncome
['4,020\n', '16,237\n']
5 TotalAssets
['708', '7,143\n']
6 Headquarters
['\xa0United States\n', '\xa0China\n']
0 Rank
['1\n', '2\n', '3\n']
1 Name
['Berkshire Hathaway', 'Ping An Insurance Group\n', 'Allianz']
2 Industry
['Conglomerate', 'Insurance\n', 'Insurance']
3 Revenue
['247,500\n', '163,597\n', '143,860\n']
4 NetIncome
['4,020\n', '16,237\n', '8,490\n']
5 TotalAssets
['708', '7,143\n', '973']
6 Headquarters
['\xa0United States\n', '\xa0China\n', '\xa0Germany\n']
0 Rank
['1\n', '2\n', '3\n', '4\n']
1 Name
['Berkshire Hathaway', 'Ping An Insurance Group\n', 'Allianz', 'AXA']
2 Industry
['Conglomerate', 'Insurance\n', 'Insurance', 'I

In [85]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )
print(mytable_dict)

{'Rank': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '57', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80'], 'Name': ['Berkshire Hathaway', 'Ping An Insurance Group', 'Allianz', 'AXA', 'JP Morgan Chase', 'ICBC', 'China Construction Bank', 'China Life Insurance', 'Bank of America', 'Agricultural Bank of China', 'Wells Fargo', 'HSBC', 'Generali Group', "People's Insurance Company", 'Bank of China', 'Citigroup', 'MetLife', 'Bank of Communications', 'Dai-ichi Life', 'Aegon', 'Banco Bradesco', 'Prudential Financial', 'Legal & General Group', 'China Merchants Bank', 'Munich Re', 'China Pacific Insurance', 'Banco Santand

In [29]:
import pandas as pd

dataframe = pd.DataFrame(mytable_dict)
dataframe.head()

Unnamed: 0,Rank,Name,Industry,Revenue,NetIncome,TotalAssets,Headquarters
0,1,Berkshire Hathaway,Conglomerate,247500,4020,708,United States
1,2,Ping An Insurance Group,Insurance,163597,16237,7143,China
2,3,Allianz,Insurance,143860,8490,973,Germany
3,4,AXA,Insurance,113130,2310,1008,France
4,5,JP Morgan Chase,Banking,105486,30709,2687,United States


## Summary

In [None]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

In [92]:
# make an HTTP request and convert the text of response object into beautiful soup object
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue' 
r = requests.get(url)
html_content = r.text

if html_content is not None:
    html_soup = BeautifulSoup(html_content, "html.parser")
else:
    raise Exception('Error getting data from {}'.format(url))

# isolate the table we want and save it into a dataframe
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    
    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )
        
dataframe = pd.DataFrame(mytable_dict)
dataframe

Unnamed: 0,Rank,Name,Industry,Revenue,NetIncome,TotalAssets,Headquarters
0,1,Berkshire Hathaway,Conglomerate,247500,4020,708,United States
1,2,Ping An Insurance Group,Insurance,163597,16237,7143,China
2,3,Allianz,Insurance,143860,8490,973,Germany
3,4,AXA,Insurance,113130,2310,1008,France
4,5,JP Morgan Chase,Banking,105486,30709,2687,United States
...,...,...,...,...,...,...,...
76,76,Standard Chartered,Banking,20976,1109,,United Kingdom
77,77,Old Mutual,Investment Services,20923,938,,United Kingdom
78,78,Freddie Mac,Investment Services,15380,9235,,United States
79,79,Westpac Banking Group,Banking,14710,5412,,Australia
