## Use only pandas

source: https://gist.github.com/aculich/b34868c098d94d614515


Installation requirements

```
pip install pandas
pip install lxml
pip install html5lib
pip install BeautifulSoup4```

or 

```conda install html5lib ```

or 

```easy_install html5lib```

## Read Wiki Tables

In [1]:
# extract tables from wikipedia
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area'

wikitables = read_html(page,  attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


In [2]:
wikitables[0]

Unnamed: 0,0,1,2,3,4
0,Rank,Country,Area (km²),Notes,
1,1,Russia*,13100000,"17,125,200 including European part",
2,2,China,9596961,"excludes Hong Kong, Macau, Taiwan and disputed...",
3,3,India,3287263,,
4,4,Kazakhstan*,2455034,"2,724,902 km² including European part",
5,5,Saudi Arabia,2149690,,
6,6,Iran,1648195,,
7,7,Mongolia,1564110,,
8,8,Indonesia*,1472639,"1,904,569 km² including Oceanian part",
9,9,Pakistan,796095,"882,363 km² including Gilgit-Baltistan and AJK",


In [3]:
# extract several tables from wikipedia from a single page
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_UFC_events'

wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 2 wikitables


In [4]:
wikitables[0].head()

Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Event,Date,Venue,Location,Ref.
UFC on ESPN 4,"Jun 29, 2019",TBA,TBA,[9]
UFC on ESPN+ 11,"Jun 22, 2019",TBA,TBA,[9]
UFC 238,"Jun 8, 2019",TBA,TBA,[9]
UFC on ESPN+ 10,"Jun 1, 2019",TBA,TBA,[9]


In [5]:
wikitables[1].shape

(470, 6)

In [6]:
wikitables[1]

Unnamed: 0_level_0,1,2,3,4,5,6
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
#,Event,Date,Venue,Location,Attendance,Ref.
465,UFC Fight Night: Assunção vs. Moraes 2,"Feb 2, 2019",Centro de Formação Olímpica do Nordeste,"Fortaleza, Brazil",10040,[21]
–,UFC 233,"Jan 26, 2019",Honda Center,"Anaheim, California, U.S.",Cancelled,[22]
464,UFC Fight Night: Cejudo vs. Dillashaw,"Jan 19, 2019",Barclays Center,"Brooklyn, New York, U.S.",12152,[23]
463,UFC 232: Jones vs. Gustafsson 2,"Dec 29, 2018",The Forum,"Inglewood, California, U.S.",15862,[24]
462,UFC on Fox: Lee vs. Iaquinta 2,"Dec 15, 2018",Fiserv Forum,"Milwaukee, Wisconsin, U.S.",9010,[25]
461,UFC 231: Holloway vs. Ortega,"Dec 8, 2018",Scotiabank Arena,"Toronto, Ontario, Canada",19039,[26]
460,UFC Fight Night: dos Santos vs. Tuivasa,"Dec 2, 2018",Adelaide Entertainment Centre,"Adelaide, Australia",8652,[27]
459,The Ultimate Fighter: Heavy Hitters Finale,"Nov 30, 2018",Pearl Theatre,"Las Vegas, Nevada, U.S.",2020,[28]
458,UFC Fight Night: Blaydes vs. Ngannou 2,"Nov 24, 2018",Cadillac Arena,"Beijing, China",10302,[29]


In [7]:
wikitables[0].head()

Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Event,Date,Venue,Location,Ref.
UFC on ESPN 4,"Jun 29, 2019",TBA,TBA,[9]
UFC on ESPN+ 11,"Jun 22, 2019",TBA,TBA,[9]
UFC 238,"Jun 8, 2019",TBA,TBA,[9]
UFC on ESPN+ 10,"Jun 1, 2019",TBA,TBA,[9]


In [8]:
# change the index table
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

wikitables = read_html(page, index_col=1, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


In [9]:
wikitables[0].head()

Unnamed: 0_level_0,0,2,3,4,5,6
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Country or area,Rank,UN continentalregion[2],UN statisticalregion[2],Population(1 July 2016)[3],Population(1 July 2017)[3],Change
World,—,—,—,7466964280,7550262101,+1.1%
China[a],1,Asia,Eastern Asia,1403500365,1409517397,+0.4%
India,2,Asia,Southern Asia,1324171354,1339180127,+1.1%
United States,3,Americas,Northern America,322179605,324459463,+0.7%


In [10]:
# works with different languages ( option encoding is available if needed)
from pandas.io.html import read_html
page = 'https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E5%9B%BD%E5%AE%B6%E5%92%8C%E5%9C%B0%E5%8C%BA%E4%BA%BA%E5%8F%A3%E6%8E%92%E5%90%8D%E5%88%97%E8%A1%A8'

wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


In [11]:
wikitables[0].head()

Unnamed: 0_level_0,1,2,3,4,5,6
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
排名,国家或者地区,大洲[2],統計地區[2],人口(2016年7月1日)[3],人口(2017年7月1日)[3],变化率
—,世界,—,—,7466964280,7550262101,+1.1%
1,中华人民共和国[a],亚洲,东亚,1403500365,1409517397,+0.4%
2,印度,亚洲,南亚,1324171354,1339180127,+1.1%
3,美國,美洲,北美,322179605,324459463,+0.7%


## Read wiki Infoboxes

In [12]:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/University_of_California,_Berkeley'
infoboxes = read_html(page, index_col=0, attrs={"class":"infobox"})
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} infoboxes".format(num=len(infoboxes)))
print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 infoboxes
Extracted 4 wikitables


In [13]:
infoboxes[0]

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
University rankings,
National,
ARWU[106],4.0
Forbes[107],14.0
U.S. News & World Report[108],22.0
Washington Monthly[109],7.0
Global,
ARWU[110],5.0
QS[111],27.0
Times[112],15.0


In [14]:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/Lisbon'
infoboxes = read_html(page, index_col=0, attrs={"class":"infobox geography vcard"})
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} infoboxes".format(num=len(infoboxes)))
print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 infoboxes
Extracted 1 wikitables


In [15]:
infoboxes[0][10:20]

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Country,Portugal,
NUTS II Region,Lisbon metropolitan area,
NUTS III Subregion,Lisbon metropolitan area,
District,Lisbon,
Municipality,Lisbon,
Settlement,Prior to Roman rule,
City,c. 1256,
Civil parishes,(see text),
Government,,
• Type,LAU,


## Scrape non wiki tables

In [16]:
from pandas.io.html import read_html
page = 'https://www.esportsearnings.com/players'
infoboxes = read_html(page, index_col=0, attrs={"class":"detail_list_table"})

infoboxes[0].head(10)


Unnamed: 0_level_0,1,2,3,4,5,6,7
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,Player ID,Player Name,Total (Overall),,Highest Paying Game,Total (Game),% of Total
1.0,KuroKy,Kuro Takhasomi,"$4,136,926.95",|,Dota 2,"$4,135,203.61",99.96%
2.0,N0tail,Johan Sundstein,"$3,742,055.59",|,Dota 2,"$3,733,903.98",99.78%
3.0,Miracle-,Amer Al-Barkawi,"$3,701,337.28",|,Dota 2,"$3,701,337.28",100.00%
4.0,MinD_ContRoL,Ivan Ivanov,"$3,492,411.76",|,Dota 2,"$3,492,411.76",100.00%
5.0,Matumbaman,Lasse Urpalainen,"$3,476,116.04",|,Dota 2,"$3,476,116.04",100.00%
6.0,JerAx,Jesse Vainikka,"$3,313,463.82",|,Dota 2,"$3,313,463.82",100.00%
7.0,SumaiL,Sumail Hassan,"$3,305,914.94",|,Dota 2,"$3,305,914.94",100.00%
8.0,GH,Maroun Merhej,"$3,095,344.84",|,Dota 2,"$3,095,344.84",100.00%
9.0,UNiVeRsE,Saahil Arora,"$3,035,737.67",|,Dota 2,"$3,035,737.67",100.00%


## Convert html tables to csv/excel

In [17]:
from pandas.io.html import read_html
page = 'https://www.esportsearnings.com/players'
infoboxes = read_html(page, index_col=0, attrs={"class":"detail_list_table"})

file_name = './my_file.csv'
infoboxes[0].to_csv(file_name, sep='\t')


In [18]:
!find . -type f -name "*.csv" 

./my_file.csv
./csv/movie_metadata.csv


## Web Scraping Wikipedia Tables using BeautifulSoup and Python

source: https://github.com/stewync/Web-Scraping-Wiki-tables-using-BeautifulSoup-and-Python/blob/master/Scraping%2BWiki%2Btable%2Busing%2BPython%2Band%2BBeautifulSoup.ipynb

In [19]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_UFC_events').text

In [20]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

In [21]:
My_table = soup.find('table',{'class':'wikitable'})

In [22]:
links = My_table.findAll('a')

In [23]:
events = []
for link in links:
    events.append(link.get('title'))    

In [24]:
import pandas as pd
df = pd.DataFrame()
df['events'] = events

df.head()

Unnamed: 0,events
0,UFC on ESPN 4
1,TBA
2,
3,UFC on ESPN+ 11
4,TBA


## Other

#### wiki-table-scrape
https://github.com/rocheio/wiki-table-scrape

## Scraping Wikipedia Tables with Python
https://roche.io/2016/05/scrape-wikipedia-with-python