

---


In this notebook, you'll be able to pull your own data from a website, like wikipedia, into a pandas data frame.


---
We'll use this wikipedia site, showing a list of countries by population: https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)

---
You should pick your own site with a large table and adapt this code for your own purposes.


---
See these website for more help on this topic: 

https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas


https://www.ankuroh.com/programming/automation/web-scraping-with-python-text-scraping-wikipedia/


---



In [0]:
# Import packages needed
import urllib.request
from bs4 import BeautifulSoup

# Set the wikipedia site we'll be looking at
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

In [2]:
# Find all the tables on that website
all_tables=soup.find_all("table")
all_tables

[<table align="right">
 <tbody><tr>
 <td><div class="thumb tright"><div class="thumbinner" style="width:352px;"><a class="image" href="/wiki/File:United_Nations_geographical_subregions.png"><img alt="" class="thumbimage" data-file-height="628" data-file-width="1357" decoding="async" height="162" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/United_Nations_geographical_subregions.png/350px-United_Nations_geographical_subregions.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/United_Nations_geographical_subregions.png/525px-United_Nations_geographical_subregions.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/08/United_Nations_geographical_subregions.png/700px-United_Nations_geographical_subregions.png 2x" width="350"/></a> <div class="thumbcaption"><div class="magnify"><a class="internal" href="/wiki/File:United_Nations_geographical_subregions.png" title="Enlarge"></a></div>Statistical regions as <a href="/wiki/United_Nations_geoscheme" title="Unit

In [3]:
# Find the table you actually want to pull. Many wikipedia sites have multiple tables.
# You can use a find command on your keyboard (e.g., Control+F or Command+F) to find the title for your table
# After you find your table, look for the 'class' and copy and paste that below
# Then print the table to see if you recognize the data
right_table=soup.find('table',class_='nowrap sortable mw-datatable wikitable')
right_table

<table class="nowrap sortable mw-datatable wikitable" id="main" style="text-align: right;">
<tbody><tr valign="bottom">
<th>Country or area
</th>
<th><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN continental<br/>region</a><sup class="reference" id="cite_ref-region_4-0"><a href="#cite_note-region-4">[4]</a></sup>
</th>
<th><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN statistical<br/>region</a><sup class="reference" id="cite_ref-region_4-1"><a href="#cite_note-region-4">[4]</a></sup>
</th>
<th>Population<br/>(1 July 2018)
</th>
<th>Population<br/>(1 July 2019)
</th>
<th>Change
</th></tr>
<tr>
<td align="left"><span class="datasortkey" data-sort-value="China"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_C

In [0]:
# There are 6 columns in our table, so we're going to create our own 6 columns

A=[]
B=[]
C=[]
D=[]
E=[]
F=[]

# This will loop through the rows to find everything marked as data in the table
for row in right_table.findAll('tr'):
    cells=row.find_all('td')
    if len(cells)==6:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))

In [5]:
# Now we'll put this into a a pandas dataframe
# And rename each of the column titles
import pandas as pd
df=pd.DataFrame(A,columns=['Country'])
df['Region1']=B
df['Region2']=C
df['2018Pop']=D
df['2019Pop']=E
df['PercentPopChange']=F

# Then we print to check our table
df

Unnamed: 0,Country,Region1,Region2,2018Pop,2019Pop,PercentPopChange
0,,Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,,Americas,Northern America,327096265,329064917,+0.60%
3,,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,,Asia,Southern Asia,212228286,216565318,+2.04%
...,...,...,...,...,...,...
228,,Americas,Caribbean,4993,4989,−0.08%
229,,Americas,South America,3234,3377,+4.42%
230,,Oceania,Polynesia,1620,1615,−0.31%
231,,Oceania,Polynesia,1319,1340,+1.59%


It looks like most of our data made it into the table, but there's a problem with the 'Country' column.

Check out the wikipedia page, and you'll see that there are also images of flags next to each Country name. This is probabaly the problem, so we'll look for the different tags next to the country names and adjust the for loop. This is the most common issue you'll have with wikipedia data.

In [0]:
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==6:
        mlnk=cells[0].findAll('a')
        A.append(mlnk[0].contents[0])
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))

In [7]:
import pandas as pd
df=pd.DataFrame(A,columns=['Country'])
df['Region1']=B
df['Region2']=C
df['2018Pop']=D
df['2019Pop']=E
df['PercentPopChange']=F

# Check our table 
df

Unnamed: 0,Country,Region1,Region2,2018Pop,2019Pop,PercentPopChange
0,China,Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,United States,Americas,Northern America,327096265,329064917,+0.60%
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%
...,...,...,...,...,...,...
228,Montserrat,Americas,Caribbean,4993,4989,−0.08%
229,Falkland Islands,Americas,South America,3234,3377,+4.42%
230,Niue,Oceania,Polynesia,1620,1615,−0.31%
231,Tokelau,Oceania,Polynesia,1319,1340,+1.59%


Now that our table looks good, it's important that we spot check a few rows to make sure everything is lined up properly. After this is done, this dataframe is ready to use! 