# The Packages

We will be using urllib and BeautifulSoup to do out scraping. urllib is part of the Python standard library and should come with Python. If you do not already have BeautifulSoup, you can pip install it via  

```python
pip install beautifulsoup4
```  

pandas is also imported to help with the storing of the data later.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

Once these are imported, we can access the web page using the urllib library. The following lines of code do the following:  

1) Store the url in string format into a variable URL   

2) Opens the web page at the given url    

3) Reads in all the html data and stores this data into a variable called html      

4) Closes the connection to the website

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' # 1)
uClient = urlopen(URL) # 2)
html = uClient.read() # 3)
uClient.close() # 4)

If we observe the html variable from above, we see a mess of html data that was taken from the webpage.

In [3]:
html

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of countries and dependencies by population - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XkAbmwpAIDEAAHN8u44AAADE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_and_dependencies_by_population","wgTitle":"List of countries and dependencies by population","wgCurRevisionId":939921126,"wgRevisionId":939921126,"wgArticleId":69058,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroup

We can begin to clean this up with BS4. First we create a BeautifulSoup object, which allows us to use BeautifulSoup's funcitons to parse through the html data. To make the object we pass our html variable created above and the type of parser we would like to use. We will be using 'html.parser' although there are more options than that. You can read about them [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser). Once the soup object is created, your can use the prettify() method to observe a cleaner output of the html data.

In [4]:
soup = BeautifulSoup(html, 'html.parser') # Create soup object

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of countries and dependencies by population - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XkAbmwpAIDEAAHN8u44AAADE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_and_dependencies_by_population","wgTitle":"List of countries and dependencies by population","wgCurRevisionId":939921126,"wgRevisionId":939921126,"wgArticleId":69058,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"

### Navigating the HTML Data

BeautifulSoup has methods that make finding data in the html easier. The way this works is by referencing the tags in the html (tags are what's enclosed in the angel brackets; for example \< a \> is an anchor tag). You can search through the html by using the find_all method and passing the tag name you are interested in in brackets. One way to identify the tags of interest in a web page is by going to the web page, right clicking, and selecting "Inspect" as shown below (Google Chrome is being used here; may be different on other web browsers).  

![](inspects.png) 

The html data will appear in the upper right corner as shown above. As you hover over the html, the corresonding parts on the web page will be highlighted. You can use this feature to see where the data you are interested in is located. For example, below we want to start at the table. Moving along through the html data shows us where we should start to gather this data.  

![](highlight.png)

# Grabbing the relevant data section

It looks like the 'td' tags contain the information we are interested in. We can use the find_all method on our soup object to grab this section by passing 'td' as an argument. This will then allow us to begin parsing through the section we are interested in for the data.

In [6]:
data = soup.find_all('td') # Grab the table data section

# Parsing to grab the data of interest

There are probably better ways to do this, but this is how I do it. The data object that has been created can be treated like a list in that it can be iterated through. Typically, I find a way to extract and store the data I am interested in for one row, and then iterate the process with a for loop for the rest of the data. As you can see below, the first item in the object is the rank, the second has the country name, the next has the population data, then the date, followed by the source. The row then repeats in a similar manner when we reach the seventh item in the object (seventh because of zero-indexing). 

In [7]:
data[0]

<td>1</td>

In [8]:
data[1]

<td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/Demographics_of_China" title="Demographics of China">China</a><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[b]</a></sup></td>

In [9]:
data[2]

<td style="text-align:right">1,401,260,600</td>

In [10]:
data[3]

<td align="right"><span data-sort-value="7001180483298739621♠" style="display:none"></span>18.0%</td>

In [11]:
data[4]

<td><span data-sort-value="000000002020-02-09-0000" style="white-space:nowrap">9 Feb 2020</span></td>

In [12]:
data[5]

<td align="left">National population clock<sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[3]</a></sup>
</td>

In [13]:
data[6]

<td>2</td>

All we have to do is pick out the data of interest and store it for each iteration. You can pick out the data by using the contents attribute. This attribute selects the contents out from the html tags and stores the elements in a list. If the result isn't what you want, you can continue to parse it down by reusing a combination of indexing and contents. Below we parse out the info we would like for the first row.

In [14]:
data[1].contents

[<span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span>,
 '\xa0',
 <a href="/wiki/Demographics_of_China" title="Demographics of China">China</a>,
 <sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[b]</a></sup>]

We just want the country name here and the contents attribute returned way more than we wanted. To get around this we can index the spot in the list that contains the data we want and then use the contents attribute again.

In [15]:
data[1].contents[2] # Indexing the list

<a href="/wiki/Demographics_of_China" title="Demographics of China">China</a>

In [16]:
data[1].contents[2].contents # Grabbing the contents again.

['China']

In [17]:
data[1].contents[2].contents[0] # Grabbing just the data alone

'China'

The above is how we will grab country names in our loop. Let's look at how to grab the rest.

In [18]:
data[2].contents[0] # population

'1,401,260,600'

In [19]:
data[4].contents[0].contents[0] # Date

'9 Feb 2020'

# Iterating the process

Now that we have the genral layout for where our data lies in each row, we can iterate through this process by adjusting the numbers for each row. You may think that because the length of the data object is 1458 and there are 6 items per row, that there is 1458/6 = 243 rows. I did. This turned out to be incorrect because the data object holds data past the rows. I only found this out by attempting to implement the loop for that many rows only to recieve "list index out of range errors". I found the correct number of rows by manually searching the end of the data object to see where the last country occurs. It happens at index 1441 and so there are actually 1446/6 = 241 rows.  

Second thing: the indexing appeared to shift a bit with the contents between two different formats. Ergo, a try-except was implemented to catch both formats. You can see the two different formats based on the way the each of the countries data is appended in the two clauses.

In [20]:
countries = []
pop = []
date = []

num_of_rows = 241 # There are 6 items per row.

for row_num in range(0, num_of_rows):

    
    try:
        countries.append(data[1 + row_num*6].contents[2].contents[0])
        pop.append(data[2+row_num*6].contents[0])
        date.append(data[4+row_num*6].contents[0].contents[0])
        
    except:
        countries.append(data[1+row_num*6].contents[0].contents[2].contents[0])
        pop.append(data[2+row_num*6].contents[0])
        date.append(data[4+row_num*6].contents[0].contents[0])
 
        
    
    
    
    
    

In [21]:
len(countries) 

241

In [22]:
len(pop)

241

In [23]:
len(date)

241

# Now to prepare the data for storage

From here you can just store this data into a dataframe and export to a csv as follows

In [24]:
temporary_dictionary = {'Country': countries, 'Population':pop, 'Date':date}

In [25]:
populations_df = pd.DataFrame(temporary_dictionary)

In [26]:
populations_df

Unnamed: 0,Country,Population,Date
0,China,1401260600,9 Feb 2020
1,India,1358409350,9 Feb 2020
2,United States,329302812,9 Feb 2020
3,Indonesia,266911900,1 Jul 2019
4,Brazil,211102824,9 Feb 2020
...,...,...,...
236,Niue,1520,1 Jul 2018
237,Tokelau,1400,1 Jul 2018
238,Vatican City,799,1 Jul 2019
239,Cocos (Keeling) Islands,538,30 Jun 2018


In [27]:
populations_df.to_csv('populations.csv')

# But Note a Little Cleaning is Very Helpful Before Storing

The data has been successfully scraped although it does still need to be cleaned. The population column for instance is a string data type, which needs to be converted to an integer in order to use many plotting features with it. That can quickly be done via the code below and then exporting. The Date column is also a string and may need to be converted depending on your analysis. 

In [28]:
populations_df['Pop Fixed'] = populations_df['Population'].str.replace(',', '').astype(int)

In [29]:
populations_df['DateTime Date'] = pd.to_datetime(populations_df['Date'])

In [30]:
populations_df.head()

Unnamed: 0,Country,Population,Date,Pop Fixed,DateTime Date
0,China,1401260600,9 Feb 2020,1401260600,2020-02-09
1,India,1358409350,9 Feb 2020,1358409350,2020-02-09
2,United States,329302812,9 Feb 2020,329302812,2020-02-09
3,Indonesia,266911900,1 Jul 2019,266911900,2019-07-01
4,Brazil,211102824,9 Feb 2020,211102824,2020-02-09


# Dropping the old columns
Dropping the unnecessary columns, and renaming the new ones, we can export the relevant data to csv.

In [31]:
populations_df.drop(columns=['Population', 'Date'], inplace=True)

In [32]:
populations_df.columns = ['Country', 'Population', 'Date']

In [33]:
populations_df.head(5)

Unnamed: 0,Country,Population,Date
0,China,1401260600,2020-02-09
1,India,1358409350,2020-02-09
2,United States,329302812,2020-02-09
3,Indonesia,266911900,2019-07-01
4,Brazil,211102824,2020-02-09


# This is a nicer dataset to export

In [34]:
populations_df.to_csv('cleaned_populations_data.csv')

# Final Thoughts

Using the methods shown above can take you pretty far in gathering the data you want from a web page. Again I typically do it for one row and then loop through the rest. There will inevitably be minor problems as we saw above and you will have to adjust for them; like how the pattern in which I was extracting each row changed through some of the iterations. Through trial and error you can fix the loop and grab what you need. 