# Scraping the Country Demographic Information on Wikipedia

## Project Outline

1. Download the page using `requests`
2. Parse the HTML source using `BeautifulSoup4`
3. Extract country names and country URLs from the main page
4. Compile extracted information into Python lists and dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file.

## Import the libraries

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os

## Download the page using `requests`

To download a page, I will use `requests.get` function from `requests`, which returns a response object containing the data from web page. 

In [3]:
wiki_url = 'https://en.wikipedia.org/wiki/Category:Demographics_by_country'

In [4]:
response = requests.get(wiki_url)

The `.status_code` property can be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299. 

In [5]:
response.status_code

200

Once the response is successful, we can get the contents of the page using `response.text`. We can always check the length of `response.text` to see if it contain anything (`len(response.text)`). We can also use `response.ok` property to check if the response is ok or not. 

In [6]:
len(response.text)

178770

In [7]:
page_content = response.text
page_content[:500]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-enabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-disabled" lang="en" dir="ltr">\n<head>\n<meta charset='

What we're looking at above is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page. 

We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [8]:
with open('downloaded_webpage.html','w') as f:
    f.write(page_content)

## Parse the HTML source using BeautifulSoup4

In [9]:
doc = BeautifulSoup(response.text, 'html.parser')

Once the web content is parsed using beautiful, it returned a BeautifulSoup object.

In [10]:
type(doc)

bs4.BeautifulSoup

In [11]:
doc.find('title')

<title>Category:Demographics by country - Wikipedia</title>

Let's create a combined function for these 2 sections, which download the webpage using `requests` and then parse the page using `beautifulsoup`.

In [12]:
def get_countries_page():
    """"Download a web page and return a beautiful soup doc"""
    #wiki_url = 'https://en.wikipedia.org/wiki/Category:Demographics_of_Asia_by_country'
    wiki_url = 'https://en.wikipedia.org/wiki/Category:Demographics_by_country'
    response = requests.get(wiki_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(wiki_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Above function would take the wiki URL, download the content using `requests`. Once it is downloaded successfully (status_code = 200), then we parse the web content using `BeautifulSoup`. If it is not downloaded successfully, then the function would raise an exception. 

Now let's test this function if it returning the same value as original code. The best way is call the same property (in this case, let's find title), and see if both are giving the same results. 

In [13]:
doc2 = get_countries_page()

In [14]:
type(doc2)

bs4.BeautifulSoup

In [15]:
doc2.find('title')

<title>Category:Demographics by country - Wikipedia</title>

As evident above, `title` is returning the same value from function, as returned from original code above. So function is working just fine. 

## Extract country names and country URLs from the main Wiki page

![Countries](https://i.imgur.com/8V5Niuk.png)

Now we can use the parsed object to find `div` tag in the HTML. From our manual inspection of the web page, we found that country names & URLs are embedded in this `div` tag of HTML. Below snippet would bring all the `div` objects from this HTML, each one representing a separate countries. 

Using this `div` tag, we can gather all the countries, as well as country URLs. 

In [16]:
div_tags = doc.find_all('div', {'class': 'CategoryTreeItem'})

In [17]:
div_tags[:3]

[<div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="Demographic_history_by_country_or_region"></span> </span> <a href="/wiki/Category:Demographic_history_by_country_or_region" title="Category:Demographic history by country or region">Demographic history by country or region</a>‎ <span dir="ltr" title="Contains 12 subcategories, 29 pages, and 0 files">(12 C, 29 P)</span></div>,
 <div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="Demographics_by_former_country"></span> </span> <a href="/wiki/Category:Demographics_by_former_country" title="Category:Demographics by former country">Demographics by former country</a>‎ <span dir="ltr" title="Contains 8 subcategories, 1 page, and 0 files">(8 C, 1 P)</span></div>,
 <div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="c

Let's take the first value of div_tags and look at it's "a" tag. 

In [18]:
div_tags[11].a.text

'Demographics of Afghanistan'

As "a" tag is returning "Demographics of Afghanistan", we can easily extract the name of country by replacing "Demographics of " with "" and that give us the name of country.

In [19]:
div_tags[11].a.text.replace('Demographics of ','')

'Afghanistan'

Similarly, we can extract the relative URL of the country by replacing " " (spaces) with "\_" (underscore). Then we can append the [base URL]('https://en.wikipedia.org/wiki/') to this relative URL to get the complete URL for the country. 

In [20]:
div_tags[11].a.text.replace(' ','_')

'Demographics_of_Afghanistan'

In [21]:
'https://en.wikipedia.org/wiki/' + div_tags[0].a.text.replace(' ','_')

'https://en.wikipedia.org/wiki/Demographic_history_by_country_or_region'

## Compile extracted information into Python lists and dictionaries

#### Let's create some helper functions to parse information from the page

To get list of countries, we can pick `div` tags with the `class` CategoryTreeItem

![Countries](https://i.imgur.com/8V5Niuk.png)

Now we can create 2 functions that scrape all country names and all country URLs. We can write another function to convert these lists of country names and country URLs into a dictionary. 

In [22]:
def get_country_list(doc):
    '''
    This function finds all the "div" tags to identify the country names. The function also removes some div tags, which are not countries
    '''
    div_tags = doc.find_all('div', {'class': 'CategoryTreeItem'})
    
    countries = []

    for i in div_tags:
        countries.append(i.a.text.replace('Demographics of ',''))

    # Removes the div tags that are not countries
    countries.remove('Demographic history by country or region')
    countries.remove('Demographics by former country')
    countries.remove('Demographics by continent and country')
    countries.remove('Ageing by country')
    countries.remove('Ethnic groups by country')
    countries.remove('Expatriates by country of residence')
    countries.remove('Immigration by country')
    countries.remove('People by ethnic or national origin')
    countries.remove('Social groups by country')
    
    return(countries)

In this function, we are extracting the country name and keep appending in the "countries" list, as long as we keep getting the div_tags. So basically, the function would extract country name from each `div` tag, and make a list of countries in the list object "countries". So this function `get_country_list` is used to get the list of countries. The function also removes the div tags that are not countries. 

In [23]:
cntrys = get_country_list(doc)

In [24]:
len(cntrys)

191

There are 190 countries scraped from the Wiki page, and the preview of top 5 countries look like below.

In [25]:
cntrys[1:6]

['Abkhazia', 'Afghanistan', 'Albania', 'Algeria', 'American Samoa']

To get list of countries URLs, we can pick `div` tags with the `class` CategoryTreeItem and then replace spaces with underscore to get relative URL. Once we append the base URL to the relative URL, we can the complete URL. This way as similar to previous function, we have defined this function as well for country urls as below. 

In [26]:
def get_country_url_list(doc):
    '''
    This function finds all the "div" tags to identify the country URLs. The function also removes some div tags, which are not country URLs.
    '''
    div_tags = doc.find_all('div', {'class': 'CategoryTreeItem'})
    
    country_urls = []
    base_url = 'https://en.wikipedia.org/wiki/'

    for i in div_tags:
        country_urls.append(base_url + i.a.text.replace(' ','_'))

    # Removes the div tags that are not URLs for the countries
    country_urls.remove('https://en.wikipedia.org/wiki/Demographic_history_by_country_or_region')
    country_urls.remove('https://en.wikipedia.org/wiki/Demographics_by_former_country')
    country_urls.remove('https://en.wikipedia.org/wiki/Demographics_by_continent_and_country')
    country_urls.remove('https://en.wikipedia.org/wiki/Ageing_by_country')
    country_urls.remove('https://en.wikipedia.org/wiki/Ethnic_groups_by_country')
    country_urls.remove('https://en.wikipedia.org/wiki/Expatriates_by_country_of_residence')
    country_urls.remove('https://en.wikipedia.org/wiki/Immigration_by_country')
    country_urls.remove('https://en.wikipedia.org/wiki/People_by_ethnic_or_national_origin')
    country_urls.remove('https://en.wikipedia.org/wiki/Social_groups_by_country')
    
    return(country_urls)

Country URLs are also captured from the same tag `div`, just need to replace spaces with "\_". For ex, "Demographics of India" needs to be changed to "Demographics_of_India" and added to base URL to become full URL as "https://en.wikipedia.org/wiki/Demographics_of_India". The function above also removes the div tags that are not URLs of the countries. 

In [27]:
cntry_url = get_country_url_list(doc)

In [28]:
len(cntry_url)

191

There are 190 country URLs, and they look like below. 

In [29]:
cntry_url[1:6]

['https://en.wikipedia.org/wiki/Demographics_of_Abkhazia',
 'https://en.wikipedia.org/wiki/Demographics_of_Afghanistan',
 'https://en.wikipedia.org/wiki/Demographics_of_Albania',
 'https://en.wikipedia.org/wiki/Demographics_of_Algeria',
 'https://en.wikipedia.org/wiki/Demographics_of_American_Samoa']

#### Now that we have 2 lists created to hold the list of countries and list of country URLs, let's create a wrapper function to hold these 2 lists in a dictionary, which later can be converted into a pandas dataframe.

In [30]:
def scrape_countries():
    '''This function creates a python dictionary to combine the lists created for country names & URLs'''
    countries_dict = {
        'country': get_country_list(doc),
        'country_url': get_country_url_list(doc)
    }
    return pd.DataFrame(countries_dict)

Let's test the function, if we are getting the list of countries and their URLs

In [31]:
countries_and_url_df = scrape_countries()

In [32]:
countries_and_url_df[1:6]

Unnamed: 0,country,country_url
1,Abkhazia,https://en.wikipedia.org/wiki/Demographics_of_...
2,Afghanistan,https://en.wikipedia.org/wiki/Demographics_of_...
3,Albania,https://en.wikipedia.org/wiki/Demographics_of_...
4,Algeria,https://en.wikipedia.org/wiki/Demographics_of_...
5,American Samoa,https://en.wikipedia.org/wiki/Demographics_of_...


We should get same total 190 records, as we got from total in the countries and URL list. 

In [33]:
len(countries_and_url_df)

191

Now we can write these 190 countries along with their demographic URLs into a CSV file.

In [34]:
countries_and_url_df.to_csv('countries.csv', index=None)

## Extract and combine data from multiple pages

### Get the demographic information for a country

Now that we have the list of countries and URLs, we can download the individual country URL using `requests` to download the HTML content, and parse the HTML content through `beautifulsoup` to find specific tags for demographic information. 

In this specific example, we would need to find all `th` tags with class `infobox-label` to identify the headers for demographic information. For example: Population, Growth Rate, Birth Rate etc. Similarly, we can use `td` tags along with class `infobox-data` to identify data elements for demographic information. For example: 39,864,082 is the population of Afghanistan from the country page. 

![demographic info-tags and class](https://i.imgur.com/pK5BTPW.png)

Let's create a function that downloads the country URLs, finds the `th` and `td` tags along with the respective classes to scrape the demographic headers (i.e. Population, Growth rate, Birth rate, Death rate etc) and the corresponding data for these headers.  

In [35]:
import pandas as pd

def get_country_demo(country_name):
    '''
    This function download country page from Wiki using requests, parse the HTML content using BeautifulSoup and 
    find "th" and "td" tags to scrape demographic headers and data. Once the required info is scraped, it loads them in lists and dictionary.
    '''
    country_page_url = 'https://en.wikipedia.org/wiki/Demographics_of_' + country_name
    response = requests.get(country_page_url)
    if response.status_code != 200:
        print('Country page could not be downloaded for {}'.format(country_name))
        
    country_doc = BeautifulSoup(response.text, 'html.parser')

    demographic_headers = country_doc.find_all('th', {'class' : 'infobox-label'})
    
    if len(demographic_headers) == 0:
        print('Demographic headers do not exist for ', country_name)
    else:
        headers_lst = ['Country']
        for i in range(len(demographic_headers)):
            if len(demographic_headers[i].text) != 0:
                headers_lst.append(demographic_headers[i].text)
            else:
                headers_lst.append(None)
    
    demographic_data = country_doc.find_all('td', {'class' : 'infobox-data'})
    
    if len(demographic_data) == 0:
        print('Demographic data do not exist for {}'.format(country_name))
        country_demo_dict = None
    else:
        data_lst = [country_name]
        for i in range(len(demographic_data)):
            if len(demographic_data[i].text) != 0:
                data_lst.append(demographic_data[i].text.split()[0])
            else:
                data_lst.append(None)

        country_demo_dict = dict(zip(headers_lst, data_lst))

    return pd.DataFrame([country_demo_dict])

The above function download the web content, parse HTML and find required tags to scrape demographic information, which probably could have done in 2-3 smaller functions. The reason why it is chunked into single function is we still need country_name information from top to bottom in this function. If we had to write smaller function, we would have to return the same variable in all smaller functions. 

Below is the usage example of this function, how & what demographic infomation it retrieves from corresponding country page

In [36]:
get_country_demo('China')

Unnamed: 0,Country,Population,Growth rate,Birth rate,Death rate,Life expectancy,• male,• female,Fertility rate,Infant mortality rate,0–14 years,15–64 years,65 and over,At birth,Nationality,Major ethnic,Minor ethnic,Official,Spoken
0,China,1411750000,-0.06%,6.77,7.37,78.6,76.0,81.3,1.08,6.76,17.29%,70.37%,0.9,1.11,noun:,Han,"Zhuang,",Standard,Various;


Now that we can get all the demographic information from specific country page, we will create a function that separates the scraped demographic information for specific country page into separate variables. For this exercise, we are interested in scraping only 5 variables (Population, Growth rate, Birth rate, Death rate, Life expectency). These variables are returned by the function, so that later on we can loop through these values to store them all together. 

In [37]:
import os
def scrape_country_separate_var(country, country_url):
    '''This function separates the list of variables into separate variables, so that we can use the in later stage'''
    individual_country_df = get_country_demo(country)
    Country_var = country
    
    try:
        Population_var = individual_country_df['Population'][0]
    except:
        Population_var = 'N/A'
        
    try:
        Growth_rate_var = individual_country_df['Growth rate'][0]
    except:
        Growth_rate_var = 'N/A'        
        
    try:
        Birth_rate_var = individual_country_df['Birth rate'][0]
    except:
        Birth_rate_var = 'N/A'
       
    try:
        Death_rate_var = individual_country_df['Death rate'][0]
    except:
        Death_rate_var = 'N/A'
        
    try:
        Life_expectancy_var = individual_country_df['Life expectancy'][0]
    except:
        Life_expectancy_var = 'N/A'        
    
    #individual_country_df.to_csv('data/'+country+'.csv', index=None)
    
    return Country_var, Population_var, Growth_rate_var, Birth_rate_var, Death_rate_var, Life_expectancy_var

In [38]:
scrape_country_separate_var('Afghanistan','https://en.wikipedia.org/wiki/Demographics_of_Afghanistan')

('Afghanistan', '40,870,394', '2.34%', '38.3', '13.7', '63.2')

In [39]:
scrape_country_separate_var('United States','https://en.wikipedia.org/wiki/Demographics_of_Japan')

('United States', '334,233,854', '0.4%', '11.0', '10.4', '76.1')

## Save the extracted information to a CSV file.

In [40]:
def scrape_countries_demographics():
    '''
    This function would loop through all the countries to get the demographic information and saves them into a CSV file
    '''
    All_Countries = []
    All_Population = []
    All_Growth_rate = []
    All_Birth_rate = []
    All_Death_rate = []
    All_Life_expectancy = []
    
    countries_df = scrape_countries()
    
    os.makedirs('data',exist_ok=True)

    for index, row in countries_df.iterrows():
        print('Scraping demographic data for "{}"'.format(row['country']))
        print(row['country'], row['country_url'])
        Country_var, Population_var, Growth_rate_var, Birth_rate_var, Death_rate_var, Life_expectancy_var = scrape_country_separate_var(row['country'], row['country_url'])
        print(Country_var, Population_var, Growth_rate_var, Birth_rate_var, Death_rate_var, Life_expectancy_var)
        
        All_Countries.append(Country_var)
        All_Population.append(Population_var)
        All_Growth_rate.append(Growth_rate_var)
        All_Birth_rate.append(Birth_rate_var)
        All_Death_rate.append(Death_rate_var)
        All_Life_expectancy.append(Life_expectancy_var)
        
    final_demographic_dict = {
        'Country' : All_Countries,
        'Population' : All_Population,
        'Growth_rate' : All_Growth_rate,
        'Birth_rate' : All_Birth_rate,
        'Death_rate' : All_Death_rate,
        'Life_expectancy' : All_Life_expectancy
    }
    
    country_demographic_info_df = pd.DataFrame(final_demographic_dict)
    country_demographic_info_df.to_csv('data/country_demographic_info.csv', index=None)
    return country_demographic_info_df

Let's run it to scrape the demographic information for all the countries on the page https://en.wikipedia.org/wiki/Category:Demographics_by_country

In [41]:
final_df = scrape_countries_demographics()

Scraping demographic data for "Censuses by country"
Censuses by country https://en.wikipedia.org/wiki/Censuses_by_country
Country page could not be downloaded for Censuses by country
Demographic headers do not exist for  Censuses by country
Demographic data do not exist for Censuses by country
Censuses by country N/A N/A N/A N/A N/A
Scraping demographic data for "Abkhazia"
Abkhazia https://en.wikipedia.org/wiki/Demographics_of_Abkhazia
Demographic headers do not exist for  Abkhazia
Demographic data do not exist for Abkhazia
Abkhazia N/A N/A N/A N/A N/A
Scraping demographic data for "Afghanistan"
Afghanistan https://en.wikipedia.org/wiki/Demographics_of_Afghanistan
Afghanistan 40,870,394 2.34% 38.3 13.7 63.2
Scraping demographic data for "Albania"
Albania https://en.wikipedia.org/wiki/Demographics_of_Albania
Albania 2,811,666 0.22% 9.7 10.9 N/A
Scraping demographic data for "Algeria"
Algeria https://en.wikipedia.org/wiki/Demographics_of_Algeria
Algeria 44,508,736 1.34% 18.52 4.32 78.03


Djibouti 994,017 2.23% 25.27 8.23 62.4
Scraping demographic data for "Dominica"
Dominica https://en.wikipedia.org/wiki/Demographics_of_Dominica
Demographic headers do not exist for  Dominica
Demographic data do not exist for Dominica
Dominica N/A N/A N/A N/A N/A
Scraping demographic data for "the Dominican Republic"
the Dominican Republic https://en.wikipedia.org/wiki/Demographics_of_the_Dominican_Republic
the Dominican Republic 10,694,700 0.91% 18.03 6.29 72.56
Scraping demographic data for "East Timor"
East Timor https://en.wikipedia.org/wiki/Demographics_of_East_Timor
East Timor 1,445,006 2.15% 30.94 5.61 69.92
Scraping demographic data for "Ecuador"
Ecuador https://en.wikipedia.org/wiki/Demographics_of_Ecuador
Ecuador 18,213,749 1.443% N/A N/A N/A
Scraping demographic data for "Egypt"
Egypt https://en.wikipedia.org/wiki/Demographics_of_Egypt
Egypt 103,376,607 1.68% 21.46 4.32 74.45
Scraping demographic data for "El Salvador"
El Salvador https://en.wikipedia.org/wiki/Demographics_of

Lithuania 2,830,546 −1.04% 9.26 15.12 75.78
Scraping demographic data for "Luxembourg"
Luxembourg https://en.wikipedia.org/wiki/Demographics_of_Luxembourg
Luxembourg 650,364 1.64% 11.61 7.21 82.98
Scraping demographic data for "Madagascar"
Madagascar https://en.wikipedia.org/wiki/Demographics_of_Madagascar
Madagascar 28,172,462 2.27% 28.68 6 N/A
Scraping demographic data for "Malaysia"
Malaysia https://en.wikipedia.org/wiki/Demographics_of_Malaysia
Malaysia 33,871,431 1.03% 14.55 5.69 76.13
Scraping demographic data for "Malawi"
Malawi https://en.wikipedia.org/wiki/Demographics_of_Malawi
Malawi 20,794,353 2.34% 27.94 4.58 72.44
Scraping demographic data for "the Maldives"
the Maldives https://en.wikipedia.org/wiki/Demographics_of_the_Maldives
the Maldives 561,631 -0.14% 15.54 4.15 76.94
Scraping demographic data for "Mali"
Mali https://en.wikipedia.org/wiki/Demographics_of_Mali
Mali 20,741,769 2.95% 41.07 8.53 62.41
Scraping demographic data for "Malta"
Malta https://en.wikipedia.org/w

Samoa 206,179 0.63% 19.21 5.37 75.19
Scraping demographic data for "San Marino"
San Marino https://en.wikipedia.org/wiki/Demographics_of_San_Marino
Demographic headers do not exist for  San Marino
Demographic data do not exist for San Marino
San Marino N/A N/A N/A N/A N/A
Scraping demographic data for "São Tomé and Príncipe"
São Tomé and Príncipe https://en.wikipedia.org/wiki/Demographics_of_São_Tomé_and_Príncipe
São Tomé and Príncipe 217,164 1.48% 28.19 6.2 N/A
Scraping demographic data for "Saudi Arabia"
Saudi Arabia https://en.wikipedia.org/wiki/Demographics_of_Saudi_Arabia
Saudi Arabia 35,950,396 1.49% 14.7 3.4 75.7
Scraping demographic data for "Senegal"
Senegal https://en.wikipedia.org/wiki/Demographics_of_Senegal
Senegal 17,923,036 2.57% 31.51 5.08 69.96
Scraping demographic data for "Serbia"
Serbia https://en.wikipedia.org/wiki/Demographics_of_Serbia
Serbia 6,690,887 −10.9 9.1 20.0 72.7
Scraping demographic data for "Seychelles"
Seychelles https://en.wikipedia.org/wiki/Demograp

Let's check how the data look like. Remember, if the data is not available on Wiki page, it would return 'N/A'. For lot of these smaller countries, we do not have data on Wiki pages. 

In [42]:
final_df[1:]

Unnamed: 0,Country,Population,Growth_rate,Birth_rate,Death_rate,Life_expectancy
1,Abkhazia,,,,,
2,Afghanistan,40870394,2.34%,38.3,13.7,63.2
3,Albania,2811666,0.22%,9.7,10.9,
4,Algeria,44508736,1.34%,18.52,4.32,78.03
5,American Samoa,,,,,
...,...,...,...,...,...,...
186,Tunisia,11896972,0.69%,14.62,6.36,76.82
187,Turkey,"85,279,553(31",0.55%,13.3,,78.6
188,Turkmenistan,5636011,0.99%,17.51,5.95,71.83
189,Tuvalu,,,,,
