**Jovian Project #1 Web Scraping**  
**Author: Samantha Roberts**  
**Date: April 21, 2021**

# Webscraping Wikipedia Country Demographics Pages

I am very interested in geography and learning statistics about other countries in the world.  I frequently use Wikipedia to quickly get this information.  **In this project I will show how I used Python programming libraries (Beautiful Soup, Requests, Pandas) to gather demographic information about the many countries of the world by scraping Wikipedia web pages.**

Wikipedia is a collaborative online publicly editable encyclopedia.  It is the largest and most-read reference work in history, and consistently one of the 15 most popular websites visited[Ref](https://www.economist.com/international/2021/01/09/wikipedia-is-20-and-its-reputation-has-never-been-higher). Though anyone in theory can edit a wiki page, there are some light governances in effect that attempt to suggest how some pages are structured.  One of the ways that this can be done is via [WikiProjects](https://en.wikipedia.org/wiki/Wikipedia:WikiProject), where a group of contriburors attempt to work together as a team to improve a topic, often by attempting to standardize it to some extent.  ["Demographics"](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Demographics) is one such WikiProject.  In this project, each country has a demographics page, and many of these pages have a table that would be displayed in the right hand colum of each country page.  This table contains information on things like population breakdown by race, sex, religion, population density, birth and death rate, language, government type and so forth. **It is this demographics summary table that this project attempts to find and scrape for each country of the world.**

<img src="https://i.imgur.com/T9OASLH.png" width="80%">.


## Two Principal Challenges:
1. Though there was a suggested standard for how the individual country demographic pages were to be displayed, I quickly found out that this standard was not maintained over the over 220 pages for all the countries and territories of the world.  
2. As wiki is publically editable, ([1.9 pages per second are edited!](https://en.wikipedia.org/wiki/Wikipedia:Statistics#:~:text=It%20may%20reflect%20varying%20levels%20of%20consensus%20and%20vetting.&text=While%20you%20read%20this%2C%20Wikipedia,598%20new%20articles%20per%20day.)), sometimes the scraping algorithim would stop working because a page had been modified since my last edit.  

**I will show how I dealt with the challenges of the non-standardized Wiki format, and how I was able to automate the scraping of 80 individual Wikipedia country pages.**

## Project Steps:

1. **Scrape the main demographics page to build a list of countries and their Wiki-links**
    - Using the python Requests and Beautiful Soup librarys we will scrape the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population) that lists the links of all county wiki pages
    - Parsing the HTML data we will make a list of dictionaries, 1 dictionary for each country that contains the country name and the link to the individual page
    - Using Pandas, we will read this data into a CSV file  


2. **Scrape individual country pages for their demographics**  

    - Read the countries and links into a dataframe using Pandas
    - Crawl the links in the CSV file and see how many country pages have the table I am trying to scrape
    - Write code specific to the majority of demographics page tables to scrape the data I want
    - Handle instances where the relevant tables do not exist in the demographics page  


3. **Clean the data scraped**  

    - Look at the data in Pandas and see if there is some 1st order way to clean the data further
    - Write the demographics info to a CSV file

***
<h1><center>OK - Lets Get Started!</center></h1>

# (Step 1)  Build a list of countries and their relative Wikipedia links

1. Scrape https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
2. Put the rows of the HTML table into a dictionary
3. Convert the dictionary to a Pandas DataFrame and write it to a CSV file

<img src="https://i.imgur.com/OzJpHZ7.png" width="80%">

### Below are all the Python functions necessary to scrape the main demographics page, select the main table, build the dictionary, and write this data to a CSV file.

In [1]:
# now lets get all of the countries from the table
# ths is from the table at https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

from bs4 import BeautifulSoup
import pandas as pd
import requests

def get_headers(table_header_row):
    """
    Takes a HTML table header
    returns a list of strings of the headers
    """
    cols = []
    for col in table_header_row:
        cols.append(col.text.strip()) 
    return cols[:4] # we dont want the date or the source references

def get_row_values(country): 
    """
    takes in HTML for a row of table data
    returns a dictionary of that data
    """
    info = {
            'rank': country.th.text.strip(),
            'link': country.a['href'],
            'name': country.a.text.strip(),
            'population': country.find('td', style='text-align:right').text.strip().replace(',',""),
            'percent_of_world': country.find_all('td', style='text-align:right')[1].text \
                                .replace('"', '').strip().strip('%')
            }
    return info

def build_country_list(rows):
    """
    take in a beautiful soup object containing rows of an HTML table
    returns a list of dictionaries each containing the data from each row
    """
    country_list = []
    for row in rows:
        country_list.append(get_row_values(row))
    return country_list

def scrape_wiki_table():
    """
    does what it says
    returns a list of dictionaries
    """
    world_countries_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    soup = BeautifulSoup(requests.get(world_countries_url).text)
    tables = soup.find_all('table', class_='wikitable sortable plainrowheaders')
    rows = tables[0].find_all('tr')
    #headers = get_headers(rows[0].find_all('th'))
    return build_country_list(rows[1:242])

def write_csv(path, list_of_dicts):
    """
    writes a list of dicts to a csv
    """
    df = pd.DataFrame(list_of_dicts)
    df.to_csv(path, index=False, header=True)

In [2]:
country_list = scrape_wiki_table()
write_csv('country_list.csv', country_list)

### We see we have scraped information of 241 countries and dependencies  

* We created a list that contains the data scraped for each country in a dictionary  
* Each country dictionary that contains the following information:  

   * rank by population
   * relative link on the wikipedia site
   * country name
   * population
   * percent of the world

In [3]:
len(country_list)

241

In [4]:
country_list[:]

[{'rank': '1',
  'link': '/wiki/Demographics_of_China',
  'name': 'China',
  'population': '1407809120',
  'percent_of_world': '17.9'},
 {'rank': '2',
  'link': '/wiki/Demographics_of_India',
  'name': 'India',
  'population': '1376560090',
  'percent_of_world': '17.5'},
 {'rank': '3',
  'link': '/wiki/Demographics_of_the_United_States',
  'name': 'United States',
  'population': '331616914',
  'percent_of_world': '4.22'},
 {'rank': '4',
  'link': '/wiki/Demographics_of_Indonesia',
  'name': 'Indonesia',
  'population': '271350000',
  'percent_of_world': '3.45'},
 {'rank': '5',
  'link': '/wiki/Demographics_of_Pakistan',
  'name': 'Pakistan',
  'population': '225200000',
  'percent_of_world': '2.86'},
 {'rank': '6',
  'link': '/wiki/Demographics_of_Brazil',
  'name': 'Brazil',
  'population': '213093087',
  'percent_of_world': '2.71'},
 {'rank': '7',
  'link': '/wiki/Demographics_of_Nigeria',
  'name': 'Nigeria',
  'population': '211401000',
  'percent_of_world': '2.69'},
 {'rank': '8'

In [5]:
country_list_df = pd.DataFrame(country_list)
country_list_df

Unnamed: 0,rank,link,name,population,percent_of_world
0,1,/wiki/Demographics_of_China,China,1407809120,17.9
1,2,/wiki/Demographics_of_India,India,1376560090,17.5
2,3,/wiki/Demographics_of_the_United_States,United States,331616914,4.22
3,4,/wiki/Demographics_of_Indonesia,Indonesia,271350000,3.45
4,5,/wiki/Demographics_of_Pakistan,Pakistan,225200000,2.86
...,...,...,...,...,...
236,–,/wiki/Demographics_of_Niue,Niue,1549,0
237,–,/wiki/Demographics_of_Tokelau,Tokelau,1501,0
238,195,/wiki/Demographics_of_Vatican_City,Vatican City,825,0
239,–,/wiki/Demographics_of_Cocos_(Keeling)_Islands,Cocos (Keeling) Islands,573,0


# (Step 2) Use the relative links to scrape the country sites

1. Read the country_list csv into a Pandas DataFrame
2. Visit the link of each country and scrape the tables
3. If that page has an information table **in the "correct" format**, retrieve the table rows
4. for each country build a dictionary of demographics  

<img src="https://i.imgur.com/T9OASLH.png" width="80%">.


### The python functions below read in the country CSV, go to each link, scrape the individual country page, and put this information into a list of dictionaries


In [6]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

def get_country_page(country_link):
    """
    takes in a country demographics URL
    returns beautiful soup object of the HTML page
    """
    country_url = 'https://en.wikipedia.org' + country_link
    response = requests.get(country_url)
    if response.status_code != 200: 
        raise Exception('Failed to fetch web page ' + country_url)
    soup = BeautifulSoup(response.text)
    return soup

# next get every row that has a <th> tag as the first header
def get_row_data(table_rows):
    """
    takes in a list of all the html rows of the table
    returns a dictionary of all key values from each row
    """
    info = {}
    for row in table_rows[1:]:    #skip the first row because it is just the table title
        if (row.td == None):      #then it is a header and we stop scraping
            break  
        elif row.td.has_attr('class'):                 #if no 'class' this not a row we want
            if (row.td['class'][0] == 'infobox-data'): # list value for key so get [0]th element
                info[row.th.text] = row.td.text        # make dict item in list
    return info

def get_demographics(country_link):
    """
    takes in the country link for demographics Wiki
    returns a dictionary with the items from the info table
    """
    soup = get_country_page(country_link)          #retrieves the Beautiful soup Page
    table = soup.find('table', class_='infobox')   #finds the first table in the page
    if table:
        if table.has_attr('style') and table.has_attr('class'):
            table_rows = table.find_all('tr')
            info = get_row_data(table_rows)
            return info
    return   #None if country link does not contain the table we want

def build_country_dictionaries(csv_path):
    """
    takes in a csv file with country demographics links
    returns a list of dictionaries
    each dictionary contains the scraped data for a specific country
    """
    country_df = pd.read_csv(csv_path)
    demo_links = country_df['link']
    table = []
    for link in demo_links:
        temp = get_demographics(link)
        if temp == None: # the site did not a table formated to fit my scraping code
            continue
        else: #then we got some data frpm a table -- add that plus country name to the dictionary list
            temp['Country'] = country_df.loc[country_df['link'] == link, 'name'].item()
            table.append(temp)
    return table 

### Let's look at all the information we get

In [7]:
raw_country_dicts = build_country_dictionaries('country_list.csv')

### This looks like the information we are interested in...

In [8]:
raw_country_dicts[5]

{'Population': '144,386,830 (excluding Crimea),[1]  146,748,590 (including Crimea)[1]',
 'Life expectancy': ' 73.34 years (2019)[2]',
 '\xa0•\xa0male': '67.75 years (2018)[3]',
 '\xa0•\xa0female': '77.82 years (2018)[3]',
 'Fertility rate': ' 1.507 (2019)[4]',
 'Infant mortality rate': '4.9 deaths/1,000 live births (2019)[5]',
 'Net migration rate': '1.69 migrant(s)/1,000 population (2014)',
 'Country': 'Russia'}

### ... But not all county data contains the same information

In [9]:
raw_country_dicts[47]

{'Population': '4,761,865 (2016 census)',
 'Density': '68 per km2',
 'Growth rate': '1.77%',
 'Birth rate': '13.7 births/1,000 population',
 'Death rate': '6.5 deaths/1,000 population',
 'Life expectancy': '80.19 years',
 '\xa0•\xa0male': '78 years',
 '\xa0•\xa0female': '82.6 years',
 'Fertility rate': '1.91 children born/woman',
 'Infant mortality rate': '3.85 deaths/1,000 live births',
 'Net migration rate': '0.86 migrant(s)/1,000 population',
 'Country': 'Ireland'}

### Some countries did note have any of the info we expected in the table

In [10]:
raw_country_dicts[44]

{'Country': 'Nicaragua'}

### Small country cites had the table but did not have standard country information at all

In [11]:
raw_country_dicts[79]

{'Sovereign state': 'Australia',
 'Annexed by the United Kingdom': '1857',
 'Transferred from Singaporeto Australia': '23 November 1955',
 'Capital': 'West Island12°11′13″S 96°49′42″E\ufeff / \ufeff12.18694°S 96.82833°E\ufeff / -12.18694; 96.82833',
 'Largest village': 'Bantam (Home Island)',
 'Official languages': 'None',
 'Spoken languages': 'MalayEnglish[a]',
 'Government': 'Directly administered dependency',
 '•\xa0Monarch ': 'Elizabeth II',
 '•\xa0Governor-General ': 'David Hurley',
 '•\xa0Administrator ': 'Natasha Griggs',
 '•\xa0Shire President ': 'Seri Wati Iku',
 'Country': 'Cocos (Keeling) Islands'}

### To look at this a bit more quantatively we put everything into a Pandas Dataframe

In [12]:
raw_df = pd.DataFrame(raw_country_dicts)
raw_df.head()

Unnamed: 0,Population,Growth rate,Birth rate,Death rate,Life expectancy,• male,• female,Fertility rate,Infant mortality rate,Country,...,British colony,Assigned to New Zealand,New Zealand sovereignty,• Ulu-o-Tokelau,• Sovereign entity,• Sovereign,• Secretary of State,• President of the Governorate,Annexed by the United Kingdom,Largest village
0,"1,407,562,280\n(1st)",0.59% (2016 est.) (159th),"11.673 births per 1,000 (2019 est.)","7.261 deaths per 1,000 (2019)",76.5 years (2019),74.4 years (2018),78.6 years (2018),1.5 children per woman (2018),9.595 deaths per 1000 live births (2019),China,...,,,,,,,,,,
1,"1,390,885,000\n(April, 2021 est.)[1]",1.2% (2020 est.)[2],"18.2 births/1,000 population (2020 est.)[2]","7.3 deaths/1,000 population (2020 est.)[2]",69.7 years (2020 est.)[2],68.4 years (2020 est.)[2],71.2 years (2020 est.)[2],2.17 children born/woman (2017)[3],"29.94 deaths/1,000 live births (2018)[4]",India,...,,,,,,,,,,
2,"308,401,808\n(2010 Census[a]) (3rd)\n\n• Estim...",0.72% (2020)[1],"11.6 births/1,000 population (2020)[1]","8.9 deaths/1,000 population (2020)[1]",77.8 years (2020)[2],75.1 years[2],80.5 years[2],1.706 children born/woman (2019)[3],,United States,...,,,,,,,,,,
3,"222,903,998 (2020) [1]",2.00 (2021)[2],"29.8 births / 1,000 population (2016)[2]","7.5 deaths / 1,000 population (2016)[2]",70.0 years (2020)[3],69.9 years (2020)[3],70.1 years(2020)[3],3.56 children born / woman (2016)[3],"53.86 deaths / 1,000 live births (2016)[3]",Pakistan,...,,,,,,,,,,
4,"161,376,708 (2018 est.)[1][2]",1.01% (2020 est.)[3],"17.71 births/1,000 population (2020 est.)[3]","5.54 deaths/1,000 population (2020 est.)[3]",72.72 years (2020)[3],71.1 years,74.4 years,2.00 children born/woman (2020 est.)[3],"24.73 deaths/1,000 live births (2020 est.)[3]",Bangladesh,...,,,,,,,,,,


### There are 92 different headings (!) for 80 countries scraped

In [13]:
len(raw_df.columns.values), len(raw_country_dicts)

(92, 80)

### We are not interested in those which are country specific

In [14]:
#These are some values we are not interested in sorting through
raw_df.columns.values[40:90] 

array(['Federation', 'Separate colony', 'Capital', 'Largest city',
       'Ethnic\xa0groups ', '•\xa0Deputy Governor ',
       '•\xa0Prime Minister of the UK ', '•\xa0Premier ',
       'Partition of island', 'Separated from Guadeloupe',
       '•\xa0President of France ', '•\xa0Prefect ',
       '•\xa0President of theTerritorial Council ', 'Autonomy granted',
       'First Regional Assembly (Autonomy Day)', 'EU accession',
       'Colony established', 'Swedish purchase', 'Returned to France',
       'Collectivity status',
       '•\xa0President of the Territorial Council ', '•\xa0Deputy ',
       '•\xa0Senator ', 'First settlement', 'British rule reasserted',
       'Falklands War', 'Current constitution',
       'Capitaland largest settlement', '•\xa0Chief Executive ',
       'Proclamation of British sovereignty(Annexation)',
       'Transferred from Singaporeto Australia', 'Spoken languages',
       '•\xa0Governor-General ', '•\xa0Administrator ',
       '•\xa0Shire President ', 'Sep

### We see that we have a variety of information scraped and the keys span over 92 different catagories
* Many of catagories irregular formatting  
* Only 80 of the 241 websites had a table in the format we found to scrape
* Some countryies do not have any of the parameters other than the country name
* Some contain parameters that are of territories and/or contain data that is not general across all countries

# (Step 3) Cleaning the data 

### We select the data that is of interest and that is shared with the most countries

1. Some dictionary keys have special characters in them and need to be modified -- so we will add new key:value pairs for these and delete the old ones
2. We will then select a subset of 10 key:value pairs that the country data has and eliminate any country that does not have at least three of these descriptors  

  * Population 
  * Growth rate 
  * Birth rate
  * Death rate
  * Life expectancy
  * Fertility rate
  * Infant mortality rate
  * Country
  * male 
  * female   


3. For the above keys, we will strip all text and references and percentage signs and commas

### Below are the Python functions needed to clean the data that was scraped

In [15]:
def strip_key_bullets(country_dicts):
    """
    Takes in a list of dictionaries and 
    strips characters from the keys by creation a new k:v
    and poping the unwanted k:v pair
    """
    vals = '•', '\xa0'
    # must use copy of list so keys don't change as we iterate over them
    for index, country in enumerate(country_dicts.copy()): 
        for key in country:
            if vals[0] in key or vals[1] in key:
                country_dicts[index][key.replace(vals[0], '').replace(vals[1], '')] = country[key]
                country_dicts[index].pop(key)
    return country_dicts

def select_keys(country_dicts):
    """ 
    copied wanted elements form each dictioary
    returns the new list of dicts
    """
    keys_to_keep = ('Population', 
                   'Growth rate', 
                   'Birth rate', 
                   'Death rate', 
                   'Life expectancy', 
                   'Fertility rate', 
                   'Infant mortality rate', 
                   'Country', 
                   'male', 
                   'female', 
                   )
    newlist = []
    for country in country_dicts:
        tempdict = {key: country[key] for key in keys_to_keep if country.get(key)}
        if len(tempdict) > 3:         # then it has enough of the entries to make it worth parsing
            newlist.append(tempdict)
    return newlist

def strip_n_split_values(country_dicts):
    for country in country_dicts:
        for key in country:
            if key.lower() == 'country' or country[key] == '': # don't want to split country names
                continue               # we need to handle density and leave the spaces in density
            else:
                country[key] = country[key].split()[0].split('[')[0].split('/')[0]. \
                split('(')[0].strip().strip('‰%').replace(',', '')
    return country_dicts
    

### This line of code below calls the cleaning functions that do almost everything we need

In [16]:
newdicts = strip_n_split_values(select_keys(strip_key_bullets(raw_country_dicts)))
write_csv('country_demographics.csv', newdicts)

In [17]:
df = pd.DataFrame(newdicts)
df.head()

Unnamed: 0,Population,Growth rate,Birth rate,Death rate,Life expectancy,Fertility rate,Infant mortality rate,Country,male,female
0,1407562280,0.59,11.673,7.261,76.5,1.5,9.595,China,74.4,78.6
1,1390885000,1.2,18.2,7.3,69.7,2.17,29.94,India,68.4,71.2
2,308401808,0.72,11.6,8.9,77.8,1.706,,United States,75.1,80.5
3,222903998,2.0,29.8,7.5,70.0,3.56,53.86,Pakistan,69.9,70.1
4,161376708,1.01,17.71,5.54,72.72,2.0,24.73,Bangladesh,71.1,74.4


**We see that 56 of the 80 sites scraped have data in the catagories we selected**

In [18]:
len(newdicts)

56

### Now we want to find information on where all the NaNs are located

In [19]:
# sum the nane in every oolumn
sum_nulls = pd.isnull(df).sum()
sum_nulls

Population               1
Growth rate              4
Birth rate               1
Death rate               2
Life expectancy          2
Fertility rate           2
Infant mortality rate    8
Country                  0
male                     4
female                   4
dtype: int64

In [20]:
#find the rows (ie countries) that contain NaNs
rows_with_nan = [index for index, row in df.iterrows() if row.isnull().any()]
rows_with_nan, len(rows_with_nan)

([2, 5, 10, 12, 13, 15, 18, 25, 26, 30, 41, 44, 46, 53, 54], 15)

# Here is a dataframe of all rows that contain NaNs

In [21]:
df.iloc[rows_with_nan, :]

Unnamed: 0,Population,Growth rate,Birth rate,Death rate,Life expectancy,Fertility rate,Infant mortality rate,Country,male,female
2,308401808.0,0.72,11.6,8.9,77.8,1.706,,United States,75.1,80.5
5,144386830.0,,,,73.34,1.507,4.9,Russia,67.75,77.82
10,83614362.0,0.55,13.0,,78.6,1.73,8.6,Turkey,75.9,81.3
12,67399000.0,,11.0,9.8,82.2,1.84,3.8,France,79.2,85.2
13,66796807.0,,11.0,9.3,81.0,1.65,,United Kingdom,,
15,51049498.0,0.8,18.9,5.8,79.0,1.8,,Colombia,76.0,83.0
18,33091113.0,1.49,14.7,3.4,75.7,1.95,,Saudi Arabia,74.2,77.3
25,17474677.0,0.29,9.8,8.8,82.1,1.574,,Netherlands,80.4,83.8
26,,1.63,22.5,6.0,69.6,2.66,24.0,Cambodia,67.3,71.6
30,2015.0,2.05,23.9,3.4,74.8,2.7,,Jordan,73.4,76.3


### Ok... We will stop here, ave the files,  and summarize what we have accomplished in this Project

In [13]:
# let us attach and save the files
import jovian
jovian.commit(filename='webscraping-wiki-country-demographics.ipynb', project='webscraping-wiki-country-demographics', files=['country_demographics.csv','country_list.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "samantha-roberts/webscraping-wiki-country-demographics" on https://jovian.ai/
[jovian] Uploading additional files...
[jovian] Committed successfully! https://jovian.ai/samantha-roberts/webscraping-wiki-country-demographics


'https://jovian.ai/samantha-roberts/webscraping-wiki-country-demographics'

***
<h1><center>Analysis of our Analysis</center></h1>

### We have used the Requests API, Beautiful Soup library and Pandas dataframes to do the following:

1. We scraped a table in wiki to get a list of countries and the links to the demographics pages
2. We put these 241 entries into a csv (attached)
3. Scrape those 241 pages for a specific style of table with specific format
4. 80 demographics pages had the tables that this code could scrape
5. We then cleaned the data and found that 56 country pages contained the information we were trying to tablulate
6. We read this information into a Pandas Dataframe to look at it further
7. Wrote this data to a csv (also attached)

It seems that scraping a wiki site can be more difficult than scraping a site that has a definite standard way of formatting the information in their HTML pages.  Though WikiProjects seeks to standardize this, the table we were scraping only existed in 80 of the 241 pages.  Additionally, even where the table itself existed, it contained data variations in the content and the formatting of the data.  This meant that we had to do some cleaning of the data to be able to create a meaningful CSV file of some key demographics information.  

Furthermore, due to people frequently editing the pages, the code for scraping the sites which contained the list of all the countries stopped working and the links have even changed.  This is a risk of all web scrapers but perhaps something to be more concerned with when dealing with Wiki specifically.  

This project demonstrates both the usefulness and the limitations of the webscraping method of acquiring data from the web.  

Overall -- a fun project!  And now I know what web scraping actually is!

## Future Work
- Learn how to use the [Wikidata SPARQL](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial) service to directly query and retrieve data for Wikipedia without web scraping 
- A [wikidata query tutorial](https://rozhon.com/sheets-for-marketers/how-to-extract-data-from-wikipedia-and-wikidata/) by Robin Rozhon
- [Another tutorial](https://janakiev.com/blog/wikidata-mayors/) on querying Wikidata with Python and SPARQL by Janakiev
    - This happens to be the one that is most useful, (in my opinion :)

## Acknowledgements and References
- I would like to thank the [Jovian](https://jovian.ai/) team for teaching me everything I know to date about Data Science 
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Links to Wikipedia demographics showing various formats:
**Not all Wiki demographics pages are formated the same - Note examples below**

* This contains a Wiki table listing of all countries in order of population:  
    - [List_of_countries_and_dependencies_by_population](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population) 
    - We can crawl the links in this table to reach each of the country demographics webpages    


* Some Wiki Demographics pages like these Have an easy table to access which has the info I want to scrape
    * [Demographics_of_Switzerland](https://en.wikipedia.org/wiki/Demographics_of_Switzerland)
    * [Demographics_of_the_United_States](https://en.wikipedia.org/wiki/Demographics_of_the_United_States)
    * [Demographics_of_Chile](https://en.wikipedia.org/wiki/Demographics_of_Chile)  


* Some Demographics pages like these have the info in a different format:
    * [Demography_of_Australia](https://en.wikipedia.org/wiki/Demography_of_Australia)   
    

* Some pages like these do not have these specific summary tables at all:
    * [Demographics_of_Kenya](https://en.wikipedia.org/wiki/Demographics_of_Kenya)
    * [Demographics_of_Thailand](https://en.wikipedia.org/wiki/Demographics_of_Thailand)
    * [Demographics_of_Italy](https://en.wikipedia.org/wiki/Demographics_of_Italy)