# Wiki Country Demographics Webscraping Project  

* So here is a link to the assignment:  
https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python  


* And here is a link to a Wiki list of all countries by population:  
https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population   
... It looks like if I crawl the links in this table I can reach the particular spellings and format of each of the demographic webpages



* Some Wiki Demographics pages like these Have an easy table to access which has the info I want to scrape
    * https://en.wikipedia.org/wiki/Demographics_of_Switzerland
    * https://en.wikipedia.org/wiki/Demographics_of_the_United_States
    * https://en.wikipedia.org/wiki/Demographics_of_Chile  


* Some Demographics pages like these have the info in a different format:
    * https://en.wikipedia.org/wiki/Demography_of_Australia   
    

* Some pages like these do not have these secific summary tables at all:
    * https://en.wikipedia.org/wiki/Demographics_of_Kenya
    * https://en.wikipedia.org/wiki/Demographics_of_Thailand
    * https://en.wikipedia.org/wiki/Demographics_of_Italy

# Methods of Scraping these Wiki sites

1. write some code to handle the cases where the tables exist
2. handle with error messages the instances where the relevant tables do not exist
3. read ALL of the table in and make a dictionary of values
4. read this into a CSV file that can be handled by Pandas
5. Crawl the list of countries in wiki and see how many countries I can retrieve information for
6. Look at the pages I could not get info for and see if there is some 2nd order way to scrape some of the data

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

# This is scraping all the link and info from the list of countries
* Things to improve: User the headers to write the keys for the get_row_values function

In [16]:
# now lets get all of the countries from the table
# ths is from the table at https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

def get_headers(table_header_row):
    cols = []
    for col in table_header_row:
        cols.append(col.text.strip()) 
    return cols[:4] # we dont want the date or the source

def get_row_values(country): 
    info = {
            'rank': country.th.text.strip(),
            'link': country.a['href'],
            'name': country.a.text.strip(),
            'population': country.find('td', style='text-align:right').text.strip().replace(',',""),
            'percent_of_world': country.find_all('td', align='right').text[:-1]
           }
    return info

def build_country_list(rows):
    country_list = []
    for row in rows:
        country_list.append(get_row_values(row))
    return country_list

def scrape_wiki_table():
    world_countries_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    soup = BeautifulSoup(requests.get(world_countries_url).text)
    tables = soup.find_all('table', class_='wikitable sortable plainrowheaders')
    rows = tables[0].find_all('tr')
    #headers = get_headers(rows[0].find_all('th'))
    return build_country_list(rows[1:242])

def write_country_csv(path, country_list):
    df = pd.DataFrame(country_list)
    df.to_csv(path, index=False, header=True)

In [17]:
country_list = scrape_wiki_table()
write_country_csv('country_list.csv', country_list)

AttributeError: 'NoneType' object has no attribute 'text'

# Now that we have the .csv we scrape the demographics sites

In [4]:
def get_country_page(country_link):
    """
    function takes in a link to be appended to the base_url
    it retrievs the HTML code from the page
    
    function feturns the Beautiful Soup object 
    that contains all the html code from that page
    """
    country_url = 'https://en.wikipedia.org' + country_link
    response = requests.get(country_url)
    if response.status_code != 200: 
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + country_url)
    soup = BeautifulSoup(response.text)
    return soup

# next get every row that has a <th> tag as the first header
def get_row_data(table_rows):
    """
    function takes in a list of all the html rows of the table
    It then scrapes the data from each row of the table 
    returns a list of dictionaries of key value from each row
   
    If the row does not have a <td> tag then the row is a header break
    and we should stop scraping because rows below are not easily parsable
    """
    info = []
    for row in table_rows[1:]: #skip the first row because it is just the table title
        if (row.td == None):
            break  
        elif row.td.has_attr('class'): #if no 'class' this not a row we want
            if (row.td['class'][0] == 'infobox-data'): # list value for key returned so get [0]th element
                info.append({row.th.text: row.td.text}) # make dict item in list
    return info

def get_demographics(country_link):
    """
    function takes in the country link for Wiki
    returns a dictionary with the items from the info table
    returns None if there is no table or if not the table we want
    """
    soup = get_country_page(country_link) #retrieves the Beautiful soup Page
    table = soup.find('table', class_='infobox') #finds the first table in the page
    if table:
        if table.has_attr('style') and table.has_attr('class'):#and table['style']: #and table['style'] == 'width: 25em':
            table_rows = table.find_all('tr')
            info = get_row_data(table_rows)
            return info
    return

In [5]:
country_info = get_demographics('/wiki/Demographics_of_the_United_States')
country_info

[{'Population': '308,401,808\n(2010 Census[a]) (3rd)\n\n•\xa0Estimate\xa0329,484,123 (pre-2020 Census) (3rd)'},
 {'Density': '86.16/sq\xa0mi (33.27/km2)'},
 {'Growth rate': ' 0.72% (2020)[1]'},
 {'Birth rate': '11.6 births/1,000 population (2020)[1]'},
 {'Death rate': '8.9 deaths/1,000 population (2020)[1]'},
 {'Life expectancy': '77.8 years (2020)[2]'},
 {'\xa0•\xa0male': '75.1 years[2]'},
 {'\xa0•\xa0female': '80.5 years[2]'},
 {'Fertility rate': '1.706 children born/woman (2019)[3]'},
 {'Net migration rate': '3 migrant(s)/1,000 population (2020)[1]'}]

In [6]:
# this removes the bullte of the Male/Female keys in the dictionaries
def parse_sex(country_info):
    for index, item in enumerate(country_info): 
        for key in item:
            if 'male' in key: # it could be male or female... 
                if 'female' in key:
                    country_info[index] = {'Female': item[key]} 
                else: 
                    country_info[index] = {'Male': item[key]}

In [7]:
parse_sex(country_info)
country_info

[{'Population': '308,401,808\n(2010 Census[a]) (3rd)\n\n•\xa0Estimate\xa0329,484,123 (pre-2020 Census) (3rd)'},
 {'Density': '86.16/sq\xa0mi (33.27/km2)'},
 {'Growth rate': ' 0.72% (2020)[1]'},
 {'Birth rate': '11.6 births/1,000 population (2020)[1]'},
 {'Death rate': '8.9 deaths/1,000 population (2020)[1]'},
 {'Life expectancy': '77.8 years (2020)[2]'},
 {'Male': '75.1 years[2]'},
 {'Female': '80.5 years[2]'},
 {'Fertility rate': '1.706 children born/woman (2019)[3]'},
 {'Net migration rate': '3 migrant(s)/1,000 population (2020)[1]'}]

# now let us read the links in the .csv and cet all the country demos we can

In [8]:
country_df = pd.read_csv('country_list.csv')
country_df

Unnamed: 0,rank,link,name,population,percent_of_world
0,1,/wiki/Demographics_of_China,China,1407562280,17.9
1,2,/wiki/Demographics_of_India,India,1375875687,17.5
2,3,/wiki/Demographics_of_the_United_States,United States,331520958,4.22
3,4,/wiki/Demographics_of_Indonesia,Indonesia,271350000,3.45
4,5,/wiki/Demographics_of_Pakistan,Pakistan,225200000,2.86
...,...,...,...,...,...
236,–,/wiki/Demographics_of_Niue,Niue,1549,0
237,–,/wiki/Demographics_of_Tokelau,Tokelau,1501,0
238,195,/wiki/Demographics_of_Vatican_City,Vatican City,825,0
239,–,/wiki/Demographics_of_Cocos_(Keeling)_Islands,Cocos (Keeling) Islands,573,0


In [9]:
demo_links = country_df['link']

count = 0
for link in demo_links:
    temp = get_demographics(link)
    if temp == None:
        count += 1
    print(temp)
    
    

[{'Population': '1,427,647,786\n (2018 data) (1st)'}, {'Growth rate': ' 0.59% (2016 est.) (159th)'}, {'Birth rate': '11.673 births per 1,000 (2019 est.)'}, {'Death rate': '7.261 deaths per 1,000 (2019)'}, {'Life expectancy': '76.5 years (2019)'}, {'\xa0•\xa0male': '74.4 years (2018)'}, {'\xa0•\xa0female': '78.6 years (2018)'}, {'Fertility rate': '1.5 children per woman (2018)'}, {'Infant mortality rate': '9.595 deaths per 1000 live births (2019)'}]
[{'Population': ' 1,390,885,000\n(April, 2021 est.)[1]'}, {'Density': '500 people per.sq.km (2011 est.)'}, {'Growth rate': '1.2% (2020 est.)[2]'}, {'Birth rate': '18.2 births/1,000 population (2020 est.)[2]'}, {'Death rate': '7.3 deaths/1,000 population (2020 est.)[2]'}, {'Life expectancy': '69.7 years (2020 est.)[2]'}, {'\xa0•\xa0male': '68.4 years (2020 est.)[2]'}, {'\xa0•\xa0female': '71.2 years (2020 est.)[2]'}, {'Fertility rate': '2.17 children born/woman (2017)[3]'}, {'Infant mortality rate': '29.94 deaths/1,000 live births (2018)[4]'}

None
None
None
None
None
[{'Population': '33,091,113'}, {'Density': '15.322 people per sq. km of land (2017)[1]'}, {'Growth rate': '1.49% (2019 [2]'}, {'Birth rate': '14.7 births/1,000 population (2018) [3]'}, {'Death rate': '3.4 deaths/1,000 population'}, {'Life expectancy': '75.7 years'}, {'\xa0•\xa0male': '74.2 years'}, {'\xa0•\xa0female': '77.3 years'}, {'Fertility rate': '1.95 children born/woman (2020) [4]'}, {'Net migration rate': '590,000 (2017)[5]'}]
None
[{'Population': '37,466,414 (2021)[1]'}, {'Growth rate': '2.34% (2016)'}, {'Birth rate': '38.3 births/1,000 population (2016)'}, {'Death rate': '13.7 deaths/1,000 population (2016)'}, {'Life expectancy': '63.2 years (2019)[2][3]'}, {'\xa0•\xa0male': '63.3 years'}, {'\xa0•\xa0female': '63.2 years'}, {'Fertility rate': '5.33 children born/woman (2015)'}, {'Infant mortality rate': '66.3 deaths/1,000 live births[4]'}]
None
None
None
None
None
None
None
None
None
None
[{'Population': '25.55 million (2018)'}, {'Density': '199.54 in

None
None
[{'Population': '8,570,146 (30 June 2019 est.)[1]'}, {'Density': '208/km2 (48th)539/sq mi'}, {'Growth rate': '0.75% (2019 est.)'}, {'Birth rate': '10.5 births/1,000 population (2015 est.)'}, {'Death rate': '8.13 deaths/1,000 population (2015 est.)'}, {'Life expectancy': '83.8 years'}, {'\xa0•\xa0male': '81.9 years'}, {'\xa0•\xa0female': '85.6 years [2]'}, {'Fertility rate': '1.53 children born/woman (2019 est.)'}, {'Infant mortality rate': '3.67 deaths/1,000 live births'}, {'Net migration rate': '4.74 migrant(s)/1,000 population (2015 est.)[3]'}]
[{'Population': '7,650,150'}, {'Density': '80.06 inhabitants per sq km.'}, {'Growth rate': '15.40% (2004–2014 est.)'}, {'Birth rate': '37.40 births/1,000 inhabitants'}, {'Death rate': '11.03 deaths/1,000 inhabitants'}, {'Life expectancy': '57.39 years'}, {'\xa0•\xa0male': '54.85 years'}, {'\xa0•\xa0female': '60.00 years'}, {'Fertility rate': '4.2 children born/women'}, {'Infant mortality rate': '73.29 deaths/1,000 births'}]
None
None

None
None
None
None
None
None
None
[{'Population': '994,017 (2019)'}, {'Growth rate': '2.23% (2014)'}, {'Birth rate': '25.27 births/1,000 population (2011 est.)'}, {'Death rate': '8.23 deaths/1,000 population (July 2011 est.)'}, {'Life expectancy': '62.4 years (2014)'}, {'\xa0•\xa0male': '59.93 years'}, {'\xa0•\xa0female': '64.94 years'}, {'Fertility rate': '2.79 children born/woman (2010)'}, {'Infant mortality rate': '53.31 deaths/1,000 infants (2012 est.)[1]'}]
[{'Population': '884,887'}, {'Density': '49.4/km2'}, {'Birth rate': '22.5 (2017 est.)'}, {'Death rate': '8.10 (2017 est.)'}, {'Life expectancy': '72.1 (2014 est.)'}, {'\xa0•\xa0male': '65.4'}, {'\xa0•\xa0female': '68.5'}, {'Fertility rate': '2.9 (2017 est.)'}, {'Infant mortality rate': '12.5 (2017 est.)'}]
None
None
None
None
None
None
None
None
None
None
None
None
[{'Status': 'Unrecognised stateRecognised by the United Nations as de jure part of Moldova'}, {'Capitaland largest city': 'Tiraspol'}, {'Official\xa0languages': 'Ru

None
None
[{'Sovereign state': 'United Kingdom'}, {'First settlement': '1764'}, {'British rule reasserted': '3 January 1833'}, {'Falklands War': '2 April to14 June 1982'}, {'Current constitution': '1 January 2009'}, {'Capitaland largest settlement': 'Stanley51°42′S 57°51′W\ufeff / \ufeff51.700°S 57.850°W\ufeff / -51.700; -57.850'}, {'Official languages': 'English'}, {'Demonym(s)': 'Falkland Islander, Falklander'}, {'Government': 'Devolved parliamentary dependency under a constitutional monarchy'}, {'•\xa0Monarch ': 'Elizabeth II'}, {'•\xa0Governor ': 'Nigel Phillips'}, {'•\xa0Chief Executive ': 'Barry Rowland'}, {'Legislature': 'Legislative Assembly'}]
[{'Sovereign state': 'Australia'}, {'Proclamation of British sovereignty(Annexation)': '6 June 1888'}, {'Transferred from Singaporeto Australia': '1 October 1958'}, {'Capitaland largest city': 'Flying Fish Cove("The Settlement")10°25′18″S 105°40′41″E\ufeff / \ufeff10.42167°S 105.67806°E\ufeff / -10.42167; 105.67806'}, {'Official language

# now let us read this file out to a csv and then back in with Pandas
* oh question -- can I somehow just convert the list to a pandas dataframe?  
* is there a way to put things into a jupyter notebook that are searchable
* is it possible to but local links in a jupyter notebook, for instance I want to build a TOC with links to those sections of the notebook
* Is it standard to extract the key (or class) names from the data or to opt for writing out a more friendly version without spaces or special characters?

{'__name__': '__main__',
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__package__': None,
 '__loader__': None,
 '__spec__': None,
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '_ih': ['',
  'from bs4 import BeautifulSoup\nimport pandas as pd\nimport requests',
  '# now lets get all of the countries from the table\n# ths is from the table at https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population\n\ndef get_headers(table_header_row):\n    cols = []\n    for col in table_header_row:\n        cols.append(col.text.strip()) \n    return cols[:4] # we dont want the date or the source\n\ndef get_row_values(country): \n    info = {\n            \'rank\': country.th.text.strip(),\n            \'link\': country.a[\'href\'],\n            \'name\': country.a.text.strip(),\n            \'population\': country.find(\'td\', style=\'text-align:right\').text.strip().replace(\',\',""),\n      