# Web Scrapping of a Wikipedia Page

- Here we are going to perform web scrapping to form a dataframe which consists of the name of countries and their population.
- Our source of information is:

    https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

# Creating Soup:

We are going to use Beautiful Soup python library to do the web scrapping.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
source = requests.get(url)

In [3]:
soup = BeautifulSoup(source.text,"html.parser")

In [4]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries and dependencies by population - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"95793c86-876f-4e5b-95b6-351975b6e641","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_and_dependencies_by_population","wgTitle":"List of countries and dependencies by population","wgCurRevisionId":1029857572,"wgRevisionId":1029857572,"wgArticleId":69058,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Indonesian-language sources (id)","CS1 Arabic-language sour

In [5]:
soup.table

<table class="sortable wikitable" style="text-align:right">
<tbody><tr>
<th data-sort-type="number">Rank</th>
<th>Country or dependent territory</th>
<th>Population</th>
<th>% of world</th>
<th>Date</th>
<th>Source (official or United Nations)
</th></tr>
<tr>
<th>1
</th>
<td style="text-align:left"><span class="flagicon" style="display:inline-block;width:25px"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span></span> <a h

## Creating Initial Dataframe

By using soup and pandas we extracted the table from wikipedia page which contains all the information that we are going to use in our Dataframe.

In [6]:
blank_list = []
for i in soup.find('tbody').select('tr'):
    try:
        text = i.text.split('\n')
        text.remove('')
        blank_list.append(text)
    except:
        pass
        

In [7]:
for i in blank_list:
    i.remove('')
df = pd.DataFrame(blank_list[:-1])
df.columns = df.loc[0,:]
df.dropna(axis=1,inplace=True)
df.drop(0,axis=0,inplace=True)
#df.drop('Rank',axis=1,inplace=True)

In [8]:
df

Unnamed: 0,Rank,Country or dependent territory,Population,% of world,Date,Source (official or United Nations)
1,1,China† [b],1411778724,17.9%,1 Nov 2020,2020 census result[3]
2,2,India† [c],1378492522,17.5%,22 Jun 2021,National population clock[4]
3,3,United States† [d],331887849,4.21%,22 Jun 2021,National population clock[5]
4,4,Indonesia†,271350000,3.45%,31 Dec 2020,National annual estimate[6]
5,5,Pakistan† [e],225200000,2.86%,1 Jul 2021,UN projection[2]
...,...,...,...,...,...,...
237,–,Niue† (NZ),1549,0%,1 Jul 2021,National annual projection[91]
238,–,Tokelau† (NZ),1501,0%,1 Jul 2021,National annual projection[91]
239,195,Vatican City† [ad],825,0%,1 Feb 2019,Monthly national estimate[190]
240,–,Cocos (Keeling) Islands† (Australia),573,0%,30 Jun 2020,National annual estimate[189]


## Data Cleaning

We found that there are some texts which need to be cleaned for the proper data analysis. Here, we are removing the extra characters from the dataframe that we have formed. 

In [9]:
df['Country or dependent territory'] = [i[:-5].replace('†','') if i[-3]=='[' else i.replace('†','') for i in df['Country or dependent territory']]
df['Source (official or United Nations)'] = [i[:i.find('[')] for i in df['Source (official or United Nations)']]

In [10]:
df

Unnamed: 0,Rank,Country or dependent territory,Population,% of world,Date,Source (official or United Nations)
1,1,China,1411778724,17.9%,1 Nov 2020,2020 census result
2,2,India,1378492522,17.5%,22 Jun 2021,National population clock
3,3,United States,331887849,4.21%,22 Jun 2021,National population clock
4,4,Indonesia,271350000,3.45%,31 Dec 2020,National annual estimate
5,5,Pakistan,225200000,2.86%,1 Jul 2021,UN projection
...,...,...,...,...,...,...
237,–,Niue (NZ),1549,0%,1 Jul 2021,National annual projection
238,–,Tokelau (NZ),1501,0%,1 Jul 2021,National annual projection
239,195,Vatican City [ad],825,0%,1 Feb 2019,Monthly national estimate
240,–,Cocos (Keeling) Islands (Australia),573,0%,30 Jun 2020,National annual estimate


## Putting Flag Images:

- Finally we are trying to extract the links from the same website to add the flag images of the countries to our dataframe.

In [11]:
first_table = soup.find_all(class_ = 'wikitable')[0]
table_img = first_table.find_all('img')

In [12]:
df['flags'] = ["https:"+i['src'].strip() for i in table_img]
df

Unnamed: 0,Rank,Country or dependent territory,Population,% of world,Date,Source (official or United Nations),flags
1,1,China,1411778724,17.9%,1 Nov 2020,2020 census result,https://upload.wikimedia.org/wikipedia/commons...
2,2,India,1378492522,17.5%,22 Jun 2021,National population clock,https://upload.wikimedia.org/wikipedia/en/thum...
3,3,United States,331887849,4.21%,22 Jun 2021,National population clock,https://upload.wikimedia.org/wikipedia/en/thum...
4,4,Indonesia,271350000,3.45%,31 Dec 2020,National annual estimate,https://upload.wikimedia.org/wikipedia/commons...
5,5,Pakistan,225200000,2.86%,1 Jul 2021,UN projection,https://upload.wikimedia.org/wikipedia/commons...
...,...,...,...,...,...,...,...
237,–,Niue (NZ),1549,0%,1 Jul 2021,National annual projection,https://upload.wikimedia.org/wikipedia/commons...
238,–,Tokelau (NZ),1501,0%,1 Jul 2021,National annual projection,https://upload.wikimedia.org/wikipedia/commons...
239,195,Vatican City [ad],825,0%,1 Feb 2019,Monthly national estimate,https://upload.wikimedia.org/wikipedia/commons...
240,–,Cocos (Keeling) Islands (Australia),573,0%,30 Jun 2020,National annual estimate,https://upload.wikimedia.org/wikipedia/commons...


In [13]:
df = df[['Rank','flags','Country or dependent territory','Population','% of world','Date','Source (official or United Nations)']]

In [14]:
from IPython.core.display import HTML
df['flags'] = ['<img src="'+ str(x) + '" width="50" >' for x in df['flags']]
HTML(df.to_html(escape=False))

Unnamed: 0,Rank,flags,Country or dependent territory,Population,% of world,Date,Source (official or United Nations)
1,1,,China,1411778724,17.9%,1 Nov 2020,2020 census result
2,2,,India,1378492522,17.5%,22 Jun 2021,National population clock
3,3,,United States,331887849,4.21%,22 Jun 2021,National population clock
4,4,,Indonesia,271350000,3.45%,31 Dec 2020,National annual estimate
5,5,,Pakistan,225200000,2.86%,1 Jul 2021,UN projection
6,6,,Brazil,213304911,2.71%,22 Jun 2021,National population clock
7,7,,Nigeria,211401000,2.68%,1 Jul 2021,UN projection
8,8,,Bangladesh,170881116,2.17%,22 Jun 2021,National population clock
9,9,,Russia,146171015,1.86%,1 Jan 2021,National annual estimate
10,10,,Mexico,126014024,1.60%,2 Mar 2020,2020 census result


## Final DataFrame

- After the whole process out final dataframe is like the following one.

In [15]:
HTML(df.to_html(escape=False))

Unnamed: 0,Rank,flags,Country or dependent territory,Population,% of world,Date,Source (official or United Nations)
1,1,,China,1411778724,17.9%,1 Nov 2020,2020 census result
2,2,,India,1378492522,17.5%,22 Jun 2021,National population clock
3,3,,United States,331887849,4.21%,22 Jun 2021,National population clock
4,4,,Indonesia,271350000,3.45%,31 Dec 2020,National annual estimate
5,5,,Pakistan,225200000,2.86%,1 Jul 2021,UN projection
6,6,,Brazil,213304911,2.71%,22 Jun 2021,National population clock
7,7,,Nigeria,211401000,2.68%,1 Jul 2021,UN projection
8,8,,Bangladesh,170881116,2.17%,22 Jun 2021,National population clock
9,9,,Russia,146171015,1.86%,1 Jan 2021,National annual estimate
10,10,,Mexico,126014024,1.60%,2 Mar 2020,2020 census result
