# Web Scraping using Python

Whenever we start a Machine Learning project, the first thing we require is a dataset to work on. While there are many sources where datasets are available, we might want to create a dataset using the data found on a website.

In this notebook, we'll  explore the process to extract information from Wikipedia and form a dataset which can later be used for Data Analytics and Machine Learning applications.

## Import Libraries

We'll first import all relevant libraries that we will require to access a website's HTML and extract information from the same.

In [0]:
import numpy as np
import pandas as pd

from urllib.request import urlopen
from bs4 import BeautifulSoup

In [73]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Define functions

Firstyly, we define the function getHTMLContent, that accepts a url and uses BeautifulSoup library to get the HTML for a webpage.

In [0]:
def getHTMLContent(link):
    html = urlopen(link)
    soup = BeautifulSoup(html, 'html.parser')
    return soup

## Understand the data

The webpage includes the information we need in the form of HTML table. Thus, we need to reach that table and extract the information. However, there might be multiple tables on the page. We would thus need to find the class of that table and then access its data.

In [75]:
content = getHTMLContent('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population')
tables = content.find_all('table')
for table in tables:
    print(table.prettify())

<table class="wikitable sortable mw-datatable" style="margin:auto;text-align:right">
 <tbody>
  <tr>
   <th data-sort-type="number">
    Rank
   </th>
   <th>
    Country
    <br/>
    <small>
     (or dependent territory)
    </small>
   </th>
   <th>
    Population
   </th>
   <th>
    % of World
    <p>
     Population
    </p>
   </th>
   <th>
    Date
   </th>
   <th class="unsortable">
    Source
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_P

The table that we will use has the class 'wikitable sortable'. It has rows of information where the first row has headings and the other rows in succession have information about each country.

Next, we explore the website for each country.

In [76]:
# The cell with the country name for each row includes a link to the country webpage on Wikipedia
table = content.find('table', {'class': 'wikitable sortable mw-datatable'})
rows = table.find_all('tr')

# List of all links
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        country_link = cells[1].find('a')
        print(country_link.get('href'))

/wiki/Demographics_of_China
/wiki/Demographics_of_India
/wiki/Demographics_of_United_States
/wiki/Demographics_of_Indonesia
/wiki/Demographics_of_Pakistan
/wiki/Demographics_of_Brazil
/wiki/Demographics_of_Nigeria
/wiki/Demographics_of_Bangladesh
/wiki/Demographics_of_Russia
/wiki/Demographics_of_Mexico
/wiki/Demographics_of_Japan
/wiki/Demographics_of_Philippines
/wiki/Demographics_of_Egypt
/wiki/Demographics_of_Ethiopia
/wiki/Demographics_of_Vietnam
/wiki/Demographics_of_Democratic_Republic_of_the_Congo
/wiki/Demographics_of_Germany
/wiki/Demographics_of_Iran
/wiki/Demographics_of_Turkey
/wiki/Demographics_of_France
/wiki/Demographics_of_Thailand
/wiki/Demographics_of_United_Kingdom
/wiki/Demographics_of_Italy
/wiki/Demographics_of_South_Africa
/wiki/Demographics_of_Tanzania
/wiki/Demographics_of_Myanmar
/wiki/Demographics_of_South_Korea
/wiki/Demographics_of_Colombia
/wiki/Demographics_of_Kenya
/wiki/Demographics_of_Spain
/wiki/Demographics_of_Argentina
/wiki/Demographics_of_Algeria

Each row has a link to the corresponding country page on Wikipedia. However, the initial weblink is missing, so we would have to append it. Let's understand the content of page with the example of one page.

In [0]:
def getAdditionalDetails(url):
    try:
        country_page = getHTMLContent('https://en.wikipedia.org' + url)
        table = country_page.find('table', {'class': 'infobox'})
        additional_details = []
        read_content = False
        for tr in table.find_all('tr'):
            if (tr.get('class') == ['mergedtoprow'] and not read_content):
                link = tr.find('a')
                if (link and (link.get_text().strip() == 'Area' or
                   (link.get_text().strip() == 'GDP' and tr.find('span').get_text().strip() == '(nominal)'))):
                    read_content = True
                if (link and (link.get_text().strip() == 'Population')):
                    read_content = False
            elif ((tr.get('class') == ['mergedrow'] or tr.get('class') == ['mergedbottomrow']) and read_content):
                additional_details.append(tr.find('td').get_text().strip('\n')) 
                if (tr.find('div').get_text().strip() != '•\xa0Total area' and
                   tr.find('div').get_text().strip() != '•\xa0Total'):
                    read_content = False
        return additional_details
    except Exception as error:
        print('Error occured: {}'.format(error))
        return []

## Create the dataset

Now that we have identified what all information needs to be extracted and how. We have compiled the whole process as a function above. Now, we just move across each row of the Country list and compile its data.

In [78]:
data_content = []
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        print(cells[1].get_text())
        country_link = cells[1].find('a')
        country_info = [cell.text.strip('\n') for cell in cells]
        additional_details = getAdditionalDetails(country_link.get('href'))
        if (len(additional_details) == 4):
            country_info += additional_details
            data_content.append(country_info)

dataset = pd.DataFrame(data_content)

 China[b]
 India
 United States[c]
 Indonesia
Error occured: 'NoneType' object has no attribute 'find_all'
 Pakistan
 Brazil
 Nigeria
Error occured: 'NoneType' object has no attribute 'find_all'
 Bangladesh
 Russia[d]
 Mexico
 Japan
 Philippines
 Egypt
Error occured: 'NoneType' object has no attribute 'find_all'
 Ethiopia
 Vietnam
 DR Congo
Error occured: 'NoneType' object has no attribute 'find_all'
 Germany
 Iran
Error occured: 'NoneType' object has no attribute 'find_all'
 Turkey
 France[e]
Error occured: 'NoneType' object has no attribute 'find_all'
 Thailand
Error occured: 'NoneType' object has no attribute 'find_all'
 United Kingdom[f]
Error occured: 'NoneType' object has no attribute 'find_all'
 Italy
Error occured: 'NoneType' object has no attribute 'find_all'
 South Africa
 Tanzania[g]
 Myanmar
Error occured: 'NoneType' object has no attribute 'find_all'
 South Korea
Error occured: 'NoneType' object has no attribute 'find_all'
 Colombia
 Kenya
Error occured: 'NoneType' object 

Now, our dataset is compiled together but lacks headers for columns. Thus, we would now add those headers and remove columns that bring no value.

In [0]:
# Define column headings
headers = rows[0].find_all('th')
headers = [header.get_text().strip('\n') for header in headers]
headers += ['Total Area', 'Percentage Water', 'Total Nominal GDP', 'Per Capita GDP']
dataset.columns = headers

drop_columns = ['Rank', 'Date', 'Source']
dataset.drop(drop_columns, axis = 1, inplace = True)
dataset.sample(3)

dataset.to_csv("/content/drive/My Drive/data science practice/data creation and cleaning/Dataset.csv", index = False)

# Data Cleaning

In [0]:
import pandas as pd
import re

In [0]:
data=pd.read_csv('/content/drive/My Drive/data science practice/data creation and cleaning/Dataset.csv')

In [82]:
data.head()

Unnamed: 0,Country(or dependent territory),Population,% of World\nPopulation,Total Area,Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria[q],469000,0.00605%,"4,163 km2 (1,607 sq mi)",2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus[r],351965,0.00454%,"3,355 km2 (1,295 sq mi) (unranked)",2.7,$4.234 billion[4],"$14,942[5]"
2,Curaçao (Netherlands),158665,0.00205%,444 km2 (171 sq mi),"158,665[3]",US$3.1 billion (149th),"$20,020 (27th)"
3,South Ossetia[u],53532,0.000690%,"3,900 km2 (1,500 sq mi)",negligible,US$0.1 billion,"US$2,000"


In [83]:
data.isnull().sum()

Country(or dependent territory)    0
Population                         0
% of World\nPopulation             0
Total Area                         0
Percentage Water                   0
Total Nominal GDP                  0
Per Capita GDP                     0
dtype: int64

**changing columns names so it may relate to data well.**

In [84]:
data.rename(columns={'Country(or dependent territory)': 'Country'}, inplace = True)
data.rename(columns={'% of World\nPopulation': 'Percentage of World Population'}, inplace = True)
data.rename(columns={'Total Area': 'Total Area (km2)'}, inplace = True)
data.head(5)

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria[q],469000,0.00605%,"4,163 km2 (1,607 sq mi)",2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus[r],351965,0.00454%,"3,355 km2 (1,295 sq mi) (unranked)",2.7,$4.234 billion[4],"$14,942[5]"
2,Curaçao (Netherlands),158665,0.00205%,444 km2 (171 sq mi),"158,665[3]",US$3.1 billion (149th),"$20,020 (27th)"
3,South Ossetia[u],53532,0.000690%,"3,900 km2 (1,500 sq mi)",negligible,US$0.1 billion,"US$2,000"


In [85]:
#removing unnecessary [] and () included data in dataset
for column in data.columns:
    data[column] = data[column].str.replace(r"\(.*\)", "")
    data[column] = data[column].str.replace(r"\[.*\]", "")
data.head(5)

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria,469000,0.00605%,"4,163 km2",2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus,351965,0.00454%,"3,355 km2",2.7,$4.234 billion,"$14,942"
2,Curaçao,158665,0.00205%,444 km2,158665,US$3.1 billion,"$20,020"
3,South Ossetia,53532,0.000690%,"3,900 km2",negligible,US$0.1 billion,"US$2,000"


In [0]:
#removing % from 3 and 5 
data['Percentage of World Population'] = data['Percentage of World Population'].str.strip('%')
data['Percentage Water'] = data['Percentage Water'].str.strip('%')
data['Percentage Water'] = data['Percentage Water'].str.strip()
data['Population'] = data['Population'].str.replace(',', '') #remove commas from population column

In [87]:
data.head()

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria,469000,0.00605,"4,163 km2",2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus,351965,0.00454,"3,355 km2",2.7,$4.234 billion,"$14,942"
2,Curaçao,158665,0.00205,444 km2,158665,US$3.1 billion,"$20,020"
3,South Ossetia,53532,0.00069,"3,900 km2",negligible,US$0.1 billion,"US$2,000"


Now, we will explore the area column. Initially, we see that the information is represented in two units: sq mi and km2. We need to convert all values to km2.

The formula to convert 'sq mi' to km2 is to multiply the value by 2.58999.

First, we check if the cell has the units as 'sq mi', then we multiply it with 2.589999, convert it to integer and save it back to the cell else we simply convert it into integer. Before this, on taking a closer look at the values, some cells have range of areas and as a result we need to split the data at '-' and then take the first value to continue further.



In [88]:
import re
data['Total Area (km2)']=data['Total Area (km2)'].str.replace(',','')
for x in range(len(data['Total Area (km2)'])):
    area = data.iloc[x]['Total Area (km2)']
    if ('sq\xa0mi' in area):
        area = area.split('-')[0]
        area = re.sub(r'[^0-9.]+', '', area)
        area = int(float(area) * 2.58999)
    else:
        area = area.split('-')[0]
        area = re.sub(r'[^0-9.]+', '', area)
        area = int(float(area))
    data.iloc[x]['Total Area (km2)'] = area

data.head(5)


Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria,469000,0.00605,41632,2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus,351965,0.00454,33552,2.7,$4.234 billion,"$14,942"
2,Curaçao,158665,0.00205,4442,158665,US$3.1 billion,"$20,020"
3,South Ossetia,53532,0.00069,39002,negligible,US$0.1 billion,"US$2,000"



Let's analyse the 'Percentage Water' column further. For Algeria, Afghanistan, and some other countries, the value is negligible. Hence, in order to retain data and not drop these rows, we will mark these cells with 0.0. Chile has the character 'b' in the end which needs to be removed. For the columns where the value is more than 100, the actual values were missing and other content has been read instead. Thus, we must remove such rows due to lack of information.

In [89]:
data['Percentage Water'] = data['Percentage Water'].replace('negligible', '0.0')
data['Percentage Water'] = data['Percentage Water'].replace('Negligible', '0.0')
data['Percentage Water'] = data['Percentage Water'].str.replace(r'[^0-9.]+', '')

data = data[data['Percentage Water'].astype(float) <= 100]

data.head(5)

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria,469000,0.00605,41632,2.35,US$1.0 billion,"US$2,000"
1,Northern Cyprus,351965,0.00454,33552,2.7,$4.234 billion,"$14,942"
3,South Ossetia,53532,0.00069,39002,0.0,US$0.1 billion,"US$2,000"


Total GDP includes the values in the form of trillions, billions and millions. We can remove '$' and convert the words to numbers.



In [0]:
data['Total Nominal GDP'] = data['Total Nominal GDP'].str.replace('$', '')

for x in range(len(data['Total Nominal GDP'])):
    gdp = data.iloc[x]['Total Nominal GDP']
    if ('trillion' in data.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000000000)
    elif ('billion' in data.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000000)
    elif ('million' in data.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000)
    else:
        gdp = int(re.sub(r'[^0-9.]+', '', gdp))
    data.iloc[x]['Total Nominal GDP'] = gdp

In [91]:
data.head()

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Transnistria,469000,0.00605,41632,2.35,1000000000,"US$2,000"
1,Northern Cyprus,351965,0.00454,33552,2.7,4234000000,"$14,942"
3,South Ossetia,53532,0.00069,39002,0.0,100000000,"US$2,000"


In [0]:
#renaming column 6 and 7 to interpret data as USD in column heading instead of data level
data.rename(columns={'Total Nominal GDP':'Total Nominal GDP (USD)'},inplace=True)
data.rename(columns={'Per Capita GDP':'Per Capita GDP (USD)'},inplace=True)

In [93]:
data['Per Capita GDP (USD)']=data['Per Capita GDP (USD)'].str.replace('$','')
data['Per Capita GDP (USD)']=data['Per Capita GDP (USD)'].str.replace('US','')
data['Per Capita GDP (USD)']=data['Per Capita GDP (USD)'].str.replace(',','')
for i in range(len(data['Per Capita GDP (USD)'])):
  data.iloc[i]['Per Capita GDP (USD)']=int(data.iloc[i]['Per Capita GDP (USD)'])

data.head()

Unnamed: 0,Country,Population,Percentage of World Population,Total Area (km2),Percentage Water,Total Nominal GDP (USD),Per Capita GDP (USD)
0,Transnistria,469000,0.00605,41632,2.35,1000000000,2000
1,Northern Cyprus,351965,0.00454,33552,2.7,4234000000,14942
3,South Ossetia,53532,0.00069,39002,0.0,100000000,2000


In [0]:
data.to_csv('/content/drive/My Drive/data science practice/data creation and cleaning/final_dataset.csv',index=False)