# Web scraping on Scrape this Site 

## Countries of the World 

Web scraping is the extraction of data from a web site to use for analysis or to build a model which can be used in solving a business challenge or in  business decision making. In this project I am going to do web scraping on the Scrape this site web site about countries of the world. The data to be extracted include the country name, capital, population and the size of the country in terms of area covered in m2. I am going to use requests library to requests the data from the website url and BeautifulSoup to parse the html and make the extraction easy.

In [569]:
#importing libraries
import pandas as pd 
from bs4 import BeautifulSoup
import requests


In [570]:
 url = 'https://www.scrapethissite.com/pages/simple/'

In [571]:
response = requests.get(url)

In [572]:
soup = BeautifulSoup(response.content,  'html.parser')

## Data Extraction



I will start by extracting the title of the html that I am about the scrape in detail

In [575]:
#Extracting the title using find 
title =  soup.find('title').get_text()
print(title)

Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping


I am going to use a loop code  to extract the data of each country and make a dataframe

In [577]:
countries = soup.find_all('div', class_ = 'col-md-4 country')
countries

[<div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ad"></i>
                             Andorra
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
 <strong>Population:</strong> <span class="country-population">84000</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ae"></i>
                             United Arab Emirates
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
 <strong>Population:</strong> <span class="country-population">4975593</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 

In [578]:
#empty list to store the extracted data
country_names = []
capitals = []
population = []
area_km2 = []

In [579]:
for country in countries:
    country_name = country.find('h3', class_  = "country-name").get_text()
    capital = country.find('div', class_ ='country-info').find('span', class_ = "country-capital").get_text()
    pop = country.find('span', class_ = 'country-population').get_text()
    area = country.find('span', class_ = 'country-area').get_text()
    
    #loading the empty lists created above
    country_names.append(country_name)
    capitals.append(capital)
    population.append(pop)
    area_km2.append(area)

    #Creating a data frame 
    df = pd.DataFrame({"Country": country_names, "Capital_City":capitals, "Population":population, "Area_km2": area_km2})
    

In [580]:
df.head()

Unnamed: 0,Country,Capital_City,Population,Area_km2
0,\n\n Andorra\n ...,Andorra la Vella,84000,468.0
1,\n\n United Arab Em...,Abu Dhabi,4975593,82880.0
2,\n\n Afghanistan\n ...,Kabul,29121286,647500.0
3,\n\n Antigua and Ba...,St. John's,86754,443.0
4,\n\n Anguilla\n ...,The Valley,13254,102.0


The country column need cleaning 

## Data Cleaning 

In [582]:
#removing special characters 
df['Country'] = df['Country'].str.replace('\n', '').str.strip()

In [583]:
df.head()

Unnamed: 0,Country,Capital_City,Population,Area_km2
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0


In [584]:
#checking the shape
df.shape

(250, 4)

In [585]:
#checking null values
df.isna().sum()

Country         0
Capital_City    0
Population      0
Area_km2        0
dtype: int64

In [586]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Country       250 non-null    object
 1   Capital_City  250 non-null    object
 2   Population    250 non-null    object
 3   Area_km2      250 non-null    object
dtypes: object(4)
memory usage: 7.9+ KB


In [587]:
#changing the data types 
df['Population'].astype(int)

0         84000
1       4975593
2      29121286
3         86754
4         13254
         ...   
245    23495361
246      159042
247    49000000
248    13460305
249    11651858
Name: Population, Length: 250, dtype: int32

In [588]:
df["Area_km2"].astype(float)

0          468.0
1        82880.0
2       647500.0
3          443.0
4          102.0
         ...    
245     527970.0
246        374.0
247    1219912.0
248     752614.0
249     390580.0
Name: Area_km2, Length: 250, dtype: float64

In [589]:
df.duplicated().sum()

0

In [590]:
df.head()

Unnamed: 0,Country,Capital_City,Population,Area_km2
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0


In [None]:
After cleaning the file I am going to save the file in a spreadsheet read for future analysis. This is all I have on web scraping in this project. Thank