In [1]:
# This lab requires 'httr' and 'rvest'packages, which are already pre-loaded into this lab environment.
# However, if you are working on your local RStudio, please uncomment the below codes and install the packages.

#install.packages("httr")
#install.packages("rvest")

In [2]:
library(httr)
library(rvest)

Loading required package: xml2
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.


In [3]:

get_wiki_covid19_page <- function() {
    
    
  # Wiki page base
    wiki_base_url <- "https://en.wikipedia.org/w/index.php"
    title<- "title=Template:COVID-19_testing_by_country"
    
  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response
    response <- GET(wiki_base_url, query=title)
  # Use the `return` function to return the response
    return (response)

}




Call the `get_wiki_covid19_page` function to get a http response with the target html page


In [4]:
# Call the get_wiki_covid19_page function and print the response
response_html<- get_wiki_covid19_page()
response_html

Response [https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country]
  Date: 2025-08-08 19:06
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 448 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
<head>
<meta charset="UTF-8">
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-heade...
RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.st...
<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return...
}];});});</script>
<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%...
...

## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


In [5]:
# Get the root html node from the http response in task 1 
root_node<- read_html(response_html)
root_node

{xml_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Get the tables in the HTML root node using `html_nodes` function.


In [6]:
# Get the table node from the root html node
table_node <- html_nodes(root_node, "table")
table_node

{xml_nodeset (4)}
[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
[4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...

In [7]:
# Read the table node and convert it into a data frame, and print the data frame for review
tables <- html_table(table_node)
df<-tables[[2]]
head(df)

Country or region,Date[a],Tested,Units[b],Confirmed(cases),"Confirmed /tested,%","Tested /population,%","Confirmed /population,%",Ref.
Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.4,0.13,[1]
Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
Andorra,23 Feb 2022,300307,samples,37958,12.6,387.0,49.0,[5]
Angola,2 Feb 2021,399228,samples,20981,5.3,1.3,0.067,[6]
Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]


## TASK 3: Pre-process and export the extracted data frame

 Pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [8]:
# Print the summary of the data frame
summary(df)

 Country or region    Date[a]             Tested            Units[b]        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed(cases)   Confirmed /tested,% Tested /population,%
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed /population,%     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`. 

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


In [9]:
#pre-processing function to convert the data frame

preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units[b]"] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}


Call the `preprocess_covid_data_frame` function


In [10]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame

#running manually for preprocessing

    # Remove the World row
    clean_df<-df[!(df$`Country.or.region`=="World"),]
    # Remove the last row
    clean_df <- df[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    clean_df["Ref."] <- NULL
    clean_df["Units[b]"] <- NULL
    
    # Renaming the columns
    names(clean_df) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
     # Convert column data types
    clean_df$country <- as.character(clean_df$country)
    clean_df$date <- as.Date(clean_df$date, format = "%d %b %Y")
    clean_df$tested <- as.numeric(gsub(",","",clean_df$tested))
    clean_df$confirmed <- as.numeric(gsub(",","",clean_df$confirmed))
    clean_df$'confirmed.tested.ratio' <- as.numeric(gsub(",","",clean_df$`confirmed.tested.ratio`))
    clean_df$'tested.population.ratio' <- as.numeric(gsub(",","",clean_df$`tested.population.ratio`))
    clean_df$'confirmed.population.ratio' <- as.numeric(gsub(",","",clean_df$`confirmed.population.ratio`))

In [11]:
head(clean_df)

country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Afghanistan,2020-12-17,154767,49621,32.1,0.4,0.13
Albania,2021-02-18,428654,96838,22.6,15.0,3.4
Algeria,2020-11-02,230553,58574,25.4,0.53,0.13
Andorra,2022-02-23,300307,37958,12.6,387.0,49.0
Angola,2021-02-02,399228,20981,5.3,1.3,0.067
Antigua and Barbuda,2021-03-06,15268,832,5.4,15.9,0.86


Get the summary of the processed data frame again


In [12]:
# Print the summary of the processed data frame again
summary(clean_df)

   country               date                tested            confirmed       
 Length:172         Min.   :2020-07-31   Min.   :     3880   Min.   :       0  
 Class :character   1st Qu.:2021-05-30   1st Qu.:   512037   1st Qu.:   37839  
 Mode  :character   Median :2022-01-25   Median :  3029859   Median :  281196  
                    Mean   :2022-01-17   Mean   : 31377219   Mean   : 2508340  
                    3rd Qu.:2022-09-27   3rd Qu.: 12386725   3rd Qu.: 1278105  
                    Max.   :2023-07-03   Max.   :929349291   Max.   :90749469  
 confirmed.tested.ratio tested.population.ratio confirmed.population.ratio
 Min.   : 0.00          Min.   :   0.006        Min.   : 0.000            
 1st Qu.: 5.00          1st Qu.:   9.475        1st Qu.: 0.425            
 Median :10.05          Median :  46.950        Median : 6.100            
 Mean   :11.25          Mean   : 175.504        Mean   :12.769            
 3rd Qu.:15.25          3rd Qu.: 156.500        3rd Qu.:16.250   

In [13]:
# Export the data frame to a csv file
write.csv(clean_df,"covid.csv")

In [14]:
# Get working directory
wd <- getwd()
# Get exported 
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)
file.exists(file_path)

[1] "C:/Users/abcha/anaconda_projects/Online_course_projects/Introduction to R Programming for Data Science - Coursera/covid.csv"


## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [15]:
# Read covid_data_frame_csv from the csv file

covid_data_frame_csv<-read.csv("covid.csv")

# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data_frame_csv[5:10,c("country" ,"confirmed" )]


Unnamed: 0,country,confirmed
5,Angola,20981
6,Antigua and Barbuda,832
7,Argentina,9060495
8,Armenia,422963
9,Australia,10112229
10,Austria,5789991


## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [16]:
# Get the total confirmed cases worldwide

confirmed_case <- sum(covid_data_frame_csv["confirmed"])
print(paste("confirmed cases: ", confirmed_case))
# Get the total tested cases worldwide
tested_case <- sum(covid_data_frame_csv["tested"])
print(paste("tested cases: ", tested_case))

# Get the positive ratio (confirmed / tested)
ratio<- confirmed_case/tested_case
print(paste("positive ratio (confirmed / tested): ", ratio))


[1] "confirmed cases:  431434555"
[1] "tested cases:  5396881644"
[1] "positive ratio (confirmed / tested):  0.0799414520197323"


## TASK 6: Get a country list which reported their testing data 

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [17]:
# Get the `country` column
covid_country<-  covid_data_frame_csv$country

# Check its class (should be Factor)

class(covid_country)

# Convert the country column into character so that it can be easily sorted
 covid_country<- as.character(covid_data_frame_csv$country)

# Sort the countries AtoZ
covid_country_AtoZ<- sort(covid_country)
# Sort the countries ZtoA
covid_country_ZtoA <-sort(covid_country,decreasing = TRUE)
# Print the sorted ZtoA list
print(covid_country_ZtoA)


  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda"              

## TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`


In [18]:
# Use a regular expression `United.+` to find matches

matches <- regexpr("United.+", covid_data_frame_csv$country)

# Print the matched country names
regmatches(covid_data_frame_csv$country, matches)

#alternet way
grep("United.+",  covid_data_frame_csv$country,value = TRUE)

## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [19]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns

country1<- covid_data_frame_csv[covid_data_frame_csv$country=="Bangladesh",c("country" ,"confirmed","confirmed.population.ratio" )]
country1
# Select a subset (should be only one row) of data frame based on a selected country name and columns
country2<- covid_data_frame_csv[covid_data_frame_csv$country=="United States",c("country" ,"confirmed","confirmed.population.ratio" )]
country2


Unnamed: 0,country,confirmed,confirmed.population.ratio
14,Bangladesh,1151644,0.7


Unnamed: 0,country,confirmed,confirmed.population.ratio
166,United States,90749469,27.4


## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [20]:

if (country1$confirmed.population.ratio>country2$confirmed.population.ratio){
    print(" Bangladesh has larger ratio of confirmed cases to population")
} else{
    print(" USA has larger ratio of confirmed cases to population")
    }


[1] " USA has larger ratio of confirmed cases to population"


## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [21]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold

less_ratiorate <- covid_data_frame_csv[covid_data_frame_csv$confirmed.population.ratio < 0.01,]
less_ratiorate

Unnamed: 0,X,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
28,28,Burundi,2021-01-05,90019,884,0.98,0.76,0.0074
34,34,China[c],2020-07-31,160000000,87655,0.055,11.1,0.0061
89,89,Laos,2021-03-01,114030,45,0.039,1.6,0.00063
119,119,North Korea,2020-11-25,16914,0,0.0,0.066,0.0
156,156,Tanzania,2020-11-18,3880,509,13.1,0.0065,0.00085
