<h1>Analysis of Global COVID-19 Pandemic Data</h1>

Estimated time needed: **90** minutes



## Overview:

In this final project, you will apply your knowledge of data analysis by completing a series of hands-on tasks in this lab notebook. The lab consists of 10 tasks, where you will write and execute code to demonstrate your skills.

While working through the lab, remember to save all your code and outputs, then download the Jupyter Notebook, as you will need to submit the completed notebook for the **Final Project: Submission and Evaluation**.

If you need to refresh your memories about specific coding details, you may refer to previous hands-on labs for code examples.


In [1]:
# This lab requires 'httr' and 'rvest'packages, which are already pre-loaded into this lab environment.
# However, if you are working on your local RStudio, please uncomment the below codes and install the packages.

#install.packages("httr")
#install.packages("rvest")

In [2]:
library(httr)
library(rvest)

Note: if you can import above libraries, please use install.packages() to install them first.


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.

Before you write the function, you can open this public page from this 

URL https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal of task 1 is to get the html page using HTTP request (`httr` library)


In [3]:

get_wiki_covid19_page <- function() {
    
  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country  
  # Which has two parts: 
    # 1) base URL `https://en.wikipedia.org/w/index.php  
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, seperated by question mark ?
    
  # Wiki page base
  wiki_base_url <- "https://en.wikipedia.org/w/index.php"
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`
 
  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response
    
  # Use the `return` function to return the response

}
library(httr)

get_wiki_covid19_page <- function() {
  
  # Wiki page base URL
  wiki_base_url <- "https://en.wikipedia.org/w/index.php"
  
  # Query parameters
  query_params <- list(
    title = "Template:COVID-19_testing_by_country"
  )
  
  # Send GET request
  response <- GET(
    url = wiki_base_url,
    query = query_params
  )
  
  # Return the HTTP response
  return(response)
}
response <- get_wiki_covid19_page()
response





Response [https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
  Date: 2025-12-14 06:51
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 456 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
<head>
<meta charset="UTF-8">
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-heade...
RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.st...
<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return...
}];});});</script>
<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%...
...

Call the `get_wiki_covid19_page` function to get a http response with the target html page


In [4]:
# Call the get_wiki_covid19_page function and print the response


## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>

Note the numbers you actually see on your page may be different from above because it is still an on-going pandemic when creating this notebook.

The goal of task 2 is to extract above data table and convert it into a data frame


Now use the `read_html` function in rvest library to get the root html node from response


In [5]:
library(rvest)
root_html <- read_html(response)





Get the tables in the HTML root node using `html_nodes` function.


In [6]:
table_nodes <- html_nodes(root_html, "table")



Read the specific table from the multiple tables in the `table_node` using the `html_table` function and convert it into dataframe using `as.data.frame`

_Hint:- Please read the `table_node` with index 2(ex:- table_node[2])._


In [7]:
covid_testing_df <- as.data.frame(
  html_table(table_nodes[[2]], fill = TRUE)
)

head(covid_testing_df)






Unnamed: 0_level_0,Country or region,Date[a],Tested,Units[b],Confirmed(cases),"Confirmed /tested,%","Tested /population,%","Confirmed /population,%",Ref.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.4,0.13,[1]
2,Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
3,Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
4,Andorra,23 Feb 2022,300307,samples,37958,12.6,387.0,49.0,[5]
5,Angola,2 Feb 2021,399228,samples,20981,5.3,1.3,0.067,[6]
6,Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]


## TASK 3: Pre-process and export the extracted data frame

The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [8]:
summary(covid_testing_df)





 Country or region    Date[a]             Tested            Units[b]        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed(cases)   Confirmed /tested,% Tested /population,%
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed /population,%     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`. 

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


We have prepared a pre-processing function for you to conver the data frame but you can also try to write one by yourself


In [9]:
preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units.b."] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}
preprocess_covid_data_frame <- function(data_frame) {

  # Remove the World row
  data_frame <- data_frame[!(data_frame$`Country or region` == "World"), ]

  # Remove the last row
  data_frame <- data_frame[1:172, ]

  # Remove unnecessary columns
  data_frame["Ref."] <- NULL
  data_frame["Units b"] <- NULL

  # Rename columns
  names(data_frame) <- c(
    "country",
    "date",
    "tested",
    "confirmed",
    "confirmed_tested_ratio",
    "tested_population_ratio",
    "confirmed_population_ratio"
  )

  # Convert column data types
  data_frame$country <- as.factor(data_frame$country)
  data_frame$date <- as.factor(data_frame$date)

  data_frame$tested <- as.numeric(gsub(",", "", data_frame$tested))
  data_frame$confirmed <- as.numeric(gsub(",", "", data_frame$confirmed))

  data_frame$confirmed_tested_ratio <- as.numeric(gsub("%", "", data_frame$confirmed_tested_ratio))
  data_frame$tested_population_ratio <- as.numeric(gsub("%", "", data_frame$tested_population_ratio))
  data_frame$confirmed_population_ratio <- as.numeric(gsub("%", "", data_frame$confirmed_population_ratio))

  return(data_frame)
}



Call the `preprocess_covid_data_frame` function


In [10]:
processed_covid_df <- preprocess_covid_data_frame(covid_testing_df)



“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”


Get the summary of the processed data frame again


In [11]:
summary(processed_covid_df)



                country             date         tested            confirmed  
 Afghanistan        :  1   2 Feb 2023 :  6   Min.   :     3880   Min.   : NA  
 Albania            :  1   1 Feb 2023 :  4   1st Qu.:   512037   1st Qu.: NA  
 Algeria            :  1   31 Jan 2023:  4   Median :  3029859   Median : NA  
 Andorra            :  1   1 Mar 2021 :  3   Mean   : 31377219   Mean   :NaN  
 Angola             :  1   23 Jul 2021:  3   3rd Qu.: 12386725   3rd Qu.: NA  
 Antigua and Barbuda:  1   29 Jan 2023:  3   Max.   :929349291   Max.   : NA  
 (Other)            :166   (Other)    :149                       NA's   :172  
 confirmed_tested_ratio tested_population_ratio confirmed_population_ratio
 Min.   :  0.0          Min.   : 0.00           Min.   :  0.0065          
 1st Qu.:148.5          1st Qu.: 5.00           1st Qu.:  8.5000          
 Median :494.0          Median :10.05           Median : 40.9500          
 Mean   :486.8          Mean   :11.25           Mean   :106.9261    

After pre-processing, you can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

- **country** - The name of the country
- **date** - Reported date
- **tested** - Total tested cases by the reported date
- **confirmed** - Total confirmed cases by the reported date
- **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
- **tested.population.ratio** - The ratio of tested cases to the population of the country
- **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file. 


In [12]:
write.csv(processed_covid_df, "covid.csv", row.names = FALSE)



Note for IBM Waston Studio, there is no traditional "hard disk" associated with a R workspace.

Even if you call `write.csv()` method to save the data frame as a csv file, it won't be shown in IBM Cloud Object Storage asset UI automatically.

However, you may still check if the `covid.csv` exists using following code snippet:


In [13]:
# Get working directory
wd <- getwd()

# Build file path
file_path <- paste(wd, "covid.csv", sep = "/")

# Print file path
print(file_path)

# Check if file exists
file.exists(file_path)



[1] "/resources/RP0101EN/labs/M5/covid.csv"


**Optional Step**: If you have difficulties finishing above webscraping tasks, you may still continue with next tasks by downloading a provided csv file from here:


In [14]:
## Download a sample csv file
# covid_csv_file <- download.file("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/covid.csv", destfile="covid.csv")
# covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")

## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [15]:
# Read the csv file
covid_data_frame_csv <- read.csv("covid.csv", stringsAsFactors = FALSE)

# Get the 5th to 10th rows with only country and confirmed columns
covid_subset <- covid_data_frame_csv[5:10, c("country", "confirmed")]

# Print the result
covid_subset




Unnamed: 0_level_0,country,confirmed
Unnamed: 0_level_1,<chr>,<lgl>
5,Angola,
6,Antigua and Barbuda,
7,Argentina,
8,Armenia,
9,Australia,
10,Austria,


## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [16]:
# Total confirmed cases worldwide
total_confirmed <- sum(processed_covid_df$confirmed, na.rm = TRUE)

# Total tested cases worldwide
total_tested <- sum(processed_covid_df$tested, na.rm = TRUE)

# Positive ratio
positive_ratio <- total_confirmed / total_tested

positive_ratio




## TASK 6: Get a country list which reported their testing data 

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [17]:
# Get the country column
countries <- processed_covid_df$country

# Check its class
class(countries)

# Sort A to Z
countries_AtoZ <- sort(as.character(countries))

# Sort Z to A
countries_ZtoA <- sort(as.character(countries), decreasing = TRUE)

# Print Z to A list
countries_ZtoA




##TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`


In [18]:
# Find countries starting with "United"
united_countries <- grep("^United.+", countries_AtoZ, value = TRUE)

# Print matched country names
united_countries




## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [19]:
# Select India
india_data <- processed_covid_df[
  processed_covid_df$country == "India",
  c("country", "confirmed", "confirmed_population_ratio")
]

# Select United States
us_data <- processed_covid_df[
  processed_covid_df$country == "United States",
  c("country", "confirmed", "confirmed_population_ratio")
]

india_data
us_data



Unnamed: 0_level_0,country,confirmed,confirmed_population_ratio
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>
73,India,,63


Unnamed: 0_level_0,country,confirmed,confirmed_population_ratio
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>
166,United States,,281


## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [20]:
if (india_data$confirmed_population_ratio > us_data$confirmed_population_ratio) {
  print("India has a higher confirmed-to-population ratio than the United States")
} else {
  print("United States has a higher confirmed-to-population ratio than India")
}



[1] "United States has a higher confirmed-to-population ratio than India"


## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [21]:
low_risk_countries <- processed_covid_df[
  processed_covid_df$confirmed_population_ratio < 1,
  c("country", "confirmed_population_ratio")
]

low_risk_countries




Unnamed: 0_level_0,country,confirmed_population_ratio
Unnamed: 0_level_1,<fct>,<dbl>
1,Afghanistan,0.4
3,Algeria,0.53
,,
27,Burkina Faso,0.76
28,Burundi,0.76
32,Chad,0.72
NA.1,,
NA.2,,
45,DR Congo,0.14
NA.3,,
