<h1>Analysis of Global COVID-19 Pandemic Data</h1>





In [2]:
# This lab requires 'httr' and 'rvest'packages, which are already pre-loaded into this lab environment.
# However, if you are working on your local RStudio, please uncomment the below codes and install the packages.

#install.packages("httr")
#install.packages("rvest")

In [1]:
library(httr)
library(rvest)

Loading required package: xml2


## Getting a `COVID-19 pandemic` Wiki page using HTTP request


In [28]:

get_wiki_covid19_page <- function () {
    
  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country  
  # Which has two parts: 
    # 1) base URL `https://en.wikipedia.org/w/index.php  
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, seperated by question mark ?
    
  # Wiki page base
    wiki_base_url <- "https://en.wikipedia.org/w/index.php"
    
    query_params<-list(title="Template:COVID-19_testing_by_country")
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`
 
  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response
    
    wiki_covid19_page<-GET(url=wiki_base_url, query=query_params)
  # Use the `return` function to return the response
    return(wiki_covid19_page)
}




Calling the `get_wiki_covid19_page` function to get a http response with the target html page


In [31]:
# Call the get_wiki_covid19_page function and print the response
response<-get_wiki_covid19_page()
html_content <- content(response, as = "text")


## Extracting COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>




Now using the `read_html` function in rvest library to get the root html node from response


In [32]:
# Get the root html node from the http response in task 1 
html_node<-read_html(html_content)

Getting the tables in the HTML root node using `html_nodes` function.


In [33]:
# Get the table node from the root html node
table_node<-html_nodes(html_node,"table")
print(length(table_node))

[1] 4


Reading the specific table from the multiple tables in the `table_node` using the `html_table` function and convert it into dataframe using `as.data.frame`


In [36]:
# Read the table node and convert it into a data frame, and print the data frame for review
data_frame<-as.data.frame(html_table(table_node[2],fill=TRUE))

## Pre-processing and exporting the extracted data frame




Let's get a summary of the data frame


In [37]:
# Print the summary of the data frame
print(summary(data_frame))

 Country.or.region    Date.a.             Tested            Units.b.        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed.cases.   Confirmed..tested.. Tested..population..
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed..population..     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  


As we can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`. 

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


In [39]:
preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units.b."] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}


Calling the `preprocess_covid_data_frame` function


In [40]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
processed_data<-preprocess_covid_data_frame(data_frame)

Getting the summary of the processed data frame again


In [41]:
# Print the summary of the processed data frame again
print(summary(processed_data))

                country             date         tested         
 Afghanistan        :  1   2 Feb 2023 :  6   Min.   :     3880  
 Albania            :  1   1 Feb 2023 :  4   1st Qu.:   512037  
 Algeria            :  1   31 Jan 2023:  4   Median :  3029859  
 Andorra            :  1   1 Mar 2021 :  3   Mean   : 31377219  
 Angola             :  1   23 Jul 2021:  3   3rd Qu.: 12386725  
 Antigua and Barbuda:  1   29 Jan 2023:  3   Max.   :929349291  
 (Other)            :166   (Other)    :149                      
   confirmed        confirmed.tested.ratio tested.population.ratio
 Min.   :       0   Min.   : 0.00          Min.   :   0.006       
 1st Qu.:   37839   1st Qu.: 5.00          1st Qu.:   9.475       
 Median :  281196   Median :10.05          Median :  46.950       
 Mean   : 2508340   Mean   :11.25          Mean   : 175.504       
 3rd Qu.: 1278105   3rd Qu.:15.25          3rd Qu.: 156.500       
 Max.   :90749469   Max.   :46.80          Max.   :3223.000       
           

After pre-processing, we can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

- **country** - The name of the country
- **date** - Reported date
- **tested** - Total tested cases by the reported date
- **confirmed** - Total confirmed cases by the reported date
- **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
- **tested.population.ratio** - The ratio of tested cases to the population of the country
- **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file. 


In [42]:
# Export the data frame to a csv file
write.csv(processed_data)

"","country","date","tested","confirmed","confirmed.tested.ratio","tested.population.ratio","confirmed.population.ratio"
"1","Afghanistan","17 Dec 2020",154767,49621,32.1,0.4,0.13
"2","Albania","18 Feb 2021",428654,96838,22.6,15,3.4
"3","Algeria","2 Nov 2020",230553,58574,25.4,0.53,0.13
"4","Andorra","23 Feb 2022",300307,37958,12.6,387,49
"5","Angola","2 Feb 2021",399228,20981,5.3,1.3,0.067
"6","Antigua and Barbuda","6 Mar 2021",15268,832,5.4,15.9,0.86
"7","Argentina","16 Apr 2022",35716069,9060495,25.4,78.3,20
"8","Armenia","29 May 2022",3099602,422963,13.6,105,14.3
"9","Australia","9 Sep 2022",78548492,10112229,12.9,313,40.3
"10","Austria","1 Feb 2023",205817752,5789991,2.8,2312,65
"11","Azerbaijan","11 May 2022",6838458,792638,11.6,69.1,8
"12","Bahamas","28 Nov 2022",259366,37483,14.5,67.3,9.7
"13","Bahrain","3 Dec 2022",10578766,696614,6.6,674,44.4
"14","Bangladesh","24 Jul 2021",7417714,1151644,15.5,4.5,0.7
"15","Barbados","14 Oct 2022",770100,103014,13.4,268,35.9
"16","Belarus","

## Getting a subset of the extracted data frame

The goal is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [49]:
# Read covid_data_frame_csv from the csv file

# Get the 5th to 10th rows, with two "country" "confirmed" columns
desired_data<-processed_data[5:10,c("country","confirmed")]
print(desired_data)

               country confirmed
5               Angola     20981
6  Antigua and Barbuda       832
7            Argentina   9060495
8              Armenia    422963
9            Australia  10112229
10             Austria   5789991


## Calculating worldwide COVID testing positive ratio

The goal is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio.


In [51]:
# Get the total confirmed cases worldwide
total_confirmed_cases<-sum(processed_data[,"confirmed"])
# Get the total tested cases worldwide
total_tested_cases<-sum(processed_data[,"tested"])
# Get the positive ratio (confirmed / tested)
print(total_confirmed_cases/total_tested_cases)

[1] 0.07994145


## Getting a country list which reported their testing data 

The goal is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [58]:
# Get the `country` column
countries<-sort(processed_data[,"country"])
# Check its class (should be Factor)
print(class(countries))
# Conver the country column into character so that you can easily sort them
countries<-as.character(countries)
# Sort the countries AtoZ
countries<-sort(countries)
# Sort the countries ZtoA
countries<-sort(countries, decreasing=TRUE)
# Print the sorted ZtoA list
print(countries)

[1] "factor"
  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda" 

## Identifying countries names with a specific pattern

Using a regular expression to find any countires start with `United`


In [61]:
# Use a regular expression `United.+` to find matches
matched_countries<-grep("^United",countries)
# Print the matched country names
print(countries[matched_countries])

[1] "United States"        "United Kingdom"       "United Arab Emirates"


## Picking two countries of my interest, and then review their testing data

The goal is to compare the COVID-19 test data between two countires, I will select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [74]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
country1<-processed_data[processed_data$country=="Pakistan",c("country", "confirmed","confirmed.population.ratio")]
# Select a subset (should be only one row) of data frame based on a selected country name and columns
country2<-processed_data[processed_data$country=="India",c("country", "confirmed","confirmed.population.ratio")]

## Comparing which one of the selected countries has a larger ratio of confirmed cases to population

The goal of this is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [75]:
# Use if-else statement
if (country1["confirmed.population.ratio"]>country2["confirmed.population.ratio"]) {
   print("Pakistan")
} else {
   print("India")
}


[1] "India"


## Finding countries with confirmed to population ratio rate less than a threshold

The goal of task is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [None]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
desired_countries<-processed_data[processed_data$confirmed.population.ratio<1]