# India Statewise COVID-19 Pandemic Data Analysis

## Scraping the data
The HTML of <a href="https://www.mygov.in/corona-data/covid19-statewise-status/"> this</a> is scraped to get the latest data as an R dataframe.

### Deducing xpath structure of web data
Say $p=$"/html/body/div[1]/div[4]/div/div/div/div[2]/div[2]/div/div/article/div/div[16]/div/div[2]"

Then ith row of the table is p/div[i]. Let $r_i=$ "$p$/div[i] & $i \epsilon [2, 36]$. $i=1$ does not adhere to deduced xpath notations. Include that manually.

So jth variable is $r_i$/div/div/div/div[j]. Let $v_{i,j} =$ "$r_i$/div/div/div/div[j]" & $j \epsilon [1, 3]$.

Name the variables accordingly.

So finally, xpath = "/html/body/div[1]/div[4]/div/div/div/div[2]/div[2]/div/div/article/div/div[16]/div/div[2]/div[i]/div/div/div/div[j]"

In [57]:
i_first <-  2
i_last  <- 36
j_first <-  1
j_last  <-  4
url <- "https://www.mygov.in/corona-data/covid19-statewise-status/"

library(tidyverse)
library(dplyr)
library(rvest)
library(xlsx)

In [26]:
# Loading HTML & setting up xpath
req_xpath <- "/html/body/div[1]/div[4]/div/div/div/div[2]/div[2]/div/div/article/div/div[16]/div/div[2]/div[row]/div/div/div/div[col]"
webpage <- read_html(url)

# Creating vectors
len <- i_last
state <- character(len)
total_confirmed <- character(len)
cured_discharged_migrated <- character(len)
deaths <- character(len)

# Manually adding first entry
state[1] <- "Andaman and Nicobar"
total_confirmed <- 24445
cured_discharged_migrated <- 1325
deaths <- 29

In [49]:
# Extracting text from nodes
for (i in i_first:i_last) {
    i_xpath <- req_xpath %>% str_replace("row", as.character(i))
    
    state_xpath   <- i_xpath %>% str_replace("col", "1")
    confirm_xpath <- i_xpath %>% str_replace("col", "2")
    cured_xpath   <- i_xpath %>% str_replace("col", "3")
    death_xpath   <- i_xpath %>% str_replace("col", "4")
    
    state[i] <- html_text(html_node(webpage, xpath = state_xpath)) %>%
                str_replace("State Name:", "") %>%
                str_replace("\\s+", "")
    
    total_confirmed[i] <- html_text(html_node(webpage, xpath = confirm_xpath)) %>%
                str_replace("Total Confirmed:", "") %>%
                str_replace("\\s+", "")
    
    cured_discharged_migrated[i] <- html_text(html_node(webpage, xpath = cured_xpath)) %>%
                str_replace("Cured/ Discharged/ Migrated:", "") %>%
                str_replace("\\s+", "")
    
    deaths[i] <- html_text(html_node(webpage, xpath = death_xpath)) %>%
                str_replace("Death:", "") %>%
                str_replace("\\s+", "")
}

total_confirmed <- as.numeric(total_confirmed)
cured_discharged_migrated <- as.numeric(cured_discharged_migrated)
deaths <- as.numeric(deaths)

# Creating a dataframe
covid_df <- data.frame(state, total_confirmed, cured_discharged_migrated, deaths)

head(covid_df)

state,total_confirmed,cured_discharged_migrated,deaths
Andaman and Nicobar,24445,1325,29
Andhra Pradesh,296609,209100,2732
Arunachal Pradesh,2741,1893,5
Assam,79667,56734,197
Bihar,106307,76452,468
Chandigarh,2216,1183,30


In [74]:
# Saving the data frame
save(covid_df, file=".\\data\\covid_statewise.Rdata")
write_csv(covid_df, path=".\\data\\covid_statewise.csv")
write.xlsx(covid_df, file=".\\data\\covid_statewise.xlsx")