## <u>Data Preparation:</u>

> * <a href="#*-Data-Cleaning">DATA CLEANING</a>
> * <a href="#*-Data-Reduction">DATA REDUCTION</a>
    * <a href="#Scaling">Scaling</a>
> * <a href="#*-Data-transformation">DATA TRANSFORMATION</a>
    * <a href="#Arrenging-data-Country-Wise">Arranging data Country-Wise</a>
    * <a href="#Explaination-of-Pooled-Datasets-(bulk-&-four)">Pooled Datasets</a>

Now, as we have the raw data for our analysis, we can move forward for our next phase i.e. Data-Preparation.<br />
* The data-preparation is considered to be the <u>most time consuming phase</u> of any datascience project.<br />
* On an average, an idal data-science project's <b>90%</b> of time is spent during Data-Collection and Data-Preparation.<br /><br />

#### * Data Cleaning

Whenever we collect any kind of the raw-data from various sourcev, it has a lot of the vulnerabilities.<br />
Most often, these are of following types:
1. NAs and NaNs
2. Missing data values
3. Incorrect data values
<br /><br />
Checking for these flaws in our data:

In [1]:
# sample NA values
check.Confirmed[which(str_detect(check.Confirmed$Country.Region, "Cruise Ship")),]

# sample wrong data "french guiana" the data value can not decrease on next day
check.Confirmed[which(str_detect(check.Confirmed$Province.State, "French Guiana")),]

# sample blank data ---> State (to be replaced by 'Others')
head(check.Recovered)

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


<br />
So, there are also many issues (like <u>blanks in the place of states' name</u> and data of a <u>Cruise Ship among countries' data</u>) with our available datasets.<br />
To get rid of these issues, the data-cleaning is performed.<br />

For data cleaning,we consider either of these two methods (or both, too):
1. **Removal:**<br />
    Here we usually remove or delete those rows/columns, where we find the vulnerabilities.<br />
    These rows/columns might include NAs.<br /><br />
2. **Replacement/Filling:**<br />
    Here we replace the NAs or incorrect or blanks data values with some acceptable value.<br />
    Mostly, values are replaced by Mean or Mode values, so that the overall statistical structure may remain the same.<br />
    Sometimes, we also fill them on the basis of some specific calculations.<br />

### What will we do?

Because we have the time-series dataset populated with discrete data values, storing the total count of the total people (having COVID-19 confirmed, have died due to COVID-19 or have recovered from COVID-19), the issues:
> 1. <u>can NOT be resolved by MEAN</u><br />
>     because in our case, either the Data value can remain CONSTANT or can INCREASE, on every next day.<br />
>     the MEAN need not to be discrete<br />
>     MEAN can also be less than the previous data, for any particular day etc..<br /><br />
> 2. <u>can NOT be resolved by MODE</u><br />
>     because it's a medical data and hence any most often occurring number cannot be blindly replaced with a missing value etc..<br />

Hence, we'll **NOT** be using any of the **replacement of MEAN/MODE/MEDIAN** 

#### <u>We'll replace with maximum values</u>

We will be replacing the missing values or NAs with the maximum value up to a day before the current day.<br />
It means that - the values are carried constant for the next day whose data is missing.


In [5]:
# removing NAs, replacing incorrect values

for (i in 1:nrow(check.Confirmed)) {
  for (j in 5:ncol(check.Confirmed)) {
    if(j==5) {
      check.Confirmed[i,j] = ifelse(is.na(check.Confirmed[i, j]), 0, check.Confirmed[i,j])
    } else {
      if(is.na(check.Confirmed[i, j])){
        check.Confirmed[i,j] = check.Confirmed[i, (j-1)]
      } else if(check.Confirmed[i, (j-1)] > check.Confirmed[i, j]){
        check.Confirmed[i,j] = check.Confirmed[i, (j-1)]
      }
    }
  }
}

for (i in 1:nrow(check.Deaths)) {
  for (j in 5:ncol(check.Deaths)) {
    if(j==5) {
      check.Deaths[i,j] = ifelse(is.na(check.Deaths[i, j]), 0, check.Deaths[i,j])
    } else {
      if(is.na(check.Deaths[i, j])){
        check.Deaths[i,j] = check.Deaths[i, (j-1)]
      } else if(check.Deaths[i, (j-1)] > check.Deaths[i, j]){
        check.Deaths[i,j] = check.Deaths[i, (j-1)]
      }
    }
  }
}

for (i in 1:nrow(check.Recovered)) {
  for (j in 5:ncol(check.Recovered)) {
    if(j==5) {
      check.Recovered[i,j] = ifelse(is.na(check.Recovered[i, j]), 0, check.Recovered[i,j])
    } else {
      if(is.na(check.Recovered[i, j])){
        check.Recovered[i,j] = check.Recovered[i, (j-1)]
      } else if(check.Recovered[i, (j-1)] > check.Recovered[i, j]){
        check.Recovered[i,j] = check.Recovered[i, (j-1)]
      }
    }
  }
}





ERROR: Error in nrow(check.Confirmed): object 'check.Confirmed' not found


In [2]:
# replace blanks and incorrect country/state names

# replacing in states
states = as.character(check.Confirmed$Province.State)
states.levels = as.character(levels(check.Confirmed$Province.State))

states[states %in% ""] = "Others"
states.levels[states.levels %in% ""] = "Others"


#######
states[states %in% "From Diamond Princess"] = "Diamond Princess"
states.levels = states.levels[!states.levels %in% "From Diamond Princess"]


# replacing in countries
countries = as.character(check.Confirmed$Country.Region)
countries.levels = as.character(levels(check.Confirmed$Country.Region))

countries[countries %in% "US"] = "United States"
countries[countries %in% "UK"] = "United Kingdom"
countries[countries %in% "Taiwan*"] = "Taiwan"
countries[countries %in% "The Bahamas"] = "Bahamas"
countries[countries %in% "Gambia, The"] = "Gambia"
countries[countries %in% "Korea, South"] = "South Korea"
countries[countries %in% c("Congo (Brazzaville)", "Congo (Kinshasa)", "Republic of the Congo")] = "Democratic Republic of the Congo"
###
countries.levels[countries.levels %in% "US"] = "United States"
countries.levels[countries.levels %in% "UK"] = "United Kingdom"
countries.levels[countries.levels %in% "Taiwan*"] = "Taiwan"
countries.levels[countries.levels %in% "The Bahamas"] = "Bahamas"
countries.levels[countries.levels %in% "Gambia, The"] = "Gambia"
countries.levels[countries.levels %in% "Korea, South"] = "South Korea"

countries.levels = countries.levels[!countries.levels %in% c("Congo (Brazzaville)", "Congo (Kinshasa)", "Republic of the Congo")]
countries.levels = c(countries.levels, "Democratic Republic of the Congo")
###############################

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


In [3]:
# rectified fectors
states.factor  = factor(c(states), levels = c(states.levels))
countries.factor  = factor(countries, levels = countries.levels)


## CUZ' INITIAL 4 COLUMNS ARE COMMON IN ALL 3 DATASETS ##

# editing factors in datasets
check.Confirmed = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Confirmed[,3:ncol(check.Confirmed)]
                  )

check.Deaths = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Deaths[,3:ncol(check.Deaths)]
                  )

check.Recovered = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Recovered[,3:ncol(check.Recovered)]
                  )



ERROR: Error in factor(c(states), levels = c(states.levels)): object 'states' not found


**Now our data has been cleaned.** Viewing the cleaned data.

In [4]:
# sample NA val
check.Confirmed[which(str_detect(check.Confirmed$Country.Region, "Cruise Ship")),]

# sample wrong data "french guiana"
check.Confirmed[which(str_detect(check.Confirmed$Province.State, "French Guiana")),]

# sample blank data ---> State (replaced by 'Other')
head(check.Recovered)

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


<br /> 
#### * Data Reduction

Though we have cleaned the dataset, yet we see that **'Diamond Princess'** Cruise is still therein among the countries' data.<br />
Hence it's an outlier, and hence has to be separated.

**So, now we'll start the process of data reduction**

In [5]:
# removing diamond princess
Diamond.Princess.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "Cruise Ship", negate = F)), ]
check.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "Cruise Ship", negate = T)), ]

Diamond.Princess.Deaths = check.Deaths[ which(str_detect(check.Deaths$Country.Region, "Cruise Ship", negate = F)),]
check.Deaths = check.Deaths[ which(str_detect(check.Deaths$Country.Region, "Cruise Ship", negate = T)), ]

Diamond.Princess.Recovered = check.Recovered[ which(str_detect(check.Recovered$Country.Region, "Cruise Ship", negate = F)), ]
check.Recovered = check.Recovered[ which(str_detect(check.Recovered$Country.Region, "Cruise Ship", negate = T)), ]

## Rectifying Row sequences
row.names(check.Confirmed) <- NULL
row.names(check.Deaths) <- NULL
row.names(check.Recovered) <- NULL

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


**Verifying if it is removed or not:**

In [6]:
# Let's check whether Diamond Princess is still at row 166 or not
check.Confirmed[166,]
#check.Deaths[166,]
#check.Recovered[166,]


# also checking dimention
cat("\nEarlier dimention: 468 X 62\n\n")    # as we saw initially

cat("New dimention: ", dim(check.Confirmed))

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


_**OK! so it's gone!**_
<br /><br />

#### Scaling

Now we have the dataset that hold the counts of the COVID-19 cases of different geographical locations.<br />
Hence, we can now create a dataset to generate the map for every unique day.(that we saw early in this project)<br /><br />

* It means, we want to <u>plot all the countries/regions</u> that are affected on a particular day
* It gives us an idea that - among all the given countries, <u>either we are going to plot a selected country on world-map, or not</u>, for a specific day
* the **factor** on which basis we'll be deciding is - <u>whether country has any confirmed case</u> till that day <u>or not</u>

* So, We'd also **need** the _Latitude and Longitude_ position for those country
<br /><br />

Finally, we can roughly estimate that we can have only 2 choices for any region, say **0** & **1**, such that:
> **0**:  don't plot on map, if it has no confirm cases i.e. val for total confirm case on that day are 0 <br />
> **1**:  plot on map, if it has some confirm cases i.e. there is at least 1 confirm case on that day <br />

<br /> 
<font size="3">
    Therefor, We're going to use **Unit Scaling** to set all the values from <u>5th to last column</u>
</font>
<br /> 

In [7]:

# in UNIT SCALING, all the data has either 0 or 1 value

ever.Affected = check.Confirmed

# Unit scaling
for (i in row.names(ever.Affected)) {
  for (j in 5:ncol(ever.Affected)) {
    if(ever.Affected[i,j] != 0)
      ever.Affected[i,j] = 1
  }
}

head(ever.Affected)


ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


<br /><br />

#### Next step is to find and remove the outliers:

We'll use **scatter plots** & **box plots** to _identify_ and **compare the MEAN** of every day to _verify_ these outliers, so that we can remove them, successfully.

In [13]:
# Let's visualize our data to varify:
library(ggplot2)

options(repr.plot.width=14, repr.plot.height=8)
ggplot(check.Confirmed) +
  geom_point(aes(x=check.Confirmed$Province.State, y=check.Confirmed$X1.29.20), color="red", size=2) +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
          ,axis.text.x = element_blank()
  )

cat("\n\n")

#ggplot(check.Deaths) +
#  geom_point(aes(x=check.Deaths$Province.State, y=check.Deaths$X1.29.20), color="red", size=2)#

#ggplot(check.Recovered) +
#  geom_point(aes(x=check.Recovered$Province.State, y=check.Recovered$X1.29.20), color="red", size=2)
  

ERROR: Error in ggplot(check.Confirmed): object 'check.Confirmed' not found


In [8]:
# Let's find the name of this outlier:-

check.Confirmed[which(check.Confirmed$X1.29.20 > 400), c("Province.State", "Country.Region")]
#check.Deaths[which(check.Deaths$X1.29.20 > 15), c("Province.State", "Country.Region")]
#check.Recovered[which(check.Recovered$X1.29.20 > 20), c("Province.State", "Country.Region")]

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


<br /> 
So, it's **Hubei**.<br />
We'll verify it by comparison of mean value on daily basis, <u>including and excluding the Hubei province</u>. 

In [9]:
# Here we are trying to compare the mean values of everyday, including and excluding Hubei province
With.Hubei = as.numeric(apply(check.Confirmed[,5:ncol(check.Confirmed)], 2, mean))

exceptHubei = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = T)), ]
Without.Hubei = as.numeric(apply(exceptHubei[,5:ncol(exceptHubei)], 2, mean))

# creating a dataframe for comperision
Mean.Comparision.Table = data.frame(
              "Date" = as.character(colnames(check.Confirmed)[5:ncol(check.Confirmed)]),
              "With Hubei" = c(With.Hubei),
              "Without Hubei" = c(Without.Hubei))

tail(Mean.Comparision.Table, 10)

ERROR: Error in apply(check.Confirmed[, 5:ncol(check.Confirmed)], 2, mean): object 'check.Confirmed' not found


<br />

So it's clear that the Hubei is the outlier..<br />


In [17]:
# let's remove Hubei from our dataset:
Hubei.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = F)), ]
check.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = T)), ]

Hubei.Deaths = check.Deaths[ which(str_detect(check.Deaths$Province.State, "Hubei", negate = F)),]
check.Deaths = check.Deaths[ which(str_detect(check.Deaths$Province.State, "Hubei", negate = T)), ]

Hubei.Recovered = check.Recovered[ which(str_detect(check.Recovered$Province.State, "Hubei", negate = F)), ]
check.Recovered = check.Recovered[ which(str_detect(check.Recovered$Province.State, "Hubei", negate = T)), ]


## Rectifying Row sequences
row.names(check.Confirmed) <- NULL
row.names(check.Deaths) <- NULL
row.names(check.Recovered) <- NULL

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


In [18]:
Hubei.Confirmed

ERROR: Error in eval(expr, envir, enclos): object 'Hubei.Confirmed' not found


In [19]:
# Let's check the once dimention more

# also checking dimention
cat("\nEarlier dimention: 467 X 62\n\n")    # after removing Cruis Ship

cat("New dimention: ", dim(check.Confirmed))


Earlier dimention: 467 X 62



ERROR: Error in cat("New dimention: ", dim(check.Confirmed)): object 'check.Confirmed' not found


In [20]:
# Let's visualize once more
library(ggplot2)

options(repr.plot.width=14, repr.plot.height=8)
ggplot(check.Confirmed) +
  geom_point(aes(x=check.Confirmed$Province.State, y=check.Confirmed$X1.29.20), color="red", size=2) +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
          ,axis.text.x = element_blank()
  )

cat("\n\n")


ERROR: Error in ggplot(check.Confirmed): object 'check.Confirmed' not found


<br />
Although now it's comparatively better, still have some outliers...

In [10]:
# Let's find them out, too:-
check.Confirmed[which(check.Confirmed$X1.29.20 > 100), c("Province.State", "Country.Region")]

ERROR: Error in eval(expr, envir, enclos): object 'check.Confirmed' not found


In [11]:
# Checking for mean comperision
With.China = as.numeric(apply(check.Confirmed[,5:ncol(check.Confirmed)], 2, mean))

exceptChina = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "China", negate = T)), ]
Without.China = as.numeric(apply(exceptChina[,5:ncol(exceptChina)], 2, mean))

# comperision
Mean.Comparision.Table = data.frame(
              "Date" = as.character(colnames(check.Confirmed)[5:ncol(check.Confirmed)]),
              "With China" = c(With.China),
              "Without China" = c(Without.China))

head(Mean.Comparision.Table, 10)

ERROR: Error in apply(check.Confirmed[, 5:ncol(check.Confirmed)], 2, mean): object 'check.Confirmed' not found


```
So, while talking about the whole world, the complete mainland of China seems to be the outlier.
Although, we'll have to verify it first.
```
<br /><br />
<u>But because, it's not a single row, we will perform this action later i.e. during data-transformation.</u><br />

<hr /><br />

#### * Data transformation

In [12]:
# We've already saved the cleaned version of the all the files
# Loading the files in order to transform the dataset(s)

# loading raw data - from source
Confirmed = read.csv("Notebooks/syllabus/static/cleaned/time_series_19-covid-Confirmed.csv")
Deaths = read.csv("Notebooks/syllabus/static/cleaned/time_series_19-covid-Deaths.csv")
Recovered = read.csv("Notebooks/syllabus/static/cleaned/time_series_19-covid-Recovered.csv")

Hubei.Confirmed = read.csv("Notebooks/syllabus/static/cleaned/Hubei/time_series_19-covid-Confirmed.csv")
Hubei.Deaths = read.csv("Notebooks/syllabus/static/cleaned/Hubei/time_series_19-covid-Deaths.csv")
Hubei.Recovered = read.csv("Notebooks/syllabus/static/cleaned/Hubei/time_series_19-covid-Recovered.csv")

Diamond.Princess.Confirmed = read.csv("Notebooks/syllabus/static/cleaned/Diamond-Princess/time_series_19-covid-Confirmed.csv")
Diamond.Princess.Deaths = read.csv("Notebooks/syllabus/static/cleaned/Diamond-Princess/time_series_19-covid-Deaths.csv")
Diamond.Princess.Recovered = read.csv("Notebooks/syllabus/static/cleaned/Diamond-Princess/time_series_19-covid-Recovered.csv")

“cannot open file 'Notebooks/syllabus/static/cleaned/time_series_19-covid-Confirmed.csv': No such file or directory”


ERROR: Error in file(file, "rt"): cannot open the connection


In [13]:
# as known, all of these files have same set of columns,
# the only things that differ are data values in dates' columns

# Let's see any one dataset's structure (as all are similer)
str(Hubei.Recovered)

ERROR: Error in str(Hubei.Recovered): object 'Hubei.Recovered' not found


<font size="3">
* Now, recalling the Problem Statement, we aim to find out the status of COVID-19 in China, within next 7 days
* In order to do so, we need to analyze the status of COVID-19 on all the previous days
</font> 

### What would it tell us?
By this, we'd be capable enough to make an estimate by what RATE the Coronavirus is spreading since late January.<br /><br />
Hence, we need to transform the data in order:
> 1. such that rows hold every data <u>Country wise, instead of State wise</u>
> 2. to include a <u>new column "Date"</u> to store aggregate data (of 3 datasets) in a single place
> 3. <u>remove unnecessary columns</u> i.e. *States, Latitude & Longitude*

#### Arranging data Country-Wise
<br /> 
**Steps:**

In [14]:
# We need Countries' data:

# It's because: many states have very few cases
tail(Confirmed)

# Most of the states' name is not identified
unknown = nrow(Recovered[which(str_detect(Recovered$Province.State, "Others")),])
cat(unknown, "/", nrow(Recovered), " States are NOT identified")

# Ultimatly, any precaution/cure or action is more likely be taken onto the country level, rather than the individual state, as it's the case of a severe Epidemic
# Only then it would be much easier for us to make any possible estimate for the world as well, due to not having really a huge data about each and every single state of the countries.

ERROR: Error in tail(Confirmed): object 'Confirmed' not found


<br /> 
As we know that Country column is a *Factor*, we <u>can easily list those countries'</u>, who have reported Confirmed cases (on the daily basis):

In [33]:
Countries = levels(Confirmed$Country.Region)

cat("\nTotal number of affected countries: ", nlevels(Confirmed$Country.Region), "\n\n\nCountries:")
head(as.matrix(Countries), 5) # top 5 countries (in sorted list-namewise)

ERROR: Error in levels(Confirmed$Country.Region): object 'Confirmed' not found


In [15]:
## Functions for extracting required data


# finds the total cases reported in given country 
    # (by Adding all the data of different states in it)
country.aggregate.daily  <-  function(dfName, country) {
  
  df <- get(dfName)
  df = df[which(str_detect(df$Country.Region, country)),]
  df = cbind(States = df[,1], Country = df[,2], df[,5:ncol(df)])     # ELEMINATING LATITUDE/LONGITUDE Col.
  
  row.names(df) <- NULL    
    
  temp = df                                             # all states' data of a country
  df = temp[1,] 
  
  df[3:ncol(temp)] = apply(   temp[,3:ncol(temp)],
                            2,
                            sum
                        )                               # applying sum of all the states' values
  df = df[2:ncol(df)]                                   # removing column 'States'  
  row.names(df) <- NULL  
  return(df)
}



# generated a dataframe having required data arranged Country-Wise 
    # (by appending every single country's data)
countries.daily <-  function(dfName, cList) {
  
  n = length(cList)       # number of countries
  
  flag = 0
  
  for (i in cList) {
    
    if(flag == 0) {
      df = country.aggregate.daily(dfName, i)
      flag = 1
    } else {
      temp = country.aggregate.daily(dfName, i)
      df = rbind(df, temp)
    }    
  }
  
  row.names(df) <- NULL  
  return(df)
}

In [35]:
China.Confirmed = country.aggregate.daily("Confirmed", "China")
World.Confirmed = countries.daily("Confirmed", Countries)

China.Confirmed
cat("\n\n")
head(World.Confirmed)

China.Deaths = country.aggregate.daily("Deaths", "China")
World.Deaths = countries.daily("Deaths", Countries)
China.Recovered = country.aggregate.daily("Recovered", "China")
World.Recovered = countries.daily("Recovered", Countries)

ERROR: Error in get(dfName): object 'Confirmed' not found


<br /> 
### Moving to next step

#### We need datewise data:

* It's so because, we aim to analyze data on the daily basis
>  Hence we'd have to add another column "Date" or simply "Day" (to hold day-> 1, 2...)

* in order to do so, we'd have to transform our data into Cross-sectional (China, Hubei & Diamond Princess) or Pooled data (Countries of world other than China)

<br /> 
#### Let's understand what a <u>Cross-sectional</u> & a <u>Pooled data</u> is:-
> * **Cross-sectional data:** Data of one or more variables, collected at the same point in time. <br />
> * **Pooled data:** A combination of time series data and cross-sectional data.<br />

In [36]:
## Functions

countries.daily.bulk.summary = function(cList) { # date wise country data
  
  # structure of resulting dataset (initially blank)
  df <- data.frame(
    Country = NULL,
    Day = NULL,           # day no.
    Date = NULL,
    Confirmed = NULL,
    Deaths = NULL,
    Recovered = NULL
  )
  
  # calculating all countries' data (date wise) through iteration
  for(i in cList) {
    this.one.confirmed = country.aggregate.daily("Confirmed", i)
    this.one.deaths = country.aggregate.daily("Deaths", i)
    this.one.recovered = country.aggregate.daily("Recovered", i)
    
    times = ncol(this.one.confirmed)-1      # no. of days
    day = 1:times
    d = as.Date("21-01-2020", format(c("%d-%m-%Y")))
    
    date = as.character((day + d), format(c("%d-%m-%Y")))      # its lenngth is equal to --> no. of days
    date = factor(c(date), levels = date)
    
    #max(Deaths.temp[1,5:ncol(Deaths.temp)])
    confirmed = as.numeric(this.one.confirmed[1,2:ncol(this.one.confirmed)])
    
    deaths = as.numeric(this.one.deaths[1,2:ncol(this.one.deaths)])
    
    recovered = as.numeric(this.one.recovered[1,2:ncol(this.one.recovered)])
    
    dataset <- data.frame(
      Country = rep(i, times),
      Day = factor(c(1:length(date)), levels = 1:length(date)),
      Date = date,
      Confirmed = confirmed,
      Deaths = deaths,
      Recovered = recovered
    )
    
    # joining this country
    df = rbind(df, dataset)
  }
    
  return(df)
}


In [16]:
bulk = countries.daily.bulk.summary(Countries)
head(bulk)

ERROR: Error in countries.daily.bulk.summary(Countries): could not find function "countries.daily.bulk.summary"


<br /> <br /> 
<font size="3">
<u>For better analysis, let's add 2 more columns:</u>
> **1. Closed.Cases** = consists all cases, that are Expired or Recovered<br />
> **2. Active.Cases** = cases that are neither Expired nor Recovered

In [17]:
bulk$Active.Cases = bulk$Confirmed - (bulk$Deaths + bulk$Recovered)
bulk$Closed.Cases = bulk$Deaths + bulk$Recovered
tail(bulk)

ERROR: Error in eval(expr, envir, enclos): object 'bulk' not found


<br /> 
<font size="3">
So, our Pooled dataset ready.<br /><br />
<u>Let's understand this dataset</u>:-
</font>

In [18]:
# Analysing the Pooled data
str(bulk)

ERROR: Error in str(bulk): object 'bulk' not found


## Explanation of Pooled Datasets (bulk & four)
<br /> 
 \* <u>Pooled data is a combination of time series data and cross-sectional data</u> <br /><br />
 
 
   Number of columns: 8 <br /> 
   Here we are discussing about the _**Bulk** dataset_
   
> #### Country:
   > * Datatype: **Factor** with 153-levels <br /> 
   > * Holds the name of Countries for daily data<br /> 
   > * Eg.: Japan
>
> #### Day:
   > * Datatype: **Factor** with 58-levels <br /> 
   > * Holds days numbered from 1 upto the last day <br /> 
   > * Eg.: for Jan 22<sup>nd</sup>, Day is 1, Jan 23<sup>rd</sup>, Day is 2 and so on..
>
> #### Date:
   > * Datatype: **Factor** with 58-levels <br /> 
   > * Holds dates in format **dd-mm-yyyy** and where individual level has the datatype _Date_ <br /> 
   > * Eg.: 22-01-2020
>
> #### Confirmed:
   > * Datatype: **num** <br /> 
   > * Holds total number of confirm cases in a country, upto the given date/day <br /> 
   > * Eg.: upto 01-02-2020, Japan reported	20 COVID-19 cases
>
> #### Deaths:
   > * Datatype: **num** <br /> 
   > * Holds total number of deaths in a country, upto the given date/day <br /> 
   > * Eg.: upto 01-02-2020, Japan reported	no Deaths
>
> #### Recovered:
   > * Datatype: **num** <br /> 
   > * Holds total number of recoveries in a county, upto the given date/day <br /> 
   > * Eg.: upto 01-02-2020, Japan reported	1 Recoveries
>
> #### Active.Cases:
   > * Datatype: **num** <br /> 
   > * Holds total Confirmed cases, except Deaths & Recoveries in a country, upto the given date/day <br /> 
   > * Eg.: upto 01-02-2020, Japan had 19 Active cases
>
> #### Closed.Cases:
   > * Datatype: **num** <br /> 
   > * Holds total number of Recoveries or Deaths in a country, upto the given date/day <br /> 
   > * Eg.: upto 01-02-2020, Japan had closed 1 COVID-19 case
   
   
#### Now we are all set to filter out China from this dataset!

In [19]:
# filtering out the China
China.dataset = bulk[which(str_detect(bulk$Country, 'China')),]

# World Pooled dataset (except china)
bulk = bulk[which(str_detect(bulk$Country, 'China', negate=T)),] # updating bulk itself

ERROR: Error in eval(expr, envir, enclos): object 'bulk' not found


In [20]:
head(China.dataset)

ERROR: Error in head(China.dataset): object 'China.dataset' not found


<br /> 
<font size="3">
In the same manner, we create <u>two</u> datasets
* holds <u>**all** the data of all the countries except Hubei in China</u>
* holds whole data categorized into <u>four locations</u>.
<br /><br /> 
These four locations are:
> 1. Diamond Princess 
> 2. Hubei province (alone) same as Diamond Princess Cruise Ship
> 3. China alone data (Except Hubei province)
> 4. World (Except China), collectively
    
<br />
    
The 2<sup>nd</sup> type of dataset is very necessary because it consists of all the outliers as well...
<br />
* Actually, here we can take them into consideration because:
> 1. Here we are comparing them with the whole World's data collectively
> 2. It's that kind of MEDICAL Data, where outliers can not be ignored! In-fact this single country and that ship are spreading the disease, rapidly.
> 3. This 2nd dataset alone keeps track on the whole data, reported till the last date
</font>

<hr />
* We've already saved this dataset
<br />

In [21]:
## Load both datewise-datasets (world & FOUR)
# includes data of all the countries
all = read.csv('Notebooks/syllabus/static/pooled/countryWise_bulk_summary.csv')

# includes data of four majour location
four = read.csv('Notebooks/syllabus/static/pooled/Four_dataset_locationWise.csv')

“cannot open file 'Notebooks/syllabus/static/pooled/countryWise_bulk_summary.csv': No such file or directory”


ERROR: Error in file(file, "rt"): cannot open the connection


In [22]:
str(all)

cat("\n\n")

str(four)

function (..., na.rm = FALSE)  




ERROR: Error in str(four): object 'four' not found


<br /> <br /> 
In the __*all dataset*__, everything is same as in 'Bulk' dataset <br />

In the __*four dataset*__: <br />

> #### Location:
   > * Datatype: **Factor** with 4-levels <br /> 
   > * Holds the name of Locations (as Countries in 'Bulk') for daily data <br /> 
   > * Levels: World, China, Hubei & Diamond Princess <br /> <br />
Rest **7** columns are same as those of 'Bulk' dataset


### Let's analyze that how China differ from rest of the data using Boxplots
#### Why Boxplot:-  <br /> 
> * It's a single visualization that tells about many statistical quantifiers <br />
> <img src="../pics/boxplot.png" height="50%" width="50%" alt="Boxplot explained"/> <br />
> * It's very easy to detect Outliers through boxplot <br />

In [23]:
# Initially we plot dataset with majour Locations
options(repr.plot.width=16, repr.plot.height=8)
withChina<-ggplot(four, aes(x=Day, y=Confirmed, color=Day)) +
  geom_boxplot(aes(group=Day)) +
  labs(title="Including China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

cat("\n\n")
withChina

ERROR: Error in ggplot(four, aes(x = Day, y = Confirmed, color = Day)): could not find function "ggplot"


<br /><br /> 
<font size="3">
Here we get a <u>continuous sequence of **outliers**</u>, for roughly upto 45 days<br />
Now, as per our previous analysis (through word-clouds and mean-comparison, we assumed the China as this outlier)<br /><br /> 
In order to Test our hypothesis, let's plot China alone, as well as Rest of all data except China
</font>

In [24]:
options(repr.plot.width=16, repr.plot.height=8)

chinaAlone <- ggplot(four[which(str_detect(four$Location, "China", negate=F)),], aes(x=Day, y=Confirmed, color=Day)) +
  geom_point(aes(group=Day)) +
  labs(title="Only China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

withoutChina <- ggplot(four[which(str_detect(four$Location, "China", negate=T)),], aes(x=Day, y=Confirmed, color=Day)) +
  geom_boxplot(aes(group=Day)) +
  labs(title="Excluding China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

cat("\n\n")
chinaAlone
cat("\n\n")
withoutChina

ERROR: Error in ggplot(four[which(str_detect(four$Location, "China", negate = F)), : could not find function "ggplot"


<br /> 
#### Comparing above 2 plots with our previous single plot of the whole (4) data categorize, collectively:-
<font size="3"> 
1. First box plots resembles the sequence, that is far more similar to the Outliers' sequence
2. Along with this, when we try plotting the whole data again, after removing the China, we find that there is no outlier, at all
<br /> 
    
So, finally we can say that the **China is an outlier**, and hence we'll study China, separately!
</font>

In [25]:
# Let's view few of the rows in existing datasets
head(all)
head(four)

                   
1 .Primitive("all")

ERROR: Error in head(four): object 'four' not found


<hr /> <br /> 
Now, as we aim to analyze the status of COVID-19 within next 10 days, which means that we basically want to analyze the active or closed cases within that time duration.<br /><br />
But, as these 2 are just discrete figures and hence can vary depending upon the Confirmed, Recovery & Death cases<br />
It means all these figures can vary dynamically
<br /><br /> 
So, in this situation, finding any internal relation between the columns Confirmed/Recovery/Death and Active/Closed cases ain't an easy task.<br />
Now, in order to establish a relationship between these, the can take <u>**Rate of Increase** in Active/Closed cases</u>
```
i.e. what percent (%) of Confirmed cases are Active/Closed, and which would simply be depend upon total Confirmed cases
```

<br /> 
<font size="3">
<u>Hence, before we move towards creating a suitable model for our problem form available dataset, we'd have to do one last transformation, by adding two more columns to our existing dataset i.e.</u>
    1. Active Cases(%)
    2. Closed Cases(%)
</font>

In [26]:
# calculate the percent (using Confirmed cases as total)
percent <- function(dfName){
    get(dfName) -> df
    part <- NULL
    
    for(i in 1:nrow(df)) {
        val = df[i,"Active.Cases"]
        Total = df[i,"Confirmed"]
        
        
        if(i == 1)
            if(val==0)
                part = 0
            else
                part = as.numeric((val*100)/Total)
        else
            if(val==0)
                part = c(part, 0)
            else
                part <- c(part, as.numeric((val*100)/Total))
    }
        
    return(part)
}

In [27]:
# CASES -> percentage
four$'percent_active' = percent("four")     # Active cases, out of every 100 Confirmed cases
four$'percent_closed' = 100-percent("four") # Closed cases, out of every 100 Confirmed cases


all$'percent_active' = percent("all")     # Active cases, out of every 100 Confirmed cases
all$'percent_closed' = 100-percent("all") # Closed cases, out of every 100 Confirmed cases


ERROR: Error in get(dfName): object 'four' not found


In [28]:
# Look onto the structure whether the things are updated or not
str(all)

cat("\n\n")

str(four)

function (..., na.rm = FALSE)  




ERROR: Error in str(four): object 'four' not found


### *OK! Everything is set. Now we can go for themodel creation...*

<br /><hr /><br />