## <u>Data Understanding:</u>

> * <a href="#*-Data-&-Problem-Statement">DATA &AMP; PROBLEM STATEMENT</a>
> * <a href="#Beginning-with-the-coading-section:">BEGINNING WITH THE CODING SECTION</a>
> * <a href="#*-Data-Collection">DATA COLLECTION</a>

### * Data & Problem Statement

Our is aim to analyze that what would be the status of the epidemic will change, within a week and hence, how log it might take to China to have its control over this infection<br />
Further, we'd aim to generalize it for any of the country, so that we could find the status of the same, whenever required. Obviously, we'd have to change few parameters, tho' we won't have to waste our time again and again in data preparation for a particular country.<br /><br />

<b>Hence we need the data about:</b>
> 1. <u>Confirmed</u> cases of COVID-19, throughout the world
> 2. <u>Death</u> cases due to COVID-19, throughout the world
> 3. Total <u>Recovery</u> from COVID-19, throughout the world

We've collected this data from various sources including <a href="https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6" target="_blank">CSSE @Johns Hopkins University</a>, <a href="https://www.who.int/emergencies/diseases/novel-coronavirus-2019" target="_blank">WHO</a> & <a href="https://www.mohfw.gov.in/" target="_blank">MoHFW - India</a>.<br />
<br /> 

#### Beginning with the coding section:

> Let's start with setting-up our working directory and load all the required packages, that we would be required:

### Initial Project Setup

In [2]:
# Setting the working directory
options(warn=-1)
setwd("~/Documents/A-tracking-of-2019-nCoV/COVID-19/")

#####  Loading LIBRARIES  #####
library(stringr)

library(AUCRF)
library(randomForest)
library(RFmarkerDetector)

library(caret)
library(mlbench)
library(kernlab)

Loading required package: randomForest

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.

AUCRF 1.1


Attaching package: ‘RFmarkerDetector’


The following object is masked from ‘package:stats’:

    screeplot


Loading required package: lattice

Loading required package: ggplot2


Attaching package: ‘ggplot2’


The following object is masked from ‘package:randomForest’:

    margin



Attaching package: ‘kernlab’


The following object is masked from ‘package:ggplot2’:

    alpha




<br />

### * Data Collection

Basically, we have collected the <b>raw data</b> from websites and GitHub repository.

In [4]:
# loading raw data
check.Confirmed = read.csv("Notebooks/syllabus/static/raw/time_series_19-covid-Confirmed.csv")
check.Deaths = read.csv("Notebooks/syllabus/static/raw/time_series_19-covid-Deaths.csv")
check.Recovered = read.csv("Notebooks/syllabus/static/raw/time_series_19-covid-Recovered.csv")

The data, that we have collected is in the <b>CSV</b> (i.e. "<u><i>comma-separated values</i></u>") format.<br />
A CSV file is used to store the <i>structured data</i> row-wise, where the data elements in each rows are separated by a comma (,).

It's pretty similar to the following:
```
Belinda Jameson,2017,Cushing House,148,3.52
Jeff Smith,2018,Prescott House,17-D,3.20
```
<br />
Let's view the head portion of our raw data:

In [5]:
# view sample
head(check.Confirmed)
head(check.Deaths)
head(check.Recovered)

Province.State,Country.Region,Lat,Long,X1.22.20,X1.23.20,X1.24.20,X1.25.20,X1.26.20,X1.27.20,...,X3.10.20,X3.11.20,X3.12.20,X3.13.20,X3.14.20,X3.15.20,X3.16.20,X3.17.20,X3.18.20,X3.19.20
,Thailand,15.0,101.0,2,3,5,7,8,8,...,53,59,70,75,82,114,147,177,212,272
,Japan,36.0,138.0,2,1,2,2,4,4,...,581,639,639,701,773,839,825,878,889,924
,Singapore,1.2833,103.8333,0,1,3,3,4,5,...,160,178,178,200,212,226,243,266,313,345
,Nepal,28.1667,84.25,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,1,1
,Malaysia,2.5,112.5,0,0,0,3,4,4,...,129,149,149,197,238,428,566,673,790,900
British Columbia,Canada,49.2827,-123.1207,0,0,0,0,0,0,...,32,39,46,64,64,73,103,103,186,231


Province.State,Country.Region,Lat,Long,X1.22.20,X1.23.20,X1.24.20,X1.25.20,X1.26.20,X1.27.20,...,X3.10.20,X3.11.20,X3.12.20,X3.13.20,X3.14.20,X3.15.20,X3.16.20,X3.17.20,X3.18.20,X3.19.20
,Thailand,15.0,101.0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
,Japan,36.0,138.0,0,0,0,0,0,0,...,10,15,16,19,22,22,27,29,29,29
,Singapore,1.2833,103.8333,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
,Nepal,28.1667,84.25,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
,Malaysia,2.5,112.5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,2,2
British Columbia,Canada,49.2827,-123.1207,0,0,0,0,0,0,...,1,1,1,1,1,1,4,4,7,7


Province.State,Country.Region,Lat,Long,X1.22.20,X1.23.20,X1.24.20,X1.25.20,X1.26.20,X1.27.20,...,X3.10.20,X3.11.20,X3.12.20,X3.13.20,X3.14.20,X3.15.20,X3.16.20,X3.17.20,X3.18.20,X3.19.20
,Thailand,15.0,101.0,0,0,0,0,2,2,...,33,34,34,35,35,35,35,41,42,42
,Japan,36.0,138.0,0,0,0,0,1,1,...,101,118,118,118,118,118,144,144,144,150
,Singapore,1.2833,103.8333,0,0,0,0,0,0,...,78,96,96,97,105,105,109,114,114,114
,Nepal,28.1667,84.25,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
,Malaysia,2.5,112.5,0,0,0,0,0,0,...,24,26,26,26,35,42,42,49,60,75
British Columbia,Canada,49.2827,-123.1207,0,0,0,0,0,0,...,4,4,4,4,4,4,4,4,4,4


#### We can see that type of collected data is: <u>time-series</u>
> A <b>Time series data</b> of a variable have a set of observations on values at different points of time. They are usually collected at fixed intervals, such as daily, weekly, monthly, annually, quarterly, etc.

<br />
We observe that all the three datasets have same columns as well as the same type of data.

In [6]:
# columns
cat("Number of columns in all 3 datasets:-\n\n")

matrix(
    c("Confirmed", "Deaths", "Recovered", ncol(check.Confirmed), ncol(check.Deaths), ncol(check.Recovered)),
    nrow = 2, ncol = 3, byrow = T
)

# Dimention
cat("Dimentions of datasets:-\n")   # same for all 3
dim(check.Confirmed)


######################
# columns' name
#cat("Name of columns in all 3 datasets:-\n")

#colnames(check.Confirmed)
#colnames(check.Deaths)
#colnames(check.Recovered)

Number of columns in all 3 datasets:-



0,1,2
Confirmed,Deaths,Recovered
62,62,62


Dimentions of datasets:-


<br />
Let's see the structure of these datasets:

In [7]:
str(check.Confirmed)
#str(check.Deaths)
#str(check.Recovered)

'data.frame':	468 obs. of  62 variables:
 $ Province.State: Factor w/ 321 levels "","Adams, IN",..: 1 1 1 1 1 25 196 298 237 1 ...
 $ Country.Region: Factor w/ 155 levels "Afghanistan",..: 142 76 129 103 90 27 8 8 8 25 ...
 $ Lat           : num  15 36 1.28 28.17 2.5 ...
 $ Long          : num  101 138 103.8 84.2 112.5 ...
 $ X1.22.20      : int  2 2 0 0 0 0 0 0 0 0 ...
 $ X1.23.20      : int  3 1 1 0 0 0 0 0 0 0 ...
 $ X1.24.20      : int  5 2 3 0 0 0 0 0 0 0 ...
 $ X1.25.20      : int  7 2 3 1 3 0 0 0 0 0 ...
 $ X1.26.20      : int  8 4 4 1 4 0 3 1 0 0 ...
 $ X1.27.20      : int  8 4 5 1 4 0 4 1 0 1 ...
 $ X1.28.20      : int  14 7 7 1 4 1 4 1 0 1 ...
 $ X1.29.20      : int  14 7 7 1 7 1 4 1 1 1 ...
 $ X1.30.20      : int  14 11 10 1 8 1 4 2 3 1 ...
 $ X1.31.20      : int  19 15 13 1 8 1 4 3 2 1 ...
 $ X2.1.20       : int  19 20 16 1 8 1 4 4 3 1 ...
 $ X2.2.20       : int  19 20 18 1 8 1 4 4 2 1 ...
 $ X2.3.20       : int  19 20 18 1 8 1 4 4 2 1 ...
 $ X2.4.20       : int  25 22 24 1

<br />

### Explanation of each columns in fetched data:-
* There are 3 dedicated databases for data about Confirmed/Death/Recovery cases, all around the world

> #### Province.State:
   > * Data-type: **factor** they can be specific, unique and valid names
   > * Holds name of City/Province/State, where the data is coming from
   > * Eg.: _Hubei_
>
> #### Country.Region:
   > * Data-type: **factor** they can be specific, unique and valid names
   > * Holds name of the country, in which the reported area comes
   > * Eg.: _China_ (Hubei is a Province of China)
>
> #### Lat:
   > * Data-type: **numeric** (i.e. can have values in decimals, too)
   > * Holds the Latitude position of the given place(as in col1)
   > * Eg.: _Latitude_ position of Hubei = 30.9756
>
> #### Long:
   > * Data-type: **numeric** (i.e. can have values in decimals, too)
   > * Holds the longitude position of the given place(as in col1)
   > * Eg.: _Longitude_ position of Hubei = 112.2707
>
> #### Col. 5 to 62:
   > * Data-type: **integer** (i.e. discrete) and remains _always positive_ as it determines  the _no, of individuals_
   > * It's a time series data where the data is collected at various interval of time
   > * Each datum value is represented, based on the different days in series (from 22/01/2020)
   > * The constant entity is the location, whose data is represented in every row

In [13]:
# showing data of Hubei
cat("\nA sample data of a location from \"Confirmed Cases\'\" dataset:\n")
check.Confirmed[which(str_detect(check.Confirmed$Province.State, "Hubei")),]


A sample data of a location from "Confirmed Cases'" dataset:


Unnamed: 0,Province.State,Country.Region,Lat,Long,X1.22.20,X1.23.20,X1.24.20,X1.25.20,X1.26.20,X1.27.20,...,X3.10.20,X3.11.20,X3.12.20,X3.13.20,X3.14.20,X3.15.20,X3.16.20,X3.17.20,X3.18.20,X3.19.20
155,Hubei,China,30.9756,112.2707,444,444,549,761,1058,1423,...,67760,67773,67781,67786,67790,67794,67798,67799,67800,67800


<br />