## <u>Modeling:</u>

> * <a href="#Extracting-China">EXTRACTING CHINA</a>
> * <a href="#Choosing-the-Right-Algo.">CHOOSING THE RIGHT ALGO.</a>
> * <a href="#Data-Splitting:-train-test">DATA SPLITTING: Train-Test</a>
> * <a href="#Selection-of-an-algorithm">ALGORITHMS</a>
    * <a href="#1.-SVMK-Regression">SVMK Regression</a>
    * <a href="#2.-KNN-Regression">KNN Regression</a>
    * <a href="#3.-Linear-Regression">Linear Regression</a>
    * <a href="#4.-Polynomial-regression">Polynomial Regression</a>


* Now we have the data of all the different countries as well as the aggregate date-wise data of the whole world, too.
* Also, we have data of some special locations, lying far away from the trend (say outliers)
<br /><br /> 
* So, we have to decide that which Country, we are going to do our analysis
    * It can be chosen with the help of the following function

In [1]:

# extracting the desired dataset
extractDatases <- function(region){
    if(region %in% c("Hubei", "World", "Diamond Princess")) {
    temp = four[which(str_detect(four$Location, region)),]
    row.names(temp) <- NULL
} else {
    temp = all[which(str_detect(all$Country, region)),]
    row.names(temp) <- NULL
}

return(temp)
}


<br /> 
### Extracting China

In [3]:
# country i.e. to be used throughout the analysis
rName = "China" # without hubei

In [2]:
# filtering out desired country/location 
region1 = extractDatases(rName)

ERROR: Error in region %in% c("Hubei", "World", "Diamond Princess"): object 'rName' not found


In [5]:
# so we are missing something, when we have outliers saperatly,
    # better is that we join Hubei data in China so that our Countries' dataset don't have any vulnarability

# joining Hubei for complete data of china
region2 = extractDatases("Hubei")
region = cbind(region1[,1:3], region1[,4:8]+region2[,4:8])

ERROR: Error in extractDatases("Hubei"): object 'four' not found


In [6]:
# filtering out desired country/location 

region$'percent_active' = percent("region")     # Active cases, out of every 100 Confirmed cases
region$'percent_closed' = 100-percent("region") # Closed cases, out of every 100 Confirmed cases

head(region)

ERROR: Error in percent("region"): could not find function "percent"


<br /> 
Because, it is the dataset of China, the **Country** column is not necessary.<br />
Simillarly, there is no need of **Date**, when we have <u>Day</u>

In [3]:
region=region[,c(-1, -3)]
head(region, 10)

ERROR: Error in eval(expr, envir, enclos): object 'region' not found


<br /> 
<center> <u><em>Now our Dataset is ready for modling</em></u> </center>
<br />

### Choosing the Right Algo.

In [4]:
# setting the theme
theme_set(theme_classic())
# setting plot size
options(repr.plot.width=8, repr.plot.height=8)

ERROR: Error in theme_set(theme_classic()): could not find function "theme_set"


<br /> <br /> 
We are working onto a **Predictive Data Analysis** (as discussed earlier) project.<br />
So, in this situation, we can choose between two types of predictive algorithms, based on our objective.<br />
<br /> 
<font size="3">
<u>These 2 categories are</u>:-<br />
    
1. Classification
2. Regression
</font>



<center>
    <u>To estimate the future status of COVID-19 cases in China - we'll be using <b>Regression</b></u>
</center>



<br /> 
## * Why Regression and not Classification?
<br /> 
<font size="3">
* <u>Classification</u>:
    * It is used to categorize the given data-points, there are discrete (i.e. selective) number of the categories.<br />
    * Every new data point (for which the estimation is being made) must belong to any of the existing class/category, only.<br />
<br /><br /> 
* <u>Regression</u>:
    * It predicts the new possible outcome, based on the earlier trend.<br />
    * There is NO such necessity for the predicted value that it must be one among the given set of categories.<br />
</font>

<hr /><br /> <br /> 
<center>
    <i>We want to calculate that what might be the upcoming figure for <em>Active Cases' %</em> in China, particularly; that ain't be limited values.</i>
    <br /> <br /> 
    <font size="4">
        <u>That's why, Regression has to be used!!</u>
    </font>
</center>



<br /> 
We will compare mainly <u>3 regression algorithms</u> to predict the required value from rest all of the columns.<br />
Then based onto the accuracy of the prediction using different columns, we'll choose the column that has to be used for prediction.

In [5]:
## REGRASSIONs

# converting from Factor
region$Day <- as.numeric(as.character(region$Day))

x <- as.matrix(region[,1:6, 8])
y <- as.matrix(region[,7])

end = "\n\n#############################################################\n\n\n"

cat("\n\nLinear Regression:\n------------------\n")
# fit model
fit <- lm(percent_active~., region)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, region)
# summarize accuracy
mse <- sqrt(mean((region$percent_active - predictions)^2))
cat("RMSE: ", mse)



cat(end, "k-Nearest Neighbors:\n--------------------\n")
# load the libraries
# fit model
fit <- knnreg(x, y, k=3)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, x)
# summarize accuracy
mse <- sqrt(mean((y - predictions)^2))
cat("RMSE: ", mse)



cat(end, "Support Vector Machine:\n-----------------------\n")
# fit model
fit <- ksvm(percent_active~., region, kernel="rbfdot")
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, region)
# summarize accuracy
mse <- sqrt(mean((y - predictions)^2))
cat("RMSE: ", mse)




#############################################################

ERROR: Error in eval(expr, envir, enclos): object 'region' not found


<hr />
<br /><br /> 
<font size="3">
    From the summary of different regression algorithms above, we find that <b>Day</b> column should be given the priority.

As inn the summary of linear regression:<br /><br /> 
```
                  | Estimate Std.|  Error    |  t value  |  Pr(>|t|)  
     -------------|--------------|-----------|-----------|-----------
     (Intercept)  |  1.000e+02   | 1.202e-14 | 8.319e+15 |   < 2e-16  ***
     Day          |  5.221e-15   | 1.270e-15 | 4.112e+00 |  0.000140  ***
```
<br /> 

* Though, Active Case(%) can easily be calculated if we know Closed Case(%) or data About Confirmed/Death/Recovered cases data
    * but we actually don't have any of these things, because once we know that we'd have the future estimate as well.
    * and won't even this whole analysis.
<br /><br /> 
* So, finally we know that we should choose **Days** for Active Case(%).<br />
</font>

In [13]:
# Visualizing the available data as a scatter plot to see how the Active Cases(%) in China has varied over days, since 22nd January, 2020

# Day vs Active Cases(%)
region.scatter.plot <- ggplot(region, aes(x = region$Day, y = region$percent_active)) +
                        geom_point() +
                        labs( x = "Days", y = "Active Cases (%)") +
                        theme(
                              text = element_text(family = "Gill Sans")
                              ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
                              ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
                              ,axis.text = element_text(size = 12)
                              ,axis.title = element_text(size = 20)
                              )
              
region.scatter.plot

ERROR: Error in ggplot(region, aes(x = region$Day, y = region$percent_active)): could not find function "ggplot"


<br /> 
From the above visualization, it's clear that on an average, the active case percentage has kept on decreasing over the days<br />

Now we'll **split** our China dataset for <u>*training(80%) &amp; testing(20%)*</u> 

<br /> 
### Data Splitting: train-test

In [6]:
set.seed(20) # generages same set of random sample every time

training.samples <- region$Day %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data  <- region[training.samples, ]
test.data <- region[-training.samples, ]

# Dimentions of the splitted datasets
dim(train.data)
dim(test.data)

ERROR: Error in region$Day %>% createDataPartition(p = 0.8, list = FALSE): could not find function "%>%"


In [7]:
head(train.data, 3)
head(test.data, 3)

ERROR: Error in head(train.data, 3): object 'train.data' not found


<br />

### Selection of an algorithm
As our training & testing datasets are ready, now we have to select the <u>right regression algorithm to train our model</u><br /> 
<br />
<font size="3">

For this, we will compare four regression algorithms i.e.
    1. Support Vector Machines Regression
    2. k-Nearest Neighbor Regression
    3. Linear Regression
    4. Polynomial Regression
    
<br /> 
### What should be compared?
* Errors (should be minimized)
* Accuracy (should be maximized)
* R-Squared (R<sup>2</sup>)
* How perfectly the model fits while testing (as in the graphs) etc..
</font>

<hr /><br /> 