# Multiple Regression with iris dataset

> Which columns are the most proper to identify Species in iris data?

## Data

In [94]:
iris <- read.csv("./Data/iris.csv")
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [95]:
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


# Multiple Regression

In [96]:
options(scipen=100)

In [97]:
iris_reg <- lm(as.numeric(Species) ~ . , data = iris)

# Species should be converted into numbers in the model.
iris_reg


Call:
lm(formula = as.numeric(Species) ~ ., data = iris)

Coefficients:
 (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
     1.18650      -0.11191      -0.04008       0.22865       0.60925  


In [98]:
summary(iris_reg)


Call:
lm(formula = as.numeric(Species) ~ ., data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59215 -0.15368  0.01268  0.11089  0.55077 

Coefficients:
             Estimate Std. Error t value      Pr(>|t|)    
(Intercept)   1.18650    0.20484   5.792 0.00000004150 ***
Sepal.Length -0.11191    0.05765  -1.941        0.0542 .  
Sepal.Width  -0.04008    0.05969  -0.671        0.5030    
Petal.Length  0.22865    0.05685   4.022 0.00009255215 ***
Petal.Width   0.60925    0.09446   6.450 0.00000000156 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2191 on 145 degrees of freedom
Multiple R-squared:  0.9304,	Adjusted R-squared:  0.9285 
F-statistic: 484.5 on 4 and 145 DF,  p-value: < 0.00000000000000022


### Multiple Regression with the relevant variables

In [99]:
species_reg_reduced <- step(iris_reg, direction = "backward")
species_reg_reduced

Start:  AIC=-450.56
as.numeric(Species) ~ Sepal.Length + Sepal.Width + Petal.Length + 
    Petal.Width

               Df Sum of Sq    RSS     AIC
- Sepal.Width   1   0.02164 6.9823 -452.09
<none>                      6.9606 -450.56
- Sepal.Length  1   0.18090 7.1415 -448.71
- Petal.Length  1   0.77649 7.7371 -436.69
- Petal.Width   1   1.99710 8.9577 -414.72

Step:  AIC=-452.09
as.numeric(Species) ~ Sepal.Length + Petal.Length + Petal.Width

               Df Sum of Sq    RSS     AIC
<none>                      6.9823 -452.09
- Sepal.Length  1   0.44324 7.4255 -444.86
- Petal.Length  1   1.51946 8.5017 -424.56
- Petal.Width   1   2.11632 9.0986 -414.38



Call:
lm(formula = as.numeric(Species) ~ Sepal.Length + Petal.Length + 
    Petal.Width, data = iris)

Coefficients:
 (Intercept)  Sepal.Length  Petal.Length   Petal.Width  
      1.1447       -0.1362        0.2521        0.5869  


In [100]:
summary(species_reg_reduced)


Call:
lm(formula = as.numeric(Species) ~ Sepal.Length + Petal.Length + 
    Petal.Width, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60753 -0.16188  0.01367  0.11217  0.54740 

Coefficients:
             Estimate Std. Error t value       Pr(>|t|)    
(Intercept)   1.14469    0.19478   5.877 0.000000027233 ***
Sepal.Length -0.13624    0.04475  -3.044        0.00277 ** 
Petal.Length  0.25213    0.04473   5.637 0.000000086707 ***
Petal.Width   0.58689    0.08822   6.652 0.000000000541 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2187 on 146 degrees of freedom
Multiple R-squared:  0.9302,	Adjusted R-squared:  0.9287 
F-statistic: 648.3 on 3 and 146 DF,  p-value: < 0.00000000000000022


    : as.numeric(Species) = 1.1447 + (-0.1362*Sepal.Length) + (0.2521*Petal.Length) + (0.5869*Petal.Width)
    
    The prediction accuracy is 92.9% with three variables. 


In [131]:
irisFunc <- function(a, b, c) {
    1.1447 + (-0.1362 * a) + (0.2521 * b) + (0.5869 * c)
}

# a: Sepal.Length, b: Petal.Length, c: Petal.Width

## Validation

In [132]:
regression <- round(irisFunc(iris$Sepal.Length, iris$Petal.Length, iris$Petal.Width))
regression

In [105]:
iris$Species_num <- as.numeric(iris$Species)
iris$Species_num

In [140]:
mean(regression == iris$Species_num)*100

# Simple Linear Regression

In [133]:
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Species_num
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
1,5.1,3.5,1.4,0.2,setosa,1
2,4.9,3.0,1.4,0.2,setosa,1
3,4.7,3.2,1.3,0.2,setosa,1
4,4.6,3.1,1.5,0.2,setosa,1
5,5.0,3.6,1.4,0.2,setosa,1
6,5.4,3.9,1.7,0.4,setosa,1


In [134]:
cor(iris[-5])

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species_num
Sepal.Length,1.0,-0.1175698,0.8717538,0.8179411,0.7825612
Sepal.Width,-0.1175698,1.0,-0.4284401,-0.3661259,-0.4266576
Petal.Length,0.8717538,-0.4284401,1.0,0.9628654,0.9490347
Petal.Width,0.8179411,-0.3661259,0.9628654,1.0,0.9565473
Species_num,0.7825612,-0.4266576,0.9490347,0.9565473,1.0


In [135]:
r_PW <- lm(iris$Species_num ~ iris$Petal.Width)
r_PW
summary(r_PW)


Call:
lm(formula = iris$Species_num ~ iris$Petal.Width)

Coefficients:
     (Intercept)  iris$Petal.Width  
           0.767             1.028  



Call:
lm(formula = iris$Species_num ~ iris$Petal.Width)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.61753 -0.13156  0.02739  0.13019  0.79370 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       0.76700    0.03657   20.97 <0.0000000000000002 ***
iris$Petal.Width  1.02807    0.02576   39.91 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2397 on 148 degrees of freedom
Multiple R-squared:  0.915,	Adjusted R-squared:  0.9144 
F-statistic:  1593 on 1 and 148 DF,  p-value: < 0.00000000000000022


## Validation

In [136]:
irisFuncPW <- function(x) {
    0.767 + (1.028 * x)
}

# iris$Petal.Width: x

In [137]:
regressionPW <- round(irisFuncPW(iris$Petal.Width))
regressionPW

In [139]:
mean(regressionPW == iris$Species_num)*100

    : We can even identify the Species only with Petal.Width column with a model with 91.4% accuracy where p-value of 0.00000000000000022. 
    
    In this case, one variable model can be more efficient to run

# Apply to Machine Learning concept

In [None]:
Can the trained model be generalized?

## === Multiple Regression ===

Now, we are goint to divide the dataset into train and test. 
The regression model will be created by only referring 70% of training data. 
The remaining 30% will be used for validation.

## Data

In [141]:
iris <- read.csv("./Data/iris.csv")
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


## Sampling and Spliting Dataset into train and test

In [142]:
set.seed(1234)
samp <- c(sample(1:50, 35), sample(51:100, 35), sample(101:150, 35))

# Training
iris.train <- iris[samp,]

# Test
iris.test <- iris[-samp,]

## Training

In [144]:
iris_reg <- lm(as.numeric(Species) ~ . , data = iris.train)
iris_reg


Call:
lm(formula = as.numeric(Species) ~ ., data = iris.train)

Coefficients:
 (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
     1.10323      -0.10256      -0.03729       0.25290       0.55638  


In [145]:
species_reg_reduced <- step(iris_reg, direction = "backward")
species_reg_reduced

Start:  AIC=-318.52
as.numeric(Species) ~ Sepal.Length + Sepal.Width + Petal.Length + 
    Petal.Width

               Df Sum of Sq    RSS     AIC
- Sepal.Width   1   0.01250 4.6084 -320.24
<none>                      4.5959 -318.52
- Sepal.Length  1   0.10947 4.7054 -318.05
- Petal.Length  1   0.64053 5.2364 -306.82
- Petal.Width   1   1.06359 5.6595 -298.67

Step:  AIC=-320.24
as.numeric(Species) ~ Sepal.Length + Petal.Length + Petal.Width

               Df Sum of Sq    RSS     AIC
<none>                      4.6084 -320.24
- Sepal.Length  1   0.25968 4.8681 -316.48
- Petal.Width   1   1.12786 5.7363 -299.25
- Petal.Length  1   1.21215 5.8206 -297.72



Call:
lm(formula = as.numeric(Species) ~ Sepal.Length + Petal.Length + 
    Petal.Width, data = iris.train)

Coefficients:
 (Intercept)  Sepal.Length  Petal.Length   Petal.Width  
      1.0597       -0.1240        0.2746        0.5347  


In [161]:
summary(species_reg_reduced)


Call:
lm(formula = as.numeric(Species) ~ Sepal.Length + Petal.Length + 
    Petal.Width, data = iris.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42775 -0.17111  0.01409  0.11349  0.50937 

Coefficients:
             Estimate Std. Error t value   Pr(>|t|)    
(Intercept)   1.05973    0.22659   4.677 0.00000904 ***
Sepal.Length -0.12402    0.05199  -2.386     0.0189 *  
Petal.Length  0.27459    0.05328   5.154 0.00000127 ***
Petal.Width   0.53472    0.10755   4.972 0.00000272 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2136 on 101 degrees of freedom
Multiple R-squared:  0.9342,	Adjusted R-squared:  0.9322 
F-statistic: 477.7 on 3 and 101 DF,  p-value: < 0.00000000000000022


In [147]:
irisFunc <- function(a, b, c) {
    1.0597 + (-0.1240 * a) + (0.2746 * b) + (0.5347 * c)
}

# a: Sepal.Length, b: Petal.Length, c: Petal.Width

## Prediction with test data

In [150]:
regression <- round(irisFunc(iris.test$Sepal.Length, iris.test$Petal.Length, iris.test$Petal.Width))
regression

In [151]:
iris.test$Species_num <- as.numeric(iris.test$Species)
iris.test$Species_num

## Accuracy

In [153]:
table(regression, iris.test$Species_num )

          
regression  1  2  3
         1 15  0  0
         2  0 12  1
         3  0  3 14

In [152]:
mean(regression == iris.test$Species_num)*100

## === Single Linear Regression ===

In [155]:
iris.train$Species_num <- as.numeric(iris.train$Species)

In [156]:
r_PW <- lm(iris.train$Species_num ~ iris.train$Petal.Width)
r_PW
summary(r_PW)


Call:
lm(formula = iris.train$Species_num ~ iris.train$Petal.Width)

Coefficients:
           (Intercept)  iris.train$Petal.Width  
                0.7588                  1.0435  



Call:
lm(formula = iris.train$Species_num ~ iris.train$Petal.Width)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42832 -0.15874  0.03253  0.13688  0.78038 

Coefficients:
                       Estimate Std. Error t value            Pr(>|t|)    
(Intercept)             0.75878    0.04355   17.42 <0.0000000000000002 ***
iris.train$Petal.Width  1.04346    0.03098   33.68 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2379 on 103 degrees of freedom
Multiple R-squared:  0.9168,	Adjusted R-squared:  0.9159 
F-statistic:  1134 on 1 and 103 DF,  p-value: < 0.00000000000000022


## Pretiction

In [157]:
irisFuncPW <- function(x) {
    0.7588 + (1.0435 * x)
}

# iris$Petal.Width: x

In [158]:
regressionPW <- round(irisFuncPW(iris.test$Petal.Width))
regressionPW

## Accuracy

In [163]:
table(regressionPW, iris.test$Species_num )

            
regressionPW  1  2  3
           1 15  0  0
           2  0 13  1
           3  0  2 14

In [159]:
mean(regressionPW == iris.test$Species_num)*100

    : By training the model and testing it to unseen dataset (test data), we can see that this model can be generalized for identification of iris Species.