#### Explain the bias-variance tradeoff.

In [None]:
The “tradeoff” between bias and variance can be viewed in this manner – a learning algorithm with low bias must be “flexible” so that it can fit the data well. But if the learning algorithm is too flexible (for instance, too linear), it will fit each training data set differently, and hence have high variance. A key characteristic of many supervised learning methods is a built-in way to control the bias-variance tradeoff either automatically or by providing a special parameter that the data scientist can adjust.



#### Discuss the pros and cons of using the BIC to select a model.

In [None]:
Pros:
-BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

Cons:

-BIC is only valid for sample size n much larger than the number of parameters k in the model.
-BIC cannot handle complex collections of models as in the feature selection problem in high-dimension.


# Model Selection on a Classification Model

In [None]:
iris.data = read.csv("data/iris.csv", row.names='X')

In [27]:
column_names = ('area',"perimeter","compactness","length of kernel",
                   "width of kernel","assymetry coefficient","length of kernel groove","target")

In [59]:
SEEDS_DATA_URL <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'
seeds.data <- read.table(SEEDS_DATA_URL,header=FALSE,sep = "")

In [57]:
seeds_df
typeof(seeds_df)

area,perimeter,compactness,length_of_kernel,width_of_kernel,assymetry_coefficient,length_of_kernel_groove,target
15.26,14.84,0.8710,5.763,3.312,2.2210,5.220,1
14.88,14.57,0.8811,5.554,3.333,1.0180,4.956,1
14.29,14.09,0.9050,5.291,3.337,2.6990,4.825,1
13.84,13.94,0.8955,5.324,3.379,2.2590,4.805,1
16.14,14.99,0.9034,5.658,3.562,1.3550,5.175,1
14.38,14.21,0.8951,5.386,3.312,2.4620,4.956,1
14.69,14.49,0.8799,5.563,3.259,3.5860,5.219,1
14.11,14.10,0.8911,5.420,3.302,2.7000,5.000,1
16.63,15.46,0.8747,6.053,3.465,2.0400,5.877,1
16.44,15.25,0.8880,5.884,3.505,1.9690,5.533,1


In [61]:
colnames(seeds.data) =c("area","perimeter","compactness","length_of_kernel",
                   "width_of_kernel","assymetry_coefficient","length_of_kernel_groove","target")

In [63]:
seeds.glm = glm("target ~ 1 + area + perimeter + compactness + length_of_kernel +
                            width_of_kernel + assymetry_coefficient + length_of_kernel_groove", data = seeds.data)
summary(seeds.glm)


Call:
glm(formula = "target ~ 1 + area + perimeter + compactness + length_of_kernel +\n                            width_of_kernel + assymetry_coefficient + length_of_kernel_groove", 
    data = seeds.data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.30568  -0.24785  -0.01632   0.24198   1.22362  

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              53.44356    7.44511   7.178 1.32e-11 ***
area                      1.48907    0.26133   5.698 4.25e-08 ***
perimeter                -3.22038    0.53815  -5.984 9.77e-09 ***
compactness             -30.67744    5.24108  -5.853 1.92e-08 ***
length_of_kernel         -2.31510    0.45444  -5.094 8.01e-07 ***
width_of_kernel           0.24598    0.78571   0.313    0.755    
assymetry_coefficient     0.11489    0.02257   5.089 8.19e-07 ***
length_of_kernel_groove   2.19260    0.20358  10.770  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.

## The Log-Likelihood

Without going too far into the math, we can think of the log-likelihood as a **likelihood function** telling us how likely a model is given the data. 

This value is not human interpretable but is useful as a comparison.

In [64]:
logLik(seeds.glm)

'log Lik.' -109.228 (df=9)

"All models are wrong, but some are useful." - George Box

We might be concerned with one additional property - the **complexity** of the model. 

##### William of Occam

[**Occam's razor**](https://en.wikipedia.org/wiki/Occam's_razor) is the problem-solving principle that, when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions.

<img src="https://upload.wikimedia.org/wikipedia/commons/a/ab/William_of_Ockham_-_Logica_1341.jpg" width=400px>

We can represent this idea of complexity in terms of both the number of features we use and the amount of data.

## Bayesian Information Criterion

https://en.wikipedia.org/wiki/Bayesian_information_criterion

The BIC is formally defined as

$$ \mathrm{BIC} = {\ln(n)k - 2\ln({\widehat L})}. $$

where

- $\widehat L$ = the maximized value of the likelihood function of the model $M$
- $x$ = the observed data
- $n$ = the number of data points in $x$, the number of observations, or equivalently, the sample size;
- $k$ = the number of parameters estimated by the model. For example, in multiple linear regression, the estimated parameters are the intercept, the $q$ slope parameters, and the constant variance of the errors; thus, $k = q + 2$.


It might help us to think of it as 

$$ \mathrm{BIC} = \text{complexity}-\text{likelihood}$$

In [65]:
BIC(seeds.glm)

In [66]:
n = length(seeds.glm$fitted.values)
p = length(coefficients(seeds.glm))

likelihood = 2 * logLik(seeds.glm)
complexity = log(n)*(p+1)

bic = complexity - likelihood
bic

'log Lik.' 266.5799 (df=9)

In [67]:
BIC_of_model = function (model) {
    n = length(model$fitted.values)
    p = length(coefficients(model))

    likelihood = 2 * logLik(model)
    complexity = log(n)*(p+1)

    bic = complexity - likelihood
    return(bic)
}

In [68]:
BIC_of_model(seeds.glm) #df below means degrees of freedom

'log Lik.' 266.5799 (df=9)

## Model Selection

Here, we choose the optimal model by removing features one by one.

In [72]:

model_1  = "target ~ 1 + area + perimeter + compactness + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_2a = "target ~ 1 + area + perimeter + compactness + length_of_kernel + width_of_kernel + assymetry_coefficient"
model_2b = "target ~ 1 + area + perimeter + compactness + length_of_kernel + width_of_kernel + length_of_kernel_groove"
model_2c = "target ~ 1 + area + perimeter + compactness + length_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_2d = "target ~ 1 + area + perimeter + compactness + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_2e = "target ~ 1 + area + perimeter + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_2f = "target ~ 1 + area + compactness + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
              

In [73]:
seeds.glm.1 = glm(model_1, data=seeds.data)
seeds.glm.2a = glm(model_2a, data=seeds.data)
seeds.glm.2b = glm(model_2b, data=seeds.data)
seeds.glm.2c = glm(model_2c, data=seeds.data)
seeds.glm.2d = glm(model_2d, data=seeds.data)
seeds.glm.2e = glm(model_2e, data=seeds.data)
seeds.glm.2f = glm(model_2f, data=seeds.data)

In [74]:
print(c('model_1', BIC_of_model(seeds.glm.1)))
print(c('model_2a', BIC_of_model(seeds.glm.2a )))
print(c('model_2b', BIC_of_model(seeds.glm.2b )))
print(c('model_2c', BIC_of_model(seeds.glm.2c )))
print(c('model_2d', BIC_of_model(seeds.glm.2d )))
print(c('model_2e', BIC_of_model(seeds.glm.2e )))
print(c('model_2f', BIC_of_model(seeds.glm.2f )))

[1] "model_1"          "266.579917042231"
[1] "model_2a"         "356.525745573155"
[1] "model_2b"         "286.569723996916"
[1] "model_2c"         "261.334679806275"
[1] "model_2d"         "286.615929930958"
[1] "model_2e"         "294.133138313643"
[1] "model_2f"         "295.505888618519"


In [75]:
print(c('model_1', BIC(seeds.glm.1)))
print(c('model_2a', BIC(seeds.glm.2a )))
print(c('model_2b', BIC(seeds.glm.2b )))
print(c('model_2c', BIC(seeds.glm.2c )))
print(c('model_2d', BIC(seeds.glm.2d )))
print(c('model_2e', BIC(seeds.glm.2e )))
print(c('model_2f', BIC(seeds.glm.2f )))

[1] "model_1"          "266.579917042231"
[1] "model_2a"         "356.525745573155"
[1] "model_2b"         "286.569723996916"
[1] "model_2c"         "261.334679806275"
[1] "model_2d"         "286.615929930958"
[1] "model_2e"         "294.133138313643"
[1] "model_2f"         "295.505888618519"


In [None]:
model_1  = "target ~ 1 + area + perimeter + compactness + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_2c = "target ~ 1 + area + perimeter + compactness + length_of_kernel + assymetry_coefficient + length_of_kernel_groove"

model_3a = "target ~ 1 + area + compactness + width_of_kernel + assymetry_coefficient"
model_3b = "target ~ 1 + area + perimeter + length_of_kernel + length_of_kernel_groove"
model_3c = "target ~ 1 + area + perimeter + compactness + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_3d = "target ~ 1 + area + perimeter + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
model_3e = "target ~ 1 + area + compactness + length_of_kernel + width_of_kernel + assymetry_coefficient + length_of_kernel_groove"
              

In [15]:
iris.glm.3a = glm(model_3a, data=iris.data)
iris.glm.3b = glm(model_3b, data=iris.data)
iris.glm.3c = glm(model_3c, data=iris.data)

In [16]:
print(c('model_1', BIC(iris.glm.1)))
print(c('model_2c', BIC(iris.glm.2c )))
print(c('model_3a', BIC(iris.glm.3a )))
print(c('model_3b', BIC(iris.glm.3b )))
print(c('model_3c', BIC(iris.glm.3c )))

[1] "model_1"           "-4.87121487462612"
[1] "model_2c"          "-9.31979403027607"
[1] "model_3a"         "25.3174210943167"
[1] "model_3b"         "15.4504250116728"
[1] "model_3c"         "-5.0467304546584"
