~~Identifying and quantifying outliers requires quantitative knowledge.Deciding what to do with outliers requires domain knowledge. For example, a gps device determines its location using data from mutiple signals in different locations and use all that data to calculate the location of the gps device. Our gps devices do not include extreme outliers because those extreme outliers are likely wrong due to some technical error. Our gps devices are generally more accurate than before because it does not use data that is most likely just completely wrong. Compare this to student outcomes of a data science bootcamps. Participants are not just typical students in school. They have varying backgrounds. Many do not have any background knowledge while a couple have PhDs in Physics. Participants with PhDs in Physics will naturally have better outcomes and that is significant and should not just be ignored like a completely wrong data point that a gps device may receive.~~

##### identifying outliers using Tukey's method
###### Tukey's Method of Identifying Outliers Just Says That Outliers are  
1. values below (Quartile 1) – (1.5 × IQR)  
and  
2. values above (Quartile 3) + (1.5 × IQR)  
where IQR = Quartile 3 - Quartile 1


In [8]:
#install.packages("mlbench")
is_mlbench_installed <- require("mlbench")
if (!is_mlbench_installed) {
    install.packages("mlbench")
    library(mlbench)
}

[1] TRUE


In [3]:
library(dplyr, warn.conflicts = FALSE); library(moments); library(ggplot2)

In [4]:
data(BostonHousing2); bh2 <- BostonHousing2

# indexing dataframe using Census tract code
rownames(bh2) <- bh2$tract

# removing attributes in updated dataset BostonHousing2 and not in original BostonHousing except tract (Census tract code)
bh2 <- subset(bh2, select = -c(cmedv, town, lon, lat, tract))

In [62]:
bh2_numeric_feat <- Filter(is.numeric, bh2)

In [6]:
summary(bh2_numeric_feat)

      medv            crim                zn             indus      
 Min.   : 5.00   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
 1st Qu.:17.02   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
 Median :21.20   Median : 0.25651   Median :  0.00   Median : 9.69  
 Mean   :22.53   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
 3rd Qu.:25.00   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
 Max.   :50.00   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptra

#### Using Tukey's Method to Identify Outliers for Each Feature

In [25]:
get_Tukey_bounds <- function(data_frame) {
    data_matrix <-data.matrix(data_frame)
    
    iqr_ <- IQR(data_matrix); iqr15 <- 1.5*iqr_
    q3 <- quantile(data_matrix, 0.75) 
    q1 <- quantile(data_matrix, 0.25)  
    lower_bnd <- q1 - iqr15
    upper_bnd <- q3 + iqr15
    
    names(lower_bnd) <- "lower"
    names(upper_bnd) <- "upper"
    
    return (c(lower_bnd, upper_bnd))
}

In [77]:
# outliers_ndx <- list()
number_of_features <- length(colnames(bh2_numeric_feat))
# ; number_of_features
for (i in 1:number_of_features) {
    feat_df <- bh2_numeric_feat[i]
#     print(rownames(feat_df))
    print("-----")
    bounds <- get_Tukey_bounds(feat_df)
#     print(colnames(feat)); print(bounds); print("--------")
#     outliers <-feat_df[feat_df < bounds["lower"] | feat_df > bounds["upper"]]
    outliers <- filter(feat_df, feat_df[i]>bounds["upper"])
#     print(rownames(outliers)); print("***************")
    feat_name <- rep(colnames(feat_df), length(outliers))
    outliers <- cbind(feat_name, outliers)
    if (i == 1) {
        outliers_df <- data.frame(outliers)
    }
    
#     print(outliers)
#     outliers_ndx[[i]] <- outliers
}
# print(outliers_df)

# names(outliers_ndx)
# colnames(bh2_numeric_feat)
# names(outliers_ndx) <- colnames(bh2_numeric_feat)
# print(outliers_ndx)

# vv <- list(c(1, 2), c(3, 4, 5))
# vv
# names(vv) <- c("n1", "n2")
# # vv["n1"]; 
# vv$n2

  [1] "2011" "2021" "2022" "2031" "2032" "2033" "2041" "2042" "2043" "2044"
 [11] "2045" "2046" "2047" "2051" "2052" "2053" "2054" "2055" "2056" "2057"
 [21] "2058" "2059" "2060" "2061" "2062" "2063" "2064" "2065" "2066" "2067"
 [31] "2068" "2069" "2070" "2071" "2072" "2081" "2082" "2083" "2084" "2091"
 [41] "2092" "2101" "2102" "2103" "2104" "2105" "2106" "2107" "2108" "2109"
 [51] "2111" "2112" "2113" "2114" "2121" "2141" "2151" "2161" "2171" "2172"
 [61] "2173" "2174" "2175" "2176" "2181" "3301" "3302" "3311" "3312" "3313"
 [71] "3321" "3322" "3323" "3324" "3331" "3332" "3333" "3334" "3335" "3336"
 [81] "3341" "3342" "3343" "3344" "3351" "3352" "3353" "3354" "3361" "3362"
 [91] "3363" "3364" "3371" "3372" "3373" "3381" "3382" "3383" "3384" "3385"
[101] "3391" "3392" "3393" "3394" "3395" "3396" "3397" "3398" "3399" "3400"
[111] "3401" "3411" "3412" "3413" "3414" "3415" "3416" "3417" "3418" "3419"
[121] "3421" "3422" "3423" "3424" "3425" "3426" "3427" "3501" "3502" "3503"
[131] "3504"

ERROR: Error in filter_impl(.data, quo): Evaluation error: undefined columns selected.


#### Identifying Each Instance That is an Outlier for more than One Feature