In [1]:
library(rdydisstools)
library(ggplot2)
library(svglite)
library(glue)
library(dplyr)


Attaching package: ‘dplyr’

The following object is masked from ‘package:glue’:

    collapse

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Ok, so first we're going to grab the maxj data from our artifacts folder then bind it all up into one data frame

In [2]:
for(i in 1:5000) {
    lsfn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/ls/sim{i}.RData')
    eofn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/eo/sim{i}.RData')
    mdfn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/md/sim{i}.RData')
    
    load(lsfn)
    load(eofn)
    load(mdfn)
    
    row <- cbind(maxLookup(ls$informedness), maxLookup(eo$informedness), maxLookup(md$informedness))
                           
    if(i==1) {
        df <- row
    } else {
        df <- rbind(df, row)
    }
}

colnames(df) <- c("ls.range", "ls.informedness", "ls.meanJ", "ls.sdJ",
                  "eo.range", "eo.informedness", "eo.meanJ", "eo.sdJ",
                  "md.range", "md.informedness", "md.meanJ", "md.sdJ")

Next, let's the metrics out into their own datasets (we could avoid this step if we didn't cbind in the previous step)

In [3]:
lsmetrics <- df %>% select(ls.range, ls.informedness, ls.meanJ, ls.sdJ)
eometrics <- df %>% select(eo.range, eo.informedness, eo.meanJ, eo.sdJ)
mdmetrics <- df %>% select(md.range, md.informedness, md.meanJ, md.sdJ)

Next, we're going to calculate the measures of central tendency for the index of maxJ for each metric. My proposal was to use the value of informedness at the threshold of average(maxJ) over all samples. Mean, Median, and Mode are all types of averages, and there may be a strong argument to using something like the mode instead of mean. Despite this, I think it would confuse readers, so I'll stick with mean for now, unless a good reason arises.

In [4]:
mode <- rbind(indexOfMode(lsmetrics), indexOfMode(eometrics), indexOfMode(mdmetrics))
lsct <- lsmetrics %>% summarise(mean=format(mean(ls.range), digits=2), sd=format(sd(ls.range), digits=2), median=median(ls.range))
eoct <- eometrics %>% summarise(mean=format(mean(eo.range), digits=2), sd=format(sd(eo.range), digits=2), median=median(eo.range))
mdct <- mdmetrics %>% summarise(mean=format(mean(md.range), digits=2), sd=format(sd(md.range), digits=2), median=median(md.range))
ct <- cbind(rbind(lsct, eoct, mdct), mode)
colnames(ct)[4] = "mode"
labels <- rbind("Longstring", "Even-odd", "Outlier")
cbind(labels, ct)

labels,mean,sd,median,mode
<fct>,<chr>,<chr>,<dbl>,<dbl>
Longstring,0.39,0.36,0.4,0.4
Even-odd,0.22,0.74,0.1,-0.3
Outlier,0.54,0.5,0.5,0.3


Now that we've found the index of avg(maxJ) for each metric we're going to look up informedness at that threshold for each sample and metric. To do that we do need to go back through each of the datasets and find the value of J at the chosen threshold. For longstring, that's .4; evenodd is .2; outlier is .5.

In [5]:
for(i in 1:5000) {
    lsfn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/ls/sim{i}.RData')
    eofn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/eo/sim{i}.RData')
    mdfn <- glue::glue('~/notebooks/dissertation/artifacts/rq1/md/sim{i}.RData')
    
    load(lsfn)
    load(eofn)
    load(mdfn)
    
    row <- cbind(i, jLookup(ls$informedness, .4), jLookup(eo$informedness, .2), jLookup(md$informedness, .5))
     if(i==1) {
        df2 <- row
    } else {
        df2 <- rbind(df2, row)
    }
    
    colnames(df2) <- c("sample", "lsjat.4", "eojat.2", "mdjat.5")
}


In [7]:
df2 %>% head
evals <- df2 %>% as.data.frame() %>% mutate(ls.eval=case_when(lsjat.4 > 0 ~ 1,
                                                      TRUE ~ 0),
                                   eo.eval=case_when(eojat.2 > 0 ~ 1,
                                                      TRUE ~ 0),
                                   md.eval=case_when(mdjat.5 > 0 ~ 1,
                                                      TRUE ~ 0))

head(evals)
evals %>% summarise(ls.total=sum(ls.eval), ls.total.mean=mean(lsjat.4), ls.total.sd=sd(lsjat.4), eo.total=sum(eo.eval), eo.total.mean=mean(eojat.2), eo.total.sd=sd(eojat.2), md.total=sum(md.eval), md.total.mean=mean(mdjat.5), md.total.sd=sd(mdjat.5))

sample,lsjat.4,eojat.2,mdjat.5
1,0.2619138,0.01232536,0.09645013
2,0.2052767,0.05148005,0.17824968
3,0.2189474,0.06315789,0.28842105
4,0.1667888,0.07203907,0.23199023
5,0.1013825,0.14132104,0.38832565
6,0.2174665,-0.02190705,0.10860664


sample,lsjat.4,eojat.2,mdjat.5,ls.eval,eo.eval,md.eval
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.2619138,0.01232536,0.09645013,1,1,1
2,0.2052767,0.05148005,0.17824968,1,1,1
3,0.2189474,0.06315789,0.28842105,1,1,1
4,0.1667888,0.07203907,0.23199023,1,1,1
5,0.1013825,0.14132104,0.38832565,1,1,1
6,0.2174665,-0.02190705,0.10860664,1,0,1


ls.total,ls.total.mean,ls.total.sd,eo.total,eo.total.mean,eo.total.sd,md.total,md.total.mean,md.total.sd
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
4960,0.2612381,0.1243657,4208,0.08444369,0.08499929,4865,0.1749188,0.09629145


In [74]:
jLookup(ls$informedness, .4)

In [59]:
jL <- function(x, i) {
  j = which(x[,1]==i)
  return(x[j,2])
}

In [123]:
ls$informedness$range[32]
which(sapply(ls$informedness$range, function(x) isTRUE(all.equal(x, .4))))

In [241]:
indexOfMode <- function(x) {
    df <- x %>% group_by(.[,1]) %>% summarise(n = n()) %>% mutate(max=max(n))
    x[which(is.equal(df$n, df$max)),1]
}

In [243]:
lsmetrics %>% group_by(ls.range) %>% mutate(n=n())

ls.range,ls.informedness,ls.meanJ,ls.sdJ,n
0.0,0.35916593,0.14512739,0.13173527,482
0.5,0.21428571,0.10869886,0.10508109,565
0.8,0.24000000,0.12024159,0.11493342,181
0.1,0.18510379,0.08996978,0.08976647,582
1.1,0.11428571,0.04556924,0.06500177,61
0.0,0.22582415,0.10094654,0.09740270,482
0.6,0.33333333,0.16568964,0.16374488,465
0.1,0.44866032,0.22649477,0.21403421,582
0.4,0.17720307,0.07475975,0.07460465,588
-0.1,0.20506119,0.09527556,0.08859422,256


In [245]:
load('~/notebooks/dissertation/artifacts/rq1/maxJ.RData')

In [248]:
maxJ %>% group_by(ls) %>% summarise(n=n())

ls,n
-0.5,5
-0.4,23
-0.3,53
-0.2,17
-0.1,256
0.0,482
0.1,582
0.2,527
0.3,499
0.4,588
