In [None]:
library(data.table)
options(repr.matrix.max.rows=10, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

One of our classmates provided an answer to the "j operation" part of our first assignment. The code of the answer, including two functions with nested conditional statements, runs right (courtesy of our classmate):

<b>(Question)</b> Make a "j" operation again but this time assign the calculated new column back to the object with := notation. Then print the object to show that the new column is added

<b>(Explanation)</b>
1. <b>how_serious_quake:</b> Categorizes earthquake seriousness based on magnitude using the `how_big_quake` function.
2. <b>depth_category:</b> Categorizes earthquake depth using the `categorize_depth` function.

In [None]:
quakes_dt <- copy(quakes)
setDT(quakes_dt)
quakes_dt

In [None]:
how_big_quake <- function(x) {
    if (x < 5) {return ("No damage!")}
    else if (x < 6) {return ("Minor damage!")}
    else {return ("Slight or serious damage!")}
}

categorize_depth <- function(depth) {
  if (depth < 50) {
    return("Shallow")
  } else if (depth < 200) {
    return("Moderate")
  } else {
    return("Deep")
  }
}

quakes_dt[, how_serious_quake:= sapply(mag, how_big_quake)]
quakes_dt[, depth_category := sapply(depth, categorize_depth)]

quakes_dt

Now let's assume that the data object is much larger and the number of categories to assign to the numeric values is much higher.

Than the code would be harder to design and the execution would be slower. 

In [None]:
quakes_dtl <- rbindlist(rep(list(quakes_dt), 1e3))

In [None]:
mb1 <- microbenchmark::microbenchmark({
quakes_dtl[, how_serious_quake:= sapply(mag, how_big_quake)]
quakes_dtl[, depth_category := sapply(depth, categorize_depth)]
},
                                      times = 10)

In [None]:
print(mb1)

In my local computer the average execution time of this method on 1000X large data is 1.2 seconds: 

<div class="lm-Widget p-Widget jp-RenderedText jp-mod-trusted jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stdout"><pre>Unit: seconds
                                                                                                                                              expr
 {     quakes_dtl[, `:=`(how_serious_quake, sapply(mag, how_big_quake))]     quakes_dtl[, `:=`(depth_category, sapply(depth, categorize_depth))] }
      min       lq     mean   median       uq      max neval
 1.186524 1.219021 1.236421 1.235059 1.255242 1.302116    10
</pre></div>

In [The Art of Unix Programming](https://www.catb.org/~esr/writings/taoup/html/), a great book about programming and Unix philosophy, Eric S. Raymond emphasizes the advantage of transferring the code complexity into a data structure:

> Data-Driven Programming
When doing data-driven programming, one clearly distinguishes code from the data structures on which it acts, and designs both so that one can make changes to the logic of the program by editing not the code but the data structure.
>
> Data-driven programming is sometimes confused with object orientation, another style in which data organization is supposed to be central. There are at least two differences. One is that in data-driven programming, the data is not merely the state of some object, but actually defines the control flow of the program. Where the primary concern in OO is encapsulation, the primary concern in data-driven programming is writing as little fixed code as possible. Unix has a stronger tradition of data-driven programming than of OO.
>
>Programming data-driven style is also sometimes confused with writing state machines. It is in fact possible to express the logic of a state machine as a table or data structure, but hand-coded state machines are usually rigid blocks of code that are far harder to modify than a table.
>
> An important rule when doing any kind of code generation or data-driven programming is this: always push problems upstream. Don't hack the generated code or any intermediate representations by hand — instead, think of a way to improve or replace your translation tool. Otherwise you're likely to find that hand-patching bits which should have been generated correctly by machine will have turned into an infinite time sink.

(https://www.catb.org/~esr/writings/taoup/html/ch09s01.html)

> Basics of the Unix Philosophy
>
> ...
>
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

(https://www.catb.org/~esr/writings/taoup/html/ch01s06.html#rule5)

> Rule of Representation: Fold knowledge into data, so program logic can be stupid and robust.
>
>Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic. See Rob Pike's Rule 5.
>
> Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.
>
> The Unix community did not originate this insight, but a lot of Unix code displays its influence. The C language's facility at manipulating pointers, in particular, has encouraged the use of dynamically-modified reference structures at all levels of coding from the kernel upward. Simple pointer chases in such structures frequently do duties that implementations in other languages would instead have to embody in more elaborate procedures.

(https://www.catb.org/~esr/writings/taoup/html/ch01s06.html#id2878263)

Now, there are at least two more ways to implement this solution to yield exactly the same answer, both by translating the conditional logic into data structures.

These methods are easier to implement, are easier to extend to more categories and work faster.

The first method involves discretization by `cut` function.

Note that, in order to ensure that we get a result identical to the one above, the inputs may need to be reversed (that makes a difference for the categories at breaks points) and the values to be converted from factor to character.

In [None]:
quakes_dt2 <- copy(quakes)
setDT(quakes_dt2)
quakes_dt2

...

*Code with `cut` function*

...

The results are identical:

<div class="lm-Widget p-Widget jp-Cell jp-CodeCell jp-Notebook-cell"><div class="lm-Widget p-Widget jp-CellHeader jp-Cell-header"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-inputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-InputCollapser jp-Cell-inputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-InputArea jp-Cell-inputArea"><div class="lm-Widget p-Widget jp-InputPrompt jp-InputArea-prompt">[30]:</div><div class="lm-Widget p-Widget jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline"><div class="CodeMirror cm-s-jupyter"><div style="overflow: hidden; position: relative; width: 3px; height: 0px; top: 5px; left: 254.5px;"><textarea autocorrect="off" autocapitalize="off" spellcheck="false" tabindex="0" style="position: absolute; bottom: -1em; padding: 0px; width: 1000px; height: 1em; outline: none;"></textarea></div><div class="CodeMirror-vscrollbar" tabindex="-1" cm-not-content="true"><div style="min-width: 1px; height: 0px;"></div></div><div class="CodeMirror-hscrollbar" tabindex="-1" cm-not-content="true"><div style="height: 100%; min-height: 1px; width: 0px;"></div></div><div class="CodeMirror-scrollbar-filler" cm-not-content="true"></div><div class="CodeMirror-gutter-filler" cm-not-content="true"></div><div class="CodeMirror-scroll" tabindex="-1"><div class="CodeMirror-sizer" style="margin-left: 0px; min-width: 257.5px; margin-bottom: -15px; border-right-width: 35px; min-height: 27px; padding-right: 0px; padding-bottom: 0px;"><div style="position: relative; top: 0px;"><div class="CodeMirror-lines" role="presentation"><div role="presentation" style="position: relative; outline: none;"><div class="CodeMirror-measure"></div><div class="CodeMirror-measure"></div><div style="position: relative; z-index: 1;"></div><div class="CodeMirror-cursors" style="visibility: hidden;"><div class="CodeMirror-cursor" style="left: 254.5px; top: 0px; height: 17px;">&nbsp;</div></div><div class="CodeMirror-code" role="presentation"><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-variable">identical</span>(<span class="cm-variable">quakes_dt2</span>, <span class="cm-variable">quakes_dt</span>)</span></pre></div></div></div></div></div><div style="position: absolute; height: 35px; width: 1px; border-bottom: 0px solid transparent; top: 27px;"></div><div class="CodeMirror-gutters" style="display: none; height: 62px;"></div></div></div></div></div></div><div class="lm-Widget p-Widget jp-CellResizeHandle"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-outputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-OutputArea jp-Cell-outputArea" style=""><div class="lm-Widget p-Widget lm-Panel p-Panel jp-OutputArea-child"><div class="lm-Widget p-Widget jp-OutputPrompt jp-OutputArea-prompt"></div><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output" data-mime-type="text/html">TRUE</div></div></div></div><div class="lm-Widget p-Widget jp-CellFooter jp-Cell-footer"></div></div>

And the execution time is 1000X faster than the original:

<div class="lm-Widget p-Widget jp-RenderedText jp-mod-trusted jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stdout"><pre>Unit: milliseconds
                                                                                                                                                                                                                                                                                                                                                                             expr
 {     quakes_dt2[, `:=`(how_serious_quake, as.character(cut(-mag,          breaks = -c(-Inf, 5, 6, Inf), labels = rev(c("No damage!",              "Minor damage!", "Slight or serious damage!")))))]     quakes_dt2[, `:=`(depth_category, as.character(cut(-depth,          breaks = -c(-Inf, 50, 200, Inf), labels = rev(c("Shallow",              "Moderate", "Deep")))))] }
     min       lq     mean   median       uq      max neval
 1.07388 1.077075 1.113923 1.083277 1.120266 1.328985    10
</pre></div>

Another method involves creating lookup tables for matching categories and doing rolling joins as mentioned in the session.

The rolling join from A data.table to B data.table involves the syntax:

`r B[A, on = ..., roll = ...]` 

Note that in this version the column order should be put into that of the original answer to ensure that the results are identical.

In [None]:
quakes_dt3 <- copy(quakes)
setDT(quakes_dt3)
quakes_dt3

...

*Code with rolling joins*

...

The results are identical:

<div class="lm-Widget p-Widget jp-Cell jp-CodeCell jp-Notebook-cell"><div class="lm-Widget p-Widget jp-CellHeader jp-Cell-header"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-inputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-InputCollapser jp-Cell-inputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-InputArea jp-Cell-inputArea"><div class="lm-Widget p-Widget jp-InputPrompt jp-InputArea-prompt">[39]:</div><div class="lm-Widget p-Widget jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline"><div class="CodeMirror cm-s-jupyter"><div style="overflow: hidden; position: relative; width: 3px; height: 0px; top: 5px; left: 160.562px;"><textarea autocorrect="off" autocapitalize="off" spellcheck="false" tabindex="0" style="position: absolute; bottom: -1em; padding: 0px; width: 1000px; height: 1em; outline: none;"></textarea></div><div class="CodeMirror-vscrollbar" tabindex="-1" cm-not-content="true"><div style="min-width: 1px; height: 0px;"></div></div><div class="CodeMirror-hscrollbar" tabindex="-1" cm-not-content="true"><div style="height: 100%; min-height: 1px; width: 0px;"></div></div><div class="CodeMirror-scrollbar-filler" cm-not-content="true"></div><div class="CodeMirror-gutter-filler" cm-not-content="true"></div><div class="CodeMirror-scroll" tabindex="-1"><div class="CodeMirror-sizer" style="margin-left: 0px; min-width: 257.5px; margin-bottom: -15px; border-right-width: 35px; min-height: 27px; padding-right: 0px; padding-bottom: 0px;"><div style="position: relative; top: 0px;"><div class="CodeMirror-lines" role="presentation"><div role="presentation" style="position: relative; outline: none;"><div class="CodeMirror-measure"></div><div class="CodeMirror-measure"></div><div style="position: relative; z-index: 1;"></div><div class="CodeMirror-cursors" style=""><div class="CodeMirror-cursor" style="left: 160.562px; top: 0px; height: 17px;">&nbsp;</div></div><div class="CodeMirror-code" role="presentation"><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-variable">identical</span>(<span class="cm-variable">quakes_dt4</span>, <span class="cm-variable">quakes_dt</span>)</span></pre></div></div></div></div></div><div style="position: absolute; height: 35px; width: 1px; border-bottom: 0px solid transparent; top: 27px;"></div><div class="CodeMirror-gutters" style="display: none; height: 62px;"></div></div></div></div></div></div><div class="lm-Widget p-Widget jp-CellResizeHandle"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-outputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-OutputArea jp-Cell-outputArea"><div class="lm-Widget p-Widget lm-Panel p-Panel jp-OutputArea-child"><div class="lm-Widget p-Widget jp-OutputPrompt jp-OutputArea-prompt"></div><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output" data-mime-type="text/html">TRUE</div></div></div></div><div class="lm-Widget p-Widget jp-CellFooter jp-Cell-footer"></div></div>

And the run time is ~6X faster than the original:

<div class="lm-Widget p-Widget jp-RenderedText jp-mod-trusted jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stdout"><pre>Unit: milliseconds
                                                                                                                                                  expr
 {     quakes_dt4l &lt;- dc[hbq[quakes_dt3l, on = "mag", roll = Inf],          on = "depth", roll = Inf]     setcolorder(quakes_dt4l, names(quakes_dt)) }
      min       lq     mean   median       uq      max neval
 124.8671 128.7244 191.6024 176.0325 231.0694 339.6005    10
</pre></div>

Now the competition is:

Create a reproducible notebook that can generate a resulting object identical to the original result, confirmed by running the `identical()` function and uses either:

- `cut` function method

- rolling join method

The code should also run sufficiently faster than the original implementation, however you don't have to demonstrate the performance, we will do that. In fact if you use these methods, the code will surely run sufficiently faster.

Please submit the clean ipynb without cell outputs and the html with cell outputs to moodle.

The first ones to submit the notebook that yields a correct answer using cut or rolling join methods will earn additional full points of a lab submission (so we will have either one or two champions).