diff --git a/exercises-pdf/nested_resampling_all.pdf b/exercises-pdf/nested_resampling_all.pdf new file mode 100644 index 000000000..18c6a02c3 Binary files /dev/null and b/exercises-pdf/nested_resampling_all.pdf differ diff --git a/exercises-pdf/nested_resampling_ex.pdf b/exercises-pdf/nested_resampling_ex.pdf new file mode 100644 index 000000000..ae8ae1d4e Binary files /dev/null and b/exercises-pdf/nested_resampling_ex.pdf differ diff --git a/exercises/nested-resampling/nested_resampling.html b/exercises/nested-resampling/nested_resampling.html index f9b09f3d7..78b7b38e9 100644 --- a/exercises/nested-resampling/nested_resampling.html +++ b/exercises/nested-resampling/nested_resampling.html @@ -2,7 +2,7 @@ - + @@ -21,7 +21,7 @@ } pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { line-height: 1.25; } +pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -56,9 +56,7 @@ - - - - + + - + - + - - @@ -3129,7 +3045,7 @@

Table of contents

  • Exercise 2: AutoML
  • Exercise 3: Kaggle Challenge
  • -

    Notebooks

    +

    Notebooks

    @@ -3149,10 +3065,8 @@

    Exercise 12 – Nested Resampling

    - -
    -

    TBD

    +
      +
    1. Understand model fitting procedure in nested resampling
    2. +
    3. Discuss bias and variance in nested resampling
    4. +

    Suppose that we want to compare four different learners:

    @@ -3359,7 +3276,7 @@

    Exercise 1: T
    -

    +

    @@ -3407,7 +3324,7 @@

    Exercise 2: AutoML

    -

    TBD

    +

    Build autoML pipeline with R/Python

    In this exercise, we build a simple automated machine learning (AutoML) system that will make data-driven choices on which learner/estimator to use and also conduct the necessary tuning.

    @@ -3431,7 +3348,7 @@

    Exercise 2: AutoML

    Solution
    -
    +
    (task <- tsk("pima"))
    <TaskClassif:pima> (768 x 9): Pima Indian Diabetes
    @@ -3489,16 +3406,12 @@ 

    Exercise 2: AutoML

    Solution
    -
    +
    ppl_combined <- ppl_preproc %>>% ppl_learners
     plot(ppl_combined)
     graph_learner <- as_learner(ppl_combined)
    -
    -
    -

    -
    -
    +

    @@ -3521,7 +3434,7 @@

    Exercise 2: AutoML

    Solution
    -
    +
    # check available hyperparameters for tuning (converting to data.table for 
     # better readability)
     tail(as.data.table(graph_learner$param_set), 10)
    @@ -3539,7 +3452,6 @@ 

    Exercise 2: AutoML

    # in result tables) graph_learner$id <- "graph_learner"
    -
    @@ -3706,7 +3618,6 @@

    Exercise 2: AutoML

    -

    Conveniently, there is a sugar function, tune_nested(), that takes care of nested resampling in one step. Use it to evaluate your tuned graph learner with

    @@ -3749,7 +3660,6 @@

    Exercise 2: AutoML

    rr$score()
     rr$aggregate()
    -
    A data.table: 10 x 11
    @@ -3813,7 +3723,6 @@

    Exercise 2: AutoML

    A rr_score: 3 x 9
    -
    classif.ce: 0.2421875
    @@ -3974,6 +3883,7 @@

    Exercise 2: AutoML

    Solution +

    Define resampling strategies

    # initalize scores with 0
    @@ -3987,6 +3897,7 @@ 

    Exercise 2: AutoML

    outer_cv = StratifiedKFold(n_splits=NUM_OUTER_FOLDS, shuffle=True, random_state=43)
    +

    Run loop

    for i, (train_index, val_index) in enumerate(outer_cv.split(X_train, y_train)):
    @@ -4009,6 +3920,7 @@ 

    Exercise 2: AutoML

    Solution +

    per fold

    # print performance per outer fold
    @@ -4018,6 +3930,7 @@ 

    Exercise 2: AutoML

    +

    aggregated

    # print performance aggregated over all folds
    @@ -4027,8 +3940,9 @@ 

    Exercise 2: AutoML

    +

    detailed

    -
    +
    # Nested CV with parameter optimization for ensemble pipeline
     clf_gs_voting = GridSearchCV(
         estimator=clf_voting, 
    @@ -4128,13 +4042,12 @@ 

    Exercise 2: AutoML

    Accuracy does not account for imbalanced data! Let’s check how the test data is distributed:

    -
    +
    unique, counts = np.unique(y_test, return_counts=True)
     table = pd.DataFrame(data = dict(zip(unique, counts)), index=[0]) #index necassary because only numeric values are in dictionary
     table
    -
    @@ -4158,15 +4071,14 @@

    Exercise 2: AutoML

    - +

    Confusion matrix

    -
    +
    pred_test = clf_gs_voting.predict(X_test)
     conf_matrix = pd.DataFrame(confusion_matrix(pred_test, y_test))
     conf_matrix
    -
    @@ -4195,7 +4107,6 @@

    Exercise 2: AutoML

    -

    The distribution shows a shift towards ‘false’ with \(2/3\) of all test observations.

    @@ -4227,7 +4138,7 @@

    Exercise 3: Ka

    -

    TBD

    +

    Apply course contents to real-world problem

    Make yourself familiar with the Titanic Kaggle challenge.

    @@ -4297,33 +4208,6 @@

    Exercise 3: Ka } } } - const toggleGiscusIfUsed = (isAlternate, darkModeDefault) => { - const baseTheme = document.querySelector('#giscus-base-theme')?.value ?? 'light'; - const alternateTheme = document.querySelector('#giscus-alt-theme')?.value ?? 'dark'; - let newTheme = ''; - if(darkModeDefault) { - newTheme = isAlternate ? baseTheme : alternateTheme; - } else { - newTheme = isAlternate ? alternateTheme : baseTheme; - } - const changeGiscusTheme = () => { - // From: https://github.com/giscus/giscus/issues/336 - const sendMessage = (message) => { - const iframe = document.querySelector('iframe.giscus-frame'); - if (!iframe) return; - iframe.contentWindow.postMessage({ giscus: message }, 'https://giscus.app'); - } - sendMessage({ - setConfig: { - theme: newTheme - } - }); - } - const isGiscussLoaded = window.document.querySelector('iframe.giscus-frame') !== null; - if (isGiscussLoaded) { - changeGiscusTheme(); - } - } const toggleColorMode = (alternate) => { // Switch the stylesheets const alternateStylesheets = window.document.querySelectorAll('link.quarto-color-scheme.quarto-color-alternate'); @@ -4390,15 +4274,13 @@

    Exercise 3: Ka return localAlternateSentinel; } } - const darkModeDefault = false; - let localAlternateSentinel = darkModeDefault ? 'alternate' : 'default'; + let localAlternateSentinel = 'default'; // Dark / light mode switch window.quartoToggleColorScheme = () => { // Read the current dark / light value let toAlternate = !hasAlternateSentinel(); toggleColorMode(toAlternate); setStyleSentinel(toAlternate); - toggleGiscusIfUsed(toAlternate, darkModeDefault); }; // Ensure there is a toggle, if there isn't float one in the top right if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { @@ -4477,9 +4359,10 @@

    Exercise 3: Ka // clear code selection e.clearSelection(); }); - function tippyHover(el, contentFn, onTriggerFn, onUntriggerFn) { + function tippyHover(el, contentFn) { const config = { allowHTML: true, + content: contentFn, maxWidth: 500, delay: 100, arrow: false, @@ -4489,17 +4372,8 @@

    Exercise 3: Ka interactive: true, interactiveBorder: 10, theme: 'quarto', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } window.tippy(el, config); } const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]'); @@ -4513,125 +4387,6 @@

    Exercise 3: Ka const note = window.document.getElementById(id); return note.innerHTML; }); - } - const xrefs = window.document.querySelectorAll('a.quarto-xref'); - const processXRef = (id, note) => { - // Strip column container classes - const stripColumnClz = (el) => { - el.classList.remove("page-full", "page-columns"); - if (el.children) { - for (const child of el.children) { - stripColumnClz(child); - } - } - } - stripColumnClz(note) - if (id === null || id.startsWith('sec-')) { - // Special case sections, only their first couple elements - const container = document.createElement("div"); - if (note.children && note.children.length > 2) { - container.appendChild(note.children[0].cloneNode(true)); - for (let i = 1; i < note.children.length; i++) { - const child = note.children[i]; - if (child.tagName === "P" && child.innerText === "") { - continue; - } else { - container.appendChild(child.cloneNode(true)); - break; - } - } - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(container); - } - return container.innerHTML - } else { - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(note); - } - return note.innerHTML; - } - } else { - // Remove any anchor links if they are present - const anchorLink = note.querySelector('a.anchorjs-link'); - if (anchorLink) { - anchorLink.remove(); - } - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(note); - } - // TODO in 1.5, we should make sure this works without a callout special case - if (note.classList.contains("callout")) { - return note.outerHTML; - } else { - return note.innerHTML; - } - } - } - for (var i=0; i res.text()) - .then(html => { - const parser = new DOMParser(); - const htmlDoc = parser.parseFromString(html, "text/html"); - const note = htmlDoc.getElementById(id); - if (note !== null) { - const html = processXRef(id, note); - instance.setContent(html); - } - }).finally(() => { - instance.enable(); - instance.show(); - }); - } - } else { - // See if we can fetch a full url (with no hash to target) - // This is a special case and we should probably do some content thinning / targeting - fetch(url) - .then(res => res.text()) - .then(html => { - const parser = new DOMParser(); - const htmlDoc = parser.parseFromString(html, "text/html"); - const note = htmlDoc.querySelector('main.content'); - if (note !== null) { - // This should only happen for chapter cross references - // (since there is no id in the URL) - // remove the first header - if (note.children.length > 0 && note.children[0].tagName === "HEADER") { - note.children[0].remove(); - } - const html = processXRef(null, note); - instance.setContent(html); - } - }).finally(() => { - instance.enable(); - instance.show(); - }); - } - }, function(instance) { - }); } let selectedAnnoteEl; const selectorForAnnotation = ( cell, annotation) => { @@ -4674,7 +4429,6 @@

    Exercise 3: Ka } div.style.top = top - 2 + "px"; div.style.height = height + 4 + "px"; - div.style.left = 0; let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter"); if (gutterDiv === null) { gutterDiv = window.document.createElement("div"); @@ -4700,32 +4454,6 @@

    Exercise 3: Ka }); selectedAnnoteEl = undefined; }; - // Handle positioning of the toggle - window.addEventListener( - "resize", - throttle(() => { - elRect = undefined; - if (selectedAnnoteEl) { - selectCodeLines(selectedAnnoteEl); - } - }, 10) - ); - function throttle(fn, ms) { - let throttle = false; - let timer; - return (...args) => { - if(!throttle) { // first call gets through - fn.apply(this, args); - throttle = true; - } else { // all the others get throttled - if(timer) clearTimeout(timer); // cancel #2 - timer = setTimeout(() => { - fn.apply(this, args); - timer = throttle = false; - }, ms); - } - }; - } // Attach click handler to the DT const annoteDls = window.document.querySelectorAll('dt[data-target-cell]'); for (const annoteDlNode of annoteDls) { @@ -4789,5 +4517,4 @@

    Exercise 3: Ka - \ No newline at end of file diff --git a/exercises/nested-resampling/nested_resampling.qmd b/exercises/nested-resampling/nested_resampling.qmd index 2a3c57acb..8e5cdb57a 100644 --- a/exercises/nested-resampling/nested_resampling.qmd +++ b/exercises/nested-resampling/nested_resampling.qmd @@ -4,16 +4,16 @@ subtitle: "[Introduction to Machine Learning](https://slds-lmu.github.io/i2ml/)" notebook-view: - notebook: ex_nested_resampling_R.ipynb title: "Exercise sheet for R" - url: "https://github.com/slds-lmu/lecture_i2ml/blob/exercises/nested_resampling/ex_forests_R.ipynb" + url: "https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/nested-resampling/ex_nested_resampling_R.ipynb" - notebook: ex_nested_resampling_py.ipynb title: "Exercise sheet for Python" - url: "https://github.com/slds-lmu/lecture_i2ml/blob/exercises/nested_resampling/ex_forests_py.ipynb" + url: "https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/nested-resampling/ex_nested_resampling_py.ipynb" - notebook: sol_nested_resampling_R.ipynb title: "Solutions for R" - url: "https://github.com/slds-lmu/lecture_i2ml/blob/exercises/nested_resampling/sol_forests_R.ipynb" + url: "https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/nested-resampling/sol_nested_resampling_R.ipynb" - notebook: sol_nested_resampling_py.ipynb title: "Solutions for Python" - url: "https://github.com/slds-lmu/lecture_i2ml/blob/exercises/nested_resampling/sol_forests_py.ipynb" + url: "https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/nested-resampling/sol_nested_resampling_py.ipynb" --- ::: {.content-hidden when-format="pdf"} @@ -37,7 +37,8 @@ notebook-view: ## Exercise 1: Tuning Principles ::: {.callout-note title="Learning goals" icon=false} -TBD +1. Understand model fitting procedure in nested resampling +2. Discuss bias and variance in nested resampling ::: @@ -155,7 +156,7 @@ ii. False -- we are relatively flexible in choosing the outer loss, but the inne ## Exercise 2: AutoML ::: {.callout-note title="Learning goals" icon=false} -TBD +Build autoML pipeline with R/Python ::: In this exercise, we build a simple automated machine learning (AutoML) system that will make data-driven choices on which learner/estimator to use and also conduct the necessary tuning. @@ -261,7 +262,7 @@ You need to define dependencies, since the tuning process is defined by which le ::: *** -\item Conveniently, there is a sugar function, `tune_nested()`, that takes care of nested resampling in one step. Use it to evaluate your tuned graph learner with +Conveniently, there is a sugar function, `tune_nested()`, that takes care of nested resampling in one step. Use it to evaluate your tuned graph learner with - mean classification error as inner loss, @@ -421,7 +422,9 @@ for i, (train_index, val_index) in enumerate(outer_cv.split(X_train, y_train)):
    **Solution** +Define resampling strategies {{< embed sol_nested_resampling_py.ipynb#2-f-1 echo=true >}} +Run loop {{< embed sol_nested_resampling_py.ipynb#2-f-2 echo=true >}}
    @@ -434,8 +437,11 @@ Extract performance estimates per outer fold and overall (as mean). According to
    **Solution** +per fold {{< embed sol_nested_resampling_py.ipynb#2-g-1 echo=true >}} +aggregated {{< embed sol_nested_resampling_py.ipynb#2-g-2 echo=true >}} +detailed {{< embed sol_nested_resampling_py.ipynb#2-g-3 echo=true >}}
    @@ -453,8 +459,7 @@ Lastly, evaluate the performance on the test set. Think about the imbalance of y Accuracy does not account for imbalanced data! Let's check how the test data is distributed: {{< embed sol_nested_resampling_py.ipynb#2-h-2 echo=true >}} - - +Confusion matrix {{< embed sol_nested_resampling_py.ipynb#2-h-3 echo=true >}} The distribution shows a shift towards 'false' with $2/3$ of all test observations. @@ -474,7 +479,7 @@ Congrats, you just designed a turn-key AutoML system that does (nearly) all the ## Exercise 3: Kaggle Challenge ::: {.callout-note title="Learning goals" icon=false} -TBD +Apply course contents to real-world problem ::: Make yourself familiar with the [Titanic Kaggle challenge](https://www.kaggle.com/c/titanic).