# Dataset analysis, model shortlisting, and model determination: Evaluating Error: Signal vs Noise
A model is a clear-edged sketch that highlights patterns while omitting noise. Multiple stories can emerge from the same dataset.  
When modeling, we assume data contains two layers: *signal* (the true, repeatable pattern we want to capture) and *noise* (random distractions). The signal is what empowers models to generalize beyond the original data.

**Notebook 3 of 5**

This tutorial covers the fundamental concepts of modeling, starting with the idea that a model is a simplified representation of separating meaningful patterns (signal) from random fluctuations (noise). It explores the core challenges of modeling, including bias (oversimplifying and missing the signal) and variance (overcomplicating and fitting noise), along with strategies to address them, such as testing diverse models and using training/testing splits.

* Notebook 1: The model decision-making process
* Notebook 2: Splitting the data
* Notebook 3: Evaluating Error: Signal vs Noise
* Notebook 4: The Problem Domain: Chosing the Models
* Notebook 5: Model shortlisting


# Learning objectives
Average time to complete 30min

By the end of this tutorial you should be able to:
* Learn that a model is a simplified representation of data, which consists of *signal* (repeatable patterns) and *noise* (irrelevant variations).
* Recognize how to identify and isolate meaningful patterns (*signal*) while minimizing the impact of random fluctuations (*noise*) in data. 

## What you will need for this tutorial

* See the [introduction document](https://uottawa-it-research-teaching.github.io/machinelearning/) for general requirements and how Jupyter notebooks work.
* We'll need Pandas for convenient data handling. It's a very powerful Python package that can read CSV and Excel files. It also has very good data manipulation capabilities which come in use for data cleaning.
* We will use scikit learn as our machine learning package.
* numpy 
* seaborn 
* matplotlib
* requests
* ipywidgets
* The data files that should have come with this notebook.

## RDM best practices

Good data handling for machine learning begins with good Research Data Managment (RDM). The quality of your source data will impact the outcome of your results, just like the reproducibility of your results will depend on the quality of your data sources, in addition to how you organize the data so that other people (and machines!) can understand and reuse it.

We also need to respect a few research data management best practices along the way, these best practices are recommended by the Digital Research Alliance of Canada. In the first tutorial we encouraged you to resepct two RDM best practices:

* SAVE YOUR RAW DATA IN ORIGINAL FORMAT<br>
* BACKUP YOUR DATA (3-2-1 rule)<br>

These practices should apply in this tutorial as well, but we will also look at best practices of data description, documentation and file naming that will streamline your data processing and project management. 

DESCRIBE YOUR DATA

* Machine Friendly: Describe your dataset with a metadata standard for discovery.
* Human Friendly: Describe your variables, so your colleagues will understand what you meant. Data without good metadata is useless. Give your variables clear names.
* Do not leave cells blank -use numeric values clearly out of range to define missing (e.g. '99999') or not applicable (e.g. '88888') data anddescribe these in your data dictionary.
* Convert your data to open, non-proprietary formats 
* Name your files well with basic meta-data in the file names

# <b>Instructor section

In [2]:
# https://mermaid.live
# https://mermaid.js.org/syntax/flowchart.html#interaction

import webbrowser
from textwrap import dedent
	 
html_content = dedent('''
	<!DOCTYPE html>
	<html>
     <style>
     .mermaid svg {
    display: block;
    width: 100%;
    height: 100%;
    margin: 0%;
    padding: 0;
    }
     </style>
     <head>
        <script src="https://d3js.org/d3.v6.min.js"></script>
        <script src="https://unpkg.com/mermaid@11.4.1/dist/mermaid.min.js"></script>
      </head>
  <body>
    <div class="mermaid" style="max-height: 350px;max-width: 1500px;">
---
title: 'Flow 3 - Evaluating Error: Signal vs Noise'
---
    graph TD;
    A[Model<br>'A simplified story about data'] --> B(Data<br>'Composed of Signal + Noise');
    B --> C[Signal<br>'Real, repeatable pattern<br>Goal to capture and generalize'];
    B --> D[Noise<br>'Imperfections, extraneous variation<br>Obscures the signal'] ;
    A --> E[Goal of Modeling<br>'Describe signal, ignore noise'];
    E --> F[Challenges];
    F --> G[Bias<br>'Failure to capture all signal<br>Underfitting'];
    G --> H('Example: Linear model<br>High bias, misses pattern');
    F --> I[Variance<br>'Capturing noise instead of signal<br>Overfitting'];
    I --> J('Example: Connect-the-dots<br>High variance, fits noise');
    E --> K[Solutions];
    K --> L[Against Bias<br>'Try various model types<br>Rich pool of candidates'];
    K --> M[Against Variance<br>'Test generalization<br>Train vs. Test data split'];
    M --> N('Good model: Accurate predictions<br>Low variance, captures signal');
    M --> O('Overfitted model: Poor predictions<br>High variance, captures noise');
    P[Next Step<br>'Choosing the right error function'] --> E;
    </div>

    <script type="module">
      window.addEventListener('load', function () {
      var svgs = d3.selectAll(".mermaid svg");
      svgs.each(function() {
    var svg = d3.select(this);
    svg.html("<g>" + svg.html() + "</g>");
    var inner = svg.select("g");
    var zoom = d3.zoom().on("zoom", function(event) {
      inner.attr("transform", event.transform);
    });
    svg.call(zoom);
  });
});
 
window.callback = function () {
      alert('A callback was triggered');
    };
    const config = {
      startOnLoad: true,
      flowchart: { useMaxWidth: true, htmlLabels: true, curve: 'cardinal' },
      securityLevel: 'loose',
    };
    mermaid.initialize(config);


    </script>
    
  </body>
</html>
	''')
	 
# Save to file and open in browser
with open('mermaid_flowchart3.html', 'w') as f:
	f.write(html_content)
	 
webbrowser.open('mermaid_flowchart3.html')


True

# <b>Participant section

```mermaid
graph TD;
    A[Model<br>'A simplified story about data'] --> B(Data<br>'Composed of Signal + Noise');
    B --> C[Signal<br>'Real, repeatable pattern<br>Goal to capture and generalize'];
    B --> D[Noise<br>'Imperfections, extraneous variation<br>Obscures the signal'] ;
    A --> E[Goal of Modeling<br>'Describe signal, ignore noise'];
    E --> F[Challenges];
    F --> G[Bias<br>'Failure to capture all signal<br>Underfitting'];
    G --> H('Example: Linear model<br>High bias, misses pattern');
    F --> I[Variance<br>'Capturing noise instead of signal<br>Overfitting'];
    I --> J('Example: Connect-the-dots<br>High variance, fits noise');
    E --> K[Solutions];
    K --> L[Against Bias<br>'Try various model types<br>Rich pool of candidates'];
    K --> M[Against Variance<br>'Test generalization<br>Train vs. Test data split'];
    M --> N('Good model: Accurate predictions<br>Low variance, captures signal');
    M --> O('Overfitted model: Poor predictions<br>High variance, captures noise');
```

### 1. Central Concept:

```mermaid
graph TD;
    A[Model<br>'A simplified story about data'] --> B(Data<br>'Composed of Signal + Noise');
```

The "Model" is the starting point, defined as a simplified story about "Data," which is split into "Signal" and "Noise."


### 2. Signal and Noise:

```mermaid
graph TD;
B(Data<br>'Composed of Signal + Noise');
    B --> C[Signal<br>'Real, repeatable pattern<br>Goal to capture and generalize'];
    B --> D[Noise<br>'Imperfections, extraneous variation<br>Obscures the signal'] ;
```

• "Signal" is the repeatable pattern we aim to capture for generalization.<br>
• "Noise" represents imperfections and variations that obscure the signal.

### 3. Goal of Modeling:

```mermaid
graph TD;
E[Goal of Modeling<br>'Describe signal, ignore noise'];
    E --> F[Challenges];
    E --> K[Solutions];
```

The objective is to describe the signal while ignoring noise, leading to the challenges and solutions.


### 4. Challenges:

```mermaid
graph TD;
F[Challenges];
    F --> G[Bias<br>'Failure to capture all signal<br>Underfitting'];
    G --> H('Example: Linear model<br>High bias, misses pattern');
    F --> I[Variance<br>'Capturing noise instead of signal<br>Overfitting'];
    I --> J('Example: Connect-the-dots<br>High variance, fits noise');
```

• "Bias" (underfitting) occurs when the model misses the signal, exemplified by a linear model with high bias.<br>
• "Variance" (overfitting) occurs when the model captures noise, exemplified by a connect-the-dots model.


### 4. Solutions:

```mermaid
graph TD;
K[Solutions];
    K --> L[Against Bias<br>'Try various model types<br>Rich pool of candidates'];
    K --> M[Against Variance<br>'Test generalization<br>Train vs. Test data split'];
    M --> N('Good model: Accurate predictions<br>Low variance, captures signal');
    M --> O('Overfitted model: Poor predictions<br>High variance, captures noise');
```

• To combat "Bias," use a variety of model types to increase the chance of capturing the signal.<br>
• To combat "Variance," test generalization by splitting data into training and testing sets, with outcomes indicating whether the model is good (low variance) or overfitted (high variance).<br>
Example of this process can be found in our workshop on Transfer Learning<br>
<b> [Transfer Learning - CNN Image Augmentation - High variance, captures noise](https://notebooks.githubusercontent.com/view/ipynb?&commit=556370e4ca1775a3910eff763cb27adb113b9f05&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f5472616e736665724c6561726e696e675f434e4e2f353536333730653463613137373561333931306566663736336362323761646231313362396630352f6e6f7465626f6f6b732f322532302d2532307472616e736665724c6561726e696e67434e4e2e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FTransferLearning_CNN&path=notebooks%2F2+-+transferLearningCNN.ipynb&platform=windows&repository_id=790769950&repository_type=Repository&version=128#CNN-Model-with-Image-Augmentation) <br>
[Transfer Learning - VGG-16 Pre-trained model - Low variance, captures signal](https://notebooks.githubusercontent.com/view/ipynb?&commit=556370e4ca1775a3910eff763cb27adb113b9f05&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f5472616e736665724c6561726e696e675f434e4e2f353536333730653463613137373561333931306566663736336362323761646231313362396630352f6e6f7465626f6f6b732f322532302d2532307472616e736665724c6561726e696e67434e4e2e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FTransferLearning_CNN&path=notebooks%2F2+-+transferLearningCNN.ipynb&platform=windows&repository_id=790769950&repository_type=Repository&version=128#VGG-16-Pre-trained-model)

#### <b>Summary