# Dataset analysis, model shortlisting, and model determination: model decision-making process
In all of the past notebooks we have looked at many different models and many different datasets. However, one big question remains: When to pick which model?

That is of course not an easy question to answer. We have indeed seen a few notebooks where we tried to apply a specific machine learning algorithm and it would just fail. For example the with k-nearest neighbours, we tried to classify books and failed to produce any accurate predictions for the genres. However when we tried the same thing with Naive Bayes, it worked really well!

**Notebook 1 of 5**

In this tutorial series, we’ll explore the process of selecting the right model—a balance between simplicity and complexity, accuracy and generalization. Through this journey, you’ll learn how to approach real-world data challenges with confidence and make informed decisions about which model truly serves your goals. There are five notebooks in this tutorial series, each notebook explores a different step in the art of selecting the right model.

* Notebook 1: The model decision-making process
* Notebook 2: Splitting the data
* Notebook 3: Evaluating Error: Signal vs Noise
* Notebook 4: The Problem Domain: Chosing the Models
* Notebook 5: Model shortlisting

**Notebook 1 - The model decision-making process** uses a fictitious scenario representing 140 years of annual temperature records as a basic case study to understand key notions involved in deciding the model to use. Notebooks 2, 3, 4 and 5 provide a deeper exploration of those notions.


# Learning objectives
Average time to complete 20min

By the end of this tutorial you should be able to:
* Acquire a basic understanding of what's involved in the machine learning model selection process.

## What you will need for this tutorial

* See the [introduction document](https://uottawa-it-research-teaching.github.io/machinelearning/) for general requirements and how Jupyter notebooks work.
* To start working with Python, we need to launch a program that will interpret and execute our Python commands. Below we list several options. If you don’t have a preference, proceed with the top option in the list that is available on your machine. Otherwise, you may use any interface you like. <br>
* <b> [How to launch Jupyter notebooks](https://swcarpentry.github.io/python-novice-inflammation/#option-a-jupyter-notebook) </b><br><br>
A clean install should provide all the packages needed for this workshop.
* We will use scikit learn as our machine learning package.
* numpy 
* seaborn 
* matplotlib
* requests
* ipywidgets

## RDM best practices

Good data handling for machine learning begins with good Research Data Managment (RDM). The quality of your source data will impact the outcome of your results, just like the reproducibility of your results will depend on the quality of your data sources, in addition to how you organize the data so that other people (and machines!) can understand and reuse it.

We also need to respect a few research data management best practices along the way, these best practices are recommended by the Digital Research Alliance of Canada. In the first tutorial we encouraged you to resepct two RDM best practices:

* SAVE YOUR RAW DATA IN ORIGINAL FORMAT<br>
* BACKUP YOUR DATA (3-2-1 rule)<br>

These practices should apply in this tutorial as well, but we will also look at best practices of data description, documentation and file naming that will streamline your data processing and project management. 

DESCRIBE YOUR DATA

* Machine Friendly: Describe your dataset with a metadata standard for discovery.
* Human Friendly: Describe your variables, so your colleagues will understand what you meant. Data without good metadata is useless. Give your variables clear names.
* Do not leave cells blank -use numeric values clearly out of range to define missing (e.g. '99999') or not applicable (e.g. '88888') data anddescribe these in your data dictionary.
* Convert your data to open, non-proprietary formats 
* Name your files well with basic meta-data in the file names


# <b>Instructor section

In [8]:
# https://mermaid.live
# https://mermaid.js.org/syntax/flowchart.html#interaction

import webbrowser
from textwrap import dedent
	 
html_content = dedent('''
	<!DOCTYPE html>
	<html>
     <style>
     .mermaid svg {
    display: block;
    width: 100%;
    height: 100%;
    margin: 0%;
    padding: 0;
    }
     </style>
     <head>
        <script src="https://d3js.org/d3.v6.min.js"></script>
        <script src="https://unpkg.com/mermaid@11.4.1/dist/mermaid.min.js"></script>
      </head>
  <body>
    <div class="mermaid" style="max-height: 350px;max-width: 1500px;">
---
title: Flow 1 - The Process
---
    graph TD;
    A[Data<br>'Annual temperatures, 120 years<br>Upward trend with curvature'] --> B[Goal of Modeling<br>'Capture underlying pattern<br>Generalize to new situations'];
    B --> C[Model Candidates];
    C --> D[Linear Model<br>'Straight line, captures tilt<br>Misses curvature'];
    C --> E[Quadratic Model<br>'Order-2 polynomial<br>Captures curvature, adds lift'];
    C --> F[Cubic Model<br>'Order-3 polynomial'];
    C --> G[Quartic Model<br>'Order-4 polynomial'];
    C --> H[Higher-Order Models<br>'Order-5 to Order-8 polynomials<br>Increasing wiggles'];
    C --> I[Interpolation Model<br>'Connects all points<br>Zero error, no generalization'];
    B --> J[Data Split<br>'Separate signal from error'];
    J --> K[Training Data<br>'70% of data<br>Fit models'];
    J --> L[Testing Data<br>'30% of data<br>Evaluate generalization'];
    K --> M[Train Models<br>'Fit candidates to training data'];
    M --> N[Evaluate Errors];
    N --> O[Training Error<br>'Hollow circles<br>Decreases with complexity'];
    N --> P[Testing Error<br>'Solid circles<br>Key for model selection'];
    O --> Q('Linear: High error<br>Misses curvature');
    O --> R('Quadratic: Lower error<br>Captures curvature');
    O --> S('Order-5: Lowest training error<br>Subtle differences');
    P --> T('Quartic: Lowest testing error<br>Best generalization');
    P --> U('Higher-order: Rising error<br>Captures quirks, not pattern');
    P --> V('Interpolation: High testing error<br>Fails to generalize');
    B --> W[Winner<br>'Quartic -Order4 polynomial<br>Balances fit and generalization'];
    X[Next Step<br>'Deeper discussion on model qualities'] --> B;
    </div>

    <script type="module">
      window.addEventListener('load', function () {
      var svgs = d3.selectAll(".mermaid svg");
      svgs.each(function() {
    var svg = d3.select(this);
    svg.html("<g>" + svg.html() + "</g>");
    var inner = svg.select("g");
    var zoom = d3.zoom().on("zoom", function(event) {
      inner.attr("transform", event.transform);
    });
    svg.call(zoom);
  });
});
 
window.callback = function () {
      alert('A callback was triggered');
    };
    const config = {
      startOnLoad: true,
      flowchart: { useMaxWidth: true, htmlLabels: true, curve: 'cardinal' },
      securityLevel: 'loose',
    };
    mermaid.initialize(config);


    </script>
    
  </body>
</html>
	''')
	 
# Save to file and open in browser
with open('mermaid_flowchart1.html', 'w') as f:
	f.write(html_content)
	 
webbrowser.open('mermaid_flowchart1.html')

True

# <b>Participant section

## Complete diagram

```mermaid 
flowchart TD;
    A[Data<br>'Annual temperatures, 140 years<br>Upward trend with curvature'] --> B[Goal of Modeling<br>'Capture underlying pattern<br>Generalize to new situations'];
    B --> C[Model Candidates];
    C --> D[Linear Model<br>'Straight line, captures tilt<br>Misses curvature'];
    C --> E[Quadratic Model<br>'Order-2 polynomial<br>Captures curvature, adds lift'];
    C --> F[Cubic Model<br>'Order-3 polynomial'];
    C --> G[Quartic Model<br>'Order-4 polynomial'];
    C --> H[Higher-Order Models<br>'Order-5 to Order-8 polynomials<br>Increasing wiggles'];
    C --> I[Interpolation Model<br>'Connects all points<br>Zero error, no generalization'];
    B --> J[Data Split<br>'Separate signal from error'];
    J --> K[Training Data<br>'70% of data<br>Fit models'];
    J --> L[Testing Data<br>'30% of data<br>Evaluate generalization'];
    K --> M[Train Models<br>'Fit candidates to training data'];
    M --> N[Evaluate Errors];
    N --> O[Training Error<br>'Hollow circles<br>Decreases with complexity'];
    N --> P[Testing Error<br>'Solid circles<br>Key for model selection'];
    O --> Q('Linear: High error<br>Misses curvature');
    O --> R('Quadratic: Lower error<br>Captures curvature');
    O --> S('Order-5: Lowest training error<br>Subtle differences');
    P --> T('Quartic: Lowest testing error<br>Best generalization');
    P --> U('Higher-order: Rising error<br>Captures quirks, not pattern');
    P --> V('Interpolation: High testing error<br>Fails to generalize');
    B --> W[Winner<br>'Quartic -Order4 polynomial<br>Balances fit and generalization'];
    X[Next Step<br>'Deeper discussion on model qualities'] --> B;
```

## Explanation of the diagram

### 1. Data and Goal: 


```mermaid

flowchart TD;
A[Data<br>'Annual temperatures, 140 years<br>Upward trend with curvature'] --> B[Goal of Modeling:<br>'Capture underlying pattern<br>Generalize to new situations'];
```

* The diagram starts with the "Data" (140 years of temperatures with an upward trend and curvature) and the "Goal of Modeling" (capturing the pattern for generalization).

![alt text](./pics/scatter_plot_temp.jpg)

### 2. Model Candidates:



Numerous models can illustrate this data, but a linear model serves as an excellent initial approach due to its simplicity. When we plot the best-fit straight line, we see it performs reasonably well. It successfully captures the overall upward trend in the data. However, it falls short in representing the curvature present in the dataset. Upon closer inspection, it becomes evident that a simple linear model doesn't quite meet our desired level of accuracy. While it provides a good starting point, the straight line approximation leaves room for improvement in fully representing the data's nuances.


![alt text](./pics/scatter_plot_temp_linear.jpg)

Fortunately, we have numerous alternatives at our disposal. A logical next step is to consider a quadratic model, which incorporates a squared term in addition to the linear component. This type of polynomial introduces curvature to the fit. When we examine the best-fit quadratic curve, we observe that it effectively captures the upward trend on the right side of the graph and the central curve. However, it also introduces a slight upward bend on the left side of the plot, a feature that isn't clearly evident in the original data points. This quadratic model, while more flexible than the linear approach, may be overcompensating in certain areas of the dataset.


![alt text](./pics/scatter_plot_temp_quad.jpg)

```mermaid
flowchart TD;
    	C[Model Candidates];
        C --> D[Linear Model<br>'Straight line, captures tilt<br>Misses curvature'];
        C --> E[Quadratic Model<br>'Order-2 polynomial<br>Captures curvature, adds lift'];
        C --> F[Cubic Model<br>'Order-3 polynomial'];
        C --> G[Quartic Model<br>'Order-4 polynomial'];
        C --> H[Higher-Order Models<br>'Order-5 to Order-8 polynomials<br>Increasing wiggles'];
        C --> I[Interpolation Model<br>'Connects all points<br>Zero error, no generalization'];
```

Various models are listed, from a simple "Linear Model" to complex "Higher-Order Models" (order-5 to order-8) and an extreme "Interpolation Model," each with strengths and weaknesses (e.g., missing curvature, adding wiggles, or overfitting).
As we progress to higher-order polynomials, the model seems to improve its fit to the data. However, this comes with a trade-off: the curve begins to exhibit more complex behavior, introducing additional undulations. If we extend this concept to its logical extreme, we could theoretically create a model that intersects precisely with every data point, resulting in zero error and perfect alignment with our observed measurements. <br>
![alt text](./pics/scatter_plot_temp_inter.jpg)<br>
But this raises an important question: does achieving zero deviation from the data necessarily equate to the most effective model? This scenario highlights the balance between accuracy and simplicity in model selection, touching on the concept of overfitting versus generalization in data analysis. We demonstrate this question in our <b> [Linear Regression Model](https://notebooks.githubusercontent.com/view/ipynb?&commit=5ead46fe16edd62e097872af1288bc211f33521c&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f4d4c5f636c65616e696e675f616e645f72656772657373696f6e2f356561643436666531366564643632653039373837326166313238386263323131663333353231632f6e6f7465626f6f6b732f4d4c54535f32303234303533305f444352465f4e6f7465626f6f6b5f454e5f312e302e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FML_cleaning_and_regression&path=notebooks%2FMLTS_20240530_DCRF_Notebook_EN_1.0.ipynb?#Linear-Regression-Model-for-Machine-Learning)</b> workshop.

The value of models lies in their ability to extrapolate insights from one context to another. When we employ a model, we operate under the premise that there's a fundamental pattern we aim to identify, albeit one that's obscured by noise or error. The primary objective of an effective model is to penetrate this layer of error and reveal the underlying pattern. In essence, a good model acts as a filter, sifting through the noise to discern the true signal within the data. 
The most common way to distinguish between meaningful trends and random fluctuations is to split our data into two groups. We can use one group to train out model, and then we can test it to see how closely it fits the second group. The first group is the training set, and the second group is the testing data set.


### 3. Data Split: 

```mermaid
flowchart TD;
    	J[Data Split<br>'Separate signal from error'];
    J --> K[Training Data<br>'70% of data<br>Fit models'];
    J --> L[Testing Data<br>'30% of data<br>Evaluate generalization'];
```

* The data is split into "Training Data" (70%) and "Testing Data" (30%) to separate signal from error and evaluate generalization.


### 4. Train and Evaluate:


```mermaid
flowchart TD;
    	M[Train Models<br>'Fit candidates to training data'];
    M --> N[Evaluate Errors];
```

 * Models are trained on the training data, and their performance is assessed via "Training Error" (how well they fit the training set) and "Testing Error" (how well they generalize to the test set). An example is shown here <b>[SVM Kernels](https://notebooks.githubusercontent.com/view/ipynb?&commit=7cc9d6d968c92a16a792e052658cfbc72065777b&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f53564d2f376363396436643936386339326131366137393265303532363538636662633732303635373737622f6e6f7465626f6f6b732f4d4c54535f32303234303533305f53564d5f4e6f7465626f6f6b324b65726e656c735f454e5f312e302e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FSVM&path=notebooks%2FMLTS_20240530_SVM_Notebook2Kernels_EN_1.0.ipynb&platform=windows&repository_id=696287816&repository_type=Repository&version=128#Run-SVM-with-default-hyperparameters)

### 5. Error Insights: 
    

```mermaid
flowchart TD;
    	N[Evaluate Errors];
    N --> O[Training Error<br>'Hollow circles<br>Decreases with complexity'];
    N --> P[Testing Error<br>'Solid circles<br>Key for model selection'];
    O --> Q('Linear: High error<br>Misses curvature');
    O --> R('Quadratic: Lower error<br>Captures curvature');
    O --> S('Order-5: Lowest training error<br>Subtle differences');
    P --> T('Quartic: Lowest testing error<br>Best generalization');
    P --> U('Higher-order: Rising error<br>Captures quirks, not pattern');
    P --> V('Interpolation: High testing error<br>Fails to generalize');
```

* "Training Error" decreases with complexity, with the "Order-5" model showing the lowest error.
* "Testing Error" reveals the "Quartic Model" as the best, with higher-order models showing increased error due to overfitting, and the interpolation model failing entirely.

We see this evaluation of errors in the Support Vector Machine workshop <b>[SVM Kernel and Hyperparameter](https://notebooks.githubusercontent.com/view/ipynb?&commit=7cc9d6d968c92a16a792e052658cfbc72065777b&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f53564d2f376363396436643936386339326131366137393265303532363538636662633732303635373737622f6e6f7465626f6f6b732f4d4c54535f32303234303533305f53564d5f4e6f7465626f6f6b324b65726e656c735f454e5f312e302e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FSVM&path=notebooks%2FMLTS_20240530_SVM_Notebook2Kernels_EN_1.0.ipynb&platform=windows&repository_id=696287816&repository_type=Repository&version=128#Run-SVM-with-default-hyperparameters)

### 6. The Selected Model: 


```mermaid
flowchart TD;
    A[Data<br>'Annual temperatures, 120 years<br>Upward trend with curvature'] --> B[Goal of Modeling:<br>'Capture underlying pattern<br>Generalize to new situations'];
    B --> W[Winner<br>'Quartic -Order4 polynomial<br>Balances fit and generalization'];
```

* In this example, the best model is chosen based on the train-test approach which happens to be the order 4 polynomial. <br>
![Quartic](./pics/scatter_plot_temp_inter.jpg)<br>We saw this approach in the <b>[Random Forest + Noisy Dataset Tutorial](https://notebooks.githubusercontent.com/view/ipynb?&commit=2f71e04f66007fd2ea722fd786e34c05273e0b4d&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f4465636973696f6e54726565732f326637316530346636363030376664326561373232666437383665333463303532373365306234642f6e6f7465626f6f6b732f4d4c54535f32303234303533305f445452465f4e6f7465626f6f6b3352616e646f6d466f726573744e6f69737944617461736574735f454e5f312e302e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FDecisionTrees&path=notebooks%2FMLTS_20240530_DTRF_Notebook3RandomForestNoisyDatasets_EN_1.0.ipynb&platform=windows&repository_id=671633172&repository_type=Repository&version=128#Random-Forest-+-Noisy-Dataset-Tutorial)

#### <b>Summary
We discussed a process for modeling an example data set, which exhibits an upward trend and curvature, with the goal of capturing this pattern for generalization. It compares various model candidates, ranging from a simple Linear Model to complex Higher-Order Models (order-5 to order-8) and an extreme Interpolation Model, each with trade-offs like missing curvature or overfitting. The data is divided into 70% Training Data and 30% Testing Data to distinguish signal from error and assess how well models generalize. During training and evaluation, Training Error decreases with model complexity, with the Order-5 model achieving the lowest error, while Testing Error identifies the Quartic (Order-4) Polynomial as the best performer, as higher-order models overfit and the interpolation model fails. We posited that the Quartic Model emerges as the winner based on this train-test approach. Notebook 2 will further explore the data splitting notion for model decision refinement.


### Please proceed to Notebook 2