# Dataset analysis, model shortlisting, and model determination: 5 tutorials
In the real world, we rarely get the luxury of neatly packaged training and testing sets. Instead, we're usually handed one massive, unstructured dataset. The key to success? How we split that data.<br>
The way we divide our data hinges on a critical question—are we aiming for interpolation or extrapolation?<br>
• Interpolation means our model will predict within the range of its training data—like filling in the gaps of a familiar puzzle.<br>
• Extrapolation pushes the model beyond its comfort zone, asking it to generalize to entirely new scenarios—like predicting tomorrow’s weather based on historical patterns.

**Notebook 2 of 5**

In this tutorial we explore the notions of data splitting, generalization (interpolation vs. extrapolation), independence in testing data, representative splits, and theory-driven model development.

* Notebook 1: The model decision-making process
* Notebook 2: Splitting the data
* Notebook 3: Evaluating Error: Signal vs Noise
* Notebook 4: The Problem Domain: Chosing the Models
* Notebook 5: Model shortlisting


# Learning objectives
Average time to complete 30min

By the end of this tutorial you should be able to:
* Learn why splitting data into training and testing sets is essential for evaluating a model's ability to generalize.
* Understand how to ensure testing data is independent to avoid unfair advantages, using domain knowledge to assess independence effectively. 
* Gain insight into how generalization goals guide the next steps in selecting model candidates and developing hypothesis-driven approaches.


## What you will need for this tutorial

* See the [introduction document](https://uottawa-it-research-teaching.github.io/machinelearning/) for general requirements and how Jupyter notebooks work.
* We'll need Pandas for convenient data handling. It's a very powerful Python package that can read CSV and Excel files. It also has very good data manipulation capabilities which come in use for data cleaning.
* We will use scikit learn as our machine learning package.
* numpy 
* seaborn 
* matplotlib
* requests
* ipywidgets

## RDM best practices

Good data handling for machine learning begins with good Research Data Managment (RDM). The quality of your source data will impact the outcome of your results, just like the reproducibility of your results will depend on the quality of your data sources, in addition to how you organize the data so that other people (and machines!) can understand and reuse it.

We also need to respect a few research data management best practices along the way, these best practices are recommended by the Digital Research Alliance of Canada. In the first tutorial we encouraged you to resepct two RDM best practices:

* SAVE YOUR RAW DATA IN ORIGINAL FORMAT<br>
* BACKUP YOUR DATA (3-2-1 rule)<br>

These practices should apply in this tutorial as well, but we will also look at best practices of data description, documentation and file naming that will streamline your data processing and project management. 

DESCRIBE YOUR DATA

* Machine Friendly: Describe your dataset with a metadata standard for discovery.
* Human Friendly: Describe your variables, so your colleagues will understand what you meant. Data without good metadata is useless. Give your variables clear names.
* Do not leave cells blank -use numeric values clearly out of range to define missing (e.g. '99999') or not applicable (e.g. '88888') data anddescribe these in your data dictionary.
* Convert your data to open, non-proprietary formats 
* Name your files well with basic meta-data in the file names

# <b>Instructor section

In [3]:
# https://mermaid.live
# https://mermaid.js.org/syntax/flowchart.html#interaction

import webbrowser
from textwrap import dedent
	 
html_content = dedent('''
	<!DOCTYPE html>
	<html>
     <style>
     .mermaid svg {
    display: block;
    width: 100%;
    height: 100%;
    margin: 0%;
    padding: 0;
    }
     </style>
     <head>
        <script src="https://d3js.org/d3.v6.min.js"></script>
        <script src="https://unpkg.com/mermaid@11.4.1/dist/mermaid.min.js"></script>
      </head>
  <body>
    <div class="mermaid" style="max-height: 350px;max-width: 1500px;">
---
title: Flow 2 - Splitting the data
---
    graph TD;
    A[Data<br>'One big grab bag<br>Annual temperatures'] --> B[Task<br>'Split into Training and Testing Sets'];
    B --> C[Goal of Splitting<br>'Test model generalization'];
    C --> D[Types of Generalization];
    D --> E[Interpolation<br>'Estimate within data range<br>e.g., missing middle years'];
    E --> F('Random split<br>Sort years into bins<br>Training vs. Testing');
    D --> G[Extrapolation<br>'Estimate beyond data range<br>e.g., future years or another town'];
    G --> H('Time-based split<br>Train on past, test on future');
    G --> I('Spatial split<br>Test on distant town');
    C --> J[Key Principle<br>'Testing data must be independent']
    J --> K('Avoid unfair advantage<br>e.g., future data tipping off trends');
    J --> L('Domain knowledge needed<br>e.g., 1km vs. 100km town distance');
    L --> M('Too close: Not independent<br>Shared patterns and quirks');
    L --> N('Far enough: Independent<br>Better test of generalization');
    C --> O[Practical Consideration<br>'Split must match model use'];
    O --> P('Representative split<br>Accurate testing error<br>Good to go');
    O --> Q('Unrepresentative split<br>Artificially low error<br>False security');
    Q --> R('High consequence applications<br>Risk of unseen weaknesses');
    S[Next Step<br>'Choosing model candidates<br>Hypothesis- and theory-driven modeling'] --> C;
    </div>

    <script type="module">
      window.addEventListener('load', function () {
      var svgs = d3.selectAll(".mermaid svg");
      svgs.each(function() {
    var svg = d3.select(this);
    svg.html("<g>" + svg.html() + "</g>");
    var inner = svg.select("g");
    var zoom = d3.zoom().on("zoom", function(event) {
      inner.attr("transform", event.transform);
    });
    svg.call(zoom);
  });
});
 
window.callback = function () {
      alert('A callback was triggered');
    };
    const config = {
      startOnLoad: true,
      flowchart: { useMaxWidth: true, htmlLabels: true, curve: 'cardinal' },
      securityLevel: 'loose',
    };
    mermaid.initialize(config);


    </script>
    
  </body>
</html>
	''')
	 
# Save to file and open in browser
with open('mermaid_flowchart2.html', 'w') as f:
	f.write(html_content)
	 
webbrowser.open('mermaid_flowchart2.html')


True

# <b>Participant section

```mermaid
graph TD;
    A[Data<br>'One big grab bag<br>Annual temperatures'] --> B[Task<br>'Split into Training and Testing Sets'];
    B --> C[Goal of Splitting<br>'Test model generalization'];
    C --> D[Types of Generalization];
    D --> E[Interpolation<br>'Estimate within data range<br>e.g., missing middle years'];
    E --> F('Random split<br>Sort years into bins<br>Training vs. Testing');
    D --> G[Extrapolation<br>'Estimate beyond data range<br>e.g., future years or another town'];
    G --> H('Time-based split<br>Train on past, test on future');
    G --> I('Spatial split<br>Test on distant town');
    C --> J[Key Principle<br>'Testing data must be independent']
    J --> K('Avoid unfair advantage<br>e.g., future data tipping off trends');
    J --> L('Domain knowledge needed<br>e.g., 1km vs. 100km town distance');
    L --> M('Too close: Not independent<br>Shared patterns and quirks');
    L --> N('Far enough: Independent<br>Better test of generalization');
    C --> O[Practical Consideration<br>'Split must match model use'];
    O --> P('Representative split<br>Accurate testing error<br>Good to go');
    O --> Q('Unrepresentative split<br>Artificially low error<br>False security');
    Q --> R('High consequence applications<br>Risk of unseen weaknesses');
    S[Next Step<br>'Choosing model candidates<br>Hypothesis- and theory-driven modeling'] --> C;
```

### 1. Data and Task: 

```mermaid
graph TD;
    A[Data<br>'One big grab bag<br>Annual temperatures'] --> B[Task<br>'Split into Training and Testing Sets'];
```

The diagram begins with "Data" (a single set of annual temperatures) and the "Task" of splitting it into training and testing sets.


### 2. Goal and Types: 

```mermaid
graph TD;
    C[Goal of Splitting<br>'Test model generalization'];
    C --> D[Types of Generalization];
    D --> E[Interpolation<br>'Estimate within data range<br>e.g., missing middle years'];
    E --> F('Random split<br>Sort years into bins<br>Training vs. Testing');
    D --> G[Extrapolation<br>'Estimate beyond data range<br>e.g., future years or another town'];
    G --> H('Time-based split<br>Train on past, test on future');
    G --> I('Spatial split<br>Test on distant town');
```

The "Goal of Splitting" is to test generalization, which branches into two types:<br>
		• "Interpolation" (estimating within the data range, e.g., missing years), handled with a random split.<br>
		• "Extrapolation" (estimating beyond the data range, e.g., future years or another town), requiring a time-based split (past vs. future) or spatial split (distant town).


### 3. Key Principle: 

```mermaid
graph TD;
    J[Key Principle<br>'Testing data must be independent']
    J --> K('Avoid unfair advantage<br>e.g., future data tipping off trends');
    J --> L('Domain knowledge needed<br>e.g., 1km vs. 100km town distance');
    L --> M('Too close: Not independent<br>Shared patterns and quirks');
    L --> N('Far enough: Independent<br>Better test of generalization');
```

The testing data must be "Independent" to ensure a fair test, avoiding unfair advantages (e.g., future data revealing trends) and requiring domain knowledge to assess independence (e.g., distinguishing a 1km vs. 100km town distance).


Independence Example: <br>
• Too close (1km) means shared patterns, making data not independent.<br>
• Far enough (100km) ensures independence, offering a true test of generalization.


### 4. Practical Consideration:

``` mermaid
graph TD;
O[Practical Consideration<br>'Split must match model use'];
O --> P('Representative split<br>Accurate testing error<br>Good to go');
O --> Q('Unrepresentative split<br>Artificially low error<br>False security');
Q --> R('High consequence applications<br>Risk of unseen weaknesses');
```

The split must reflect how the model will be used:<br>
• A "Representative Split" yields accurate testing errors and confidence in the model.<br>
An "Unrepresentative Split" risks artificially low errors, leading to false security and potential weaknesses in high-consequence scenarios.

#### <b>Summary