# Dataset analysis, model shortlisting, and model determination: 5 tutorials
The Problem Domain Selection phase is a critical first step in any data science project. This is where you identify the fundamental nature of your problem, which then guides your algorithm choice.

**Notebook 4 of 5**

In this tutorial offers a comprehensive guide through data science tasks, starting with problem selection and branching into four main modeling categories: supervised learning (classification and regression), unsupervised learning (clustering and dimensionality reduction), natural language processing (text analysis and large language models), and computer vision (image processing with CNNs and DNNs).

* Notebook 1: The model decision-making process
* Notebook 2: Splitting the data
* Notebook 3: Evaluating Error: Signal vs Noise
* Notebook 4: The Problem Domain: Chosing the Models
* Notebook 5: Model shortlisting


# Learning objectives
Average time to complete 30min

By the end of this tutorial you should be able to:
* Learn to clarify project objectives by identifying whether the goal is prediction, classification, pattern discovery, or relationship analysis, and assess data types (text, images, numerical, etc.).

## What you will need for this tutorial

* See the [introduction document](https://uottawa-it-research-teaching.github.io/machinelearning/) for general requirements and how Jupyter notebooks work.
* We'll need Pandas for convenient data handling. It's a very powerful Python package that can read CSV and Excel files. It also has very good data manipulation capabilities which come in use for data cleaning.
* We will use scikit learn as our machine learning package.
* numpy 
* seaborn 
* matplotlib
* requests
* ipywidgets
* The data files that should have come with this notebook.

## RDM best practices

Good data handling for machine learning begins with good Research Data Managment (RDM). The quality of your source data will impact the outcome of your results, just like the reproducibility of your results will depend on the quality of your data sources, in addition to how you organize the data so that other people (and machines!) can understand and reuse it.

We also need to respect a few research data management best practices along the way, these best practices are recommended by the Digital Research Alliance of Canada. In the first tutorial we encouraged you to resepct two RDM best practices:

* SAVE YOUR RAW DATA IN ORIGINAL FORMAT<br>
* BACKUP YOUR DATA (3-2-1 rule)<br>

These practices should apply in this tutorial as well, but we will also look at best practices of data description, documentation and file naming that will streamline your data processing and project management. 

DESCRIBE YOUR DATA

* Machine Friendly: Describe your dataset with a metadata standard for discovery.
* Human Friendly: Describe your variables, so your colleagues will understand what you meant. Data without good metadata is useless. Give your variables clear names.
* Do not leave cells blank -use numeric values clearly out of range to define missing (e.g. '99999') or not applicable (e.g. '88888') data anddescribe these in your data dictionary.
* Convert your data to open, non-proprietary formats 
* Name your files well with basic meta-data in the file names

# <b>Instructor section

In [11]:
# https://mermaid.live
# https://mermaid.js.org/syntax/flowchart.html#interaction

import webbrowser
from textwrap import dedent
	 
html_content = dedent('''
	<!DOCTYPE html>
	<html>
     <style>
     .mermaid svg {
    display: block;
    width: 100%;
    height: 100%;
    margin: 0%;
    padding: 0;
    }
     </style>
     <head>
        <script src="https://d3js.org/d3.v6.min.js"></script>
        <script src="https://unpkg.com/mermaid@11.4.1/dist/mermaid.min.js"></script>
      </head>
  <body>
    <div class="mermaid" style="max-height: 350px;max-width: 1500px;">
---
title: 'Flow 4 - The Problem Domain: Chosing the Model'
---
        flowchart TD;
	    start[Start: Data Science Project] --> problem_type{Select Problem Domain};
        %% click start href "https://www.github.com" "This is a link" _blank
	    %% Problem Type Branching;
	    problem_type -->|Supervised Learning| supervised{Supervised Learning};
	    problem_type -->|Unsupervised Learning| unsupervised{Unsupervised Learning};
	    problem_type -->|Natural Language| nlp_domain{NLP Tasks};
	    problem_type -->|Computer Vision| vision_domain{Vision Tasks};
        
         
	    %% Supervised Learning Paths;
	    supervised -->|Classification| classification{Classification Complexity};
	    supervised -->|Regression| regression{Regression Complexity};
        click supervised call callback() "Tooltip for a callback"
	    
	    %% Classification Algorithms;
	    classification -->|Small Dataset| simple_classification[/Simple Classification/];
	    classification -->|Complex Dataset <a href='https://notebooks.githubusercontent.com/view/ipynb?&commit=7cc9d6d968c92a16a792e052658cfbc72065777b&device=unknown_device&docs_host=https%3A%2F%2Fdocs.github.com&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f754f74746177612d49542d52657365617263682d7465616368696e672f53564d2f376363396436643936386339326131366137393265303532363538636662633732303635373737622f6e6f7465626f6f6b732f4d4c54535f32303234303533305f53564d5f4e6f7465626f6f6b334265796f6e644c696e6561724879706572706c616e655f454e5f312e302e6970796e62&logged_in=false&nwo=uOttawa-IT-Research-teaching%2FSVM&path=notebooks%2FMLTS_20240530_SVM_Notebook3BeyondLinearHyperplane_EN_1.0.ipynb&platform=windows&repository_id=696287816&repository_type=Repository&version=128#Visualizing-Non-linear-classification'>Example</a>| advanced_classification[/Advanced Classification/];
	    
	    simple_classification --> KNN["K-Nearest Neighbors (KNN) <a href='https://github.com/uOttawa-IT-Research-teaching/ML_k-nearest-neightbours/blob/main/notebooks/MLTS_20241205_KNN_K-nearest%20neighbours_EN_1.0.ipynb'>KNN</a>;
	    Distance-based classification;
	    Works with small datasets;
	    Low computational complexity"];
	    
	    simple_classification --> NaiveBayes["Naive Bayes <a href='https://github.com/uOttawa-IT-Research-teaching/ML_naive_bayes/blob/main/notebooks/MLTS_20240530_NB_BayesTrainingNotebook_EN_1.0.ipynb'>NaiveBayes</a>;;
	    Probabilistic classifier;
	    Fast training;
	    Works with categorical data"];
	    
	    advanced_classification --> SVM["Support Vector Machines (SVM) <a href='https://github.com/uOttawa-IT-Research-teaching/SVM/blob/main/notebooks/MLTS_20240530_SVM_Notebook1Regularization_EN_1.0.ipynb'>SVM</a>;
	    Finds optimal separation hyperplane;
	    Effective in high-dimensional spaces;
	    Complex decision boundaries"];

            
	    advanced_classification --> RandomForest["Random Forest;
	    Ensemble of decision trees <a href='https://github.com/uOttawa-IT-Research-teaching/DecisionTrees/blob/main/notebooks/MLTS_20240530_DTRF_Notebook2RandomForest_EN_1.0.ipynb'>RF</a>;
	    Handles complex relationships;
	    Reduces overfitting;
	    High accuracy"];
         	    
	    %% Regression Paths;
	    regression --> DecisionTrees["Decision Trees <a href='https://github.com/uOttawa-IT-Research-teaching/DecisionTrees/blob/main/notebooks/MLTS_20240530_DTRF_Notebook1DecisionTree_EN_1.0.ipynb'>DT</a>;
	    Handles non-linear relationships;
	    Captures complex interactions;
	    Interpretable results"];
	    
	    %% NLP Domain;
	    nlp_domain -->|Text Classification| nlp_classification["Traditional NLP Techniques <a href='https://github.com/uOttawa-IT-Research-teaching/ML_Natural_Language_Processing/blob/main/notebooks/01-Sentiment%20Analysis.ipynb'>NLP</a>;
	    Naive Bayes;
	    SVM"];
	    
	    nlp_domain -->|Advanced Language Tasks| advanced_nlp{Advanced NLP};
	    
	    advanced_nlp -->|Large Language Models| LLM["Large Language Models (LLM)
	    Transformer-based ;
	    Contextual understanding;
	    Generative capabilities"];
	    
	    advanced_nlp -->|Transfer Learning| TransferLearning["Transfer Learning <a href='https://github.com/uOttawa-IT-Research-teaching/TransferLearning_CNN/blob/main/notebooks/2%20-%20transferLearningCNN.ipynb'>TransferL</a>
	    Leverage pre-trained models;
	    Reduce training time;
	    Effective with limited data"];
	    
	    %% Computer Vision;
	    vision_domain -->|Feature Extraction| CNN["Convolutional Neural Networks (CNN) <a href='https://github.com/uOttawa-IT-Research-teaching/TransferLearning_CNN/blob/main/notebooks/1%20-%20CNN_Concepts.ipynb'>CNN</a>
	    Specialized image processing;
	    Learns hierarchical features;
	    State-of-the-art computer vision"];
	    
	    vision_domain -->|Complex Visual Tasks| DNN["Deep Neural Networks (DNN) <a href='https://github.com/uOttawa-IT-Research-teaching/DNN-CNN_Intro/blob/main/2%20-%20DNN_PyTorch.ipynb'>DNN</a>
	    Multiple hidden layers;
	    Complex pattern recognition <a href='https://github.com/uOttawa-IT-Research-teaching/DeepLearning_CNN/blob/main/2%20-%20Inference.ipynb'>Inference</a>;
	    Versatile architecture"];
	    
	    %% Unsupervised Learning;
	    unsupervised -->|Clustering| clustering["Clustering Algorithms
	    K-Means;
	    Hierarchical Clustering;
	    DBSCAN"];
	    
	    unsupervised -->|Dimensionality Reduction| dim_reduction["Dimensionality Reduction
	    PCA;
        t-SNE;
	    UMAP"];
	    
	    %% Model Inference and Deployment;
	    KNN & NaiveBayes & SVM & RandomForest & DecisionTrees &  LLM & TransferLearning & CNN & DNN --> inference{Model Inference};
	    inference -->|Deployment Preparation| ModelDeployment["Model Deployment
	    Performance optimization
	    Real-world prediction
	    Scalable inference"];
	    
	    %% Styling
	    classDef decision fill:#f9e79f,stroke:#d35400;
	    classDef algorithm fill:#d4edda,stroke:#155724;
	    classDef deployment fill:#f2f3f4,stroke:#2c3e50;
	    class problem_type,supervised,unsupervised,nlp_domain,vision_domain,classification,regression,advanced_nlp,inference decision;
	    class KNN,NaiveBayes,SVM,RandomForest,DecisionTrees,LLM,TransferLearning,CNN,DNN algorithm;
	    class ModelDeployment deployment;
    </div>

    <script type="module">
      window.addEventListener('load', function () {
      var svgs = d3.selectAll(".mermaid svg");
      svgs.each(function() {
    var svg = d3.select(this);
    svg.html("<g>" + svg.html() + "</g>");
    var inner = svg.select("g");
    var zoom = d3.zoom().on("zoom", function(event) {
      inner.attr("transform", event.transform);
    });
    svg.call(zoom);
  });
});
 
window.callback = function () {
      alert('A callback was triggered');
    };
    const config = {
      startOnLoad: true,
      flowchart: { useMaxWidth: true, htmlLabels: true, curve: 'cardinal' },
      securityLevel: 'loose',
    };
    mermaid.initialize(config);


    </script>
    
  </body>
</html>
	''')
	 
# Save to file and open in browser
with open('mermaid_flowchart4.html', 'w') as f:
	f.write(html_content)
	 
webbrowser.open('mermaid_flowchart4.html')


True

# <b>Participant section

```mermaid
---
title: 'Flow 4 - The Problem Domain: Chosing the Model'
---
        flowchart TD;
	    start[Start: Data Science Project] --> problem_type{Select Problem Domain};
        click start href "https://www.github.com" "This is a link" _blank
	    %% Problem Type Branching;
	    problem_type -->|Supervised Learning| supervised{Supervised Learning};
	    problem_type -->|Unsupervised Learning| unsupervised{Unsupervised Learning};
	    problem_type -->|Natural Language| nlp_domain{NLP Tasks};
	    problem_type -->|Computer Vision| vision_domain{Vision Tasks};
        click problem_type href "https://shorturl.at/ukoga" "This is a link" _blank
         
	    %% Supervised Learning Paths;
	    supervised -->|Classification| classification{Classification Complexity};
	    supervised -->|Regression| regression{Regression Complexity};
        click supervised call callback() "Tooltip for a callback"
	    
	    %% Classification Algorithms;
	    classification -->|Small Dataset| simple_classification[/Simple Classification/];
	    classification -->|Complex Dataset| advanced_classification[/Advanced Classification/];
	    
	    simple_classification --> KNN["K-Nearest Neighbors (KNN);
	    Distance-based classification;
	    Works with small datasets;
	    Low computational complexity"];
	    
	    simple_classification --> NaiveBayes["Naive Bayes;
	    Probabilistic classifier;
	    Fast training;
	    Works with categorical data"];
	    
	    advanced_classification --> SVM["Support Vector Machines (SVM);
	    Finds optimal separation hyperplane;
	    Effective in high-dimensional spaces;
	    Complex decision boundaries"];

            
	    advanced_classification --> RandomForest["Random Forest;
	    Ensemble of decision trees <a href='https://shorturl.at/ukoga'>link</a>;
	    Handles complex relationships;
	    Reduces overfitting;
	    High accuracy"];
         	    
	    %% Regression Paths;
	    regression --> DecisionTrees["Decision Trees;
	    Handles non-linear relationships;
	    Captures complex interactions;
	    Interpretable results"];
	    
	    %% NLP Domain;
	    nlp_domain -->|Text Classification| nlp_classification["Traditional NLP Techniques;
	    Naive Bayes;
	    SVM"];
	    
	    nlp_domain -->|Advanced Language Tasks| advanced_nlp{Advanced NLP};
	    
	    advanced_nlp -->|Large Language Models| LLM["Large Language Models (LLM)
	    Transformer-based;
	    Contextual understanding;
	    Generative capabilities"];
	    
	    advanced_nlp -->|Transfer Learning| TransferLearning["Transfer Learning
	    Leverage pre-trained models;
	    Reduce training time;
	    Effective with limited data"];
	    
	    %% Computer Vision;
	    vision_domain -->|Feature Extraction| CNN["Convolutional Neural Networks (CNN)
	    Specialized image processing;
	    Learns hierarchical features;
	    State-of-the-art computer vision"];
	    
	    vision_domain -->|Complex Visual Tasks| DNN["Deep Neural Networks (DNN)
	    Multiple hidden layers;
	    Complex pattern recognition;
	    Versatile architecture"];
	    
	    %% Unsupervised Learning;
	    unsupervised -->|Clustering| clustering["Clustering Algorithms
	    K-Means;
	    Hierarchical Clustering;
	    DBSCAN"];
	    
	    unsupervised -->|Dimensionality Reduction| dim_reduction["Dimensionality Reduction
	    PCA;
	    t-SNE;
	    UMAP"];
	    
	    %% Model Inference and Deployment;
	    KNN & NaiveBayes & SVM & RandomForest & DecisionTrees &  LLM & TransferLearning & CNN & DNN --> inference{Model Inference};
	    inference -->|Deployment Preparation| ModelDeployment["Model Deployment
	    Performance optimization
	    Real-world prediction
	    Scalable inference"];
	    
	    %% Styling
	    classDef decision fill:#f9e79f,stroke:#d35400;
	    classDef algorithm fill:#d4edda,stroke:#155724;
	    classDef deployment fill:#f2f3f4,stroke:#2c3e50;
	    class problem_type,supervised,unsupervised,nlp_domain,vision_domain,classification,regression,advanced_nlp,inference decision;
	    class KNN,NaiveBayes,SVM,RandomForest,DecisionTrees,LLM,TransferLearning,CNN,DNN algorithm;
	    class ModelDeployment deployment;
```

## Understanding Problem Domain Selection:

```mermaid
flowchart TD;
	    start[Start: Data Science Project] --> problem_type{Select Problem Domain};
        click start href "https://www.github.com" "This is a link" _blank
	    %% Problem Type Branching;
	    problem_type -->|Supervised Learning| supervised{Supervised Learning};
	    problem_type -->|Unsupervised Learning| unsupervised{Unsupervised Learning};
	    problem_type -->|Natural Language| nlp_domain{NLP Tasks};
	    problem_type -->|Computer Vision| vision_domain{Vision Tasks};
```

### 1. Analyze the problem:

When beginning a data science project, you need to clearly understand what you're trying to accomplish. Ask yourself:<br>
• What is the goal? Are you trying to predict something, classify items, find patterns, or understand relationships?<br>
• What type of data do you have? Text, images, numerical data, categorical data, time series, etc.<br>
• Do you have labeled data? (examples with known outcomes)

### 2. Map to Machine Learning Domains:

**Supervised Learning**<br>
Choose this when you have labeled data and want to predict outcomes:
* Classification: When predicting categories (spam/not spam, fraud/legitimate)
* Regression: When predicting numerical values (house prices, temperature)

**Unsupervised Learning**<br>
Choose this when you have unlabeled data and want to discover patterns:
* For finding natural groupings in data
* For reducing dimensionality
* For anomaly detection

<b>Natural Language Processing (NLP)</b><br>
Choose this for text-based problems:
* Text classification
* Sentiment analysis
* Language understanding
* Text generation
* Document summarization

<b>Computer Vision</b><br>
Choose this for image or video-based problems:
* Image classification
* Object detection
* Image generation
* Video analysis

### 3. Example Process:

1. Problem: Predicting customer churn for a company<br>
2. Analysis:<br>
	• Goal: Predict which customers will leave (Yes/No outcome)<br>
	• Data: Historical customer information with <b>known outcomes</b><br>
	• Type: A classification problem with labeled data<br>
3. Domain Selection: Supervised Learning → Classification<br>
4. Next Step: Based on dataset complexity, you might choose:<br>
	• Simple dataset: KNN or Naive Bayes<br>
	• Complex relationships: Random Forest or SVM<br>