# Dataset analysis, model shortlisting, and model determination

In all of the past notebooks we have looked at many different models and many different datasets. However, one big question remains: When to pick which model?

That is of course not an easy question to answer. We have indeed seen a few notebooks where we tried to apply a specific machine learning algorithm and it would just fail. For example the with k-nearest neighbours, we tried to classify books and failed to produce any accurate predictions for the genres. However when we tried the same thing with Naive Bayes, it worked really well!

## Learning objectives

You will be able to:
* Convert a Jupyter to a regular Python script for running large projects on a compute cluster

## What you will need

## Choices!

We have seen many models in this set of modules
* linear regression
* random forest
* K-nearest neighbours
* Support vector machines
* Image inferencing
* Naïve Bayes
* Transfer learning
* Deep and Convolutional Neural Networks (DNN/CNN)
* Recursive neural networks and long-short term memory
* Natural language processing

It is not usually clear which model would work the best for your dataset. Even with natural language processing, which tells you right in the name what it is for, is not clear because NLP is more of an umbrella term of a collection of models. For example, with Naïve Bayes, we already did some natural language processing by classifying books in genres based on the book summaries. For the NLP notebooks we instead used an unsupervised model for classification.

## Cleaning your data

There is no easy flowchart that will help you pick the right model. However, the very first step is always to clean your data. This is an important step that many online tutorials will leave out because it's sometimes not considered part of machine learning, but it is! Without good input data, no model will be able to work well.

The next step is to normalize your data if you can. Many model perform better if the input data is between 0 and 1 or between -1 and 1. We saw this with k-nearest neighbours for example where the one of the features had numbers in the thousands and the other in the tens. The bigger numbers simply made the algorithm pretty much disregard the features with the smaller numbers.

In practice, you will likely try out a few different models to see how they perform as we did with the books where we first used k-nearest neighhours with poor results and then Naïve Bayes with much better results.

## Model shortlisting
Once you have clean data however, there are a few things you can do to narrow things down the choice of your model. The first question is what are you want the output to be. Do you want to predict a continuous variable, for example age or income" If so then you need a model that does regression such as linear regression. If you are looking you yes/no answers, for example patient does or does not have this decease, then you may want models like the random forest. If you want to divide your data into groups then you need to use an algorithm that can classify such as K-nearest neighbours or Naive Bayes. There is certainly some overlap between yes/no questions and classification since yes/no questions are just classifying your data into two groups.

Another thing you need to consider is the input data. Do you have labelled data or not? That determines if you can use supervised or unsupervised models.

Another important question is how much computational resources do you have available. Model like k-nearest neighbours are very fast to train however they become slow to use when the dataset grows because they need to check every neighbour whenever you ask it to classify a new data point.

Do you need explainability? Some models are black boxes like the various neural networks, others like decision trees or support vector machines will let you see why it is classifying the way it is to some extent.

Let's go over the pros and cons of each model that we have seen so far.

### Linear Regression

**Pros:**

1. **Interpretability**: The coefficients in a linear regression model can be easily interpreted as the change in the response variable per unit change in each predictor, while controlling for all other predictors.
2. **Simple to implement**: Linear regression is one of the most straightforward machine learning algorithms to implement, requiring only basic mathematical operations and no complex hyperparameter tuning.
3. **Fast training time**: The optimization process for linear regression models is typically very fast, making it suitable for large datasets or when working with limited computational resources.

**Cons:**

1. **Assumes linearity**: Linear regression assumes a linear relationship between the predictors and response variable, which may not always be true in real-world data.
2. **Sensitive to outliers**: Linear regression can be heavily influenced by outliers in the data, leading to poor model performance or even incorrect conclusions if not properly handled.
3. **Not suitable for categorical variables**: By default, linear regression is designed to work with numerical predictors and response variables. It's not well-suited for modeling relationships involving categorical variables (e.g., text data, binary flags), which often require more specialized techniques like logistic regression or decision trees.

### K-nearest Neighbours
1. **Simple to implement**: KNN is a straightforward algorithm that only looks at nearest neighbours as the name implies. 
2. **Interpretable results**: KNN provides transparent and interpretable results, allowing you to understand why a particular prediction was made and what features were most influential in making that decision.
3. **Good for small datasets**: KNN is relatively fast and efficient when dealing with smaller datasets (e.g., fewer than 10,000 samples), as it doesn't require significant computational resources for small datasets.

**Cons:**

1. **Sensitive to noise**: KNN can be heavily influenced by noisy or outlier data points in the training set, which may lead to poor performance on test sets.
2. **Computational complexity increases with k**: As you increase the value of k (the number of nearest neighbours), the computational cost of calculating distances and finding the closest neighbours also grows, making it less suitable for large datasets.
3. **No explicit model learned**: KNN doesn't learn a explicit model or representation of the data; instead, it relies on the distance metric to make predictions. This can limit its ability to generalize well in complex domains.

### Support Vector Machines
**Pros:**

1. **Good performance on high-dimensional data**: SVMs can handle datasets with a large number of features.
2. **Robust to noise and outliers**: SVMs can use the kernel trick that allows them to map the input space into a higher-dimensional feature space where it's easier to find linear separators. This makes them more robust to noisy or outlier data points and in particular when using soft margins.
3. **Interpretable results**: The decision boundary of an SVM can occasionally be visualized, which can provide valuable insights about the relationships between features and classes. Whether or not the boundary can be visualized depends on the dimensions of the input data where you can plot maybe five dimensions if you add colours and animation.

**Cons:**

1. **Computational complexity**: Training an SVM can be computationally expensive, especially for large datasets. This is because the algorithm scales between quadratically to the third power with the number of data points. That is, if you double the number of data points in your input data training will take between four and eight times as long.
2. **Requires careful tuning of hyperparameters**: The performance of an SVM depends heavily on the choice of kernel function, regularization parameter, and other hyperparameters. Finding the right combination of these parameters can be time-consuming and requires trial and error or just experience.
3. **Not suitable for large datasets with many classes**: While SVMs are good at handling high-dimensional data, they're not well-suited for problems where there are a very large number of classes because at its core, SVMs only do binary classification. While it is possible to extend SVMs it usually does so by dividing the problem into multiple binary classification tasks and thus 

### Naive Bayes

**Pros:**

1. **Simple to implement**: Naive Bayes is a straightforward algorithm that can be easily implemented, even for those without extensive machine learning experience.
2. **Fast training time**: Naive Bayes models require minimal computational resources during the training process, making them suitable for large datasets or real-time applications where speed matters.
3. **Good performance on binary classification problems**: Naive Bayes is particularly effective when dealing with binary classification tasks (e.g., spam vs. not spam emails)

**Cons:**

1. **Assumes independence of features**: The "naive" part of the algorithm refers to its assumption that all features are independent, which is often unrealistic in real-world scenarios where feature correlations exist.
2. **Sensitive to class imbalance**: Naive Bayes can be affected by class imbalance issues (e.g., when one class has significantly more instances than others), leading to biased predictions and poor performance on minority classes.
3. **Not suitable for continuous-valued features**: Naive Bayes is primarily designed for categorical data and may not perform well when dealing with continuous-valued features that require more sophisticated handling.

### Deep and Convolutional Neural Networks

**Pros:**

1. **State-of-the-Art Performance**: DNNs and CNNs have been shown to achieve state-of-the-art performance on many benchmark datasets, often surpassing traditional machine learning algorithms like decision trees or support vector machines.
2. **Ability to Learn Complex Patterns**: The hierarchical structure of CNNs and DNNs allows them to learn complex patterns and relationships in data that may not be easily captured by shallower models. This is particularly useful for tasks involving high-dimensional input spaces or noisy data.
3. **Flexibility and Customizability**: CNNs and DNNs can be designed with various architectures, activation functions, and optimization algorithms, making them highly flexible and customizable to specific problem domains where CNNs in particular excel in image recognition tasks.

**Cons:**

1. **High Computational Requirements**: Training large-scale DNNs and CNNs requires significant computational resources (e.g., GPUs) and memory, which can make it challenging for researchers or practitioners without access to such infrastructure.
2. **Overfitting Risk**: The complexity of DNNs makes them prone to overfitting, where the model becomes too specialized in the training data and fails to generalize well to new examples.
3. **Interpretability Challenges**: Due to their complex internal representations, it can be difficult to understand why a DNN or CNN is making certain predictions or decisions.

### Image Inferencing

Image inferencing almost always involves convolutional networks. If you looking to do image recognition, CNN is likely what you want. It is however a computationally expensive technique. There are still other technique. For example, for just detecting markers a CNN might be overkill and you could rely on libraries like [OpenCV](https://docs.opencv.org/4.x/index.html) which take a more traditional approach using impressive image manipulation techniques.

Another way to reduce the computational cost of CNN is by scaling the input image down to as small as a size as you can get away wtth since reducing the both dimensions of an image by a factor two means you'll have four times less pixels as input and depending on the algorithm, your computation time might reduce by much more than just a factor four.

### Transfer Learning

**Pros:**

1. **Efficient Training**: Transfer learning allows you to leverage pre-trained models on large datasets, saving time and computational resources during training.
2. **Improved Performance**: By transferring knowledge from a source task to a target task, transfer learning can often lead to better performance, especially when the target task has limited data.
3. **Domain Adaptation**: Transfer learning is useful for adapting models trained on one domain to perform well on a different but related domain, reducing the need for collecting labeled data in the target domain.

**Cons:**

    Overfitting: Transfer learning can sometimes lead to overfitting, especially if the source and target domains are too dissimilar or if the transferred knowledge is not properly adapted to the target task.
    Negative Transfer: In some cases, transferring knowledge from a source task can actually hurt performance on the target task if the source task is too different or irrelevant.
    Limited Flexibility: Pre-trained models may not always be suitable for all target tasks, limiting the flexibility of transfer learning in certain scenarios.


### Natural Language Processing

Example 1 - Image classification: You want to build a model that can classify new images as either cats or dogs. In this case, image inferencing might be suitable for training a convolutional neural network (CNN) on the data. CNNs are particularly good at processing visual data.

Example 2 - Stock market prediction: You want to predict stock prices based on historical financial data. In this case, linear regression might be a suitable algorithm as it can model continuous variables and make predictions for new inputs. Additionally, preprocessing steps like scaling features (e.g., price changes) will help improve the model's performance.

Example 4 - Sentiment analysis: You have text data from social media posts that you want to analyze for positive or negative sentiment. In this case, natural language processing techniques like recurrent neural networks (RNNs) and long-short term memory (LSTM) models would be suitable due to their ability to process sequential data effectively.

## Running large models

When the size of your dataset or the complexity of your model grows, you will run into the limitations of your computer. You could of course buy bigger hard drives, more memory, and more powerful GPUs, but a better way to go about this is to use a dedicated compute cluster. This can either be a cluster from the Digital Research Alliance of Canada, your group's own server, or any of the commercial cloud offerings.

While many of these offerings will come with JupyterHub interfaces of their own, it is not a very efficient way of doing things because Jupyter notebooks are not designed for efficiency. For example, they don't end when your code finishes running. For dedicated commercial clouds that means you will keep paying. For compute clusters, it will mean you keep resources tied up such that other users cannot use them.

The solution to this is to use a standalone script that you start from the command-line. It will run from start to finish and save the results to disk. Afterwards, you will be able to load the saved model or results and analyze them on your own machine since those tasks would typically require much fewer resources than for training a model. This does mean you need to make a adjustment to your workflow. Before, you likely had a single notebook with both the training and the analysis in it which you will now have to split in two. There should be one notebook for training the model and writing the relevant files to disk and another notebook that loads those files and uses the trained model.

There are a few ways of making a Jupyter notebook into a standalone entity. One of the easiest ways is to simply export from Jupyter by going to `File` → `Save and export notebook as...` → `Executable Script`.

The resulting Python script is then something you can run from the command-line:
```
python nameofscript.py
```

This makes it suitable for running on remote servers. For example, for running on one of the Alliance clusters with a nice big GPU, you would create a submission script that look roughly like this:
```
#!/bin/bash
#SBATCH --account=def-someuser    # Account name of your allocation. Usually def-[name of supervisor]
#SBATCH --gpus-per-node=1         # Number of GPUs
#SBATCH --mem=4000M               # Memory per node
#SBATCH --time=0-03:00            # Maximum run time, in days - hours : minutes

module load python                # Load the latest version of Python
source ~/tf/bin/activate          # Load the Tensorflow or Torch environment
python nameofscript.py            # Run your script
```
See https://docs.alliancecan.ca/wiki/Using_GPUs_with_Slurm for GPU job submission and https://docs.alliancecan.ca/wiki/TensorFlow and https://docs.alliancecan.ca/wiki/PyTorch for loading Tensorflow or PyTorch.