# Classification and Regression

A brief introduction to classification and regression tacks in Machine Learning

Learning objectives:

1. Distinguish classification from regression
2. Define a loss/cost function 
3. Understand how to train a logistic/softmax regression for binary/multiclass classification
4. Know how to benchmark a classifier

## Classification:

In machine learning, classification is a supervised learning task where the model learns to assigning predefined labels, categories or classes to instances, individual data points or examples in a dataset. There is two types of classification binary and multiclass classification, respectively for classifying into two classes or multiple classes.


Here are some practical application of classification in Environmental Sciences!

> 1. Wildlife Identification: Classification techniques can be used to identify animal species from images or audio recordings, supporting wildlife monitoring projects.
> 2. Land Cover Classification: Satellite imagery can be classified into various land cover types, aiding in monitoring land use changes over time.
> 3. Invasive Species Detection: Developing models that classify invasive species in images, helping conservationists identify and manage ecological threats.
> 4. Water Quality Assessment: Using classification algorithms to determine the quality of water bodies based on factors like chemical concentrations and biological indicators.

**Binary Classification**: A type of classification task where there is only two possible classes, labels or categories (e.g., spam vs. non-spam emails)

### Evaluating Classification Models:

**Cross-Validation:** Sampling the data and producing folds for each class, predictions are then made on the test folds and then the number of correct predictions is counted, and the final outputs is the ratio of correct predictions. Accuracy, ratio of correct predictions on the total number of instances, is not the preferred performance measure when dealing with skewed datasets (i.e., when some classes are much more frequent than others).


**Confusion Matrix:**
Summarises the model’s performance, showing true positives, true negatives, false positives, and false negatives. The rows represent the actual class, while the column represents the predicted class, the confusion matrix compares those two values. A perfect classifier would have only true positives and true negatives, so only nonzero values on its main diagonal.

![confusion.jpeg](https://miro.medium.com/v2/resize:fit:1218/format:webp/1*jMs1RmSwnYgR9CsBw-z1dw.png)

[Medium : What is a confusion matrix?](https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5)

**Precision and Recall:**

A condensed metric that can be used is the accuracy of positive predictions, the precision, = TP / TP + FP where TP is the number of true positives, and FP is the number of false positives, if no FP than 100% precision. It is usually combined with another metric called recall, = TP / TP + FN, with FN being false negative. These two metrics are combined into a single metric called F1 score, corresponding to the harmonic mean of precision and recall, enabling a simple classifiers comparison.
The classifier will only get a high F1 score if both recall and precision are high, but a trade-off exists between these two, one is won at the expense of the other


**ROC Curve:**

It a plot of the recall versus the false positive rate (FPR). A trade-off exists also between the two. The comparison between classifier is achieved by measuring the area under the curve (AUC), a perfect classifier will have an AUC of 1.

### Classification Algorithm:
Some algorithms work only with binary classifiers (e.g., SGD Classifiers, Support Vector Machine classifiers) others like Logistic Regressions classifiers (e.g., Random Forest classifiers, naïve Bayes classifiers) work on both multiple and binary classification.

To overcome this obstacle, binary classifiers can use two different strategy, one-versus-the-rest (OvR) strategy and the one-versus-one (OvO) strategy.

- One-Versus-the-Rest (OvR) Strategy: Train a binary classifier, for each class. When classifying, obtain scores from all classifiers and select the class with the highest score.
- One-Versus-One (OvO) Strategy: Train a binary classifier for every pair of class. This requires N × (N – 1) / 2 classifier for N classes. To classify an image, run it through all 45 classifiers and determine the winning class. OvO is advantageous as each classifier focuses on distinguishing only two classes.

### Multiclass classification:

Also known as multinomial classification, refers to a classification problem where instances are categorized into three or more distinct classes. 
Example: Classifying images of animals into categories like "dog," "cat," "elephant," and "lion."

> Multilabel Classification:
Multilabel classification deals with instances that can belong to multiple classes simultaneously. In other words, an instance can have multiple labels associated with it. 
Example: Tagging a news article with multiple categories like "politics," "economy," and "technology" to capture its diverse content.

>Multioutput Classification (or Multioutput Regression):
Multioutput classification (or regression) involves predicting multiple output variables simultaneously for each instance. Each output variable can have multiple possible values or classes. 
Example: Predicting both the color and size of a piece of fruit, where color could be "red," "green," or "yellow," and size could be "small," "medium," or "large."


### Training a model:

**Cost Function :**
When using a model, you need to define the parameters and in order to estimate the best values for these parameters you need to specify a performance measure either a utility function that measures how good the model or a cost function that quantifies how well a machine learning model’s predictions align with the actual target values. It measures the discrepancy between the predicted values generated by the model and the true values from the training dataset. The objective of a machine learning algorithm is to minimize this cost function, which essentially means improving the model’s accuracy and precision in making predictions.
 
The choice of a cost function depends on the nature of the problem—whether it is a classification, regression, or other type of task—and the desired properties of the model’s predictions. Different algorithms and tasks require different types of cost functions.

**Types of Cost Functions:**

> Mean Squared Error (MSE): Used in regression tasks, it calculates the average squared difference between predicted and actual values. It penalizes larger errors more heavily.

> Log Loss: Commonly used in classification tasks, especially in logistic regression and neural networks. It measures the dissimilarity between predicted probabilities and actual binary class labels.

> Absolute Error (L1 Loss): Similar to MSE, but it computes the absolute difference between predicted and actual values. It is less sensitive to outliers compared to MSE.
 


**Gradient Descent:**
 A generic optimization technique that modifies the model parameters to minimize the cost function and ultimately determine the optimal model parameters, thereby rely on the partial derivative of the cost function with respect to the model parameters to iteratively improve those parameters in the direction that reduces the cost. An important parameter of Gradient Descent is the size of the steps knows as the learning rate hyperparameter, it has an impact on the number of iterations and therefor the speed of the algorithm.

> Hyperparameter is a parameter of a learning algorithm, it is not affected by the learning process and remains constant during the training.

![gradient.png](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/06/12190659/4-2.png)

[An Easy Guide to Gradient Descent in Machine Learning, Great Learning.](https://www.mygreatlearning.com/blog/gradient-descent/)

The optimization techniques using Gradient Descent include the following approaches:

> Batch Gradient Descent: Updates model parameters using the entire training dataset in each iteration, it makes it very slow if the set is heavy.

> Stochastic Gradient Descent (SGD): Updates model parameters based on a single random training instance or a small batch of instances. It has a faster convergence than the previous one, but noisy updates and may not reach the global minimum.

> Mini-batch Gradient Descent: A compromise between batch and SGD, updating parameters using a small batch of instances.

**Early Stopping**: a regularisation method employed in iterative learning algorithm (e.g., Gradient Descent) It involves stopping the training procedure the moment the validation error reaches its lowest point, making it a valuable tool for mitigating overfitting.

**Training a Binary Classifier: Logistic Regression**

Logistic Regression is a popular model for binary classification, it estimates the probability of an instance to belong to a class. Its training consists of iteratively optimize the model’s parameters using gradient descent.  Training this model involves finding the parameter that enhance high probabilities estimates for positive instances and vice versa for negative instances. This is achieved using the log loss cost function. The training implies to use the partial derivatives of the previous mention cost function and a Gradient Descent algorithm.


**Training a Multiclass Classifier:**

Softmax Regression also called Multinominal Logistic Regression is a model based on the logistic regression that can be used for multiple classes. It used the same principle as the previous mention model, implying the estimation of probabilities for each class for a given instance using the Softmax function. It can predict only one class at a time it is multiclass exclusively not multiouput. The training for this model is based on the Cross entropy cost function and a Gradient Descent algorithm.



**Error Analysis**

On both a binary and a multiclass classification model involves an examination of misclassified instances, aiming to uncover valuable insights into the model's weaknesses and potential data-related issues. This process often begins with a visualisation of the Confusion Matrix, which provides a comprehensive view of common confusion patterns and helps us understand how these patterns directly influence the model's overall performance and accuracy.

### Other Classification Methods:

>Support Vector Machines (SVM): Effective for both binary and multiclass classification tasks.

> Decision Trees: Tree-like structures used for classification, providing interpretable decision rules.

> Random Forests: Ensemble methods that combine multiple decision trees for improved performance.

## Regression

In machine learning, regression is supervised learning task where the model predicts a continuous numerical value, or a real-valued output based on single or multiple input data and learn the relationship between the input features and the target variable.

Here are some practical applications of classification in Environmental Sciences!

> 1. Climate Modelling: models that predict climate variables such as temperature, precipitation, and sea-level rise taking into account gas concentrations, solar radiation to make long-term climate predictions.
> 2. Air Quality Prediction: models use meteorological data, emissions data to forecast air quality levels.



### Linear Regression:
Linear regression is a supervised machine learning algorithm used for predicting a continuous numerical output based on one or more input features. It assumes a linear relationship between the inputs and the target variable. The goal of linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual target values. The most common cost function used in linear regression is the Mean Squared Error (MSE).

### Multiple Linear Regression:
Multiple linear regression is an extension of linear regression that deals with multiple input features. Instead of just one input, there are multiple independent variables influencing the target variable. The algorithm estimates the coefficients for each feature, determining their individual impact on the target variable while considering their interrelationships.

### Other Regression Methods:

> Polynomial Regression: This type of regression extends linear regression to capture nonlinear relationships by introducing polynomial terms of the input features. It fits a curve to the data instead of a straight line.

> Ridge Regression (L2 Regularization): Ridge regression adds a regularization term to the linear regression cost function. It helps prevent overfitting by penalizing large coefficient values, thus promoting simpler models.

>  Lasso Regression (L1 Regularization): Similar to ridge regression, lasso regression also adds a regularization term. However, it uses the absolute values of coefficients, often resulting in some coefficients being exactly zero. This leads to feature selection.

>  Elastic Net Regression: Elastic Net combines L1 and L2 regularization to balance the strengths of both. It can handle situations where there are correlated features.

>  Support Vector Regression (SVR): SVR applies the principles of support vector machines to regression problems. It aims to fit a hyperplane that captures as many instances within a specified margin as possible.

>  Decision Tree Regression: Similar to classification decision trees, decision tree regression predicts a continuous target value by partitioning the feature space into regions and assigning the average target value of instances within each region.

>  Random Forest Regression: An ensemble method combining multiple decision tree regressors. It improves predictive accuracy and reduces overfitting by averaging the predictions of individual trees.

>  Gradient Boosting Regression: A boosting technique that builds an additive model in a forward stage-wise manner. It combines the predictions of weak learners (often decision trees) to create a strong predictive model.


Each regression method has its own strengths, weaknesses, and applicability to different types of data and problem domains. The choice of which method to use depends on the nature of the data, the problem's requirements, and the desired level of interpretability and predictive accuracy.