# Module 1: Introduction to Scikit-Learn

## Part 18: Multiclass and multioutput algorithms

Scikit-learn offers a variety of algorithms and techniques for tackling multi-class and multi-output problems in machine learning. Here, we'll delve into the theory and concepts behind these algorithms.

### 18.1 Multi-Class Classification

Multi-class classification refers to the task of classifying data points into more than two distinct classes or categories. Several approaches are used for multi-class classification in scikit-learn:

- One-vs-Rest (OvR) or One-vs-All (OvA)<br> In OvR, a binary classifier is trained for each class against all the other classes. During prediction, each classifier produces a decision score, and the class with the highest score is chosen as the prediction. This method is suitable when classes are not mutually exclusive.

- Multinomial (Softmax) Logistic Regression<br> This is a generalized logistic regression for multi-class classification. It models the probabilities of each class directly and uses a softmax function to convert scores into class probabilities. This approach is well-suited for problems where classes are mutually exclusive.

- Support Vector Machines (SVM)<br> SVMs can be used for multi-class classification using various strategies, including one-vs-one (OvO) and one-vs-rest (OvR). In OvO, a binary classifier is trained for each pair of classes, while in OvR, a binary classifier is trained for each class against the rest.

#### Multi-Class Classification One-vs-Rest (OvR) or One-vs-All (OvA) Example

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = LogisticRegression(solver='liblinear', multi_class='ovr')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n", classification_rep)

In this example, we use the Iris dataset, a well-known multi-class classification problem. We apply the OvR strategy by creating a Logistic Regression classifier with the multi_class='ovr' parameter. After training the model on the training data, we make predictions on the test data and evaluate its performance using accuracy and a classification report.

This code demonstrates a straightforward use of the OvR strategy for multi-class classification in scikit-learn. The classifier builds multiple binary classifiers, one for each class, to handle the multi-class problem effectively.

#### Multi-Class Classification Multinomial (Softmax) Logistic Regression Example

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = LogisticRegression(solver='lbfgs', multi_class='multinomial')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n", classification_rep)

In this example, we use the Iris dataset, a classic multi-class classification problem. We apply Multinomial Logistic Regression by creating a Logistic Regression classifier with the multi_class='multinomial' parameter and specifying the solver as 'lbfgs', which is suitable for the multinomial loss. After training the model on the training data, we make predictions on the test data and evaluate its performance using accuracy and a classification report.

This code demonstrates the use of Multinomial Logistic Regression for multi-class classification in scikit-learn, where the model directly estimates class probabilities using a softmax function, making it suitable for problems with mutually exclusive classes.

#### Multi-Class Classification Support Vector Machines (SVM) Example

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = SVC(kernel='linear', decision_function_shape='ovr')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Print a classification report
classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n", classification_rep)

In this example, we again use the Iris dataset, a classic multi-class classification problem. We apply Support Vector Machines (SVM) by creating an SVM classifier with the SVC class. We specify a linear kernel and use the "one-vs-rest" ('ovr') strategy for multi-class classification. After training the model on the training data, we make predictions on the test data and evaluate its performance using accuracy and a classification report.

This code demonstrates a straightforward use of SVM for multi-class classification in scikit-learn, where the classifier constructs a hyperplane for each class to separate it from the rest.

### 18.2 Multi-Output Classification/Regression

Multi-output problems involve predicting multiple target variables (outputs) for each data point. Scikit-learn provides extensions to various algorithms for multi-output classification and regression tasks:

- Multi-Output Regression<br> This extends standard regression to predict multiple continuous target variables simultaneously. For example, in a multi-output regression problem, you might predict both the price and the age of a house based on its features.

- Multi-Label Classification<br> In multi-label classification, each data point can belong to multiple classes or categories simultaneously. For instance, in document classification, a document might belong to several topics at once.

####  Multi-Output Regression Example

In [None]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X, y = make_regression(n_samples=100, n_features=2, n_targets=3, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

regressor = MultiOutputRegressor(LinearRegression())
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error for Each Output:")
print(mse)
r2 = r2_score(y_test, y_pred)
print("\nR-squared for Each Output:")
print(r2)

In this example, we generate synthetic multi-output regression data with three target variables. We use MultiOutputRegressor to create a multi-output regression model and use LinearRegression as the base estimator. After training the model on the training data, we make predictions on the test data.

We calculate the mean squared error (MSE) and R-squared (coefficient of determination) for each target variable to assess the model's performance.

This code demonstrates a basic example of multi-output regression in scikit-learn, where the model simultaneously predicts multiple continuous target variables for each input data point.

#### Multi-Output Classification/Regression Multi-Label Classification Example

In [None]:
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

X, y = make_multilabel_classification(n_samples=100, n_features=5, n_classes=3, n_labels=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
classification_rep = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_rep)

In this example, we generate synthetic multi-label classification data with three classes and two labels for each data point. We use MultiOutputClassifier to create a multi-label classification model and use RandomForestClassifier as the base estimator. After training the model on the training data, we make predictions on the test data.

We calculate the accuracy for each label and print a classification report that includes precision, recall, F1-score, and support metrics for each label.

This code demonstrates multi-label classification in scikit-learn, where the model predicts multiple binary labels for each input data point, allowing for scenarios where each data point can belong to multiple classes simultaneously.

### 18.3 Summary

Multi-class and multi-output algorithms are essential components of machine learning, addressing diverse and complex problems beyond binary classification or single-target regression. Here's a summary:

Multi-class classification deals with categorizing data into more than two distinct classes or categories. Common strategies include One-vs-Rest (OvR), multinomial logistic regression, and support vector machines (SVM) with one-vs-one or one-vs-rest approaches. Ensemble methods like Random Forest and Gradient Boosting can be extended for multi-class classification.

Multi-output problems involve predicting multiple target variables (outputs) for each data point, making them useful for tasks like multi-target regression and multi-label classification. Scikit-learn offers specialized tools like MultiOutputRegressor and MultiOutputClassifier to extend single-output algorithms for multi-output tasks.
Multi-label classification is a subcategory where each data point can belong to multiple classes simultaneously.