# 5.2 Classification

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/classification-algorithm-in-machine-learning.png)

Classification Algorithms can be divided into the mainly two category:

* Linear Models
    * Logistic Regression
    * Support Vector Machines(SVM)
* Non-linear Models
    * K-Nearest Neighbours
    * Naïve Bayes
    * Decision Tree Classification
    * Random Forest Classification

## 5.2.1 Logistic regression

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/logistic-regression-in-machine-learning.png)

<b>Logistic regression</b> is a classification algorithm used for predicting the probability of a binary outcome. It is a linear method that models the relationship between the dependent variable (output) and one or more independent variables (inputs) using a logistic function.

The logistic function, also known as the sigmoid function, is defined as:

f(x) = 1 / (1 + e^-x)

It maps the input x to a value between 0 and 1, which can be interpreted as the probability of a positive outcome (e.g. 1 for a binary classification problem).

Here is an example of logistic regression in Python using scikit-learn:

In this example, we split the data into training and test sets, initialize a logistic regression model, fit the model to the training data, make predictions on the test data, and evaluate the model performance using the accuracy score.

Logistic regression is a simple and widely used method for classification, and it is effective for many applications. However, it is limited to binary classification and is sensitive to the assumption of linearity between the independent variables and the logit of the outcome.

## 5.2.2 Support Vector Machines(SVM)

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm5.png)

A <b>support vector machine (SVM)</b> is a supervised learning algorithm that can be used for classification or regression. In the case of classification, the algorithm creates a hyperplane or set of hyperplanes in a high-dimensional space, which can be used to classify new data points.

An SVM is particularly well-suited for classification of complex but small- or medium-sized datasets.

Here is a simple example of how to train and use an SVM in Python using the scikit-learn library:

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm

# Load the iris dataset as an example
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

# Split the data into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Create an SVM model
model = svm.SVC()

# Train the model on the training data
model.fit(X_train, Y_train)

# Test the model on the test data
accuracy = model.score(X_test, Y_test)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.9666666666666667


This code will train an SVM on the iris dataset and print the accuracy of the model on the test set. The iris dataset is a classic dataset in machine learning, which consists of measurements of various iris flowers and the species of iris that the measurements correspond to. The SVM will learn from the training data to predict the species of a new iris flower based on its measurements.

## 5.2.3 Decision Tree

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png)

<b>Decision tree</b> classification is a supervised learning algorithm that can be used for classification tasks. It works by creating a tree-like model of decisions based on the features of the data.

At each internal node of the tree, the algorithm selects the feature that maximizes the information gain at that node, and then creates branches based on the possible values of that feature. This process is repeated recursively on each branch until the leaves of the tree are reached, at which point the predicted class is determined.

Here is an example of how to train and use a decision tree classifier in Python using the scikit-learn library:

In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree

# Load the iris dataset as an example
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

# Split the data into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Create a decision tree classifier
model = tree.DecisionTreeClassifier()

# Train the model on the training data
model.fit(X_train, Y_train)

# Test the model on the test data
accuracy = model.score(X_test, Y_test)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.9666666666666667


This code will train a decision tree classifier on the iris dataset and print the accuracy of the model on the test set. The iris dataset is a classic dataset in machine learning, which consists of measurements of various iris flowers and the species of iris that the measurements correspond to. The decision tree classifier will learn from the training data to predict the species of a new iris flower based on its measurements.

## 5.2.4 Random Forest

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm.png)

<b>Random forest</b> classification is an ensemble learning method that is used for classification tasks. It works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

One of the key features of random forest classifiers is that they can handle a large number of features, and they can still make accurate predictions even if some of the features are correlated or if there are nonlinear relationships between the features and the target.

Here is an example of how to train and use a random forest classifier in Python using the scikit-learn library:

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the iris dataset as an example
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

# Split the data into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Create a random forest classifier
model = RandomForestClassifier()

# Train the model on the training data
model.fit(X_train, Y_train)

# Test the model on the test data
accuracy = model.score(X_test, Y_test)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 1.0


### Bagging and Boosting

Bagging and boosting are ensemble learning methods that can be used to improve the performance of a machine learning model.

<b>Bagging</b> (short for bootstrapped aggregation) is a method that involves training multiple models on different subsets of the training data and then combining their predictions. This can be done using decision trees, neural networks, or any other type of model. The idea behind bagging is that the combination of the predictions from the multiple models will be more accurate than the predictions of any individual model.

<b>Boosting</b> is a method that involves training multiple models sequentially, with each model attempting to correct the mistakes of the previous model. Boosting algorithms typically use decision trees, but they can also be used with other types of models. The main idea behind boosting is to train a weak model, and then to iteratively improve it by adding new models that focus on the mistakes made by the previous models. Boosting algorithms can often achieve higher accuracy than bagging, but they are also more prone to overfitting.

Here is an example of how to use bagging and boosting in Python using the scikit-learn library:

In [5]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# Load the iris dataset as an example
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

# Split the data into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Create a bagging classifier
bagging_model = BaggingClassifier()

# Train the model on the training data
bagging_model.fit(X_train, Y_train)

# Test the model on the test data
bagging_accuracy = bagging_model.score(X_test, Y_test)

# Create an AdaBoost classifier
boosting_model = AdaBoostClassifier()

# Train the model on the training data
boosting_model.fit(X_train, Y_train)

# Test the model on the test data
boosting_accuracy = boosting_model.score(X_test, Y_test)

# Print the accuracies
print("Bagging accuracy:", bagging_accuracy)
print("Boosting accuracy:", boosting_accuracy)

Bagging accuracy: 0.9666666666666667
Boosting accuracy: 0.9666666666666667


## 5.2.5 K-nearest neighbors (KNN) classifier

![Image of Runcode](https://static.javatpoint.com/tutorial/machine-learning/images/k-nearest-neighbor-algorithm-for-machine-learning5.png)

The <b>K-nearest neighbors (KNN) classifier</b> is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. It's a lazy learning algorithm because it doesn't have a specialized training phase. Instead, it uses all of the data for training while classifying a new data point or instance.

Here's a simple example of how a KNN classifier can be implemented in Python using the popular scikit-learn library:

The parameter 'n_neighbors' specifies the number of nearest neighbors that the classifier should consider while predicting the label of a new data point. By default, it is set to 5. You can specify any other value for n_neighbors depending on the size and nature of your dataset.

## 5.2.6 XGBoost

<b>XGBoost</b> (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosting framework for machine learning. It has a number of hyperparameters that can be tuned to achieve better performance, and it is widely used in a variety of machine learning tasks such as classification, regression, and ranking.

Here's a simple example of how XGBoost can be implemented in Python using the popular scikit-learn library:

There are several hyperparameters that you can tune to improve the performance of the XGBoost model. Some of the important ones are:

* max_depth: maximum depth of the tree.
* learning_rate: learning rate for the boosting.
* n_estimators: number of trees in the ensemble.
* gamma: minimum loss reduction required to make a split.

You can specify these hyperparameters while creating the XGBoost model like this:

xgb = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, gamma=0)