## Linear Classifiers

Linear classifiers are a type of supervised machine learning algorithm that are used for classification tasks. They aim to find a linear decision boundary that separates the data points into different classes.

There are several types of linear classifiers, including:

- Logistic Regression
- Support Vector Machines (SVM)
- Perceptron

These classifiers make predictions based on a linear combination of the input features, where the coefficients of the linear combination are learned during the training process.

Linear classifiers are often used when the data is linearly separable, meaning that a straight line or hyperplane can separate the data points of different classes. However, they can also be used for non-linearly separable data by applying non-linear transformations to the input features.

To train a linear classifier, you typically need a labeled dataset, where each data point is associated with a class label. The classifier learns from this labeled data to make predictions on new, unseen data points.

## Support Vector Machines (SVM)

Support Vector Machines (SVM) is a supervised machine learning algorithm that can be used for both classification and regression tasks. It is based on the concept of finding the hyperplane that best separates the data points into different classes.

In SVM, the goal is to find the optimal hyperplane that maximizes the margin between the support vectors, which are the data points closest to the decision boundary. The decision boundary is defined by the hyperplane, and the margin is the distance between the hyperplane and the support vectors.

SVM can handle both linearly separable and non-linearly separable data by using different types of kernels. Kernels are functions that transform the input data into a higher-dimensional space, where it becomes easier to find a hyperplane that separates the data.

SVM has several advantages, including:

- Effective in high-dimensional spaces
- Memory efficient
- Versatile with different kernel functions

However, SVM can be sensitive to the choice of hyperparameters and may require careful tuning to achieve optimal performance.

```
from sklearn.svm import LinearSVC

# Apply SVM and print scores
    svm = LinearSVC()
    svm.fit(X_train,y_train)
    print(svm.score(X_train,y_train))
    print(svm.score(X_test,y_test))
```

A subset of scikit-learn's built-in wine dataset is already loaded into X, along with binary labels in y
```
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier

# Define the classifiers
classifiers = [LogisticRegression(),LinearSVC(),SVC(),KNeighborsClassifier()]

# Fit the classifiers
for c in classifiers:
    c.fit(X,y)

# Plot the classifiers
plot_4_classifiers(X, y, classifiers)
plt.show()
```

![image](image.png)


## Minimizing Loss in Linear Classifiers

In linear classifiers, the goal is to find the optimal hyperplane that separates the data points into different classes. This hyperplane is determined by minimizing a loss function.

A loss function measures the error or mismatch between the predicted class labels and the true class labels. The goal is to minimize this error by adjusting the parameters of the linear classifier.

One commonly used loss function for linear classifiers is the hinge loss function, which penalizes misclassifications and encourages a larger margin between the decision boundary and the support vectors.

The hinge loss function can be visualized in a loss diagram. The x-axis represents the margin, which is the distance between the decision boundary and the support vectors. The y-axis represents the loss, which increases as the margin decreases.

![Loss Diagram](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Hinge_loss.svg/400px-Hinge_loss.svg.png)

In the loss diagram, the loss is zero when the margin is larger than a certain threshold. As the margin decreases, the loss increases linearly until it reaches a maximum value when the margin is zero or negative.

The goal of minimizing the loss is to find the hyperplane that maximizes the margin while keeping the loss as low as possible. This is achieved through optimization algorithms that adjust the parameters of the linear classifier based on the gradient of the loss function.

By minimizing the loss, linear classifiers can effectively separate the data points into different classes and make accurate predictions on new unseen data.

## Regularization in Logistic Regression

Regularization is a technique used in machine learning to prevent overfitting and improve the generalization of a model. In the context of logistic regression, regularization involves adding a penalty term to the loss function.

The loss function in logistic regression is typically the negative log-likelihood function, which measures the error between the predicted probabilities and the actual labels. The penalty term is added to this loss function to control the complexity of the model.

There are two commonly used types of regularization in logistic regression:

1. L1 Regularization (Lasso Regression): This type of regularization adds the absolute values of the coefficients as the penalty term. It encourages sparsity in the model by shrinking some coefficients to zero.

2. L2 Regularization (Ridge Regression): This type of regularization adds the squared values of the coefficients as the penalty term. It encourages small values for all coefficients, but does not force them to zero.

The regularization parameter, often denoted as lambda (λ) or c, controls the strength of the penalty term. A higher value of lambda results in stronger regularization and a simpler model, while a lower value of lambda allows the model to fit the training data more closely.

Regularization helps to prevent overfitting by discouraging the model from relying too heavily on any single feature or combination of features. It can also improve the interpretability of the model by reducing the impact of irrelevant or noisy features.


To define logistic regression probabilities, you can use the `predict_proba` method of the logistic regression model. This method returns the predicted probabilities for each class.

Here's an example:

```python
# Import the necessary libraries
from sklearn.linear_model import LogisticRegression

# Instantiate the logistic regression model
lr = LogisticRegression()

# Fit the model to the training data
lr.fit(X_train, y_train)

# Predict probabilities for the test data
probs = lr.predict_proba(X_test)

# Print the predicted probabilities
print(probs)
```

In this example, `X_train` and `y_train` are the training data, and `X_test` is the test data. The `predict_proba` method returns an array of shape `(n_samples, n_classes)`, where `n_samples` is the number of samples in the test data and `n_classes` is the number of classes in the target variable.


```
# Specify L1 regularization
lr = LogisticRegression(solver='liblinear', penalty="l1")

# Instantiate the GridSearchCV object and run the search
searcher = GridSearchCV(lr, {'C':[0.001, 0.01, 0.1, 1, 10]})
searcher.fit(X_train, y_train)

# Report the best parameters
print("Best CV params", searcher.best_params_)

# Find the number of nonzero coefficients (selected features)
best_lr = searcher.best_estimator_
coefs = best_lr.coef_
print("Total number of features:", coefs.size)
print("Number of selected features:", np.count_nonzero(coefs))

Best CV params {'C': 1}
Total number of features: 2500
Number of selected features: 1219

# Get the indices of the sorted cofficients
inds_ascending = np.argsort(lr.coef_.flatten()) 
inds_descending = inds_ascending[::-1]

# Print the most positive words
print("Most positive words: ", end="")
for i in range(5):
    print(vocab[inds_descending[i]], end=", ")
print("\n")

# Print most negative words
print("Most negative words: ", end="")
for i in range(5):
    print(vocab[inds_ascending[i]], end=", ")
print("\n")

Most positive words: favorite, superb, noir, knowing, excellent, 

Most negative words: worst, disappointing, waste, boring, lame, 

```

```LogisticRegression(multi_class='ovr')``` performs 'one vs rest' regression on multi-class target variables, otherwise can perform multinomial (softmax) regression.

```
# Fit one-vs-rest logistic regression classifier
lr_ovr = LogisticRegression(multi_class='ovr')
lr_ovr.fit(X_train, y_train)

print("OVR training accuracy:", lr_ovr.score(X_train, y_train))
print("OVR test accuracy    :", lr_ovr.score(X_test, y_test))

# Fit softmax classifier
lr_mn = LogisticRegression(multi_class='multinomial')
lr_mn.fit(X_train, y_train)

print("Softmax training accuracy:", lr_mn.score(X_train, y_train))
print("Softmax test accuracy    :", lr_mn.score(X_test, y_test))

OVR training accuracy: 0.9955456570155902
OVR test accuracy    : 0.9644444444444444
Softmax training accuracy: 1.0
Softmax test accuracy    : 0.9688888888888889

```


Support vectors are the data points that lie closest to the decision boundary of a support vector machine (SVM) classifier. These data points play a crucial role in defining the decision boundary and separating different classes.

In SVM, the decision boundary is determined by a subset of the training data called support vectors. These support vectors are the data points that have non-zero coefficients in the representation of the decision boundary.

To find the support vectors in scikit-learn, you can use the `support_vectors_` attribute of the trained SVM model. This attribute returns an array of shape (n_support_vectors, n_features) containing the support vectors.

Here is an example of how to find the support vectors using scikit-learn:

```python
# Fit SVM classifier
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

# Get support vectors
support_vectors = svm.support_vectors_

print('Number of support vectors:', support_vectors.shape[0])
print('Support vectors:', support_vectors)
```


### Jointly tuning gamma and C with GridSearchCV
```
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train,y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)

# Report the test accuracy using these best parameters
print("Test accuracy of best grid search hypers:", searcher.score(X_test,y_test))

Best CV params {'C': 1, 'gamma': 0.001}
Best CV accuracy 0.9988826815642458
Test accuracy of best grid search hypers: 0.9988876529477196
```

The SGD (Stochastic Gradient Descent) Classifier is a linear classifier that uses stochastic gradient descent to optimize the model parameters. It is a popular algorithm for large-scale machine learning tasks due to its efficiency and ability to handle large datasets.

```
# Import the necessary libraries
from sklearn.linear_model import SGDClassifier

# Instantiate the SGDClassifier
sgd_classifier = SGDClassifier()

# Print the details of the SGDClassifier
print(sgd_classifier)
```

Here are the key steps involved in the working of the SGD Classifier:

1. **Initialization**: The model parameters (weights and biases) are initialized randomly or with some predefined values.

2. **Training**: The SGD Classifier iteratively updates the model parameters using stochastic gradient descent. In each iteration, a random sample (or a mini-batch) from the training data is used to compute the gradient of the loss function with respect to the model parameters. The model parameters are then updated in the direction of the negative gradient to minimize the loss function.

3. **Convergence**: The training process continues until a stopping criterion is met, such as reaching a maximum number of iterations or achieving a desired level of convergence.

4. **Prediction**: Once the model is trained, it can be used to make predictions on new unseen data by computing the dot product between the input features and the learned model parameters.

The SGD Classifier is particularly useful for large-scale classification problems where the number of samples or features is very large. It can handle both sparse and dense input data and supports various loss functions and regularization techniques.

```
# We set random_state=0 for reproducibility 
linear_classifier = SGDClassifier(random_state=0)

# Instantiate the GridSearchCV object and run the search
parameters = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1], 
             'loss':['hinge','log_loss']}
searcher = GridSearchCV(linear_classifier, parameters, cv=10)
searcher.fit(X_train, y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)
print("Test accuracy of best grid search hypers:", searcher.score(X_test, y_test))

Best CV params {'alpha': 0.001, 'loss': 'hinge'}
Best CV accuracy 0.9490730158730158
Test accuracy of best grid search hypers: 0.9611111111111111

```