# LogisticRegression & SupportVectorMachines(SVM)

Abe both prominent machine learning algorithms utilized predominantly in classification tasks.

### Logistic Regression:
A probability estimation process that makes use of a logistic or sigmoid function. The sigmoid function consumes a linear combination of the feature set, transforms it, and subsequently outputs a probability value ranging between 0 and 1.

### Support Vector Machines (SVM)
is an robust supervised learning algorithm designed for a broad spectrum of tasks, including both classification and regression.

SVMs encompass a series of supervised learning techniques that can be applied for classification, regression, and outlier detection. This discussion, however, is specifically centered around classification.
SVM effectively processes both linearly separable and non-linearly separable datasets, achieved by leveraging a range of kernel functions that project the data into higher-dimensional feature spaces.
The kernel functions most frequently utilized in SVM include linear, polynomial, radial basis function (RBF), and sigmoid. These functions induce non-linearity in the SVM model, empowering it to detect complex data patterns.


### Benefits:

Highly effective in scenarios with high-dimensional spaces.
Still maintains effectiveness even when the quantity of features exceeds the number of samples.
Optimized for memory efficiency.
Versatility in kernel types, with the emphasis in this discussion being on linear kernels.
Limitations:

When the quantity of features surpasses the quantity of samples, the application of regularization becomes vital to prevent overfitting.
Unlike logistic regression, SVMs don't directly provide probability estimates.
Different SVM variants are available in Scikit-learn for classification:

### SVC:
This variant is based on the libsvm implementation. Its fit time scales quadratically with the dataset size.

### NuSVC:
Operates similarly to SVC, but also provides control over the number of support vectors.

### LinearSVC:
Offers a quick implementation for linear kernel SVMs.

# 1. SVC

In [None]:
# import libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons  
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score, classification_report 
from sklearn import svm, datasets
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc


In [None]:
# Generate moon dataset
X, y = make_moons(n_samples=200, noise=0, random_state=2, shuffle=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)




In [None]:
class CustomSVC:
    """Custom SVC class"""
    def __init__(self, kernel='rbf', C=1.0, gamma='scale', degree=3):
        """Initialize the class with the given parameters"""
        self.kernel = kernel
        self.C = C
        self.gamma = gamma
        self.degree = degree
        self.model = None

    def fit(self, X, y):
        """Fit the model with the given data"""
        self.model = SVC(kernel=self.kernel, C=self.C, gamma=self.gamma, degree=self.degree)
        self.model.fit(X, y)

    def predict(self, X):
        """Predict the given data"""
        return self.model.predict(X)



### Plot the support vectors:

In [None]:
plt.figure(figsize=(20, 15))

# we create an instance of SVM and fit out data. We do not scale our
kernels = ['linear', 'poly', 'rbf']

# iterate over different kernels
for i, kernel in enumerate(kernels):

    # we create an instance of SVM and fit out data. We do not scale our
    model = svm.SVC(kernel=kernel)

    # fit the model
    model.fit(X_train, y_train)

    # plot the line, the points, and the nearest vectors to the plane
    plt.subplot(2, 2, i + 1)
    plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
                s=100, facecolors='none', edgecolors='k', marker='o')
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, zorder=10, cmap=plt.cm.Paired, edgecolors='k')
    
    plt.axis('tight')

    # create grid to evaluate model
    X1, X2 = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))

    # evaluate model on a grid
    decision_function = model.decision_function(np.c_[X1.ravel(), X2.ravel()])
    plt.contour(X1, X2, decision_function.reshape(X1.shape), colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
    
    # plot the decision function
    plt.title('SVC with {} kernel'.format(kernel))

plt.show()


### Try another wany of drawing these plot, just for learning.

In [None]:
def plot_decision_boundary(model, X, y, kernel):
""""Plot the decision boundary of the given model"""

    h = 0.01
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    # create a meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # predict the function value for the whole grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)

    # plot the contour
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolors='black')
    plt.scatter(
    model.model.support_vectors_[:, 0], 
    model.model.support_vectors_[:, 1], 
    s=50, facecolors='none', edgecolors='red' )
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title(kernel)
    plt.show()

for kernel in kernels:

    # create an instance of the model
    model = CustomSVC(kernel=kernel)

    # fit the model
    model.fit(X_train, y_train)

    # predict the response
    plot_decision_boundary(model, X_test, y_test, kernel)


The linear kernel gives us a linear decision boundary, which might not be effective for the moon-shaped dataset. The poly and rbf kernels provide more complex, non-linear decision boundaries that better fit the data.

# 2. Model Evaluation

In [None]:
# Load the dataset
data = pd.read_csv(r'.\Data\breast-cancer.csv')

data.head() 

In [None]:
data.info()

In [None]:
# Separate features and target variable
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
# Map 'M' and 'B' to 1 and 0 respectively
y = y.map({'M': 1, 'B': 0})

# Normalize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)


the positive skewness values indicate a right-skewed distribution.

it means the majority of the data is concentrated on the left side of the distribution, and the tail extends towards the right.  


In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
# Model training
models = [LogisticRegression(), svm.SVC(kernel='linear'), svm.SVC(kernel='rbf')]
model_names = ['Logistic Regression', 'SVM (linear kernel)', 'SVM (RBF kernel)']

for i, model in enumerate(models):
    model.fit(X_train, y_train)

In [None]:
# Prediction and evaluation
y_pred = model.predict(X_test)
print(f"Classification Report for {model_names[i]}: ")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("AUC-ROC Score:")
print(roc_auc_score(y_test, y_pred))
print("\n")

Now, its time to create some Logistic Regression and Support Vector Classifier (SVC) models:

In [None]:
# Model 1: Logistic Regression with default hyperparameters
lr_model_1 = LogisticRegression()
lr_model_1.fit(X_train, y_train)

# Model 2: Logistic Regression with different C value
lr_model_2 = LogisticRegression(C=0.1)
lr_model_2.fit(X_train, y_train)

# Model 3: Logistic Regression with different regularization penalty (L1)
lr_model_3 = LogisticRegression(penalty='l1', solver='liblinear')
lr_model_3.fit(X_train, y_train)

# Model 4: SVM with default hyperparameters
svm_model_1 = SVC(probability=True)
svm_model_1.fit(X_train, y_train)

# Model 5: SVM with different C value and linear kernel
svm_model_2 = SVC(C=0.1, kernel='linear', probability=True)
svm_model_2.fit(X_train, y_train)

# Model 6: SVM with different gamma value and RBF kernel
svm_model_3 = SVC(gamma=0.1, kernel='rbf', probability=True)
svm_model_3.fit(X_train, y_train)


In [None]:
# Create a list of models
models = [lr_model_1, lr_model_2, lr_model_3, svm_model_1, svm_model_2, svm_model_3]


In [None]:
# Evaluate each model
for model in models:
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred) 
    print("Model:", model)
    print("Accuracy:", accuracy)
    print("Classification Report:")
    print(report)
    print("------------------------------")

### Accuracy:
This metric evaluates the overall correctness of a model's predictions. A higher accuracy value indicates that the model is making correct predictions more frequently.

### Precision:
Precision is the ratio of true positives to the sum of true positives and false positives. In the case of breast cancer diagnosis, precision reflects the model's ability to accurately identify malignant cases (M). A higher precision value means there are fewer false positives, indicating that the model correctly identifies actual positive cases.

### Recall:
Recall is the ratio of true positives to the sum of true positives and false negatives. For breast cancer diagnosis, recall measures the model's ability to correctly identify all actual positive cases. A higher recall value signifies fewer false negatives, indicating that the model accurately captures most positive cases.

### F1-score:
The F1-score provides a balanced measure by considering both precision and recall. A higher F1-score indicates a better balance between precision and recall, signifying that the model performs well in capturing positive cases while minimizing false positives and false negatives.


 In the case of breast cancer diagnosis, the focus is often on minimizing false negatives (missing actual positive cases). Therefore, we would prioritize models with higher recall values while maintaining a reasonable level of precision. Among the evaluated models, SVC with default parameters (Model: SVC()) achieved the highest recall for the (M) class while maintaining a high overall accuracy.

This assignment has been done with help of Fathemeh Rakhshani-Far