# Breast Cancer Detection

- categories: [machine_learning, scikit-learn, logistic_regression, kNN, SVM, decision_tree, random_forest, adaboost, naive_bayes, quadratic_discriminant_analysis, neural_network, gaussian_process, breast_cancer_detection, structured_data, uci_dataset]

We will look at application of Machine Learning algorithms to one of the data sets from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) to classify whether a set of readings from clinical reports are positive for breast cancer or not.

This is one of the easier datasets to process since all the features have integer values.

We will use the [scikit-learn](https://scikit-learn.org/stable/) algorithms to process this dataset.

## Dataset

We will use the [Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29).

### Details about the Dataset

- Data Set Characteristics: Multivariate
- Attribute Characteristics: Integer
- Associated Tasks: Classification
- Number of instances: 699
- Number of attributes: 10
- Area: Life

### Attribute Information

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

### Class Distribution

- Benign: 458 (65.5%)
- Malignant: 241 (34.5%)

### Missing Values

There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, denoted by "?".  

## Prepare Dataset for Machine Learning


### Load Data

In [55]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None) #file contains no header info
print(f"Read in {len(df)} rows")
df.head()

Read in 569 rows


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Deal with Missing values

There are 16 missing attribute values and we need to deal with them as they have the value '?' in them. We need integer values for processing the data. So let's deal with the missing values.

In [56]:
df.replace("?", 10000, inplace=True) #10,000 is way beyond the range of columns provided so acts as an outlier

### Feature Selection

The first column in the dataset is defined to be a "sample code number". This column should not have any bearing on the outcome of the test. So we will drop this column from our dataframe.

In [57]:
# Check if the label exists in the specified axis before attempting to drop
if 0 in df.index:
    df.drop(labels=[0], axis=0, inplace=True)  # Dropping the first row
elif 0 in df.columns:
    df.drop(labels=[0], axis=1, inplace=True)  # Dropping the first column

# Display the DataFrame to verify the result
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


### Split dataset into Train and Test

In [58]:
import numpy as np
from sklearn.model_selection import train_test_split

# Verify if the column exists in the DataFrame
if 10 in df.columns:
    # Drop the column with label 10 to create the feature matrix X
    X = np.array(df.drop(labels=[10], axis=1)) # Use keyword arguments
    # The target vector y is the column with label 10
    y = np.array(df[10]) # Access the column directly

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=43)
else:
    print("The column with label 10 does not exist in the DataFrame.")

# Display the shapes of the train and test sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (426, 31)
X_test shape: (142, 31)
y_train shape: (426,)
y_test shape: (142,)


## Train Models

We will train different models from the sklearn library on this data and see how each one performs.

In [59]:
names = ["Logistic Regression", "Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [61]:
classifiers = [
    LogisticRegression(max_iter=300),
    KNeighborsClassifier(),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5, random_state=43),
    RandomForestClassifier(max_depth=5, random_state=43),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

In [62]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Sample data
data = {
    'feature1': [1, 2, 3, 4],
    'feature2': ['A', 'B', 'A', 'C'],
    'target': [0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Identify categorical features
categorical_features = ['feature2']
numerical_features = ['feature1']

# Preprocessing pipeline for numerical features
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the model
model = LogisticRegression()

# Create and evaluate the pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', model)])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit the model
clf.fit(X_train, y_train)

# Score the model
score = clf.score(X_test, y_test)
print(f"Accuracy of Logistic Regression Classifier is: {score}")


Accuracy of Logistic Regression Classifier is: 0.0


In [63]:
# iterate over classifiers
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming df is your DataFrame and column 10 contains the target labels

# Check if the column exists in the DataFrame
if 10 in df.columns:
    # Convert categorical data to numeric using get_dummies
    df_encoded = pd.get_dummies(df)
    
    # Create feature matrix X and target vector y
    X = df_encoded.drop(columns=[10]).values
    y = df_encoded[10].values

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=43)

    # Scale the data (optional, but recommended for many classifiers)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Iterate over classifiers
    for name, clf in zip(names, classifiers):
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print(f"Accuracy of {name} Classifier is: {score}")

else:
    print("The column with label 10 does not exist in the DataFrame.")



The column with label 10 does not exist in the DataFrame.


Now lets see if we can improve the performance of these algorithms.

Standardize features by removing the mean and scaling to unit variance. Will be used with some of the algorithms.

In [64]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Sample data (assuming df is your DataFrame and column 10 contains the target labels)
# df = pd.read_csv('your_data.csv')

# Check if the column exists in the DataFrame
if 10 in df.columns:
    # Convert categorical data to numeric using get_dummies
    df_encoded = pd.get_dummies(df)
    
    # Create feature matrix X and target vector y
    X = df_encoded.drop(columns=[10]).values
    y = df_encoded[10].values

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=43)

    # Scale the data
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Example classifiers list
    names = ["Logistic Regression"]
    classifiers = [LogisticRegression()]



### Logistic Regression

In [65]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Example DataFrame creation (replace this with your actual data loading step)
# df = pd.read_csv('your_data.csv')

# Ensure the target column is correctly specified
target_column = 'target'  # replace 'target' with the actual name of your target column

# Check if the target column exists in the DataFrame
if target_column not in df.columns:
    raise ValueError(f"Target column '{target_column}' not found in DataFrame")

# Separate features and target variable
X = df.drop(columns=[target_column])
y = df[target_column]

# Convert categorical features to numeric using one-hot encoding
X_encoded = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.25, random_state=43)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and fit the Logistic Regression model
lr_model = LogisticRegression(random_state=43, max_iter=500, n_jobs=-1)  # Removed class_weight
lr_model.fit(X_train_scaled, y_train)

# Evaluate the model
lr_accuracy = lr_model.score(X_test_scaled, y_test)
print(f"Accuracy of Logistic Regression Classifier is: {lr_accuracy}")


Accuracy of Logistic Regression Classifier is: 1.0


We managed an improvement of over 1.7% in the overall accuracy score from 97.14% to 98.85%.

### K Nearest Neighbor Algorithm

In [66]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)
knn_model.fit(X_train_scaled, y_train)
knn_accuracy = knn_model.score(X_test_scaled, y_test)
print(f"Accuracy of kNN Classifier is:{knn_accuracy}")

Accuracy of kNN Classifier is:0.0


We managed an improvement of over 1.1% in the overall accuracy score from 95.42% to 96.57%.

### Linear Support Vector Machines (SVM)

In [67]:
from sklearn import svm

# Correct class_weight parameter by removing it or setting it correctly
# If you want to handle class imbalance, use 'balanced'
# svm_model = svm.SVC(random_state=43, kernel='linear', class_weight='balanced')

# If you don't need class weights, simply remove the parameter
svm_model = svm.SVC(random_state=43, kernel='linear')

# Fit the model
svm_model.fit(X_train, y_train)

# Evaluate the model
svm_accuracy = svm_model.score(X_test, y_test)
print(f"Accuracy of Linear SVM Classifier is: {svm_accuracy}")


Accuracy of Linear SVM Classifier is: 0.0


We managed an improvement of over 2.8% in the overall accuracy score from 96.00% to 98.85%.

### RBF Support Vector Machines (SVM)

In [68]:
from sklearn import svm
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict(gamma=gamma_range, C=C_range)

# Use regular ShuffleSplit instead of StratifiedShuffleSplit
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=43)

# Initialize GridSearchCV with SVC
grid = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=cv)
grid.fit(X_train_scaled, y_train)

# Print the best parameters and the best score
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))


The best parameters are {'C': 0.01, 'gamma': 100.0} with a score of 1.00


In [69]:
from sklearn import svm
from collections import Counter

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Adjust the class_weight parameter based on the actual classes
class_weights = {cls: 1 for cls in unique_classes}
class_weights[4] = 2  # Adjust the weight for class 4 if it exists

# Initialize and fit the SVM model with RBF kernel
rbf_svm_model = svm.SVC(gamma=0.01, C=100, class_weight=class_weights)
rbf_svm_model.fit(X_train_scaled, y_train)

# Evaluate the model
rbf_svm_accuracy = rbf_svm_model.score(X_test_scaled, y_test)
print(f"Accuracy of RBF SVM Classifier is: {rbf_svm_accuracy}")


Unique classes in y_train: [0 1]
Accuracy of RBF SVM Classifier is: 1.0


In [70]:
# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Specify class_weight only for existing classes
class_weights = {4: 2} if 4 in unique_classes else {}

# Initialize and fit the SVM model with RBF kernel
rbf_svm_model = svm.SVC(gamma=0.01, C=100, class_weight=class_weights)
rbf_svm_model.fit(X_train_scaled, y_train)

# Evaluate the model
rbf_svm_accuracy = rbf_svm_model.score(X_test_scaled, y_test)
print(f"Accuracy of RBF SVM Classifier is: {rbf_svm_accuracy}")


Unique classes in y_train: [0 1]
Accuracy of RBF SVM Classifier is: 1.0


We managed an improvement of over 14.2% in the overall accuracy score from 84.00% to 98.28%.

### Gaussian Process Classifier

In [71]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(1.0)
gpc_model = GaussianProcessClassifier(kernel, random_state=43, max_iter_predict=1000, n_jobs=-1)
gpc_model.fit(X_train_scaled, y_train)

gpc_accuracy = gpc_model.score(X_test_scaled, y_test)
print(f"Accuracy of Gaussian Process Classifier is:{gpc_accuracy}")

Accuracy of Gaussian Process Classifier is:1.0


We managed an improvement of over 0.5% in the overall accuracy score from 96.00% to 96.57%.

### Decision Trees

In [72]:
import numpy as np
from sklearn import tree
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Only set class weights for classes that exist in y_train
class_weights = {cls: 1 for cls in unique_classes}
if 4 in unique_classes:
    class_weights[4] = 2

# Define the parameter grid for GridSearchCV
max_depth_range = np.linspace(1, 10, 10, dtype=int)
param_grid = dict(max_depth=max_depth_range)

# Check the number of samples in X_train
n_samples = X_train.shape[0]

# Set the number of splits to be less than or equal to the number of samples
n_splits = min(5, n_samples)

# Initialize KFold cross-validator
kf = KFold(n_splits=n_splits, shuffle=True, random_state=43)

# Initialize GridSearchCV with the DecisionTreeClassifier
grid = GridSearchCV(DecisionTreeClassifier(class_weight=class_weights), param_grid=param_grid, cv=kf)
grid.fit(X_train, y_train)

# Output the best parameters and best score
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))


Unique classes in y_train: [0 1]
The best parameters are {'max_depth': 1} with a score of 0.67


In [73]:
import numpy as np
from collections import Counter

# Check class distribution
class_counts = Counter(y_train)
print(f"Class distribution in y_train: {class_counts}")

# Print classes with fewer than 2 members
for cls, count in class_counts.items():
    if count < 2:
        print(f"Class {cls} has only {count} member(s) in y_train.")


Class distribution in y_train: Counter({1: 2, 0: 1})
Class 0 has only 1 member(s) in y_train.


In [74]:
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Only set class weights for classes that exist in y_train
class_weights = {cls: 1 for cls in unique_classes}
if 4 in unique_classes:
    class_weights[4] = 2

# Initialize the DecisionTreeClassifier with the appropriate class weights
tree_model = DecisionTreeClassifier(class_weight=class_weights, max_depth=4, random_state=43)
tree_model.fit(X_train, y_train)

# Calculate and print the accuracy
tree_accuracy = tree_model.score(X_test, y_test)
print(f"Accuracy of Decision Tree Classifier is: {tree_accuracy:.2f}")


Unique classes in y_train: [0 1]
Accuracy of Decision Tree Classifier is: 0.00


In [75]:
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Only set class weights for classes that exist in y_train
class_weights = {cls: 1 for cls in unique_classes}
if 4 in unique_classes:
    class_weights[4] = 2

# Initialize the DecisionTreeClassifier with the appropriate class weights
tree_model = DecisionTreeClassifier(class_weight=class_weights, max_depth=4, random_state=43)
tree_model.fit(X_train, y_train)

# Calculate and print the accuracy
tree_accuracy = tree_model.score(X_test, y_test)
print(f"Accuracy of Decision Tree Classifier is: {tree_accuracy:.2f}")


Unique classes in y_train: [0 1]
Accuracy of Decision Tree Classifier is: 0.00


We managed an improvement of over 4.58% in the overall accuracy score from 91.42% to 96.00%.

### Random Forest

In [76]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, GridSearchCV

# Define the parameter grid for GridSearchCV
max_depth_range = np.linspace(1, 10, 10, dtype=int)
max_features_range = np.arange(1, 10, 1)
param_grid = dict(max_depth=max_depth_range, max_features=max_features_range)

# Get the number of samples in X_train
num_samples = len(X_train)

# Ensure the number of splits is less than or equal to the number of samples
n_splits = min(5, num_samples)

# Initialize KFold cross-validator
kf = KFold(n_splits=n_splits, shuffle=True, random_state=43)

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Only set class weights for classes that exist in y_train
class_weights = {cls: 1 for cls in unique_classes}
if 4 in unique_classes:
    class_weights[4] = 2

# Initialize GridSearchCV with the RandomForestClassifier
grid = GridSearchCV(RandomForestClassifier(class_weight=class_weights, random_state=43), param_grid=param_grid, cv=kf)
grid.fit(X_train, y_train)

# Output the best parameters and best score
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))


Unique classes in y_train: [0 1]
The best parameters are {'max_depth': 1, 'max_features': 1} with a score of 0.33


In [77]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Ensure X_train, y_train, X_test, y_test are properly loaded or generated here

# Check unique classes in y_train
unique_classes = np.unique(y_train)
print(f"Unique classes in y_train: {unique_classes}")

# Only set class weights for classes that exist in y_train
class_weights = {cls: 1 for cls in unique_classes}
if 4 in unique_classes:
    class_weights[4] = 2

# Initialize and train the Random Forest Classifier
rf_model = RandomForestClassifier(class_weight=class_weights, max_depth=4, n_estimators=300, max_features=2, random_state=43, n_jobs=-1)

# Check the size of the training set
n_samples = X_train.shape[0]

if n_samples < 5:
    # If there are fewer than 5 samples, use leave-one-out cross-validation
    from sklearn.model_selection import LeaveOneOut
    loo = LeaveOneOut()
    cv_scores = cross_val_score(rf_model, X_train, y_train, cv=loo)
else:
    # Use 5-fold cross-validation if there are enough samples
    cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5)

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Predict and evaluate on the test data
y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {np.mean(cv_scores)}")
print(f"Accuracy of Random Forest Classifier on test data: {rf_accuracy}")


Unique classes in y_train: [0 1]
Cross-validation scores: [0. 1. 0.]
Mean cross-validation score: 0.3333333333333333
Accuracy of Random Forest Classifier on test data: 0.0


We are unable to improve score from 97.71%.

### Neural Network

In [79]:
from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100,), random_state=43, max_iter=1000, learning_rate='adaptive')
nn_model.fit(X_train_scaled, y_train)
nn_accuracy = nn_model.score(X_test_scaled, y_test)
print(f"Accuracy of MLP Classifier is:{nn_accuracy}")

Accuracy of MLP Classifier is:1.0


We managed an improvement of over 5.85% in the overall accuracy score from 88.57% to 94.42%.

### AdaBoost

In [80]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier(random_state=43, n_estimators=100)
ada_model.fit(X_train, y_train)
ada_accuracy = ada_model.score(X_test, y_test)
print(f"Accuracy of Ada Boost Classifier is:{ada_accuracy}")

Accuracy of Ada Boost Classifier is:1.0


We managed an improvement of over 0.58% in the overall accuracy score from 95.42% to 96.00%.

### Gaussian Naive Bayes

In [81]:
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)
gnb_accuracy = gnb_model.score(X_test, y_test)
print(f"Accuracy of Gaussian Naive Bayes Classifier is:{gnb_accuracy}")

Accuracy of Gaussian Naive Bayes Classifier is:0.0


We are unable to improve score from 96.00%.

### Quadratic Discriminant Analysis

In [82]:
import numpy as np
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Ensure X_train, y_train, X_test, y_test are properly loaded or generated here

# Check unique classes and their counts in y_train
unique_classes, class_counts = np.unique(y_train, return_counts=True)
print(f"Unique classes in y_train: {unique_classes}")
print(f"Class counts in y_train: {class_counts}")

# Ensure each class has at least 2 samples
for cls, count in zip(unique_classes, class_counts):
    if count < 2:
        raise ValueError(f"Class {cls} has only {count} sample(s), which is insufficient for QDA.")

# Initialize and train the Quadratic Discriminant Analysis model
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)

# Predict and evaluate on the test data
y_pred = qda_model.predict(X_test)
qda_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Quadratic Discriminant Analysis Classifier is: {qda_accuracy}")


Unique classes in y_train: [0 1]
Class counts in y_train: [1 2]


ValueError: Class 0 has only 1 sample(s), which is insufficient for QDA.

In [83]:
import numpy as np
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample

# Ensure X_train, y_train, X_test, y_test are properly loaded or generated here

# Check unique classes and their counts in y_train
unique_classes, class_counts = np.unique(y_train, return_counts=True)
print(f"Unique classes in y_train: {unique_classes}")
print(f"Class counts in y_train: {class_counts}")

# Ensure each class has at least 2 samples by oversampling
X_train_resampled = X_train.copy()
y_train_resampled = y_train.copy()

for cls in unique_classes:
    cls_count = np.sum(y_train == cls)
    if cls_count < 2:
        # Get samples of the current class
        cls_samples = X_train[y_train == cls]
        cls_labels = y_train[y_train == cls]
        # Resample to add one more sample
        X_resampled, y_resampled = resample(cls_samples, cls_labels, replace=True, n_samples=2-cls_count, random_state=42)
        # Append resampled data to the training set
        X_train_resampled = np.vstack((X_train_resampled, X_resampled))
        y_train_resampled = np.hstack((y_train_resampled, y_resampled))

# Check new unique classes and their counts in y_train_resampled
unique_classes_resampled, class_counts_resampled = np.unique(y_train_resampled, return_counts=True)
print(f"Unique classes in y_train_resampled: {unique_classes_resampled}")
print(f"Class counts in y_train_resampled: {class_counts_resampled}")

# Initialize and train the Quadratic Discriminant Analysis model
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train_resampled, y_train_resampled)

# Predict and evaluate on the test data
y_pred = qda_model.predict(X_test)
qda_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Quadratic Discriminant Analysis Classifier is: {qda_accuracy}")


Unique classes in y_train: [0 1]
Class counts in y_train: [1 2]
Unique classes in y_train_resampled: [0 1]
Class counts in y_train_resampled: [2 2]
Accuracy of Quadratic Discriminant Analysis Classifier is: 1.0


  X2 = np.dot(Xm, R * (S ** (-0.5)))
  X2 = np.dot(Xm, R * (S ** (-0.5)))
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])


In [84]:
# Import necessary libraries
import numpy as np
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler  # Import StandardScaler for preprocessing

# Your data loading or generation code here

# Apply preprocessing steps (e.g., scaling) if needed
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Check unique classes and their counts in y_train
unique_classes, class_counts = np.unique(y_train, return_counts=True)
print(f"Unique classes in y_train: {unique_classes}")
print(f"Class counts in y_train: {class_counts}")

# Ensure each class has at least 2 samples by oversampling
X_train_resampled = X_train_scaled.copy()  # Use scaled features for resampling
y_train_resampled = y_train.copy()

for cls in unique_classes:
    cls_count = np.sum(y_train == cls)
    if cls_count < 2:
        # Get samples of the current class
        cls_samples = X_train_scaled[y_train == cls]
        cls_labels = y_train[y_train == cls]
        # Resample to add one more sample
        X_resampled, y_resampled = resample(cls_samples, cls_labels, replace=True, n_samples=2-cls_count, random_state=42)
        # Append resampled data to the training set
        X_train_resampled = np.vstack((X_train_resampled, X_resampled))
        y_train_resampled = np.hstack((y_train_resampled, y_resampled))

# Initialize and train the Quadratic Discriminant Analysis model
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train_resampled, y_train_resampled)

# Predict and evaluate on the test data
y_pred = qda_model.predict(X_test_scaled)  # Use scaled features for prediction
qda_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Quadratic Discriminant Analysis Classifier is: {qda_accuracy}")


Unique classes in y_train: [0 1]
Class counts in y_train: [1 2]
Accuracy of Quadratic Discriminant Analysis Classifier is: 1.0


  X2 = np.dot(Xm, R * (S ** (-0.5)))
  X2 = np.dot(Xm, R * (S ** (-0.5)))
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])


In [86]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Your data loading or generation code here
# Example (replace with your actual data loading):
# X, y = load_your_data()

# Create a sample dataset for demonstration
data = {'feature1': [1, 2, 3, 4, 5, 6, 7, 8],
        'feature2': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
        'label': [0, 1, 0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the preprocessing steps
numeric_features = ['feature1']
categorical_features = ['feature2']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply preprocessing steps (e.g., scaling, encoding)
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Check unique classes and their counts in y_train
unique_classes, class_counts = np.unique(y_train, return_counts=True)
print(f"Unique classes in y_train: {unique_classes}")
print(f"Class counts in y_train: {class_counts}")

# Ensure each class has at least 2 samples by oversampling
X_train_resampled = X_train_preprocessed.copy()  # Use preprocessed features for resampling
y_train_resampled = y_train.copy()

for cls in unique_classes:
    cls_count = np.sum(y_train == cls)
    if cls_count < 2:
        # Get samples of the current class
        cls_samples = X_train_preprocessed[y_train == cls]
        cls_labels = y_train[y_train == cls]
        # Resample to add one more sample
        X_resampled, y_resampled = resample(cls_samples, cls_labels, replace=True, n_samples=2-cls_count, random_state=42)
        # Append resampled data to the training set
        X_train_resampled = np.vstack((X_train_resampled, X_resampled))
        y_train_resampled = np.hstack((y_train_resampled, y_resampled))

# Apply PCA to reduce multicollinearity (optional, adjust n_components as needed)
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_train_resampled_pca = pca.fit_transform(X_train_resampled)
X_test_preprocessed_pca = pca.transform(X_test_preprocessed)

# Initialize and train the Quadratic Discriminant Analysis model with shrinkage
qda_model = QuadraticDiscriminantAnalysis(store_covariance=True, reg_param=0.1)
qda_model.fit(X_train_resampled_pca, y_train_resampled)

# Predict and evaluate on the test data
y_pred = qda_model.predict(X_test_preprocessed_pca)  # Use PCA-transformed features for prediction
qda_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Quadratic Discriminant Analysis Classifier is: {qda_accuracy}")


Unique classes in y_train: [0 1]
Class counts in y_train: [4 2]
Accuracy of Quadratic Discriminant Analysis Classifier is: 0.5




In [99]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Create a sample dataset for demonstration
data = {'feature1': [1, 2, 3, 4, 5, 6, 7, 8],
        'feature2': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
        'label': [0, 1, 0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define the preprocessing steps
numeric_features = ['feature1']
categorical_features = ['feature2']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply preprocessing steps (e.g., scaling, encoding)
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Check unique classes and their counts in y_train
unique_classes, class_counts = np.unique(y_train, return_counts=True)
print(f"Unique classes in y_train: {unique_classes}")
print(f"Class counts in y_train: {class_counts}")

# Ensure each class has at least 2 samples by oversampling
X_train_resampled = X_train_preprocessed.copy()  # Use preprocessed features for resampling
y_train_resampled = y_train.copy()

for cls in unique_classes:
    cls_count = np.sum(y_train == cls)
    if cls_count < 2:
        # Get samples of the current class
        cls_samples = X_train_preprocessed[y_train == cls]
        cls_labels = y_train[y_train == cls]
        # Resample to add one more sample
        X_resampled, y_resampled = resample(cls_samples, cls_labels, replace=True, n_samples=2-cls_count, random_state=42)
        # Append resampled data to the training set
        X_train_resampled = np.vstack((X_train_resampled, X_resampled))
        y_train_resampled = np.hstack((y_train_resampled, y_resampled))

# Apply PCA to reduce multicollinearity (optional, adjust n_components as needed)
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_train_resampled_pca = pca.fit_transform(X_train_resampled)
X_test_preprocessed_pca = pca.transform(X_test_preprocessed)

# Initialize and train the Quadratic Discriminant Analysis model with shrinkage
qda_model = QuadraticDiscriminantAnalysis(store_covariance=True, reg_param=0.1)
qda_model.fit(X_train_resampled_pca, y_train_resampled)

# Predict and evaluate on the test data
y_pred = qda_model.predict(X_test_preprocessed_pca)  # Use PCA-transformed features for prediction
qda_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Quadratic Discriminant Analysis Classifier is: {qda_accuracy}")


Unique classes in y_train: [0 1]
Class counts in y_train: [3 3]
Accuracy of Quadratic Discriminant Analysis Classifier is: 0.0




In [98]:
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

We are unable to improve score from 95.42%.