# Project Title: "Breast Cancer Classification Using Machine Learning Algorithms: Decision Tree, Random Forest, SVM, Kernel SVM, and Naive Bayes"

##Problem Statement:

Breast cancer is one of the most common types of cancer among women worldwide. Early detection and accurate diagnosis are crucial for effective treatment and improving survival rates. This project aims to develop and compare the performance of various machine learning algorithms to classify breast cancer tumors as either malignant or benign. The algorithms include Decision Tree, Random Forest, Support Vector Machine (SVM), Kernel SVM, and Naive Bayes.

The goal is to identify the most effective model for accurate classification, providing insights into the strengths and weaknesses of each algorithm. This will assist medical practitioners in making informed decisions based on the predictions of the models.

##Dataset Overview:
The Breast Cancer Wisconsin (Diagnostic) Dataset will be used for this project. This dataset is well-known and widely used for binary classification tasks. It includes 569 instances of malignant and benign tumors with 30 feature columns that describe various characteristics of the cell nuclei present in the digitized image.

##*Key Features:*
1. ID: Unique identifier for each patient.

2. Diagnosis: Binary classification target, where 'M' denotes malignant and 'B' denotes benign.

3. Features: 30 numeric columns that describe characteristics of the cell nuclei, such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

##*Example Features:*
4. Radius_mean: Mean of distances from the center to points on the perimeter.

5. Texture_mean: Standard deviation of gray-scale values.

6. Perimeter_mean: Mean size of the core tumor.

7. Area_mean: Mean area of the tumor.

8. Smoothness_mean: Mean of local variation in radius lengths.

https://www.google.com/url?q=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fuciml%2Fbreast-cancer-wisconsin-data%3Fform%3DMG0AV3

In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

In [15]:
data=pd.read_csv('breast-cancer.csv')
data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [16]:
data['diagnosis']=data['diagnosis'].map({'M':1,'B':0})

In [17]:
X=data.drop(columns=['id','diagnosis'])
y=data['diagnosis']

In [18]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)

In [19]:
sc=StandardScaler()
X_train_scaled=sc.fit_transform(X_train)
X_test_scaled=sc.transform(X_test)

In [20]:
from sklearn.tree import DecisionTreeClassifier
tree_model=DecisionTreeClassifier(random_state=101)
tree_model.fit(X_train_scaled,y_train)

In [21]:
tree_pred=tree_model.predict(X_test_scaled)
print('Decision Tree')
print(classification_report(y_test,tree_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test,tree_pred))
print('Accuracy:',accuracy_score(y_test,tree_pred))

Decision Tree
              precision    recall  f1-score   support

           0       0.95      0.91      0.93        88
           1       0.86      0.93      0.89        55

    accuracy                           0.92       143
   macro avg       0.91      0.92      0.91       143
weighted avg       0.92      0.92      0.92       143

Confusion Matrix
[[80  8]
 [ 4 51]]
Accuracy: 0.916083916083916


In [22]:
from sklearn.ensemble import RandomForestClassifier
forest_model=RandomForestClassifier(n_estimators=150,random_state=101)
forest_model.fit(X_train_scaled,y_train)


In [23]:
forest_pred=forest_model.predict(X_test_scaled)
print('Random Forest')
print(classification_report(y_test,forest_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test,forest_pred))
print('Accuracy:',accuracy_score(y_test,forest_pred))


Random Forest
              precision    recall  f1-score   support

           0       0.99      0.97      0.98        88
           1       0.95      0.98      0.96        55

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Confusion Matrix
[[85  3]
 [ 1 54]]
Accuracy: 0.972027972027972


In [24]:
from sklearn.svm import SVC
kernel_svm_model=SVC(kernel='rbf',random_state=101)
kernel_svm_model.fit(X_train_scaled,y_train)
kernel_svm_pred=kernel_svm_model.predict(X_test_scaled)
print('Kernel SVM')
print(classification_report(y_test,kernel_svm_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test,kernel_svm_pred))
print('Accuracy:',accuracy_score(y_test,kernel_svm_pred))

Kernel SVM
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        88
           1       0.98      0.98      0.98        55

    accuracy                           0.99       143
   macro avg       0.99      0.99      0.99       143
weighted avg       0.99      0.99      0.99       143

Confusion Matrix
[[87  1]
 [ 1 54]]
Accuracy: 0.986013986013986


In [25]:
from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X_train_scaled, y_train)
nb_pred = nb_model.predict(X_test_scaled)

print('Naive Bayes')
print(classification_report(y_test, nb_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test, nb_pred))
print('Accuracy:', accuracy_score(y_test, nb_pred))

Naive Bayes
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        88
           1       0.89      0.93      0.91        55

    accuracy                           0.93       143
   macro avg       0.92      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143

Confusion Matrix
[[82  6]
 [ 4 51]]
Accuracy: 0.9300699300699301


##Result

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Initialize the SVM model
svm_model = SVC(random_state=101)

# Perform grid search
grid_search = GridSearchCV(svm_model, param_grid, refit=True, verbose=2)
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and estimator
print("Best Parameters:", grid_search.best_params_)
best_svm_model = grid_search.best_estimator_

# Make predictions on the test data
best_svm_predictions = best_svm_model.predict(X_test_scaled)

# Evaluate the model
print("Best SVM Accuracy:", accuracy_score(y_test, best_svm_predictions))
print("Confusion Matrix:\n", confusion_matrix(y_test, best_svm_predictions))
print("Classification Report:\n", classification_report(y_test, best_svm_predictions))

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01

**Conclusion:**

Support Vector Machines (SVM) often achieve high accuracy .

1. -> Handles Complex Data: SVMs work well with high-dimensional data (many features) and can separate classes effectively.

2. -> Reduces Overfitting: By focusing on the best boundary (hyperplane) with maximum margin, SVMs avoid overfitting.

3. -> Kernel Trick: This allows SVMs to handle non-linear relationships by transforming data into higher dimensions.

4. -> Maximizes Margin: Ensures the boundary is far from data points, improving generalization to new data.

5. -> Regularization: Controls the balance between fitting the training data and keeping the model simple.

According to the analysis and prediction of breast cancer dataset, I found that Support vector mechanism algorithm gives best accuracy with 1% error approx

In [27]:
def predict_diagnosis(user_input):
    """
    Predicts breast cancer diagnosis based on user input.

    Args:
        user_input (list): A list of 30 numerical features in the same order as the training data.

    Returns:
        str: 'Malignant' or 'Benign' prediction.
    """
    # Convert input to numpy array and reshape for scaling
    user_input_array = np.asarray(user_input).reshape(1, -1)

    # Scale the user input using the same scaler fitted on the training data
    user_input_scaled = sc.transform(user_input_array)

    # Predict using the best performing model (best_svm_model from GridSearchCV)
    prediction = best_svm_model.predict(user_input_scaled)

    if prediction[0] == 1:
        return 'Malignant'
    else:
        return 'Benign'

# Example usage: Replace with actual user input (30 features)
# Example input with 30 features (replace with your data)
example_input = [
    17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
    1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
    25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189
]

predicted_diagnosis = predict_diagnosis(example_input)
print(f"The predicted diagnosis is: {predicted_diagnosis}")

The predicted diagnosis is: Malignant


