## **Question 3: Use Machine Learning Model to Classify Class - For Provided dataset**
https://drive.google.com/file/d/1o51JeH-aMdC4hvHvOSBcWx2faIWCaiy6/view?usp=sharing

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC

from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [30]:
data =pd.read_csv('/content/wisconsin.csv')
data

Unnamed: 0,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,benign
1,5,4,4,5,7,10.0,3,2,1,benign
2,3,1,1,1,2,2.0,3,1,1,benign
3,6,8,8,1,3,4.0,3,7,1,benign
4,4,1,1,3,2,1.0,3,1,1,benign
...,...,...,...,...,...,...,...,...,...,...
694,3,1,1,1,3,2.0,1,1,1,benign
695,2,1,1,1,2,1.0,1,1,1,benign
696,5,10,10,3,7,3.0,8,10,2,malignant
697,4,8,6,4,3,4.0,10,6,1,malignant


In [31]:
data.dropna(inplace=True)

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 698
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Cl.thickness     683 non-null    int64  
 1   Cell.size        683 non-null    int64  
 2   Cell.shape       683 non-null    int64  
 3   Marg.adhesion    683 non-null    int64  
 4   Epith.c.size     683 non-null    int64  
 5   Bare.nuclei      683 non-null    float64
 6   Bl.cromatin      683 non-null    int64  
 7   Normal.nucleoli  683 non-null    int64  
 8   Mitoses          683 non-null    int64  
 9   Class            683 non-null    object 
dtypes: float64(1), int64(8), object(1)
memory usage: 58.7+ KB


In [40]:
# numerical feature
numerical_feature = {feature for feature in data.columns if data[feature].dtypes != 'O'}
print(f'Count of Numerical feature: {len(numerical_feature)}')
print(f'Numerical feature are:\n {numerical_feature}')

Count of Numerical feature: 9
Numerical feature are:
 {'Normal.nucleoli', 'Cl.thickness', 'Cell.shape', 'Epith.c.size', 'Cell.size', 'Marg.adhesion', 'Mitoses', 'Bare.nuclei', 'Bl.cromatin'}


In [41]:
# Categorical feature
categorical_feature = {feature for feature in data.columns if data[feature].dtypes == 'O'}
print(f'Count of Categorical feature: {len(categorical_feature)}')
print(f'Categorical feature are:\n {categorical_feature}')

Count of Categorical feature: 1
Categorical feature are:
 {'Class'}


In [43]:
encoder = LabelEncoder()
for feature in categorical_feature:
    data[feature] = encoder.fit_transform(data[feature])

In [45]:
X = data.drop(columns='Class')
y = data['Class']



In [46]:
X_train, X_test, y_train, y_test = train_test_split(X,  y, test_size = 0.2, random_state = 0)

In [47]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [53]:
# Create an SVM classifier
clf = SVC(kernel='linear', C=1)
clf.fit(X_train, y_train)

In [50]:

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(classification_report(y_test, y_pred))

confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", confusion)


Accuracy: 0.9562043795620438
              precision    recall  f1-score   support

           0       0.98      0.95      0.97        87
           1       0.92      0.96      0.94        50

    accuracy                           0.96       137
   macro avg       0.95      0.96      0.95       137
weighted avg       0.96      0.96      0.96       137

Confusion Matrix:
 [[83  4]
 [ 2 48]]


##**The confusion matrix further breaks down the model's performance:**

.True Negatives (TN): 83

.False Positives (FP): 4

.False Negatives (FN): 2

.True Positives (TP): 48

*In summary, your SVM model demonstrates strong performance in classifying breast cancer tumors, with high accuracy and good precision and recall scores for both benign and malignant tumors. It correctly identifies a significant portion of both classes while minimizing false positives and false negatives.*

In [51]:
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

pred_xgb = classifier.predict(X_test)

#Model Evaluation
xgb_acc = accuracy_score(y_test, pred_xgb)

print('Accuracy Score: ' + str(xgb_acc))

print('Precision Score: ' + str(precision_score(y_test, pred_xgb)))

print('Recall Score: ' + str(recall_score(y_test, pred_xgb)))

print('F1 Score: ' + str(f1_score(y_test, pred_xgb)))

print('Classification Report: \n' + str(classification_report(y_test, pred_xgb)))

Accuracy Score: 0.9708029197080292
Precision Score: 0.9423076923076923
Recall Score: 0.98
F1 Score: 0.9607843137254902
Classification Report: 
              precision    recall  f1-score   support

           0       0.99      0.97      0.98        87
           1       0.94      0.98      0.96        50

    accuracy                           0.97       137
   macro avg       0.97      0.97      0.97       137
weighted avg       0.97      0.97      0.97       137



##**The evaluation metrics for a classification XGBoost model**

**Accuracy Score:** Accuracy measures the proportion of correctly classified has an accuracy of approximately 97.08%, which indicates that it correctly classified about 97.08% of the samples in the test set.

**Precision Score:** Precision is the proportion of true positive predictions (correctly predicted positive cases) out of all positive predictions (true positives + false positives).The precision for class 0 is approximately 99%, and for class 1, it's approximately 94%. High precision indicates that when the model predicts a class, it's likely to be correct.

**Recall Score:** Recall, also known as sensitivity or true positive rate, is the proportion of true positives out of all actual positives (true positives + false negatives). The recall for class 0 is approximately 97%, and for class 1, it's approximately 98%. High recall means that the model is good at capturing the positive cases.

**F1 Score:** The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall. The F1 score for class 0 is approximately 98%, and for class 1, it's approximately 96%.

**Classification Report:** This report provides a summary of precision, recall, and F1 score for both classes (0 and 1). It also includes the support column, which indicates the number of samples in each class in the test set.

*Overall, these metrics indicate that your XGBoost model is performing well on the classification task for breast cancer tumors. It has high accuracy, precision, recall, and F1 score, suggesting that it can effectively distinguish between benign (class 0) and malignant (class 1) tumors. This is a promising result for medical diagnostic purposes.*