<a href="https://colab.research.google.com/github/rutripathi96/Breast-Cancer-Prediction/blob/main/Breast_Cancer_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breast Cancer Prediction Project
This notebook contains a simple logistic regression model for predicting whether a breast tumor is benign or malignant. The model is trained using the Breast Cancer Wisconsin dataset from the UCI Machine Learning Repository.

Steps:

**Data Preprocessing:**

Cleaned dataset and performed train-test split.
Excluded the sample code number as it was deemed irrelevant.

**Model Training:**

Trained a logistic regression model using scikit-learn.

**Prediction:**

Utilized the trained model to predict tumor types.(benign or malignant)

**Evaluation:**

Evaluated model performance using a confusion matrix and k-fold cross-validation.

# Importing the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing the dataset

In [None]:
dataset = pd.read_csv('breast_cancer.csv')
X = dataset.iloc[:,1:-1].values
y = dataset.iloc[:,-1].values

**Checking for null values**

In [None]:
print(dataset.isnull().sum())

Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64


# Splitting the dataset into training and testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state=0)

In [None]:
print(X_train)

[[10  1  1 ...  5  4  1]
 [ 1  1  1 ...  3  1  1]
 [ 5  1  1 ...  3  1  1]
 ...
 [ 1  1  1 ...  1  1  1]
 [ 3  1  1 ...  2  1  1]
 [10  9  7 ...  7  7  1]]


In [None]:
print(X_test)

[[ 1  1  1 ...  1  1  1]
 [ 3  1  1 ...  2  1  1]
 [ 5  5  5 ...  4  3  1]
 ...
 [ 4  1  1 ...  1  1  1]
 [ 4 10  4 ...  9 10  1]
 [ 2  1  1 ...  2  1  1]]


In [None]:
print(y_train)

[4 2 2 2 4 2 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 4 4 4 4 4 4 2 2 2 2 2 2 4 4 4
 2 2 2 4 2 2 4 4 2 4 2 2 4 4 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 4 4 2 4 2 2 2
 4 2 2 2 2 4 4 2 2 4 4 2 2 4 4 2 4 2 4 4 2 2 2 4 2 4 2 4 2 2 2 2 2 4 2 2 4
 2 2 4 2 2 2 2 2 4 2 2 4 2 4 2 2 4 4 4 2 2 2 2 2 2 4 4 2 2 2 2 2 2 2 4 2 2
 4 2 2 2 2 2 2 2 4 2 2 2 4 4 2 4 2 2 2 4 2 2 2 4 4 2 4 2 2 4 2 2 2 2 2 2 2
 4 4 4 4 2 4 2 4 2 4 4 4 2 2 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 4 4 2 2 2
 2 2 2 2 2 2 2 2 4 2 4 2 2 2 2 2 2 4 4 4 4 2 2 4 2 4 2 4 2 2 2 2 4 2 4 2 2
 2 2 4 2 2 4 2 2 2 2 2 2 2 4 2 4 2 2 2 4 2 2 2 2 2 4 2 2 2 2 4 2 2 4 2 2 2
 2 4 4 2 2 2 2 4 2 2 4 2 2 2 2 4 4 2 4 2 4 2 2 2 4 4 4 2 2 2 2 2 2 2 2 4 4
 2 2 2 2 2 2 2 4 4 2 2 2 2 4 4 4 2 4 2 4 2 2 2 2 2 4 2 4 4 2 2 2 2 2 2 2 4
 2 2 2 4 2 2 4 4 4 2 4 4 4 2 2 2 4 2 4 2 2 4 2 4 4 4 2 2 2 4 2 4 4 4 2 2 2
 4 2 4 2 2 2 2 4 4 2 2 2 4 4 2 2 4 2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 2 4 2 2
 2 2 4 4 2 2 4 4 2 2 4 4 4 2 2 4 2 2 2 2 2 4 2 4 4 2 2 2 2 4 2 2 2 2 4 2 4
 2 4 2 2 4 2 2 2 2 4 2 2 

In [None]:
print(y_test)

[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 2 4 4 4 2 2 2 4 2 4 4 2 2 2 4 2 4 4 2 2 2 4
 4 2 4 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2 2 4 2 2 4 2 2 2 2 2 2 2 2 4 4 2 2 2 2
 2 2 4 2 2 2 4 2 4 2 2 4 2 2 4 2 4 2 4 4 4 2 4 4 4 2 2 2 4 4 2 2 4 4 2 2 4
 2 2 4 2 2 2 4 2 2 2 4 2 2 4 4 2 4 2 4 2 2 4 2 2 4 2]


# Training the Model

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

# Predicting the test set results

In [None]:
y_pred = classifier.predict(X_test)

# Making the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[84  3]
 [ 3 47]]


# Computing the accuracy with k fold cross validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier,X = X_train, y= y_train, cv=10)
print("Accuracy : {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation : {:.2f} %".format(accuracies.std()*100))

Accuracy : 96.70 %
Standard Deviation : 1.97 %
