## Breast Cancer Wisconsin (Diagnostic) Dataset

### 📄 Description

Breast cancer is the most common cancer among women globally, accounting for **25% of all cancer cases**, affecting **over 2.1 million people in 2015**. It begins when breast cells grow uncontrollably, often forming tumors detectable via X-rays or physical lumps.

The primary challenge is **classifying tumors as either malignant (cancerous) or benign (non-cancerous)**.

### 📊 Feature Descriptions

| Column Name               | Description                                                   |
| ------------------------- | ------------------------------------------------------------- |
| `id`                      | Unique identifier for each patient                            |
| `diagnosis`               | Diagnosis result (M = Malignant, B = Benign)                  |
| `radius_mean`             | Mean of distances from center to points on the perimeter      |
| `texture_mean`            | Standard deviation of gray-scale values                       |
| `perimeter_mean`          | Mean size of the core tumor perimeter                         |
| `area_mean`               | Mean area of the tumor                                        |
| `smoothness_mean`         | Mean of local variation in radius lengths                     |
| `compactness_mean`        | Mean of perimeter² / area - 1.0                               |
| `concavity_mean`          | Mean severity of concave portions of the contour              |
| `concave points_mean`     | Mean number of concave portions of the contour                |
| `symmetry_mean`           | Mean symmetry of the tumor                                    |
| `fractal_dimension_mean`  | Mean of "coastline approximation" - complexity of the contour |
| `radius_se`               | Standard error of radius                                      |
| `texture_se`              | Standard error of texture                                     |
| `perimeter_se`            | Standard error of perimeter                                   |
| `area_se`                 | Standard error of area                                        |
| `smoothness_se`           | Standard error of smoothness                                  |
| `compactness_se`          | Standard error of compactness                                 |
| `concavity_se`            | Standard error of concavity                                   |
| `concave points_se`       | Standard error of concave points                              |
| `symmetry_se`             | Standard error of symmetry                                    |
| `fractal_dimension_se`    | Standard error of fractal dimension                           |
| `radius_worst`            | Worst (largest) radius measurement                            |
| `texture_worst`           | Worst texture measurement                                     |
| `perimeter_worst`         | Worst perimeter measurement                                   |
| `area_worst`              | Worst area measurement                                        |
| `smoothness_worst`        | Worst smoothness measurement                                  |
| `compactness_worst`       | Worst compactness measurement                                 |
| `concavity_worst`         | Worst concavity measurement                                   |
| `concave points_worst`    | Worst number of concave points                                |
| `symmetry_worst`          | Worst symmetry measurement                                    |
| `fractal_dimension_worst` | Worst fractal dimension measurement                           |



**Importing Dependencies**

In [64]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')


### Data Collection and Pre-Processing

In [65]:
# Importing the data into a Pandas Dataframe
file_path = "C:/Users/USER/Desktop/Datasets/breast-cancer.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [66]:
# overview of the dataframe 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [67]:
# Checking for missing values 
df.isna().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

In [68]:
# Number of rows and column 
df.shape

(569, 32)

In [69]:
# basic statiscal measures of the dataset 
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [70]:
# distribution of diagnosis
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

In [71]:
df.groupby('diagnosis').mean()

Unnamed: 0_level_0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
B,26543820.0,12.146524,17.914762,78.075406,462.790196,0.092478,0.080085,0.046058,0.025717,0.174186,0.062867,0.284082,1.22038,2.000321,21.135148,0.007196,0.021438,0.025997,0.009858,0.020584,0.003636,13.379801,23.51507,87.005938,558.89944,0.124959,0.182673,0.166238,0.074444,0.270246,0.079442
M,36818050.0,17.46283,21.604906,115.365377,978.376415,0.102898,0.145188,0.160775,0.08799,0.192909,0.06268,0.609083,1.210915,4.323929,72.672406,0.00678,0.032281,0.041824,0.01506,0.020472,0.004062,21.134811,29.318208,141.37033,1422.286321,0.144845,0.374824,0.450606,0.182237,0.323468,0.09153


In [72]:
# label encoding (diagnosis)
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})

In [73]:
print(df['diagnosis'])

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64


**Seperating the Features and Target**

In [74]:
X = df.drop(['id', 'diagnosis'], axis=1)
Y = df['diagnosis']

In [75]:
print(X)

     radius_mean  texture_mean  perimeter_mean  ...  concave points_worst  symmetry_worst  fractal_dimension_worst
0          17.99         10.38          122.80  ...                0.2654          0.4601                  0.11890
1          20.57         17.77          132.90  ...                0.1860          0.2750                  0.08902
2          19.69         21.25          130.00  ...                0.2430          0.3613                  0.08758
3          11.42         20.38           77.58  ...                0.2575          0.6638                  0.17300
4          20.29         14.34          135.10  ...                0.1625          0.2364                  0.07678
..           ...           ...             ...  ...                   ...             ...                      ...
564        21.56         22.39          142.00  ...                0.2216          0.2060                  0.07115
565        20.13         28.25          131.20  ...                0.1628       

In [76]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64


**Seperating Training and Test data**

In [77]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=13)

In [78]:
print(X.shape, X_train.shape, X_test.shape)

(569, 30) (455, 30) (114, 30)


### Model Training and Evaluation

In [79]:
# training the model
model = LogisticRegression()
model.fit(X_train, Y_train)

In [80]:
# the accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print("Accuracy on training data :",training_data_accuracy)

Accuracy on training data : 0.945054945054945


In [81]:
# the accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print("Accuracy on test data :",test_data_accuracy)

Accuracy on test data : 0.956140350877193


**Building Predictive System**

In [82]:
data = (12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,0.2976,1.599,2.039,23.94,0.007149,0.07217,0.07743,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075)

In [83]:
# change the input data to numpy array
input_data = np.asarray(data)

# reshape the array to make prediction with one data point 
input_data = input_data.reshape(1, -1)

prediction = model.predict(input_data)  

if (prediction[0] == 0): 
     print('The breast cancer is Benign') 
else: 
    print('The breast cancer is Malignment')

The breast cancer is Malignment
