Introduction:

This notebook consists of building a classifier using the kNN algorithm to predict the binary-valued target variable "Diagnosis" in breast cancer data.

In [65]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets


Data Overview:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/

Dataset has 569 samples, 30 features and has no missing values.


In [66]:
# metadata 
breast_cancer_wisconsin_diagnostic.metadata

{'uci_id': 17,
 'name': 'Breast Cancer Wisconsin (Diagnostic)',
 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic',
 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv',
 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.',
 'area': 'Health and Medicine',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 569,
 'num_features': 30,
 'feature_types': ['Real'],
 'demographics': [],
 'target_col': ['Diagnosis'],
 'index_col': ['ID'],
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 1993,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C5DW2B',
 'creators': ['William Wolberg',
  'Olvi Mangasarian',
  'Nick Street',
  'W. Street'],
 'intro_paper': {'ID': 230,
  'type': 'NATIVE',
  'title': 'Nuclear feature extraction for breast tumor diagnosis',
  'authors': 'W. Street, W. Wolberg, O. Mangasarian',
  'venue': 'Electronic imaging',
  'yea

In [67]:
# variable information 
breast_cancer_wisconsin_diagnostic.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,ID,ID,Categorical,,,,no
1,Diagnosis,Target,Categorical,,,,no
2,radius1,Feature,Continuous,,,,no
3,texture1,Feature,Continuous,,,,no
4,perimeter1,Feature,Continuous,,,,no
5,area1,Feature,Continuous,,,,no
6,smoothness1,Feature,Continuous,,,,no
7,compactness1,Feature,Continuous,,,,no
8,concavity1,Feature,Continuous,,,,no
9,concave_points1,Feature,Continuous,,,,no


In [68]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   radius1             569 non-null    float64
 1   texture1            569 non-null    float64
 2   perimeter1          569 non-null    float64
 3   area1               569 non-null    float64
 4   smoothness1         569 non-null    float64
 5   compactness1        569 non-null    float64
 6   concavity1          569 non-null    float64
 7   concave_points1     569 non-null    float64
 8   symmetry1           569 non-null    float64
 9   fractal_dimension1  569 non-null    float64
 10  radius2             569 non-null    float64
 11  texture2            569 non-null    float64
 12  perimeter2          569 non-null    float64
 13  area2               569 non-null    float64
 14  smoothness2         569 non-null    float64
 15  compactness2        569 non-null    float64
 16  concavit

In [69]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Diagnosis  569 non-null    object
dtypes: object(1)
memory usage: 4.6+ KB


Data Preprocessing:

Standardize cell measurement data. Extract part of the data for model validation later on. Prepare a split validation test data set.

In [70]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# preserve column names
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [71]:
from sklearn.model_selection import train_test_split

# Split the dataset into a training set and a testing set
# 70% of the data will be used for training, 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=123)

In [72]:
X_train

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
559,-0.743348,1.079841,-0.718729,-0.714976,-0.266890,-0.042470,0.281240,-0.202977,-1.546608,0.411444,...,-0.784675,1.869899,-0.744086,-0.714386,-0.112597,-0.016317,0.435670,-0.275239,-1.276034,0.186983
295,-0.101476,-1.400813,-0.161014,-0.205313,-0.311725,-0.798444,-0.981414,-0.767349,-0.801815,-0.521339,...,-0.331164,-1.424431,-0.389933,-0.385832,-0.673696,-0.935539,-1.126787,-0.861616,-0.125792,-0.886975
264,0.869853,0.647006,0.808603,0.777609,0.064029,-0.272730,0.022733,0.421754,0.202194,-0.991984,...,1.099776,0.594832,0.990045,0.976374,1.027137,0.015490,0.559926,1.275894,0.509996,-0.456949
125,-0.078755,-0.483948,-0.145362,-0.188249,-0.605638,-0.814553,-0.936592,-0.967510,-0.721494,-0.552527,...,-0.161357,-0.341520,-0.207346,-0.271919,-0.730683,-0.758692,-0.916512,-0.967897,-0.868353,-0.671962
280,1.429361,1.701168,1.409980,1.374017,0.401353,0.776234,1.296937,1.230911,0.329977,-0.084717,...,1.542933,1.664716,1.564912,1.482653,2.009061,0.825932,1.454664,1.105356,0.577943,0.734491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,-0.717787,-1.500877,-0.726143,-0.689095,-0.464730,-0.551317,-0.588066,-0.397721,-0.699589,0.428455,...,-0.664567,-1.386977,-0.723832,-0.647058,0.470421,-0.439986,-0.383747,-0.458568,-0.208298,0.200283
322,-0.359929,-1.389177,-0.376851,-0.426869,1.212639,-0.303242,-0.637784,-0.384824,-0.980711,0.278189,...,-0.461626,-0.748629,-0.430739,-0.494120,0.978917,-0.198253,-0.446594,0.013609,-0.839233,0.087789
382,-0.589980,0.798266,-0.544495,-0.588983,-1.922199,0.056078,-0.117631,-0.493675,-2.222032,0.537611,...,-0.766038,0.493869,-0.592774,-0.689424,-1.945375,0.427072,0.091208,-0.082318,-1.148229,0.528899
365,1.792899,0.579522,1.723026,1.814853,-0.345884,0.165996,0.115389,0.746242,-0.706891,-1.024589,...,1.665111,0.112814,1.606612,1.581096,0.014527,-0.106013,-0.009540,0.942432,-0.471997,-0.919671


Modeling:

In [73]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors = 3)
model.fit(X_train, y_train.values.ravel()) # build the classifier

In [74]:
model.predict(X_test)

array(['B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M',
       'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M',
       'M', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B',
       'M', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'M',
       'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'M',
       'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B',
       'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'M',
       'B', 'M', 'B', 'M', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'B',
       'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M',
       'M', 'M'], dtype=object)

Model Evaluation:

Model Validation: