<a href="https://colab.research.google.com/github/sforceas/machine-learning/blob/master/BreastCancer_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [98]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Wisconsin Breast Cancer Dataset
Description: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.


Attribute information

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

Several of the papers listed above contain detailed descriptions of
how these features are computed. 

The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.


https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

In [99]:
#cabezeras de la base de datos en el repositorio de la UCI. SE = Standard Error, W = Worst Values
headers = ['id','type','radius','texture','perimeter','area','smoothness','compactness','concavity','concave-points','symmetry','fractal dimension','SE_radius','SE_texture','SE_perimeter','SE_area','SE_smoothness','SE_compactness','SE_concavity','SE_concave-points','SE_symmetry','SE_fractal dimension','W_radius','W_texture','W_perimeter','W_area','W_smoothness','W_compactness','W_concavity','W_concave-points','W_symmetry','W_fractal dimension']

In [100]:
#URL de la base de datos en el repositorio de la UCI
db_url="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"


In [101]:
df= pd.read_csv(db_url,names=headers)
df

Unnamed: 0,id,type,radius,texture,perimeter,area,smoothness,compactness,concavity,concave-points,symmetry,fractal dimension,SE_radius,SE_texture,SE_perimeter,SE_area,SE_smoothness,SE_compactness,SE_concavity,SE_concave-points,SE_symmetry,SE_fractal dimension,W_radius,W_texture,W_perimeter,W_area,W_smoothness,W_compactness,W_concavity,W_concave-points,W_symmetry,W_fractal dimension
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,1.0950,0.9053,8.589,153.40,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.01860,0.01340,0.01389,0.003532,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.006150,0.04006,0.03832,0.02058,0.02250,0.004571,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0.4956,1.1560,3.445,27.23,0.009110,0.07458,0.05661,0.01867,0.05963,0.009208,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.011490,0.02461,0.05688,0.01885,0.01756,0.005115,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,1.1760,1.2560,7.673,158.70,0.010300,0.02891,0.05198,0.02454,0.01114,0.004239,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0.7655,2.4630,5.203,99.04,0.005769,0.02423,0.03950,0.01678,0.01898,0.002498,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,0.4564,1.0750,3.425,48.55,0.005903,0.03731,0.04730,0.01557,0.01318,0.003892,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0.7260,1.5950,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


# Model 1. Mean values as features

## Spliting Train and Test datasets

In [102]:
feature_cols = ['radius','texture','perimeter','area','smoothness','compactness','concavity','concave-points','symmetry','fractal dimension']

In [103]:
x = df[feature_cols]
y = df.type

In [104]:
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [105]:
y_train

478    B
303    B
155    B
186    M
101    B
      ..
277    M
9      M
359    B
192    B
559    B
Name: type, Length: 398, dtype: object

## Model Building

In [106]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
y_pred

array(['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'M', 'B',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B',
       'M', 'M', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'M',
       'B', 'M', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B',
       'M', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M',
       'M', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'M', 'B',
       'B', 'B'], dtype=object)

## Confusion matrix and accuracy


In [107]:
cnf_matrix = pd.crosstab(y_test ,y_pred ,rownames=['Real'], colnames=['Prediction'])
cnf_matrix

Prediction,B,M
Real,Unnamed: 1_level_1,Unnamed: 2_level_1
B,102,6
M,8,55


In [108]:
accurracy_M = metrics.accuracy_score(y_test,y_pred)
print(f'Accuracy = {round(accurracy*100,1)}%')

Accuracy = 91.8%


# Model 2. Worst values as features


## Spliting Train and Test datasets

In [109]:
feature_cols = ['W_radius','W_texture','W_perimeter','W_area','W_smoothness','W_compactness','W_concavity','W_concave-points','W_symmetry','W_fractal dimension']

In [110]:
x = df[feature_cols]
y = df.type

In [111]:
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [112]:
y_train

478    B
303    B
155    B
186    M
101    B
      ..
277    M
9      M
359    B
192    B
559    B
Name: type, Length: 398, dtype: object

## Model Building

In [127]:
logreg = LogisticRegression(solver='lbfgs', max_iter=2000)
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
y_pred

array(['M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'M', 'B',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B',
       'M', 'M', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'M',
       'B', 'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B',
       'M', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'M', 'M', 'M', 'B',
       'B', 'B'], dtype=object)

## Confusion matrix and accuracy


In [128]:
cnf_matrix = pd.crosstab(y_test ,y_pred ,rownames=['Real'], colnames=['Prediction'])
cnf_matrix

Prediction,B,M
Real,Unnamed: 1_level_1,Unnamed: 2_level_1
B,102,6
M,1,62


In [130]:
accurracy = metrics.accuracy_score(y_test,y_pred)
print(f'Accuracy = {round(accurracy*100,1)}%')

Accuracy = 95.9%


# Model 3. Mean & Worst values as features


## Spliting Train and Test datasets

In [116]:
feature_cols = ['radius','texture','perimeter','area','smoothness','compactness','concavity','concave-points','symmetry','fractal dimension','W_radius','W_texture','W_perimeter','W_area','W_smoothness','W_compactness','W_concavity','W_concave-points','W_symmetry','W_fractal dimension']

In [117]:
x = df[feature_cols]
y = df.type

In [118]:
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [119]:
y_train

478    B
303    B
155    B
186    M
101    B
      ..
277    M
9      M
359    B
192    B
559    B
Name: type, Length: 398, dtype: object

## Model Building

In [131]:
logreg = LogisticRegression(solver='lbfgs', max_iter=2000)
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
y_pred

array(['M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'M', 'B',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B',
       'M', 'M', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'M',
       'B', 'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B',
       'M', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'M', 'M', 'M', 'B',
       'B', 'B'], dtype=object)

## Confusion matrix and accuracy


In [132]:
cnf_matrix = pd.crosstab(y_test ,y_pred ,rownames=['Real'], colnames=['Prediction'])
cnf_matrix

Prediction,B,M
Real,Unnamed: 1_level_1,Unnamed: 2_level_1
B,102,6
M,1,62


In [134]:
accurracy = metrics.accuracy_score(y_test,y_pred)
print(f'Accuracy = {round(accurracy*100,1)}%')

Accuracy = 95.9%


# Conclussion
Better prediction accuracy with less features by using the model 2: 95.9% accuracy score and only 1 Fake Negative.