## Breast Cancer Analysis and Prediction

### Overview

Breast cancer is a type of cancer that starts in the breast. Cancer starts when cells begin to grow out of control. Breast cancer cells usually form a tumor that can often be seen on an x-ray or felt as a lump. Breast cancer occurs almost entirely in women, but men can get breast cancer, too.

It‚Äôs important to understand that most breast lumps are benign and not cancer (malignant). Non-cancerous breast tumors are abnormal growths, but they do not spread outside of the breast. They are not life threatening, but some types of benign breast lumps can increase a woman's risk of getting breast cancer. Any breast lump or change needs to be checked by a health care professional to determine if it is benign or malignant (cancer) and if it might affect your future cancer risk.

[Learn more...](https://www.cancer.org/cancer/breast-cancer/about)


### About de Data
The DataSet has been downloaded from Kaggle website
> Click [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/download) to download it.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

#### Attribute Information:
- ID number
- <b>Diagnosis (M = malignant, B = benign)</b>
-3-32.Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

For this analysis, as a guide to predictive analysis I followed the instructions and discussion on "A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)" at Analytics Vidhya.



### üöÄ Loading/Installing the Libraries
---

In [1]:
import numpy as np 
import pandas as pd
from sklearn.covariance import EllipticEnvelope

### üöÄ Loading Database
---

In [2]:
df = pd.read_csv('files/data.csv', encoding='latin1')
#display(df)
df.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,


### üöÄ Exploring Database
---

In [3]:
list(df.columns.values)

['id',
 'diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst',
 'Unnamed: 32']

In [4]:
len(df)

569

In [5]:
#df.describe().T
print(df.diagnosis.unique())

['M' 'B']


### üöÄ Preparing the Data
---

#### ‚ñ∂Ô∏è Unnecessary Features

In [6]:
# Dropping unnecessary features
df.drop('id',axis=1,inplace=True)
df.drop('Unnamed: 32',axis=1,inplace=True)

#### ‚ñ∂Ô∏è Missing Data

In [7]:
# Verifying if there is any missing data
pd.DataFrame(df.isna().sum())

Unnamed: 0,0
diagnosis,0
radius_mean,0
texture_mean,0
perimeter_mean,0
area_mean,0
smoothness_mean,0
compactness_mean,0
concavity_mean,0
concave points_mean,0
symmetry_mean,0


In [8]:
# Setting up the X value (feature)
X = df.drop(['diagnosis'],axis=1)

X

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [9]:
# Setting up the X value (target)
#y = df.diagnosis

#y

#### ‚ñ∂Ô∏è Outlier Detector

In [10]:
# Creating the outlier detector
outlier_detector = EllipticEnvelope(contamination=.1)

# Fit detector
outlier_detector.fit(X)

# Predict outliers
outlier_detector.predict(X)

array([-1,  1,  1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1,  1, -1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1, -1,  1, -1, -1,  1,  1,  1,  1, -1, -1,  1,
        1,  1,  1,  1,  1

In [11]:
df['outlier'] = outlier_detector.predict(X)

df[df['outlier'] == -1].count()

diagnosis                  57
radius_mean                57
texture_mean               57
perimeter_mean             57
area_mean                  57
smoothness_mean            57
compactness_mean           57
concavity_mean             57
concave points_mean        57
symmetry_mean              57
fractal_dimension_mean     57
radius_se                  57
texture_se                 57
perimeter_se               57
area_se                    57
smoothness_se              57
compactness_se             57
concavity_se               57
concave points_se          57
symmetry_se                57
fractal_dimension_se       57
radius_worst               57
texture_worst              57
perimeter_worst            57
area_worst                 57
smoothness_worst           57
compactness_worst          57
concavity_worst            57
concave points_worst       57
symmetry_worst             57
fractal_dimension_worst    57
outlier                    57
dtype: int64

In [12]:
df = df[df['outlier']==1]
df = df.drop(['outlier'],axis=1)

df.head(5)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
5,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151


#### ‚ñ∂Ô∏è Deleting Similar Features

In [13]:
# Create correlation matrix
corr_matrix = df.corr().abs()

corr_matrix

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
radius_mean,1.0,0.280126,0.997774,0.992195,0.083748,0.433939,0.633702,0.766074,0.103793,0.379305,...,0.973466,0.284896,0.965596,0.95825,0.091007,0.410077,0.532201,0.709703,0.214009,0.023511
texture_mean,0.280126,1.0,0.286795,0.28882,0.056316,0.199826,0.297899,0.255039,0.060442,0.087041,...,0.318125,0.917313,0.327434,0.321102,0.066952,0.259949,0.291724,0.255825,0.121208,0.117023
perimeter_mean,0.997774,0.286795,1.0,0.990101,0.123969,0.4892,0.677979,0.799301,0.138205,0.330288,...,0.97516,0.293764,0.973161,0.959959,0.128727,0.458738,0.575127,0.743085,0.240942,0.072719
area_mean,0.992195,0.28882,0.990101,1.0,0.084553,0.428245,0.640254,0.769817,0.112423,0.359521,...,0.972079,0.290706,0.962994,0.971838,0.100932,0.397799,0.526617,0.698453,0.210934,0.02711
smoothness_mean,0.083748,0.056316,0.123969,0.084553,1.0,0.664796,0.489568,0.539374,0.540488,0.622414,...,0.144975,0.012199,0.171571,0.143753,0.810281,0.436009,0.386526,0.468787,0.358952,0.507146
compactness_mean,0.433939,0.199826,0.4892,0.428245,0.664796,1.0,0.881245,0.81619,0.562981,0.563008,...,0.485854,0.235099,0.548697,0.473939,0.616166,0.876389,0.822589,0.80593,0.501086,0.729839
concavity_mean,0.633702,0.297899,0.677979,0.640254,0.489568,0.881245,1.0,0.922579,0.458114,0.301189,...,0.671702,0.333099,0.72043,0.670488,0.498917,0.803903,0.906278,0.872848,0.445182,0.577485
concave points_mean,0.766074,0.255039,0.799301,0.769817,0.539374,0.81619,0.922579,1.0,0.442535,0.159398,...,0.797499,0.294018,0.825561,0.792322,0.498073,0.689489,0.77307,0.912646,0.42684,0.425719
symmetry_mean,0.103793,0.060442,0.138205,0.112423,0.540488,0.562981,0.458114,0.442535,1.0,0.447305,...,0.157319,0.086559,0.19092,0.160183,0.424885,0.430359,0.377014,0.393283,0.696474,0.422251
fractal_dimension_mean,0.379305,0.087041,0.330288,0.359521,0.622414,0.563008,0.301189,0.159398,0.447305,1.0,...,0.305997,0.056652,0.254683,0.285266,0.55607,0.424801,0.293802,0.156796,0.270397,0.754554


In [14]:
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

upper

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
radius_mean,,0.280126,0.997774,0.992195,0.083748,0.433939,0.633702,0.766074,0.103793,0.379305,...,0.973466,0.284896,0.965596,0.95825,0.091007,0.410077,0.532201,0.709703,0.214009,0.023511
texture_mean,,,0.286795,0.28882,0.056316,0.199826,0.297899,0.255039,0.060442,0.087041,...,0.318125,0.917313,0.327434,0.321102,0.066952,0.259949,0.291724,0.255825,0.121208,0.117023
perimeter_mean,,,,0.990101,0.123969,0.4892,0.677979,0.799301,0.138205,0.330288,...,0.97516,0.293764,0.973161,0.959959,0.128727,0.458738,0.575127,0.743085,0.240942,0.072719
area_mean,,,,,0.084553,0.428245,0.640254,0.769817,0.112423,0.359521,...,0.972079,0.290706,0.962994,0.971838,0.100932,0.397799,0.526617,0.698453,0.210934,0.02711
smoothness_mean,,,,,,0.664796,0.489568,0.539374,0.540488,0.622414,...,0.144975,0.012199,0.171571,0.143753,0.810281,0.436009,0.386526,0.468787,0.358952,0.507146
compactness_mean,,,,,,,0.881245,0.81619,0.562981,0.563008,...,0.485854,0.235099,0.548697,0.473939,0.616166,0.876389,0.822589,0.80593,0.501086,0.729839
concavity_mean,,,,,,,,0.922579,0.458114,0.301189,...,0.671702,0.333099,0.72043,0.670488,0.498917,0.803903,0.906278,0.872848,0.445182,0.577485
concave points_mean,,,,,,,,,0.442535,0.159398,...,0.797499,0.294018,0.825561,0.792322,0.498073,0.689489,0.77307,0.912646,0.42684,0.425719
symmetry_mean,,,,,,,,,,0.447305,...,0.157319,0.086559,0.19092,0.160183,0.424885,0.430359,0.377014,0.393283,0.696474,0.422251
fractal_dimension_mean,,,,,,,,,,,...,0.305997,0.056652,0.254683,0.285266,0.55607,0.424801,0.293802,0.156796,0.270397,0.754554


In [15]:
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

to_drop

['perimeter_mean',
 'area_mean',
 'concave points_mean',
 'perimeter_se',
 'area_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'concavity_worst',
 'concave points_worst']

In [16]:
df.drop(df[to_drop], axis=1)

#X.shape

df.head(5)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
5,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151


#### ‚ñ∂Ô∏è Manual Adjustments

In [17]:
# Identifying all feature which aren¬¥t int64 or float64
df.select_dtypes(exclude=['int64','float64']).columns

Index(['diagnosis'], dtype='object')

In [18]:
# Note: Diagnosis (M = malignant, B = benign)
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})

df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
5,1,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,1,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,1,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151


In [19]:
# Setting up the X value (feature)
X = df.drop(['diagnosis'],axis=1)

# Setting up the X value (target)
y = df.diagnosis