# Breast Cancer Detection
You belong to the data team at a local research hospital. You've been tasked with developing a means to help doctors diagnose breast cancer. You've been given data about biopsied breast cells; where it is benign (not harmful) or malignant (cancerous).

- What features of a cell are the largest drivers of malignancy? 
- Build a model that predicts whether a given biopsied breast cell is benign or malignant.
- What features drive your false positive rate for your model you derived above, what features drive your false negative rate? 
- How would a physician use your product?
- There is a non-zero cost in time and money to collect each feature about a given cell. 
- How would you go about determining the most cost-effective method of detecting malignancy?

## Data:

| Name | Range or Description |
| :- | :- |
| Sample code number | id number |
| Clump Thickness | 1-10 |
| Uniformity of Cell Size | 1-10 |
| Uniformity of Cell Shape | 1-10 |
| Marginal Adhesion | 1-10 |
| Single Epithelial Cell Size | 1-10 |
| Bare Nuclei | 1-10 |
| Bland Chromatin | 1-10 |
| Normal Nucleoli | 1-10 |
| Mitoses | 1-10 |
| Class | (4 for benign, 2 for malignant) |







In [13]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
%matplotlib inline

# Dataset Information
Let's explore the dataset.

In [2]:
# Read the CSV file
df = pd.read_csv('breast-cancer-wisconsin.txt', sep=',', index_col='Index', header=0)
# Print few rows from dataframe
df

Unnamed: 0_level_0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1241035,7,8,3,7,4,5,7,8,2,4
1,1107684,6,10,5,5,4,10,6,10,1,4
2,691628,8,6,4,10,10,1,3,5,1,4
3,1226612,7,5,6,3,3,8,7,4,1,4
4,1142706,5,10,10,10,6,10,6,5,2,4
...,...,...,...,...,...,...,...,...,...,...,...
15850,1169049,7,3,4,4,3,3,3,2,7,4
15851,1076352,3,6,4,10,3,3,3,4,1,4
15852,1107684,6,10,5,5,4,10,6,10,1,4
15853,1111249,10,6,6,3,4,5,3,6,1,4


# Check dataset information
Let's find non-standard items in the dataset (if any).

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15855 entries, 0 to 15854
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   ID                           15855 non-null  int64 
 1   Clump Thickness              15855 non-null  int64 
 2   Uniformity of Cell Size      15827 non-null  object
 3   Uniformity of Cell Shape     15827 non-null  object
 4   Marginal Adhesion            15827 non-null  object
 5   Single Epithelial Cell Size  15827 non-null  object
 6   Bare Nuclei                  15827 non-null  object
 7   Bland Chromatin              15827 non-null  object
 8   Normal Nucleoli              15827 non-null  object
 9   Mitoses                      15827 non-null  object
 10  Class                        15827 non-null  object
dtypes: int64(2), object(9)
memory usage: 1.5+ MB


Although, the data is supposed to have only numerical values in the columns, some are interpreted as Objects. Let's find unwanted values in each column.

In [4]:
# Print unique values of each column
def print_unique(np_arr):
    for col in np_arr:
        print('Unique values in column "{}"":'. format(col))
        print(df[col].unique(),'\n')

# Print information about Object columns only (remove first two columns)
print_unique(np.delete(df.columns.values, [0,1], axis=0))

Unique values in column "Uniformity of Cell Size"":
['8' '10' '6' '5' '4' '9' '3' '1' 'No idea' '2' '7' '50' '100' '30' nan
 '#' '?' '80' '40' '60' '90' '20'] 

Unique values in column "Uniformity of Cell Shape"":
['3' '5' '4' '6' '10' '7' '1' 'No idea' '2' '60' '100' '40' nan '#' '9'
 '8' '?' '30' '50' '70'] 

Unique values in column "Marginal Adhesion"":
['7' '5' '10' '3' '6' '4' '1' '2' 'No idea' '30' '40' nan '#' '8' '?' '70'
 '60' '100' '50' '9' '20'] 

Unique values in column "Single Epithelial Cell Size"":
['4' '10' '3' '6' '2' '8' 'No idea' '1' '30' nan '5' '#' '?' '40' '20' '7'
 '60' '80' '100' '9'] 

Unique values in column "Bare Nuclei"":
['5' '10' '1' '8' '2' '3' '6' 'No idea' '?' '80' '60' '30' nan '100' '#'
 '9' '50' '7' '4' '20'] 

Unique values in column "Bland Chromatin"":
['7' '6' '3' '2' '4' 'No idea' '1' '5' '70' '30' nan '40' '#' '10' '8' '?'
 '20' '9' '60' '50'] 

Unique values in column "Normal Nucleoli"":
['8' '10' '5' '4' '3' '7' '2' '6' '9' '1' 'No idea' '40' 

OK! Let's replace \['No idea', '#', '?'\] with np.nan:

In [5]:
df = df.replace(['No idea', '#', '?'], np.nan)
print_unique(np.delete(df.columns.values, [0,1], axis=0))

Unique values in column "Uniformity of Cell Size"":
['8' '10' '6' '5' '4' '9' '3' '1' nan '2' '7' '50' '100' '30' '80' '40'
 '60' '90' '20'] 

Unique values in column "Uniformity of Cell Shape"":
['3' '5' '4' '6' '10' '7' '1' nan '2' '60' '100' '40' '9' '8' '30' '50'
 '70'] 

Unique values in column "Marginal Adhesion"":
['7' '5' '10' '3' '6' '4' '1' '2' nan '30' '40' '8' '70' '60' '100' '50'
 '9' '20'] 

Unique values in column "Single Epithelial Cell Size"":
['4' '10' '3' '6' '2' '8' nan '1' '30' '5' '40' '20' '7' '60' '80' '100'
 '9'] 

Unique values in column "Bare Nuclei"":
['5' '10' '1' '8' '2' '3' '6' nan '80' '60' '30' '100' '9' '50' '7' '4'
 '20'] 

Unique values in column "Bland Chromatin"":
['7' '6' '3' '2' '4' nan '1' '5' '70' '30' '40' '10' '8' '20' '9' '60'
 '50'] 

Unique values in column "Normal Nucleoli"":
['8' '10' '5' '4' '3' '7' '2' '6' '9' '1' nan '40' '90' '20' '30' '80'
 '60' '50' '100' '70'] 

Unique values in column "Mitoses"":
['2' '1' '7' nan '3' '10' '70' '8

The dataset looks much better already! Now, we remove nan values and change column types to numeric and check the Class column value counts and convert 20 and 40 to 2 and 4!

In [6]:
df = df.dropna()
df= df.apply(pd.to_numeric)
df = df[df['Class'] != 20]
df = df[df['Class'] != 40]
# df['Class'] = df['Class'].replace(20, 2)
# df['Class'] = df['Class'].replace(40, 4)
df['Class']=df['Class'].astype('category')
df['Class'].value_counts()


4    15162
2      442
Name: Class, dtype: int64

# Explatory Data Analysis

Let's plot the columns in the data frame against each other. 

In [7]:

g = sns.PairGrid(df.drop('ID', axis=1), hue='Class')
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()
plt.show()
g.savefig('res.png')

# Classification

In [14]:
X = df.loc[:, ~df.columns.isin(['ID'])]
y = df['Class'] 
X_train, X_test, y_train, y_test = train_test_split(X,y)