# Breast Cancer Classifier with Logistic Regression 
This Notebook is about the training a breast cancer Classifier with  Logistic-Regression.

Logistic regression is used to classify instances based on the values of their predictor variables. The output is the probability that an input data item belongs to a certain class (compare the support vector machine, where the output is the single class that best fits the input data item).

## Load and Perapare Datasets
Import of standard libraries and basic funtions to include data sets in the notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load data in the data frame
df_breastCancer = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

## Exploratory Analysis

In [None]:
# analyse the shape of the breastcancer dataframe
print('shape of breast cancer.csv:', df_breastCancer.shape)

In [None]:
df_breastCancer.head(3)

In [None]:
# looking for possible target values as boolean or object data type
df_breastCancer.dtypes

In [None]:
# transform the diagnosis Character into binary value for array calculation
# M = Malignant ,B = benign 
df_breastCancer.diagnosis[df_breastCancer.diagnosis == 'M'] = 1
df_breastCancer.diagnosis[df_breastCancer.diagnosis == 'B'] = 0

# switch data type to bool for clear training target value  
df_breastCancer.diagnosis = df_breastCancer.diagnosis.astype('bool')

# Check transform result
print(df_breastCancer.diagnosis)

## Logistic Regression with Scikit-Learn
After the Exploratory Analysis, the Target Value 'diagnosis' is identified. Import of the Wisconsin-Breast-Cancer Data set. Including the Features of cell attributes for the train set.

In [None]:
# build Features X and target Variable y 
X = df_breastCancer[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
y = df_breastCancer['diagnosis']
X.shape, y.shape

In [None]:
# Split in train and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=11)

#test split by array shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape 
# print(y_train)

### Principle Component Analysis
Principle Componentn Analysis (PCA) can be used to reduce the number of features by identify the relevant ones and to find and understand dependencies between features. In this case, the amount of features is manageable but i would like to see if there are dependencies between some features and how big they are.

In [None]:
from sklearn.decomposition import PCA

# instantiate the class PCA with a variance of 90% from the original data
pca = PCA(n_components=0.9)

# learning the model und transforming the over training data
X_train_fact = pca.fit_transform(X_train)
X_test_fact = pca.transform(X_test)

# get the Result in the exp_var as numpy array
exp_var = pca.explained_variance_ratio_
sum_exp_var = sum(exp_var)

print('explained variance by factor:', exp_var)
print('sum explained variance all factors:{:.3f}' .format(sum_exp_var))

In [None]:
# instantiation and training of the data model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluation of the data model woth confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix
y_test_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_test_pred)
matrix = confusion_matrix(y_test, y_test_pred)
print('confusion matrix: ', '\n',matrix)
print('accuracy: ', accuracy)

### Conlusion and increasing of reliabilty

The accuracy of the trained Logistic Regression Model is 89%. The Confusion matrix shows  that 73 cases are correct classified with 3 errors. 29 diagnosis are cancer positiv, 9 cancer diagnosis are incorrect. 

To exclude the 3 missed Cancer diagnosis, to eliminitate the risk of unseen cancer diagnossis, the prediction treshold will be incereased to 99%.  


In [None]:
X_pred = [[6., 3., 4., 1., 5., 2., 3., 9., 1.]]

# # retrain the model with higher probability
y_test_pred_proba = model.predict_proba(X_test)



# Limit the estimation to 99%
y_test_pred_proba99 = [0 if prob[0] > .99 else 1 for prob in y_test_pred_proba]

accuracy99 = accuracy_score(y_test, y_test_pred_proba99)
matrix99 = confusion_matrix(y_test, y_test_pred_proba99)
print('confusion matrix: ', '\n',matrix99)
print('accuracy: ', accuracy99)

The error diagnosis is excluded the accuracy decreases over 50%. THis has an high impact on the diagnosis but a big gain for the reliability.
