# Breast Cancer Prediction - Logistic Regression & Random Forest + RFECV

This dataset is computed from a digitized image of [a fine needle aspirate(FNA)](https://www.worcsacute.nhs.uk/pathology/pathology-fine-needle-aspiration) of a breast mass. It consists of 33 attributes and 569 subjects. We will classify the breast cancer with logistic regression and random forest. Also, we eill perform Recursive Feature Elimination with Cross-validation for both models to reduce features and identidy the best features.
Data from Kaggle Dataset: [Breast Cancer Wisconsin (Diagnostic) Data set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

### Objective
* We will try to predict breast cancer from the dataset, a digitized image of a fine needle aspiration (FNA) of a breast mass.

### Techniques
* We will build models using Logistic Regression and Random Forest Classifier.
* We will use Recursive Feature Elimination, Cross-validated (RFECV) feature selection to choose the best subset for the score of the model. 

#### What is Fine Needle Aspiration (FNA)?  
* [Fine Needle Aspiration (FNA)](https://www.myvmc.com/investigations/fine-needle-aspiration-biopsy-fna/) is usually performed on suspicious lump if abnormality is found on test such as x-ray, ultrasound or mammography.

### Table of Contents
1. [Data Description](#1.-Data-Description)
2. [Data Preparation](#2.-Data-Preparation)<br>
    2-1. [Import Libraries](#2-1.-Import-Libraries)<br>
    2-2. [Load Dataset](#2-2.-Load-Dataset)<br>
    2-3. [Preview Data](#2-3.-Preview-Data)<br>
3. [Data Cleaning](#3.-Data-Cleaning)<br>
    3-1. [Check Missing Values](#3-1.-Check-Missing-Values)<br>
    3-2. [Feature Selection](#3-2.-Feature-Selection)<br>
    3-3. [Encode Categorical Data](#3-3.-Encode-Categorical-Data)<br>
    3-4. [Recheck the Cleaned Data](#3-4.-Recheck-the-Cleaned-Data)<br>
4. [Data Visualization](#4.-Data-Visualization)<br>
    4-1. [Malignant vs Benin](#4-1.-Malignant-vs-Benin)<br>
    4-2. [Distribution of Features](#4-2.-Distribution-of-Features)<br>
    4-3. [Correlation Heatmap](#4-3.-Correlation-Heatmap)<br>
    4-4. [Feature Scaling](#4-4.-Feature-Scaling)<br>
    4-5. [Mean Features vs Diagnosis](#4-5.-Mean-Features-vs-Diagnosis)<br>
    4-6. [Standard Error Features vs Diagnosis](#4-6.-Standard-Error-Features-vs-Diagnosis)<br>
    4-7. [Worst Features vs Diagnosis](#4-7.-Worst-Features-vs-Diagnosis)<br>
5. [Training and Testing Data Split](#5.-Training-and-Testing-Data-Split)<br>
6. [Model Building](#6.-Model-Building)<br>
    6-1. [Logistic Regression](#6-1.-Logistic-Regression)<br>
    6-2. [Random Forest](#6-2.-Random-Forest)<br>  
7. [Recursive Feature Elimination with Cross-validation (RFECV)](#7.-Recursive-Feature-Elimination-with-Cross-validation-(RFECV))<br>
    7-1. [RFECV for Logistic Regression](#7-1.-RFECV-for-Logistic-Regression)  
    7-2. [RFECV for Random Forest](#7-2.-RFECV-for-Random-Forest)
8. [Confusion Matrix](#8.-Confusion-Matrix)<br>


# 1. Data Description
This dataset is computed from a digitized image of a [fine needle aspiration (FNA)](https://www.worcsacute.nhs.uk/pathology/pathology-fine-needle-aspiration) of a breast mass. It consists of 33 attributes and 569 subjects. 

The dataset from Kaggle: [Breast Cancer Wisconsin (Diagnostic) Data set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

### Attribute Infomation:
1. <span style="color:blue">**id**</span>: ID number
2. <span style="color:blue">**diagnosis**</span>: M = Malignant, B = Benign  

10 real-valued features are computed for each cell nucleus:  
3. <span style="color:blue">**radius**</span>: mean of distances from center to points on the perimeter  
4. <span style="color:blue">**texture**</span>: standard deviation of gray-scale values  
5. <span style="color:blue">**perimeter**</span>:  
6. <span style="color:blue">**area**</span>:  
7. <span style="color:blue">**smoothness**</span>: local variation in radius lengths  
8. <span style="color:blue">**compactness**</span>: perimeter^2/area-1.0  
9. <span style="color:blue">**concavity**</span>: severity of concave portions of the contour  
10. <span style="color:blue">**concave points**</span>: number of concave portions of the contour  
11. <span style="color:blue">**symmetry**</span>:  
12. <span style="color:blue">**fractal dimension**</span>: "coastline approximation"-1  

The <span style="color:blue">**mean**</span>, <span style="color:blue">**standard error**</span> and "<span style="color:blue">**worst**</span>" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

# 2. Data Preparation
### 2-1. Import Libraries
We will use the matplotlib and seaborn library for data visualization and the scikit-learn for building machine learning model. 

In [None]:
# For data processing and analysis
import numpy as np 
import pandas as pd 

# For data visualization
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set(style='darkgrid')
import plotly.graph_objs as go
import plotly.offline as py

# For preprocessing dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# For model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Recursive Feature Elimination with Cross-Validation
# To identify the best features by reducing less important features
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

# For model evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve, auc

### 2-2. Load Dataset

In [None]:
# Load Dataset as CSV file
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

### 2-3. Preview Data
Firstly, we will look over dataset

In [None]:
df.head()

In [None]:
# Variable identification
df.info()

df.describe()

As you can see the datatype, most of the variable is conntinuous variable without **'diagnosis'**, which is a predictor on our model. 
We can see the relationship between features. 
So, we will explore the dataset.

# 3. Data Cleaning
### 3-1. Check Missing Values
Let's check if there is a missing value or not.

In [None]:
df.isnull().sum()

In [None]:
df['Unnamed: 32'].unique()

### 3-2. Feature Selection 
We will remove <span style="color:blue">**'Unnamed: 32'**</span>, includes missing values as 'NaN'.
Also, <span style="color:blue">**'id'**</span> column will be removed because we are going to train a model to understand general patterns. 

### 3-3. Encode Categorical Data

In [None]:
def drop_and_encode_features(data):
    """
    - Drop 'id' and 'Unnamed: 32' columns
    - Encode 'diagnosis' to numerical variable
    """
    data.drop(columns=['id', 'Unnamed: 32'], inplace=True)
    
    label = LabelEncoder()
    data['diagnosis'] = label.fit_transform(data['diagnosis'])
    return data

### 3-4. Recheck the Cleaned Data

In [None]:
df.head()

In [None]:
df['diagnosis'].value_counts()

# 4. Data Visualization
**Exploratory Data Analysis (EDA)** is a critical process of visualizing, summarizing and interpreting the dataset to allows us to discover certain insights, patterns and statistical measure.



### 1. Diagnosis (Malignant vs Benin)
Explaining the frequency of **'Malignant'** and **'Benign'** in <span style="color:blue">'diagnosis'</span>.

In [None]:
def get_freq_diagnosis(data):
    """
    Visualize the frequency of 'Malignant' and 'Benign' in 'diagnosis' using pie chart
    """
    result = data['diagnosis'].value_counts()
    values = [result['M'], result['B']]
    labels = ['Malignant', 'Benign']
    trace = go.Pie(labels=labels, values=values)
    py.iplot([trace])
    
get_freq_diagnosis(df);

The pie chart above shows the retio of benign is 37.7% and malignant is 62.7%. It indicates **imbalanced** data to predict the diagnosis. 

In [None]:
# Implement the function allowing 'diagnosis' to be encoded as numerical variables
drop_and_encode_features(df);

In [None]:
# Check the statistical data
df.describe()

### 2. Distribution of Features

In [None]:
mean_features = df.loc[:, df.columns.str.contains('_mean')]
se_features = df.loc[:, df.columns.str.contains('_se')]
worst_features = df.loc[:, df.columns.str.contains('_worst')]

In [None]:
def histograms(feature_data):
    """
    Represent the data distribution
    """
    fig = plt.figure(figsize=(15, 10))
    for feature in feature_data:
        idx = feature_data.columns.get_loc(feature)
        ax = plt.subplot(5, 2, idx+1)
        sns.distplot(feature_data[feature], bins=20, 
                     label='skewness: %.2f'%(feature_data[feature].skew()))
        ax.legend(loc='best')
        thresh = feature_data[feature].mean()
        ax.axvline(x=thresh, color='r', linestyle='dashed', linewidth=2)
        plt.ylabel('Density')
    plt.tight_layout()
    plt.show()

In [None]:
histograms(mean_features);

In [None]:
histograms(se_features);

Standard error is a statistical term that measures the accuracy with which a sample distribtion represents a population by using standard deviation.


In [None]:
histograms(worst_features);

### 3. Distribution of Features According to Diagnosis

In [None]:
def feature_dist_diagnosis(data, features):   
    """
    Distribution (Malignant vs. Benign)    
    """
    feature_means = list(data.columns[1:11])
    fig = plt.figure(figsize=(15, 10))
    for idx, feature in enumerate(features):
        plt.subplot(5, 2, idx+1)
        sns.distplot(data[data['diagnosis']==1][feature], label='Malignant', color='red', bins=20)
        sns.distplot(data[data['diagnosis']==0][feature], label='Benign', color='green', bins=20)
        plt.legend(loc='upper right')
    plt.tight_layout()
    plt.show()

In [None]:
feature_dist_diagnosis(df, mean_features);

In [None]:
feature_dist_diagnosis(df, se_features);

In [None]:
feature_dist_diagnosis(df, worst_features);

### Correlation of Features

In [None]:
def correlation_heatmap(data):
    """
    
    """
    plt.figure(figsize=(18,12))
    
    corr = data.corr()
    mask = np.zeros_like(corr, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    cmap = sns.diverging_palette(220,10,as_cmap=True)
    
    sns.heatmap(corr, mask=mask, annot=True, fmt='.1f', 
                lw=0, cmap=cmap, linewidth=0.5, 
                cbar_kws={'shrink': .5})
    plt.title('Correlation of Features', fontsize=20)
    plt.tight_layout()

correlation_heatmap(df);

In [None]:
Corr_df = df.corr(method='pearson')
Corr_df = Corr_df.mask(np.tril(np.ones(Corr_df.shape)).astype(np.bool))
Corr_df = Corr_df[abs(Corr_df) >= 0.7].stack().reset_index()
Corr_df.head(50)

In [None]:
Corr_features = Corr_df['level_0'].unique()
Corr_features

In [None]:
Corr_diagnosis = Corr_df[Corr_df['level_0'] == 'diagnosis'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_radius_m = Corr_df[Corr_df['level_0'] == 'radius_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_texture_m = Corr_df[Corr_df['level_0'] == 'texture_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_perimeter_m = Corr_df[Corr_df['level_0'] == 'perimeter_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_area_m = Corr_df[Corr_df['level_0'] == 'area_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_smoothness_m = Corr_df[Corr_df['level_0'] == 'smoothness_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_compactness_m = Corr_df[Corr_df['level_0'] == 'compactness_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_concavity_m = Corr_df[Corr_df['level_0'] == 'concavity_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_concave_points_m = Corr_df[Corr_df['level_0'] == 'concave points_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)

Corr_fractal_dimension_m = Corr_df[Corr_df['level_0'] == 'fractal_dimension_mean'].sort_values(by=[0], ascending=False).reset_index(drop=True)


### Diagnosis vs Features

In [None]:
def corr_f_diag(data):   
    """
       
    """
    fig = plt.figure(figsize=(15, 10))
    for idx, item in enumerate(Corr_diagnosis['level_1']):
        plt.subplot(2, 4, idx+1)
        sns.boxplot(x='diagnosis', y=item, hue='diagnosis', data=data)
        plt.legend(loc='best')
    plt.tight_layout()
    print(Corr_diagnosis)
    plt.show()
    
corr_f_diag(df);

### Feature vs Feature

In [None]:
def feature_feature_corr(data, corr_data):
    """
    
    """
    fig = plt.figure(figsize=(15, 8))
    for idx, item in enumerate(corr_data['level_1']):
        plt.subplot(2, 4, idx+1)
        y = corr_data['level_0'][0]
        sns.scatterplot(x=item, y=y, hue='diagnosis', data=data)
        plt.legend(loc='best')
    plt.tight_layout()
    print(corr_data)
    plt.show()

In [None]:
feature_feature_corr(df, Corr_radius_m);

In [None]:
feature_feature_corr(df, Corr_texture_m);

In [None]:
feature_feature_corr(df, Corr_perimeter_m);

In [None]:
feature_feature_corr(df, Corr_area_m);

In [None]:
feature_feature_corr(df, Corr_smoothness_m);

In [None]:
feature_feature_corr(df, Corr_compactness_m);

In [None]:
feature_feature_corr(df, Corr_concavity_m);

In [None]:
feature_feature_corr(df, Corr_concave_points_m);

In [None]:
feature_feature_corr(df, Corr_fractal_dimension_m);

# 5. Training and Testing Data Split

### Feature Scaling

In [None]:
def feature_scaling(data):
    """
    Split dataset and standardize the dataset
    """
    y = data['diagnosis']
    x = data.drop('diagnosis', axis=1)
    x = (x - x.mean()) / x.std()
    return x, y

In [None]:
def get_data_split(data):
    """
    Train-test split (Train : Test = 70% : 30%)
    """
    X, y = feature_scaling(data);
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    print("Train Shape: ", X_train.shape)
    print("Test Shape: ", X_test.shape)
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_data_split(df);
X_train.head()

# 6. Model Building

## 6-1. Logistic Regression

In [None]:
def get_test_score(model, X_test, y_test):
    """
    Get test accuracy score for models
    """
    model_score = model.score(X_test, y_test)
    return model_score

In [None]:
from sklearn.metrics import accuracy_score

# Logistic Regression Model
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_train)
train_score = accuracy_score(y_train, y_pred_logreg)
pred = logreg_model.predict(X_test)
val_score = accuracy_score(y_test, pred)

print('Logistic Regression Model')
print('Training Accuracy Score: {}'.format(train_score))
print('Test Accuracy Score: {}'.format(val_score))
print(classification_report(y_test, pred))

In [None]:
plot_roc_curve(logreg_model, X_test, y_test);

In [None]:
def conf_matrix(y_test, y_predict):
    """
    Plot a confusion matrix of models
    """
    plt.figure(figsize=(7, 5))
    sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d', 
                cbar_kws={'shrink': .5})
    plt.xlabel('Predicted', fontsize=15)
    plt.ylabel('Actual', fontsize=15)
    plt.show()
    return plt

In [None]:
# Confusion matrix of logistic regression
conf_matrix(y_test, pred);

## 6-2. Random Forest

In [None]:
# Random Forest Classification Model
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)

print('Random Forest Model')
print('Training Accuracy Score:', rf_model.score(X_train, y_train))
print('Test Accuracy Score:', rf_model.score(X_test, y_test))

y_pred_rf = rf_model.predict(X_test)
print(classification_report(y_test, y_pred_rf))

In [None]:
plot_roc_curve(rf_model, X_test, y_test);

In [None]:
# Confusion matrix of random forest
conf_matrix(y_test, y_pred_rf);

# 7. Recursive Feature Elimination with Cross-validation (RFECV)

## 7-1. RFECV for Logistic Regression

In [None]:
rfecv_logreg = RFECV(estimator=logreg_model, step=1, 
                     cv=StratifiedKFold(5), scoring='accuracy')
rfecv_logreg = rfecv_logreg.fit(X_train, y_train)

print('Optimal number of features in LogisticRegression:', rfecv_logreg.n_features_)
print('Best featuures in LogisticRegression:', X_train.columns[rfecv_logreg.support_])

In [None]:
def rfecv_grid_scores(model):
    plt.figure(figsize=(10, 7))
    plt.plot(range(1, len(model.grid_scores_) + 1), model.grid_scores_);
    plt.xlabel('Number of features selected')
    plt.ylabel('Cross validation score')
    plt.show()

In [None]:
rfecv_grid_scores(rfecv_logreg);

In [None]:
y_rfe_logreg = rfecv_logreg.predict(X_test)

print('Logistic Regression with RFECV')
print( 'Training Accuracy Score:', rfecv_logreg.score(X_train, y_train))
print( 'Test Accuracy Score:', rfecv_logreg.score(X_test, y_test))
print(classification_report(y_test, y_rfe_logreg))

## 7-2. RFECV for Random Forest

In [None]:
rfecv_rf = RFECV(estimator=rf_model, step=1, cv=StratifiedKFold(5), scoring='accuracy')
rfecv_rf = rfecv_rf.fit(X_train, y_train)

print('Optimal number of features in RandomForest:', rfecv_rf.n_features_)
print('Best featuures in RandomForest:', X_train.columns[rfecv_rf.support_])

In [None]:
rfecv_grid_scores(rfecv_rf);

In [None]:
y_rfe_rf = rfecv_rf.predict(X_test)

print('Logistic Regression with RFECV')
print( 'Training Accuracy Score:', rfecv_rf.score(X_train, y_train))
print( 'Test Accuracy Score:', rfecv_rf.score(X_test, y_test))
print(classification_report(y_test, y_rfe_rf))

# 8. Confusion Matrix

In [None]:
# Confusion matrix of Logistic Regression with RFECV
conf_matrix(y_test, y_rfe_logreg);

In [None]:
# Confusion matrix of Random Forest with RFECV
conf_matrix(y_test, y_rfe_rf);

## Thanks for reading! If you have any advice, please leave a comment down below.