## Prediction of benign or malignant cancer tumors

Let's import required libraries first

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score, classification_report
from sklearn.model_selection import train_test_split, KFold, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#setting max columns display to 35 for more readability
pd.options.display.max_columns=35

In [None]:
#read data
data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv", sep=',')

1. #### Attribute information 

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

Okay now that we have information about columns lets construct the columns list. 
we will have 3 sets of real valued features - mean, std(standard error) and worst

In [None]:
data.columns

In [None]:
data = data.drop('Unnamed: 32', axis=1)

## Feature Engineering

In [None]:
len(data.columns)

In [None]:
data.dtypes

In [None]:
data.isna().sum().sum()

In [None]:
data.loc[:, 'radius_mean':].describe()

<h4>We have 32 columns in total</h4>
<ul>
    <li>All the columns have proper datatypes, no conversions needed. </li>
    <li>There are no missing values which is awesome </li>
    <li>Just using descibe on data to check few stats </li>
</ul>

Our target variable is **Diagnosis**, let's plot bar graph for each of the features and see how our target classes are distributed

In [None]:
f, ax = plt.subplots(3, 10, figsize=(15,6))
i, j, jt = 0, 0, 0
for col in data.columns:
    if col not in ['id', 'diagnosis']:
        if j <= 9:
            data[['diagnosis',col]].groupby('diagnosis').mean().plot.barh(ax=ax[i, j])
            if j == 0 and i==0:
                ax[i,j].set_ylabel("Mean")
            elif j == 0 and i ==1:
                ax[i,j].set_ylabel("Standard Err")
            elif j == 0 and i==2:
                ax[i,j].set_ylabel("Worst")
            else:
                ax[i,j].set_ylabel("")
            
            if i == 2:
                ax[i,j].set_xlabel(col[:-6])
            else:
                ax[i,j].set_xlabel("")
            ax[i,j].legend("")
            if j == 9:
                j = 0
                i += 1
            else:
                j += 1
f.suptitle("Class distribution")
plt.show()

In [None]:
plt.subplots(figsize=(15,12))
sns.heatmap(round(data.loc[:, 'radius_mean':].corr(),2), annot=True)
plt.show()

Heatmap is very helpful for understanding the correlation between variables.
<ul>
    <li>There is a strong positive correlation between mean and worst set of features</li>
    <li>There is also strong correlation within mean features and std features</li>
</ul>

In [None]:
set1 = ['diagnosis','radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'concave points_mean', 'radius_worst', 'texture_worst','perimeter_worst', 'area_worst', 'concave points_worst']
set2 = ['diagnosis','radius_se', 'perimeter_se', 'area_se',]

In [None]:
sns.pairplot(data[set1], hue='diagnosis', corner=True)

In [None]:
sns.pairplot(data[set2], hue='diagnosis', corner=True)

After looking at above plots we can easily identify the columns with high correlation,
Will take out the features having correlation greater than 0.9
<ul>
    <li>For example **mean_radius** and **worst_radius** are positively correlated(0.97)</li>
    <li>Now we can look at our **class distibution plot** and select one of mean_radius/worst_radius based on how equally the classes are distributed</li>
    <li>Prepare the list of features to remove and take them out</li>
</ul>

In [None]:
cols_to_remove = ['radius_worst', 'texture_worst','perimeter_worst', 'area_worst', 'concave points_worst','perimeter_se','area_se','perimeter_mean', 'area_mean', 'concave points_mean']

In [None]:
columns = data.columns.tolist()
for c in cols_to_remove:
    columns.remove(c)

<h3>Data Normalization</h3><br/>
Normalizing the features before we go ahead with modelling

In [None]:
data_mean = data[columns[2:]].mean()
data_std = data[columns[2:]].std()

In [None]:
norm_data = (data[columns[2:]] - data_mean) / data_std

### More feature engineering

In [None]:
corr_matrix = norm_data.corr()

In [None]:
cols_mask = corr_matrix[(corr_matrix >= 0.7) & (corr_matrix < 1)].isna().sum() < 20

In [None]:
cols_mask[cols_mask.values].index.values

Before we start modelling, let's take look at our heatmap, there are still columns with high correlation. Above code selects features having correlation higher than 0.7, there are 12 such features, can try using PCA and reduce number of features.

### PCA

In [None]:
pca = PCA(n_components=5)
pca.fit(data[cols_mask[cols_mask.values].index.values])

In [None]:
pca.explained_variance_ratio_

#### That's great first component itself explains 91% of variance in the features, let's take only 1st priciple component and move ahead.

<ul>
    <li>Eigenvalue and Eigenvectors are main ingredients for constructing principle components. </li>
    <li>Eigenvalue and Eigenvectors helps to reduces linear operations by compressing related variables. </li>
</ul>

In [None]:
pca = PCA(n_components=1)
pca.fit(data[cols_mask[cols_mask.values].index.values])
print(pca.explained_variance_ratio_)
components = pca.transform(data[cols_mask[cols_mask.values].index.values])

#### Prepare target variable

In [None]:
target = data['diagnosis'].astype('category')

In [None]:
dict(enumerate(target.cat.categories))

In [None]:
y = target.cat.codes

#### Remove those 12 features and include 1st principle component in dataframe

In [None]:
for c in cols_mask[cols_mask.values].index.values:
    columns.remove(c)

In [None]:
norm_data['PC1'] = components.reshape(len(components))
columns.append('PC1')

In [None]:
X = norm_data[columns[2:]]

In [None]:
X.head()

#### Target variable has 2 classes, which means a binary classification.
Let's start with classic go-to model for binary classfication - Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X[:400], y[:400])
pred = logreg.predict(X[400:])
print(accuracy_score(y[400:], pred))

In [None]:
confusion_matrix(y[400:], pred, labels=[0,1])

In [None]:
print(classification_report(y[400:], pred))

#### Tried tree based models like Decision tree and Random Forest but it didn't perform well compared to Logistic Regression (may be because tree based models works well with categorical features)
Let't try Support vector classification model, it works very well with continues variables

In [None]:
svc = LinearSVC(random_state=0, fit_intercept=True)
svc.fit(X[:400], y[:400])
pred = svc.predict(X[400:])
print(accuracy_score(y[400:], pred))

In [None]:
print(classification_report(y[400:], pred))

#### Similar performace, no improvement

In this perticular use case we cannot afford to have **recall less than 1**, that is we need to minimize **false negatives** for malignant cancer tumors (class 1).<br/>
**Let's try fine tuning our logistic regression model.**
<ul>
    <li>LR Model predicted 2 false negative</li>
    <li>This may be due to class imbalance</li>
    <li>Let's try adding class weight to LR model</li>
</ul>

In [None]:
y.value_counts() #class counts

In [None]:
weights = {0:1.0, 1:1.9}
logreg = LogisticRegression(class_weight=weights)
logreg.fit(X[:400], y[:400])
pred = logreg.predict(X[400:])
accuracy_score(y[400:], pred)

In [None]:
confusion_matrix(y[400:], pred, labels=[0,1])

In [None]:
print(classification_report(y[400:], pred))

 **Awesome**, now there are no false negatives and we have 100% recall. Yes our precision score is reduced but this trade-off is important.

## Learning Curve

In [None]:
train_examples, train_score, test_score = learning_curve(logreg, X, y, shuffle=True)

In [None]:
plt.plot(train_examples, train_score.mean(axis=1), marker='d', label="Train")
plt.plot(train_examples, test_score.mean(axis=1), marker='d', label="Test")
plt.legend()
plt.show()

Our learning shows that model is **good fit** as training/ testing scores are converging, if they are overlapped on the most of the datapoints then it will be a **overfit**, if they are far from each other then its **underfit**