### Table of Contents

* [Load Libraries](#chapter1)
* [Load the data](#chapter2)
* [Cleaning and Preparing the data](#chapter3)
* [Exploratory Data Analysis](#chapter4)
* [Correlation Matrix](#chapter5)
* [Diagnosis vs Features](#chapter6)
* [Outlier Detection](#chapter7)
* [Drop Outliers](#chapter8)
* [Creating Test and Train Dataset](#chapter9)
* [Standardization](#chapter10)
* [Classification and Build a Model](#chapter11)
* [Logistic Regression](#chapter12)
* [Decision Tree](#chapter13)
* [Random Forest](#chapter14)
* [Conclusion](#chapter15)

### Load Libraries <a class="anchor" id="chapter1"></a>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colors import ListedColormap
%matplotlib inline 

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis, LocalOutlierFactor
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

### Load the data <a class="anchor" id="chapter2"></a>

In [1]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

In [1]:
data.head()

In [1]:
data.info()

### Cleaning and Preparing the data <a class="anchor" id="chapter3"></a>

Since we don't need the *id* column and the empty *Unnamed: 32* column in the dataset, we remove them.

In [1]:
data.drop("id", axis=1, inplace=True)

In [1]:
data.drop('Unnamed: 32', axis=1, inplace=True)

Let's look at the values of the column we are going to predict

In [1]:
data.diagnosis.unique()

Columns are initials of Malignant and Bening results. These values are not the kind our algorithms will understand. For this, we will define Malignant as 1 and Bening as 0.

In [1]:
data.diagnosis = data['diagnosis'].map({'M':1, 'B':0})

### Exploratory Data Analysis <a class="anchor" id="chapter4"></a>

Let's look at the statistical information of our dataset

In [1]:
data.describe()

Let's look at the density of our target value

In [1]:
sns.countplot(data["diagnosis"])

On average, 350 of our samples have benign tumors and 200 have malignant tumors.

### Correlation Matrix <a class="anchor" id="chapter5"></a>

In [1]:
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(data.corr(), annot=True, fmt='.1f',
            ax=ax, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map', size=14);

Let's plot the traits with a correlation greater than 0.75

In [1]:
flt = np.abs(data.corr()['diagnosis']) > .75

In [1]:
corr_feat = data.corr().columns[flt].tolist()

In [1]:
f, ax = plt.subplots(figsize = (8,8))
sns.heatmap(data[corr_feat].corr(), annot=True, fmt='.2f',
           ax=ax, cmap='coolwarm',vmin=-1,vmax=1)
plt.xticks(rotation=60)
plt.yticks(rotation=0)
plt.title('Correlation Between Features (Th>0.75)');

Here we see the 5 features that are most correlated with each other. One of them is our target value. This is good

Now let's look at the distribution of our values

In [1]:
def melt(dataset, param):
    data_melted = pd.melt(dataset, id_vars=param,
                     var_name="features",
                     value_name="value")
    return data_melted

def boxplot(dataset, param):
    plt.figure(figsize= (14,8))
    sns.boxplot(x="features", y="value", hue=param, data=dataset)
    plt.xticks(rotation = 90)
    return plt.show()

def pairplot(dataset, param):
    sns.pairplot(dataset, diag_kind='kde', markers='+', hue=param);

In [1]:
boxplot(melt(data,"diagnosis"),"diagnosis")

It looks like there are outliers. If we remove them and standardize our values, we get more accurate results.

In [1]:
pairplot(data[corr_feat],"diagnosis")

When we plot the highly correlated features by classifying them according to the target value, we see that the data with benign tumors and data with malignant tumors are grouped in separate values. In other words, the fact that one of the features we are comparing has a high value causes the other to have a high value. The analysis we will do in this way will not give exactly correct results. Because when the value of one of the features with high correlation increases, the other increases as well.

### Diagnosis vs Features <a class="anchor" id="chapter6"></a>

As a result, if we divide our data into patients and non-patients, our prediction model will give more accurate results.

In [1]:
feat = list(data.columns[1:11])
dataM = data[data.diagnosis ==1]
dataB = data[data.diagnosis ==0]

Let's plot the distribution of the values in the features by those who are sick and those who are not.

In [1]:
plt.rcParams.update({'font.size': 8})
f, axes = plt.subplots(nrows=5, ncols=2, figsize=(8,10))
axes = axes.ravel()
for i,ax in enumerate(axes):
    ax.figure
    binwidth= (max(data[feat[i]]) - min(data[feat[i]]))/50
    ax.hist([dataM[feat[i]],dataB[feat[i]]], 
            bins=np.arange(min(data[feat[i]]), max(data[feat[i]]) + binwidth, binwidth), 
            alpha=0.5,stacked=True, density=True, label=['M','B'],color=['r','g'])
    ax.legend(loc='upper right')
    ax.set_title(feat[i])
plt.tight_layout()
plt.show()

* Unlike the above, the mean values of cell *radius*, *perimeter*, *area*, *compactness*, *concavity* and *concave points* can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors.

* mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other. In any of the histograms there are no noticeable large outliers that warrants further cleanup.

### Outlier Detection <a class="anchor" id="chapter7"></a>

Let's detect outliers.

In [1]:
X = data.drop(['diagnosis'], axis=1)
y = data.diagnosis

In [1]:
column = X.columns.tolist()

In [1]:
LOF = LocalOutlierFactor()
y_pred = LOF.fit_predict(X)
X_score = LOF.negative_outlier_factor_

In [1]:
outlier_score = pd.DataFrame()
outlier_score['score'] = X_score
outlier_score.head()

In [1]:
radius = (X_score.max() - X_score) / (X_score.max() - X_score.min())
outlier_score["radius"] = radius
filt = outlier_score["score"] < -2.5
outlier_index = outlier_score[filt].index.tolist()
plt.figure(figsize = (12,8))
plt.scatter(X.iloc[outlier_index,0], X.iloc[outlier_index,1], 
            color = "blue", s = 50, label = "Outliers" )
plt.scatter(X.iloc[:,0], X.iloc[:,1], color ="k", s=3, label = "Data Points" )
plt.scatter(X.iloc[:,0],X.iloc[:,1], s=1000*radius, edgecolor = "r", 
            facecolors = "none", label="Outlier Scores")
plt.legend()
plt.show()

We see outliers. Let's remove them now

### Drop Outliers <a class="anchor" id="chapter8"></a>

In [1]:
X = X.drop(outlier_index)
y = y.drop(outlier_index).values

### Creating Test and Train Dataset <a class="anchor" id="chapter9"></a>

Since this data set is not ordered, I am going to do a simple 70:30 split to create a training data set and a test data set.

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### Standardization <a class="anchor" id="chapter10"></a>

In order for our model to give more accurate results, we need to standardize our data.

In [1]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train_df = pd.DataFrame(X_train, columns= column)

In [1]:
X_train_df.head()

In [1]:
X_train_df.describe()

In [1]:
df = X_train_df

In [1]:
df['diagnosis'] = y_train

In [1]:
boxplot(melt(df,'diagnosis'),'diagnosis')

looks better.

In [1]:
pairplot(df[corr_feat],'diagnosis')

### Classification and Build a Model <a class="anchor" id="chapter11"></a>

Here we are going to build a classification model and evaluate its performance using the training set.

In [1]:
def classification_and_fit_model(model, data, predictors, outcome):
    model.fit(data[predictors], data[outcome])
    predictions = model.predict(data[predictors])
    accuracy =metrics.accuracy_score(predictions, data[outcome])
    print('Accuracy : %s' % '{0:.3%}'.format(accuracy))
    kf = KFold(n_splits=5)
    error = []
    for train, test in kf.split(data):
        train_predictors = data[predictors].iloc[train,:]
        train_target = data[outcome].iloc[train]
        model.fit(train_predictors, train_target)
        error.append(model.score(data[predictors].iloc[test,:],
                                data[outcome].iloc[test]))
        print('Cross-Validation Score : %s' % '{0:.3%}'.format(np.mean(error)))
    model.fit(data[predictors],data[outcome])
        

### Logistic Regression <a class="anchor" id="chapter12"></a>

* Based on the observations in the histogram plots, we can reasonably hypothesize that the cancer diagnosis depends on the mean cell radius, mean perimeter, mean area, mean compactness, mean concavity and mean concave points. We can then perform a logistic regression analysis using those features as follows:

In [1]:
predictor = ['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concave points_mean']

In [1]:
outcome = 'diagnosis'

In [1]:
model = LogisticRegression()

In [1]:
classification_and_fit_model(model, X_train_df, predictor, outcome)

The prediction accuracy is good. What happens if we use just one predictor? Use the mean_radius:

In [1]:
predictor1 = ['radius_mean']

In [1]:
classification_and_fit_model(model, X_train_df, predictor1, outcome)

This gives a similar prediction accuracy and a cross-validation score.

The accuracy of the predictions are good but not great. The cross-validation scores are reasonable. Can we do better with another model?

### Decision Tree <a class="anchor" id="chapter13"></a>

In [1]:
model = DecisionTreeClassifier()

In [1]:
classification_and_fit_model(model, X_train_df, predictor, outcome)

Here we are over-fitting the model probably due to the large number of predictors. Let use a single predictor, the obvious one is the radius of the cell.

In [1]:
classification_and_fit_model(model, X_train_df, predictor1, outcome)

The accuracy of the prediction is much much better here. But does it depend on the predictor?

Using a single predictor gives a 97% prediction accuracy for this model but the cross-validation score is not that great.

### Random Forest <a class="anchor" id="chapter14"></a>

In [1]:
features_mean = list(X_train_df.columns[1:11])

In [1]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=25, max_depth=7, max_features=2)

In [1]:
classification_and_fit_model(model, X_train_df, features_mean, outcome)

Using all the features improves the prediction accuracy and the cross-validation score is great.

An advantage with Random Forest is that it returns a feature importance matrix which can be used to select features. So lets select the top 5 features and use them as predictors

In [1]:
feature_importance = pd.Series(model.feature_importances_, index=features_mean).sort_values(ascending=False)
feature_importance

Using top 5 features

In [1]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=25, max_depth=7, max_features=2)

In [1]:
classification_and_fit_model(model, X_train_df, predictor, outcome)

Using the top 5 features only changes the prediction accuracy a bit but I think we get a better result if we use all the predictors.

What happens if we use a single predictor as before? Just check.

In [1]:
model = RandomForestClassifier(n_estimators=100)

In [1]:
classification_and_fit_model(model, X_train_df, predictor1, outcome)

This gives a better prediction accuracy too but the cross-validation is not great

### Conclusion <a class="anchor" id="chapter15"></a>

The best model to be used for diagnosing breast cancer as found in this analysis is the Random Forest model with the top 5 predictors, 'concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean'. It gives a prediction accuracy of ~95% and a cross-validation score ~ 95% for the test data set.

I will see if I can improve this more by tweaking the model further and trying out other models in a later version of this analysis.

#### Thanks for viewing my notebook :)
I will be very happy if you vote.