### Problem Statement
The problem in hand is a binary classification problem involving a large number of features. Here the inputs are the characteristics of the test culture (such as the radius, smoothness, compactness etc. of the distribution of cells), while the output is binary, i.e., benign or malign. 

So the method to approach this would be to understand the significance of the features, execute some strategies for feature reduction, apply a binary classification algorithm and iterate this process, until performance saturates. 

In short, the objective of this study is to build a predictive model that wil improve the accuracy, objectivity and reproducibility of breast cancer diagnosis by FNA. 


### Data Exploration & Exploratory Visualization

The dataset has 569 rows and 33 columns. Amongst the 33 columns, the first two are `ID number` and `Diagnonsis (M=malignant, B = benign)`. And the last column is an unnamed column with only NaN values, so it is removed right away. The other 30 columns correspond to mean, standard deviation and the largest values (points on the tails) of the distributions of the following 10 features computed for the cellnuclei;

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values) 
- perimeter 
- area 
- smoothness (local variation in radius lengths) 
- compactness (perimeter^2 / area - 1.0) 
- concavity (severity of concave portions of the contour) 
- concave points (number of concave portions of the contour) 
- symmetry 
- fractal dimension ("coastline approximation" - 1)


All feature values are recorded with four significant digits. The class distribution of the samples is such that 357 are benign and 212 are malignant, which is imbalanced, but not so bad. 

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
import matplotlib.pyplot as pl
import matplotlib.patches as mpatches
import importlib
importlib.import_module('mpl_toolkits.mplot3d').Axes3D
from time import time
from sklearn.metrics import f1_score, accuracy_score
from collections import Counter
import re
from sklearn.model_selection import cross_validate
import math
import random 
from collections import defaultdict
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_validate
import re

random.seed(50)

Output and features are first extracted from the complete dataset. 

In [None]:
def als_split_data(data):
    output = data['diagnosis']
    features = pd.DataFrame(data=data)
    cols = ['id', 'diagnosis']
    for col in cols:
        if col in features.columns:
            features = features.drop(col, axis=1)
    return output, features



In [None]:
data = pd.read_csv('../input/breast-cancer.csv')
data.columns
# Removing the last unnamed columns
data = data.drop(['Unnamed: 32'], axis =1)
output, features = als_split_data(data)
data.head()

The above is an example of what the dataset looks like. Below is a statistical description of the 30 dataset features. As mentioned before, the 30 features correspond to the mean, standard deviation and "worst values" of the measured cell culture characteristics such as radius, texture, area, smoothness etc. 

In [None]:
display(features.describe())

From what it seems, `area` is a parameter that is well spread out, i.e., it has a large standard deviation, but it's probably only because `area` is a squared function of the radius. In the next few steps, I will do further data exploration in the following order

1. Visualize the distribution statistics; apply appropriate data preprocessing steps
2. Look for outliers
3. Observe feature correlations that will guide in feature selection

### Data Preprocessing, Outlier Detection

The following plots shows histograms of 6 features - 'area_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'symmetry_mean'

In [None]:
def vs_distribution(data):
    """
    Visualization code for displaying skewed distributions of features
    """
    
    # Create figure
    fig = pl.figure(figsize = (18,15));

    # Skewed feature plotting
    for i, feature in enumerate(data.columns[:10]):
        ax = fig.add_subplot(5, 5, i+1)
        ax.hist(data[feature], bins = 25, color = '#00A0A0')
        ax.set_title("'%s'"%(feature), fontsize = 14)
        ax.set_xlabel("Value")
        ax.set_ylabel("Number of Records")
        #ax.set_ylim((0, 2000))
        #ax.set_yticks([0, 500, 1000, 1500, 2000])
        #ax.set_yticklabels([0, 500, 1000, 1500, ">2000"])

        fig.suptitle("Distributions Features", \
            fontsize = 16, y = 1.03)

    fig.tight_layout()
    fig.show()

In [None]:
vs_distribution(features)

I am first going to look for outliers, remove them and observe the distributions again. 

It is to be noted that the features contain `mean`, `se` and `worst` values of the measurements of the 10 features describe in section I. Since `worst` and `se` values determine the quality of measured data, I am going to observe only the last 20 features for removing outliers. 

I am using `als.print_outliers()` for this. This function looks for outliers in each of the features, that are lying `how_far` steps away from its respective interquartile range. Each of these bad points are counted in a dictionary and finally bad points to be discarded are selected as those that occurred with highest frequency i.e., points that were bad in most features (as determined by the `worst_th` parameter).

In [None]:
# For each feature print the rows which have outliers in all features 
def als_print_outliers(data, how_far=2, worst_th=6, to_display=False):
    # Select the last 10 features as they are the worst collected during measurements
    data = data.iloc[:,11:30]
    really_bad_data = defaultdict(int)
    for col in data.columns:
        Q1 = np.percentile(data[col], 25)
        Q3 = np.percentile(data[col], 75)
        step = (Q3-Q1)*how_far
        bad_data = list(data[~((data[col]>=Q1-step)&(data[col]<=Q3+step))].index)
        for i in bad_data:
            really_bad_data[i]+= 1
        # Display the outliers
    max_ind = max(really_bad_data.values())
    worst_points = [k for k, v in really_bad_data.items() if v > max_ind-worst_th]
    if to_display:
        print("Data points considered outliers are:") 
        display(data.ix[worst_points,:])
    return worst_points
    

In [None]:
outlier_indices = als_print_outliers(features, worst_th=3)

In [None]:
# Cleaning dataset by dropping outliers (cl)
data_cl = data.drop(data.index[outlier_indices]).reset_index(drop=True) # cleaned data
output_cl, features_cl = als_split_data(data_cl)

vs_distribution(features_cl)

In [None]:
print('Size of new dataset is {0:.2f} % of the original'.format(100.0*len(data_cl)/len(data)))

As can be seen from the above two plots, the distribution characteristics have definitely changed, especially w.r.t to the features depicting dimensions of the cell cultures, if not for features like `concavity_mean`, `concave_points_mean`, `symmetry_mean` and `fractional_dimension_mean`

Before proceeding with other visualization techniques, I will apply logarithmic transormations to features (to see if it will remove the skewness) and apply min-max scaling as well. These data preprocessing steps might be useful in further steps where I will be analysing the effects of each of these features on classification.

In [None]:
def als_transform_log_minmax(data):
    cols = data.columns
    data_transformed = pd.DataFrame(data=data)
    scaler = MinMaxScaler()
    for col in cols:
        data_transformed[col] = data[col].apply(lambda x: np.log(x+1))
        data_transformed[col] = scaler.fit_transform(data[col].values.reshape(-1,1))
    return data_transformed

In [None]:
# Applying log transfromation and minmax scaling (tr)
features_cl_tr = als_transform_log_minmax(features_cl) # cleaned, transformed data
data_cl_tr = pd.concat([output_cl, features_cl_tr], axis=1)

In [None]:
vs_distribution(features_cl_tr)

Skewness has reduced in most plots except for `concavity_mean`, `concave_points_mean` and `fractal_dimension_mean`. I am going to look for outliers again and check the distribution statistics again 

In [None]:
outlier_indices = als_print_outliers(features_cl_tr,worst_th=3)

In [None]:
# Cleaning dataset again - dropping outliers
data_cl_tr_cl = data_cl_tr.drop(data_cl.index[outlier_indices]) # cleaned transformed cleaned data
output_cl_tr_cl, features_cl_tr_cl = als_split_data(data_cl_tr_cl)

In [None]:
print('Size of new dataset is {0:.2f} % of the original'.format(100.0*len(data_cl_tr_cl)/len(data)))

Below is a plot showing the size distribution of benign and malign samples, before and after data transformation operations.

In [None]:
def als_encode_diagnosis(d):
    if d== 'B':
        ed = 0
    else:
        ed = 1
    return ed

In [None]:
def vs_show_output_classes(data, data_clean):
    '''
    Visualization code for histogram of classes
    '''
    
    # Create figure
    fig = pl.figure(figsize=(10,6))
    
    encoded_data = data.apply(lambda x: als_encode_diagnosis(x))
    encoded_data_clean = data_clean.apply(lambda x: als_encode_diagnosis(x))

    ax = fig.add_subplot(111)
    n, bins, patches = ax.hist(encoded_data, bins=np.arange(3), alpha=0.5, color='b', label='Lost data', width=0.5)
    n_c, bins_c, patches_c = ax.hist(encoded_data_clean, bins=np.arange(3), color='k', label = 'Data filtered for outliers',width=0.5)
    '''
    colors = ['r', 'g']
    for i in range(2):
        patches[i].set_fc(colors[i])
        patches_c[i].set_fc(colors[i])
    '''    
    ax.set_title('Barplot of output classes', fontsize=16)
    ax.set_xticks([b+0.25 for b in bins[:-1]])
    ax.set_xticklabels(['Benign', 'Malign'], fontsize=16)
    
    ax.legend(fontsize=16)
    ax.set_ylabel('Number of records', fontsize=16)
    fig.tight_layout()
    fig.show()

In [None]:
vs_show_output_classes(output, output_cl_tr_cl)

As seen, we haven't lost too much data. Outliers seem to have been removed almost equally in both classes and since this is a classification problem, I believe this data pre-processing step would definitely be useful in producing a robust model.

I am now going to create violin plots of the features split across the diagnosis type. Violin plots represent probability distributions of samples, similar to histograms and box plots. But, rather than showing counts of data points, violin plots use kernel density estimation (KDE) to compute empirical distributions of the samples. 

### Visualizing feature effects

In [None]:
def als_return_select_cols(data, **kwargs):
    checks = ['radius', 'area', 'perimeter']
    cols = [c for c in data.columns for ch in checks if re.search('{}(.)'.format(ch), c)]
    if kwargs['which']=='mean_non_dims':
        cols = [c for c in data.columns if c not in cols]
        cols = [c for c in cols if re.search('(.)_mean', c)]
    elif kwargs['which']=='se_non_dims':
        cols = [c for c in data.columns if c not in cols]
        cols = [c for c in cols if re.search('(.)_se', c)]
    elif kwargs['which']=='worst_non_dims':
        cols = [c for c in data.columns if c not in cols]
        cols = [c for c in cols if re.search('(.)_worst', c)]
    elif kwargs['which']=='all':
        cols = data.columns
    return cols

In [None]:
def vs_violin_swarm_plots(data, **kwargs):
    cols = ['diagnosis'] + als_return_select_cols(data, which=kwargs['which'])
    d = pd.melt(data[cols], id_vars = 'diagnosis', var_name = 'features', value_name = 'value')
    
    sns.set(font_scale=1.5)
    fig, ax = pl.subplots(figsize=(15,10))
    ax = sns.violinplot(x='features', y = 'value', hue='diagnosis', data=d, split=True, inner='quart')
    for tick in ax.get_xticklabels():
        tick.set_rotation(45)    
    fig.tight_layout()
    pl.subplots_adjust(bottom=0.2)
    pl.show()

In [None]:
vs_violin_swarm_plots(data_cl_tr_cl,which='only_dims') # first 10 features

In [None]:
vs_violin_swarm_plots(data_cl_tr_cl,which='mean_non_dims') # Next 10 features

In [None]:
vs_violin_swarm_plots(data_cl_tr_cl,which='se_non_dims') # Last 10 features

In [None]:
vs_violin_swarm_plots(data_cl_tr_cl,which='worst_non_dims') # Last 10 features

I am now going to discover any feature corerlations. Now, radius, area and perimeter would be features that are correlated - as seen even their distribution characteristics are similar as seen in the above violin plots. This is established in the following correlation heat map.

### Visualizing feature correlations

In [None]:
def vs_observe_correlations(data, **kwargs):
    cols = als_return_select_cols(data, which=kwargs['which'])
    fig,ax = pl.subplots(figsize=(10,7))
    sns.heatmap(data[cols].corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
    fig.tight_layout()
    pl.show()

In [None]:
vs_observe_correlations(features_cl_tr_cl, which='only_dims')

Almost all numbers in the correlation matrix are high indicating high correlation between the features, as expected. So I can make a random choice to select all the `area` features (i.e., `area_mean`, `area_se`, `area_worst`). But again, from the above heatmap, `area_mean` is highly correlated with `area_se` and `area_worst`. So it wouldn't be a bad decision to only choose `area_mean` from these 9 features. 

In [None]:
vs_observe_correlations(features_cl_tr_cl, which='mean_non_dims')

The above heatmap again shows high values in the centre. Implying high correlation between `compactness_mean`, `concavity_mean`, `concave_points_mean`

In [None]:
vs_observe_correlations(features_cl_tr_cl, which='se_non_dims')

Similarly the above heatmap indicates high correlation between `compactness_se`, `concavity_se`, `concave_points_se`

In [None]:
vs_observe_correlations(features_cl_tr_cl, which='worst_non_dims')

And there is a high correlation between `compactness_worst`, `concavity_worst`, `concave_points_worst`

The final conclusions of feature selection implies that keeping one of the correlated features in each of the list item below might probably aid in classification:
- radius_mean, radius_se, radius_worst
- perimeter_mean, perimeter_se, perimeter_worst
- area_se, area_worst
- smoothness_se, compactness_se, concave_points_se, concavity_se, symmetry_se
- compactness_worst, concavity_worst, concave_points_worst

So now the data has been scaled, transformed to reduce skweness, outliers have been removed, and feature correlations have been inspected. The next study would be feature trasnformation. By applying PCA to the `good_data`, new dimensions that best maximizes the variance of features can be discovered. In addition to finding these dimensions, PCA also reports the captured variance of each dimension.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
output_float = output_cl_tr_cl.apply(lambda x: als_encode_diagnosis(x))
coeffs = clf.fit(features_cl_tr_cl[:450], output_float[:450]).coef_.T
LDA_F = features_cl_tr_cl[:450].dot(coeffs)

preds = clf.predict(features_cl_tr_cl[451:])
fig, ax = pl.subplots()
ax.scatter(np.arange(len(preds)), preds-output_float[451:], c = preds, cmap='winter')
ax.legend()

### Feature Transformation by PCA

In [None]:
# Applying PCA
from sklearn.decomposition import PCA 

In [None]:
def vs_plot_pca_variance(pca):
    x = np.arange(1,len(pca.components_)+1)
    fig, ax = pl.subplots(figsize=(10,6))
    
    # plot the cumulative variance
    ax.plot(x, np.cumsum(pca.explained_variance_ratio_), '-o', color='black')

    # plot the components' variance
    ax.bar(x, pca.explained_variance_ratio_, align='center', alpha=0.5)

    # plot styling
    ax.set_ylim(0, 1.05)
    
    for i,j in zip(x, np.cumsum(pca.explained_variance_ratio_)):
        ax.annotate(str(j.round(2)),xy=(i+.2,j-.02))
    ax.set_xticks(range(1,len(pca.components_)+1))
    ax.set_xlabel('PCA components')
    ax.set_ylabel('Explained Variance')
    
    fig.tight_layout()
    pl.show()
    

In [None]:
pca = PCA(n_components = 6).fit(features_cl_tr_cl)
vs_plot_pca_variance(pca)

The above graph indicates that 92% of variance in the data can be achieved with just 6 dimensions instead of the 30 features. 

In [None]:
def vs_pca_results(good_data, pca, **kwargs):
    cols = als_return_select_cols(good_data, which=kwargs['which'])
    cols_indices = [i for i, j in enumerate(good_data.keys()) if j in cols]
    # Dimension indexing
    dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    # PCA components
    components = pd.DataFrame(np.round(pca.components_[:,cols_indices], 4), columns = good_data.keys()[cols_indices])
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = pl.subplots(figsize = (14,8))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)

In [None]:
# Generate PCA results plot
vs_pca_results(features_cl_tr_cl, pca, which='only_dims')

In [None]:
vs_pca_results(features_cl_tr_cl, pca, which='mean_non_dims')

In [None]:
vs_pca_results(features_cl_tr_cl, pca, which='se_non_dims')

In [None]:
vs_pca_results(features_cl_tr_cl, pca, which='worst_non_dims')

The above plots indicate the following:

1. The first principal component dimension has all positive weights.

2. The second principal component dimension has positive weights for all features except those related to cell dimensions (radius, perimeter, mean). The first two dimensions contribute to upto 72% variance. 

3. `texture_mean`, `texture_se` and `texture_worst`, all three contribute heavily to the third principal component dimension

4. `radius, perimeter, area` "mean" and "worst" features contribute only to the first two principal component dimensions

In [None]:
def als_return_reduced_data(good_data, pca):
    dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
    reduced_data = pd.DataFrame(data=pca.transform(good_data), columns=dimensions)
    return reduced_data

In [None]:
reduced_features = als_return_reduced_data(features_cl_tr_cl, pca)
output_float = output_cl_tr_cl.apply(lambda x: als_encode_diagnosis(x))

In [None]:
def vs_scatter_two_dimensions(reduced_features, output_float):
    fig, ax = pl.subplots(figsize=(8,5))
    ax.scatter(reduced_features.loc[:,'Dimension 1'], reduced_features.loc[:, 'Dimension 2'], c=output_float, cmap='winter')
    ax.set_xlabel('Dimension 1')
    ax.set_ylabel('Dimension 2')
    ax.set_title('Projections of features on first two principal components')
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    
    fig.tight_layout()
    pl.show()

In [None]:
vs_scatter_two_dimensions(reduced_features, output_float)

As can be seen from the above plots, quite a good classification is achieved with just 2 dimensions. It is not clear from the plots if adding the 3rd dimension improved classification. This can be known for sure only by applying classification models to the reduced features and analysing the scores. That is another reason why PCA is so important when dealing with data having a lot of features. It will minimize noise by only using the most important set of independent features

### Algorithms and Techniques

The reduced features from above will now be used in a binary classification algorithm. There are many algorithms suitable for this problem and below, I have provided an analysis of a few of them that I think can be used. 

1) Gaussian Naive Bayes (GaussianNB)

    * Naive Bayes method is a supervised learning alogrithm based on Bayes' theorem with the naive assumption of independence between various pairs of features. And Gaussian Naive Bayes method assumes the likelihood of the features to be Gaussian.
    
    One real-world application of this method is text prediction based on a sample text data. Given a test sentence, Naive-Bayes can predict the next word based on the word with the highest probability of occurence, as derived from the sample data. Subsequent words can similarly be predicted based on high probability words occurring next to the first predicted word and so on.

    * The advantage of this algorithm is that it is simple, fast and requires relatively little training data. 

    * However, it is commonly known as a a bad algorithm due to its overly simplified assumption of independence between feature pairs. 

    * The problem in hand is binary classification. And the data we are using is the reduced feature matrix after applying PCA. This implies that the transformed features are independent of each other and thus it is likely that the algorithm successful. However, if the feature-output is not so simple to be captured by this Naive algorithm, it definitely wouldn't be suitable. Neverthless, it is worth giving this a shot. 

2) Decision Trees/Random forests

    * Decision trees is a non-parametric model that can classify data based on a tree of decision rules. Random forests on the other hand, is an ensemble learning method, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. 

    One real-world application of this could be predicting the performance of a car with a series of labels such as - average, good, excellent, extra-ordinary etc. A car's performance is related to various features - like its make, mileage, size & speed of engine etc. By asking a series of decision rules on a number of these features - we might be able to build a pretty good model for predicting the car performance. 
    
    * The advantages of this method are many. They are simple to understand, visualize and interpret. It requires little preparation, and it can handle both categorical and numerical variables without much pre-processing. 

    * While the performance of a decision tree might improve with the number of rules set, it could very easily run into a problem of overfitting, if the right questions are not asked. They are also unstable to small variations in data. And often they can create biased trees if some classes dominate. Random forests on the other hand has the advantage of not falling into the issue of overfitting

    * Since the problem in hand is quite similar to the example provided above, this algorithm is quite worth a shot.
 
3) K-Nearest Neighbors Classifier

    * The k-nearest-neighbours algorithm is another "lazy" non-parametric model. Thus it can be used in the same stance as decision trees. 
    
4) Support Vector Classifier

    * Support vector machine constructs hyperplanes in infinite-dimensional spaces to achieve classification. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

    * The advantage of the method is very clear, in the sense that it is a very formal approach to classification. The disadvantage however is the high training time complexity (more than quadratic with the number of samples) which makes it makes it difficult to apply this algorithm to a couple of 10000 samples

## Methodology

- Step 1:
    I m going to begin with applying models on the reduced data from above:
        1. Gaussian Naive Bayes (Since it is a simple model for initial testing)
        2. Decision tree/random forests (since this is a classification problem)
        3. Clustering (since it is a non-parametric approach as compared to sophisticated plane-separation approaches)
        4. SVM/Logistic regression 

- Step 2:
    Based on my understanding of the correlation maps above, I'll forcedly remove some features that I think are not important, apply PCA on the left over and use the above models agains to see if accuracy improves. (I can also use the feature selection modules during this process)
    

- Step 3:
    I will pick the best 3 models from among steps 1 & 2. And I will fine-tune the model hyperparameters.
    

- Step 4:
    I will also finally try some adaboost methods - which is a collection of weak learners on the selected features as well as on the reduced features. 

In [None]:
def als_print_evaluation_metrics(clf, x, y, scoring, cv=5, only_times=True, print_times=True):
    scores = cross_validate(clf, x, y, cv=cv, scoring=scoring, return_train_score=True)
    if print_times:
        print('Average fit time is:   {:.3f}s'.format(np.mean(scores['fit_time'])))
        print('Average score time is: {:.3f}s\n'.format(np.mean(scores['score_time'])))
    if not only_times:
        print(' {: >7} {: >10} |  {: >3}    |  {: >3}    |  {: >3}    |'.format(' ', ' ', 'Avg', 'Min', 'Max'))
        for f in ['train', 'test']:
            for s in scoring:
                key = [sc for sc in scores.keys() if re.search('{}(.){}'.format(f,s),sc)]
                print(' {: >7} {: >10} |  {: >.3f}  |  {: >.3f}  |  {: >.3f}  |'.format(f, s, np.mean(scores[key[0]]), np.min(scores[key[0]]), np.max(scores[key[0]])))

In [None]:
def vs_plot_evaluation_metrics(clfs, clf_labels, x, y, cv=5):
    scoring = ['accuracy', 'precision', 'recall']
    scores = {}
    for label, clf in zip(clf_labels, clfs):
        scores[label] = cross_validate(clf, x, y, cv=cv, scoring=scoring, return_train_score=True)
    colors = ['b', 'g', 'r', 'k', 'c']
    lab2 = ['Avg', 'Min', 'Max']
    fig, ax = pl.subplots(2,3, figsize = (20,15))
    for i, f in enumerate(['train', 'test']):
        for j, s in enumerate(scoring):
            minval=1
            for k, lab1 in enumerate(clf_labels):
                scs = scores[lab1]
                key = [sc for sc in scs.keys() if re.search('{}(.){}'.format(f,s),sc)][0]
                alphac = [1,0.2,0.5]
                for l, lval in enumerate([np.mean(scs[key]), np.min(scs[key]), np.max(scs[key])]):
                    lab = lab1 + ' ' + lab2[l]
                    if lval < minval:
                        minval = lval
                    ax[i,j].bar(k+1+(0.23*l), lval, 0.23, color=colors[k], label=lab, alpha=alphac[l]) 
            ax[i,j].legend()
            ax[i,j].set_ylim(minval-0.01,1)
            ax[i,j].set_xlim(0,8)
            ax[i,j].set_title('{} {} scores'.format(f,s))
            ax[i,j].set_ylabel('Score')
            ax[i,j].set_xticklabels([])
    fig.tight_layout()
    pl.show()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

clf_GNB = GaussianNB()
clf_RF = RandomForestClassifier()
clf_KNN = KNeighborsClassifier()
clf_SVM = SVC()
clfs = [clf_GNB, clf_RF, clf_KNN, clf_SVM]
clf_labels = ['GNB', 'RF', 'KNN', 'SVC']
vs_plot_evaluation_metrics(clfs, clf_labels, reduced_features, output_float, cv=5)
scoring=['accuracy', 'precision', 'recall', 'f1']
for label, clf in zip(clf_labels, clfs):
    print('{}:'.format(label))
    als_print_evaluation_metrics(clf, reduced_features, output_float, scoring, cv=5)

The above plot leads to the following conclusions:
1. Gaussian Naive Bayes is the least performing of all as suspected
2. Random Forests have the best training scores (sometimes even perfect), but the testing scores are not so good. This indicates a biased model
3. KNN and SVC are comparable. 
    - KNN training accuracy scores seem higher than SVC. But KNN training precision scores are slightly lower than SVC
    - It can be observed that the variance of scores across the folds (as seen from the mean, min, max values) is lower in KNN for the test recall scores, but the behavior is vice-versa for the test precision scores
    - Also, KNN has low fitting time but higher score time and it is vice versa for SVC. However, in this case since the dataset is so small, it doesn't really matter

It would be wise to print the f-beta test scores for KNN and SVC models to make the final decision on best model

In [None]:
scoring=['f1']
for label, clf in zip(['KNN', 'SVC'], [clf_KNN, clf_SVM]):
    print('{}:'.format(label))
    als_print_evaluation_metrics(clf, reduced_features, output_float, scoring, cv=5, only_times=False, print_times=False)
    print('\n')

The next step of the strategy is to remove some of the features observed from data exploration/visualization (i.e. feature selection), followed by PCA transformation and use of classifier model on that. 

In [None]:
cols = ['radius_mean', 'radius_se', 'radius_worst', 'perimeter_mean', 'perimeter_se', 'perimeter_worst', 'area_se', 'area_worst', 'smoothness_se', 'compactness_se', 'concave points_se', 'concavity_se', 'symmetry_se']
selected_features = pd.DataFrame(features_cl_tr_cl)
for col in cols:
    selected_features = selected_features.drop([col],axis=1)
pca = PCA(n_components = 6).fit(selected_features)
vs_plot_pca_variance(pca)

In [None]:
selected_reduced_features = als_return_reduced_data(selected_features, pca)
clf_GNB = GaussianNB()
clf_RF = RandomForestClassifier()
clf_KNN = KNeighborsClassifier()
clf_SVM = SVC()
clfs = [clf_GNB, clf_RF, clf_KNN, clf_SVM]
clf_labels = ['GNB', 'RF', 'KNN', 'SVC']
output_float = output_cl_tr_cl.apply(lambda x: als_encode_diagnosis(x))
vs_plot_evaluation_metrics(clfs, clf_labels, selected_reduced_features, output_float, cv=5)

The results do not indicate any model improvement, rather there might be a performance drop due to the selection of features before transformation. There is also an indication of overfitting, since training scores are much better, while testing score are worse.

The final step for model refinement is fine tuning of the model hyper parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, fbeta_score, precision_score, make_scorer
from sklearn.model_selection import ShuffleSplit

knn = KNeighborsClassifier()
parameters = {'n_neighbors':list(range(2,7)), 'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']}
scoring=make_scorer(fbeta_score, beta=0.5)
clf = GridSearchCV(knn, parameters, scoring=scoring, cv=5)
clf.fit(reduced_features, output_float)
results_pd = pd.DataFrame(clf.cv_results_)

In [None]:
select_cols = ['mean_train_score', 'mean_test_score', 'param_algorithm', 'param_n_neighbors', 'rank_test_score', 'std_test_score']
results_pd[select_cols].sort_values(['rank_test_score'], ascending=True).reset_index(drop=True).head(8)

The results indicate that all algorithms give same result. Only the `n_neighbors` parameter has any significant effect on the scores

In [None]:
best_clf = KNeighborsClassifier(n_neighbors=4, algorithm='auto')
next_best_clf = KNeighborsClassifier(n_neighbors=5, algorithm='auto')

Challenges faced during coding:

The application of classification models was easy due to classes available for each of them in scikit. So the real challenge was in the section above - which composes of data preprocessing, visualization, feature selection, transformation etc. Also, since the features were all numerical, the only thing I had to do was encode the diagnosis to a float value. So the challenges in coding this project was actually quite minimal. 

## Results

In [None]:
scoring=['accuracy', 'precision', 'recall', 'f1']
print('Best obtained from grid search:\n')
als_print_evaluation_metrics(best_clf, reduced_features, output_float, scoring, only_times=False)
print('\nSecond best model from grid search, it has lower variance of test scores across folds:\n')
als_print_evaluation_metrics(next_best_clf, reduced_features, output_float, scoring, only_times=False)

#### Robustness of model

In this section, I am going to further test the model of some other data. Now since I have no other sources, I am going to do the following:

1. Corrupt the output, i.e., change some `benign` diagnoses to `malign`. 
2. We will know that the model is robust if the above change results in more false negatives



In [None]:
def als_corrupt_output(output_float, f):
    positives = output_float[output_float==1]
    turnovers = int(len(positives)*f)
    turnover_index = np.random.choice(positives.index, turnovers)
    output_float[turnover_index] = 0
    return output_float
               

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix 
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(reduced_features, 
                                                    output_float, 
                                                    test_size = 0.2, 
                                                    random_state = 0)
best_clf.fit(X_train, y_train)
y_preds = best_clf.predict(X_test)
print('Confusion matrix of best model tested on original testing data:')
print(pd.DataFrame(confusion_matrix(y_test, y_preds), columns=['TP', 'FN'], index=['FN', 'TN']))
print('\n')
print('Confusion matrix of best model tested on testing data corrupted with false benign diagnoses, by 30%:')
y_test_corrupted = als_corrupt_output(y_test, 0.3)
print(pd.DataFrame(confusion_matrix(y_test_corrupted, y_preds), columns=['TP', 'FN'], index=['FN', 'TN']))


The trend is just as expected. The number of false negatives have gone up. This gives me the confidence that the model performance is not random, rather it has clearly recognized the "hyper-plane of separation" between the 2 classes. More experiments like these can be conducted.


## Conclusion

Thus, as I had set out, I have addressed the binary classification problem of cancer diagnosis from FNA tests, based on the following strategy 

1. I did an extensive data exploratory and visualization analysis of the 30 features constituting the test results. 
    - Log transformation and min max scaling was applied
    
    - Outliers were detected based on points lying outside the interquartile range. Points that were identified as outliers in the most number of features were dropped. This process resulted in losing about 4% of the data. And the loss was more or less equally distributed, this maintaining the class balance. 
    
    - Violin plots, which are kernel density estimation plots divided by class type were plotted for all the features. It allowed to visualize which features were most likely to affect classification. 
    
    - Correlation heat maps were also plotted to further identify the relation between the features
   
    - The above visualization studies would serve as a guide for heuristic feature selection    

2. Following the above analysis, I attempted feature transformation based on PCA. The feature weights composing the first 6 principal component dimensions were represented. And scatter plots of first two and three dimensions were also plotted to visualize the separation achieved with just 3 transformed features. 

3. After feature transformation, I used the cleaned, reduced, transformed features in classification algorithms like Gaussian Bayes, Random Forests, (non-parametric) K-nearest neighbors and support vector classifiers. The classifiers' scores were estimated across 5 cross validation folds, and the mean, worst and best values of them were plotted to compare them. The KNN classifier came first among them on various aspects

4. I also tried to drop features that were observed as not useful from the first step and then followed it with a PCA and classification. It seemed that dropping those features actually resulted in a small drop of the results 

5. Model hyper parameters were also optimized by a simple grid search algorithm as available in sklearn scikit 

6. The final test scores that I have are accuracy: 0.973, precision: 0.995 and recall: 0.93. 

7. I have also printed out a second-best classifier which has slightly lower scores, but also lower variance of testing scores across the cross validation splits. However, in order to make strong conclusions out of this, it is defnitely necessary to have more data points

#### Free form visualization

To understand the significance of the data exploratory and transformatory analyses, I have shown 3 different scatter plots:
1. Scatter plot of 3 'non-significant' features (`smoothness_mean`, `concavity_mean`, `compactness_se`). I am calling these non-significant for many reasons - they did not contribute much to the first few principal component dimensions, they did not seem to determine classification (as was seen from the violin plots) and they seemed to be correlated as seen from the heat maps as well 

2. Scatter plot of 3 'significant' features (`area_mean`, `texture_mean`, `fractal_dimension_mean`) for all of the opposite reasons mentioned above

3. Scatter plot of first 3 principal component dimensions.

As can be seen in the plots, the separation of malign/benign cases progressively gets better. Visually it can be seen it will be easiest to find the separating hyperplane in the third case. Thus the more un-correlated the features are, the easier it is to separate the output. This is in my opinion, the single key to this problem. 

In [None]:
def vs_biplot(good_data, reduced_data, output_float, pca):
    '''
    Produce a biplot that shows a scatterplot of the reduced
    data and the projections of the original features.
    
    good_data: original data, before transformation.
               Needs to be a pandas dataframe with valid column names
    reduced_data: the reduced data (the first two dimensions are plotted)
    pca: pca object that contains the components_ attribute

    return: a matplotlib AxesSubplot object (for any additional customization)
    
    This procedure is inspired by the script:
    https://github.com/teddyroland/python-biplot
    '''

    fig = pl.figure(figsize = (22,10))
    ax1 = fig.add_subplot(1,3,3, projection='3d')
    # scatterplot of the reduced data    
    xs = reduced_data.loc[:, 'Dimension 1']
    ys = reduced_data.loc[:, 'Dimension 2']
    zs = reduced_data.loc[:, 'Dimension 3']
    
    ax1.scatter(xs, ys, zs, c=output_float, cmap='winter')
    feature_vectors = pca.components_.T

    # we use scaling factors to make the arrows easier to see
    arrow_size, text_pos = 5.6, 6

    # projections of the original features
    for i, v in enumerate(feature_vectors):
        ax1.plot([0, arrow_size*v[0]], [0, arrow_size*v[1]], [0, arrow_size*v[2]], lw=1.5, color='red')
        ax1.text(v[0]*text_pos, v[1]*text_pos, v[2]*text_pos, good_data.columns[i], color='black', 
                 ha='center', va='center', fontsize=14)

    ax1.set_xlabel("Dimension 1", fontsize=14)
    ax1.set_ylabel("Dimension 2", fontsize=14)
    ax1.set_zlabel("Dimension 3", fontsize=14)

    ax1.set_title("Scatter on first 3 PCs", fontsize=18);
    
    ax2 = fig.add_subplot(1,3,1, projection='3d')
    cols = ['smoothness_mean', 'concavity_mean', 'compactness_se']
    # scatterplot of the reduced data    
    xs = good_data.loc[:, cols[0]]
    ys = good_data.loc[:, cols[1]]
    zs = good_data.loc[:, cols[2]]
    
    ax2.scatter(xs, ys, zs, c=output_float, cmap='winter')
    ax2.set_xlabel(cols[0], fontsize=14)
    ax2.set_ylabel(cols[1], fontsize=14)
    ax2.set_zlabel(cols[2], fontsize=14)

    ax2.set_title("Scatter on any 3 'non-significant' features", fontsize=18);
    
    ax3 = fig.add_subplot(1,3,2, projection='3d')
    cols = ['area_mean', 'texture_mean', 'fractal_dimension_mean']
    # scatterplot of the reduced data    
    xs = good_data.loc[:, cols[0]]
    ys = good_data.loc[:, cols[1]]
    zs = good_data.loc[:, cols[2]]
    
    ax3.scatter(xs, ys, zs, c=output_float, cmap='winter')
    ax3.set_xlabel(cols[0], fontsize=14)
    ax3.set_ylabel(cols[1], fontsize=14)
    ax3.set_zlabel(cols[2], fontsize=14)

    ax3.set_title("Scatter on any 3 'significant' features", fontsize=18);
    
    fig.tight_layout()
    pl.show()
    

In [None]:
vs_biplot(features_cl_tr_cl, reduced_features, output_float, pca)

#### Reflection on the challenges faced

As I have repeatedly emphasized throughout, this project is about understanding the features, twisting and turning them and literally, as an analogy, being able to identify who are the "guys" in-charge. While this is straight-forward by using principal component analysis, the result is transformed dimensions that does not make much physical sense. On the other hand, analyses of the significant features based on the data visualization exercises proved to be very difficult, just because there were so many and it was difficult to come at a robust conclusion. This was evident at the step where I observed higher classification error when only some features were selected, I obviously did not drop the best candidates. 

Another problem I faced was the size of the dataset. It was too small for me to identify and reject most outliers. It was also too small for me to attempt cross-validation with higher number of folds, as more folds implied significantly reducing the test-set size, resulting more variance in the testing scores across folds. 

#### Improvement
1. A more useful analysis would be to perform feature selection using a simple classifier, maybe a decision tree. This would allow us to further explore the effects of feature selection before feature tranformation 

2. Deep learning algorithms could be tested as well, as it appears that the decision boundary between malign and benign cases is very non-linear. But again the dataset size is very small.

3. Generally there is a limitation of data. There are only a total of 500 odd points. This presents a challenge in  analysing the effects of overfitting as well as the effects of increasing the number of folds in cross validation etc. 