This project deals with performance evaluation of machine learning techniques in classifiying malignant, begign melanoma and normal mole.Skin tumour is uncontrolled growth of cells in the skin which may be cancerous. Melanoma is one of the common types of skin cancer. The aim is to develop computer aided diagnosis for skin tumours. The dermal images of three types such as benign tumour, malignant melanoma and normal moles obtained from the authorised PH2 database and other sources. The details are provided in the attached README file. Pre-processing is performed to remove hair cells. Contour based level set technique for segmentation of the lesion is performed from which clinical, texture and morphological features are extracted. The significant features are obtained using Random Subset Feature Selection technique and classification is carried out using machine learning algorithms. To begin with, we load the libraries.

In [None]:
import numpy as np
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline 

The dataset is loaded in the form of a .csv file containing 15 attributes and 1500 instances and the 16th attribute is the class, depicting '1' for malignant, '2' for begign and '3'for normal mole.

In [None]:
import os
os.getcwd()
filename = 'Book1.csv'
names = ["area", 'asymmetry_index', 'GD', 'SD', 'perimeter', 'circularity_index', 
         'compact_index', 'entropy', 'color_mean', 'color_std','contrast',
         'correlation', 'energy', 'homogenity', 'mean', 'class']
data = read_csv(filename, names=names)
print(data.shape)

Take a look at the data types

In [None]:
set_option('display.max_rows', 500)
print(data.dtypes)

Take a peek at the first 20 rows of the data

In [None]:
set_option('display.width', 100)
print(data.head(20))

Its time to describe the data, lets change the prediction to 3 decimal points

In [None]:
set_option('precision', 3)
print(data.describe())

Take a quick look at the breakdown of the class values

In [None]:
print(data.groupby('class').size())

Lets measure the correlation between the attributes. Correlation refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is Pearson's Correlation Coefficient, assuming a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can show poor performance if there are highly correlated attributes in your dataset. As such, it is a good
idea to review all of the pairwise correlations of the attributes in your dataset.

In [None]:
set_option('display.width', 100)
correlations = data.corr(method='pearson')
print(correlations)

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models. You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.


In [None]:
skew = data.skew()
print(skew)

There appears to be quite a lot of skew in the data, needed to be corrected by box-cox transform or other methods. However, ensemble methods are less sensitive to data distribution. 

Unimodal data vislualizations: visualizations of individual attributes: 
It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let's look at histograms of each attribute to get a sense of the data distributions. A fast way to get an idea of the distribution of each attribute is to look at histograms. Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed
or even has an exponential distribution. It can also help you see possible outliers.

In [None]:
pyplot.figure(figsize=(20,10), dpi=100)
data.hist(sharex = False, sharey = False, xlabelsize = 1, ylabelsize = 1)
pyplot.show()

We can see that most of the features have a skewed distribution. Some sort of data standardization and scaling becomes necessary.

Density plot: 
Lets look at the same perspective of the data using density plots.

In [None]:
data.plot(kind='density', subplots=True, layout=(4,4), sharex=False, legend=False,
fontsize=1)
pyplot.figure(figsize=(20,10), dpi=100)
pyplot.show()

Lets visualize the box plots

In [None]:
data.plot(kind='box', subplots=True, layout=(4,4), sharex=False, sharey=False)
pyplot.figure(figsize=(20,10), dpi=100)
pyplot.show()

We can see that the spread of attributes is quite different. Some features like contrast, circularity index, and compact-index appear quite skewed towards smaller values.

We will now perform multimodal data visualizations by plotting Multivariate Plots that shows the interactions between multiple variables in the dataset. This can be done with Correlation Matrix Plot and Scatter Plot Matrix. Correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated. You can calculate the correlation between each pair of attributes. This is called a correlation matrix. You can then plot the correlation matrix and get an idea of which variables have a high correlation with each other. This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

In [None]:
# correlation matrix: visualizing the correlation between the attributes
correlations = data.corr()
# plot correlation matrix
fig = pyplot.figure(figsize=(20,10), dpi=100)
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,15,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as the top right. This is useful as we can see two different views on the same data in one plot. We can also see that each variable is perfectly positively correlated with each other (as you would have expected) in the diagonal line from top left to bottom right. Patches of white shows strong negative correlation among the variables. We will now plot a Scatter Plot Matrix. A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatter plot for each pair of attributes in your data. Drawing all these scatter plots together is called a scatter plot matrix. Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

In [None]:
#you can use seaborn that gives a clear picture
sns.pairplot(data); #plotting paired visualizations, equivalent to scatter matrix
pyplot.figure(figsize=(20,10), dpi=100)
pyplot.show()

The plot shows some sort of linear correlation between the variables. A few appear to be linearly separable. Lets begin the evaluation. Lets evaluate the performance of different models including Logistic Regression, Linear Discriminant Analysis, KNeighborsClassifier, Decision Trees, and SVM. We will perform 10-fold cross-validation.

In [None]:
# Split-out validation dataset
array = data.values
X = array[:,0:15].astype(float)
Y = array[:,15].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,
test_size=validation_size, random_state=seed)

In [None]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'accuracy'
#creating a baseline of performance for this problem
# Spot-Check Algorithms: two linear and four non-linear
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)
    print(msg)

Lets compare the performance of the algorithms by measuring the distribution of scores across all cross-validation folds.

In [None]:
fig = pyplot.figure(figsize=(20,10), dpi=100)
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
bp = pyplot.boxplot(results, patch_artist = True)
## change outline color, fill color and linewidth of the boxes
for box in bp['boxes']:
    box.set(color='Green')
    box.set(facecolor='white')
for whisker in bp['whiskers']:
    whisker.set(color="Black")
for cap in bp['caps']:
    cap.set(color="Gray")
for median in bp['medians']:
    median.set(color="red")
# change the style of fliers and their fill
for flier in bp['fliers']:
    flier.set(marker='o', color='#e7298a', alpha=0.3)
ax.set_xticklabels(names)
pyplot.show()

Lets evaluate our algorithms with standardized data. We suspect that the differing distributions of the raw data may be negatively impacting the skill of some of the algorithms. Let's evaluate the same algorithms with a standardized copy of the #dataset where the data is transformed such that each attribute has a mean value of zero and a standard deviation of one. We also need to avoid data leakage when we transform the data. A good way to avoid leakage is to use pipelines that standardize the data and build the model for each fold in the cross-validation test harness. That way we can get a fair estimation of how each model with standardized data might perform on unseen data.

In [None]:
# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()), ('LR',
LogisticRegression())])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', StandardScaler()),('LDA',
LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN',
KNeighborsClassifier())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART',
DecisionTreeClassifier())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()),('NB',
GaussianNB())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()),('SVM', SVC())])))
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)
    print(msg)

Now lets compare the performance of Scaled Algorithms and measure the distribution of scores across all cross-validation folds.

In [None]:
fig = pyplot.figure(figsize=(20,10), dpi=100)
fig.suptitle('Scaled Algorithms Comparison')
ax = fig.add_subplot(111)
bp = pyplot.boxplot(results, patch_artist = True)
## change outline color, fill color and linewidth of the boxes
for box in bp['boxes']:
    box.set(color='Green')
    box.set(facecolor='white')
for whisker in bp['whiskers']:
    whisker.set(color="Black")
for cap in bp['caps']:
    cap.set(color="Gray")
for median in bp['medians']:
    median.set(color="red")
# change the style of fliers and their fill
for flier in bp['fliers']:
    flier.set(marker='o', color='#e7298a', alpha=0.3)
ax.set_xticklabels(names)
pyplot.show()

The results suggest digging deeper into the CART abd SVM algorithms. It is very likely that configuration beyond the default may yield even more accurate models. But before that, we can try our hands with ensemble methods as well as they are known to produce better results. Lets see how they work.

In [None]:
ensembles = []
ensembles.append(('AB', AdaBoostClassifier()))
ensembles.append(('GBM', GradientBoostingClassifier()))
ensembles.append(('RF', RandomForestClassifier()))
ensembles.append(('ET', ExtraTreesClassifier()))
results = []
names = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

As expected, both random forest classifier and extra trees classifier are doing great with the data in comparison to CART and SVM. Lets try with the standardized data to look for the improvement in performance. Lets use pipelines again for our task.

In [None]:
# Standardized ensembles
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB',
AdaBoostClassifier())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM',
GradientBoostingClassifier())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF',
RandomForestClassifier())])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),('ET',
ExtraTreesClassifier())])))
results = []
names = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

With the standardized data, the extra trees classifier performs marginally better than the random forest classifier. Lets now tune our ensembles. The default number of bagging methods to perform (n estimators) is 100. This is a good candidate parameter for the bagging classifiers to tune. Often, the larger the number of bagging stages, the better the performance but the longer the training time. Below we define a parameter grid n estimators values from 50 to 400 in increments of 50. Each setting is evaluated using 10-fold cross-validation.

In [None]:
# Tune scaled Random Forest Classifier
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
param_grid = dict(n_estimators=numpy.array([50,100,150,200,250,300,350,400]))
model = RandomForestClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
# Tune scaled Extra Trees Classifier
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
param_grid = dict(n_estimators=numpy.array([50,100,150,200,250,300,350,400]))
model = ExtraTreesClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

We observe that the extra Trees Classifier performing marginally better than that Random Forest Classifier. We will use it as the final model to predict on the test data.

In [None]:
#Finalize the model
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = ExtraTreesClassifier(random_state=seed, n_estimators=150)
model.fit(rescaledX, Y_train)
# estimate accuracy on validation dataset
rescaledValidationX = scaler.transform(X_validation)
predictions = model.predict(rescaledValidationX)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

Again, extra trees classifier perform on par with the test data, giving around 99% precision, recall, and F1-score, serving to be an excellent candidate to classify DERMAL images under study. The efficiency of the algorithm can be explored further by adding more features, learned by a deep learning model.