# Spot Classification using Machine Learning
In this notebook, I will use the 'spot_image_data" that I had created in my previous notebook for training a model to classify spots into good and bad spots. The original dataset was created by fitting each spot using 2D Gaussian, and then using hard-set criteria for classification. The reason for implementing machine learning is that 2D Gaussian is computaionally very expensive for large dataset. Creating a good model can help reduce the time significantly. Also, a model can help in maintaining consistency between various experimental datasets.

## Import

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pickle
import sympy

## Load dataset

In [3]:
# The dataset was pickled. It is a Bunch object.
with open('spot_image_data.pkl','rb') as fid:
    image = pickle.load(fid)

I will save the data in X and the target values in y. Data are the flattened images, and target values are the classification based on the 2D Gaussian fits.

In [4]:
X = image.data.astype(np.int32) # image data
y = image.target.values # classification
print ("Number of images = {}".format(X.shape[0]))
print ("Number of features per image = {}".format(X.shape[1]))

Number of images = 3215
Number of features per image = 81


# Creating Training and Test Data set

In [22]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.1)
# I will used 10% data for testing and 90% for training.

## Logistic Regression
First lets test cross validation score 

In [21]:
from sklearn.linear_model import LogisticRegression
logit1 = LogisticRegression(C=0.1,penalty='l2')
cross_val_score(logit1,X,y,cv=3,n_jobs=-1)

array([ 0.93470149,  0.91044776,  0.92343604])

The scores are good.

## Multi-Layer Perceptron Regressor

In [47]:
from sklearn.neural_network import MLPClassifier
nnclf_1 = MLPClassifier(hidden_layer_sizes=(24),activation='relu',solver='adam',alpha=0.1)
cross_val_score(nnclf_1,X,y,cv=3,n_jobs=-1)

array([ 0.85261194,  0.83955224,  0.86087768])

The scores with one layer are no as good as with Logistic Regression

In [48]:
nnclf_2 = MLPClassifier(hidden_layer_sizes=(24,12),activation='relu',solver='adam',alpha=0.1)
cross_val_score(nnclf_2,X,y,cv=3,n_jobs=-1)

array([ 0.84141791,  0.82742537,  0.80485528])

There is not improvement in scores on addition of another layer.  I have checked other values of alpha, but the scores don't get much better.

## Support Vector Machine (SVM)

#### Linear Kernel

In [52]:
from sklearn import svm
svc_linear =svm.SVC(C=0.1,kernel='linear')
cross_val_score(svc_linear,X,y,cv=3,n_jobs=-1)

array([ 0.93470149,  0.93097015,  0.9178338 ])

#### Polynomial kernel

In [54]:
svc_poly =svm.SVC(C=0.1,kernel='poly',degree=3)
cross_val_score(svc_poly,X,y,cv=3,n_jobs=-1)

array([ 0.9636194 ,  0.95335821,  0.95331466])

#### rbf kernel

In [55]:
svc_rbf =svm.SVC(C=0.1,kernel='rbf',degree=3)
cross_val_score(svc_rbf,X,y,cv=3,n_jobs=-1)

array([ 0.77798507,  0.77798507,  0.77871148])

SVM with a polynomial kernel of order 3 gives the best cross-validation score. I will preform a grid search to get best paramaters .

WARNING : The linear kernel takes very long time.

In [56]:
from sklearn.model_selection import GridSearchCV
Cs = np.logspace(-7,-1,15)
gammas = np.logspace(-7,1,15)
svc_poly3 =svm.SVC(kernel='poly',degree=3)
clf = GridSearchCV(estimator=svc_poly3,param_grid=dict(C=Cs,gamma=gammas),n_jobs=-1)

The grid search is performed on the training dataset

In [57]:
clf.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-07,   2.68270e-07,   7.19686e-07,   1.93070e-06,
         5.17947e-06,   1.38950e-05,   3.72759e-05,   1.00000e-04,
         2.68270e-04,   7.19686e-04,   1.93070e-03,   5.17947e-03,
         1.38950e-02,   3.72759e-02,   1.00000e-01]), 'gamma': array([  1.00000e-0...,   1.38950e-02,   5.17947e-02,   1.93070e-01,
         7.19686e-01,   2.68270e+00,   1.00000e+01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [60]:
print ("The best estimator :")
print (clf.best_estimator_)

The best estimator :
SVC(C=7.1968567300115142e-07, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1.9306977288832496e-05,
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)


In [63]:
print ("The best score : {}".format(clf.best_score_))

The best score : 0.9602488765986865


In [64]:
print ("The best parameters : {}".format(clf.best_params_))

The best parameters : {'C': 7.1968567300115142e-07, 'gamma': 1.9306977288832496e-05}


In [65]:
print ("Score on test dataset : {}".format(clf.score(X_test,y_test)))

Score on test dataset : 0.9596273291925466


The cross-validation score on the test dataset is very good. Lets look at the predictions and metrics.

### Predictions and Metrics

In [67]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = clf.predict(X_test)
print (metrics.classification_report(y_true=y_test,y_pred=y_pred))

             precision    recall  f1-score   support

      False       0.89      0.95      0.92        76
       True       0.98      0.96      0.97       246

avg / total       0.96      0.96      0.96       322



The f1-scores for both False and True are very good

In [102]:
cm1 = metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
print (cm1)


[[ 72   4]
 [  9 237]]


There are very few false negatives and false positives.

In this notebook, I have tested different classifiers. SVM with a polynomial kernel seems to perform the best. I used  grid search to find the best parameters for the model. The model gives very good predictions on the test dataset.

In the next notebook, I will try to train a model using regression to estimate the location of the centroid.