Hi, this is my first kernel on Kaggle. I didn't look at any of the state of the art methods used for this dataset (still haven't), just went in with a purpose of implementing the algorithms which I have studied, while maintaining the tradeoff between bias and variance. People new to the whole scikit-learn and machine learning world are the ones most likely to benefit from this kernel.

The dataset contains 30 attributes of patients - and a class label telling whether the tumour is malignant or benign. There are a total of 357 benign objects and 212 malignant objects. We start by loading the packages, importing the dataset and mapping 'B' and 'M' class labels to integers, and separating the attributes and class labels.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn import linear_model #for logistic regression
from sklearn.neural_network import MLPClassifier #for neural network
from sklearn.model_selection import KFold, GridSearchCV, cross_val_score, cross_val_predict, validation_curve 
#GridSearchCV is used to optimize parameters of the models used
#the other modules and functions 
from sklearn.ensemble import VotingClassifier #for creating ensembles of classifiers

df = pd.read_csv('../input/data.csv', skiprows=[0], header=None)
df = df.replace({'B':0, 'M':1})
x = df.iloc[:,2:] 
y = df.iloc[:,1]
print (x.shape, y.shape)

Now, we normalize the data. Normalization is a bit of a "controversial" subject (for lack of a better term). I tried to research on this a bit by looking at questions on [quora][1] and [stackexchange][2]. If you're using regularization, it makes sense to normalize your input, while at the same time, you should not normalize if you are trying to interpret and explain the coefficients and relate them to the features. Since this kernel is aimed at being more of an introduction to solving problems using scikit-learn and pandas, I decided to not focus on excessive explorations of coefficients. In hindsight, it seems I could have thought more on whether to normalize or not.


  [1]: https://www.quora.com/Should-input-data-to-logistic-regression-be-normalized
  [2]: https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning

In [None]:
x_mean = x.mean()
x_std = x.std()
x_norm = (x - x_mean)/x_std
print (x_norm.shape)

We start by using a simple logistic regression model, and use K fold cross validation to get the accuracy on the dataset. K fold cross validation is a method used to prevent overfitting (A situation where your model fits the training data *too* well but does not generalize well enough to data which is outside the training set). 

Here, our training data is divided into 5 parts, model is generated for the 4 parts, and tested on the 5th part. This is done 5 times by using different combinations of these parts as training and test sets, and eventually an average of all these models is used to get the final accuracy. 

Find out more about cross validation [here][1] and scikit specific information here


  [1]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

In [None]:
logreg = linear_model.LogisticRegression()
kfold = KFold(n_splits=5,random_state=7)
cv_results = cross_val_score(logreg, x_norm, y, cv=kfold)
print (cv_results.mean()*100, "%")

So, we can see that using logistic regression gave us an accuracy of 97.7% on the dataset.

Next, we optimize the parameters of our model. For logistic regression, it makes sense to look at the parameter C, which is the inverse of the regularization parameter. The lower the value of C, the higher we penalize the coefficients of our logistic regression.

In [None]:
logreg = linear_model.LogisticRegression()
param_grid = {"C":[0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
grid = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=kfold)
grid.fit(x_norm,y)
print (grid.best_estimator_)
print (grid.best_score_*100, "%")

As we can see, our accuracy has increased slightly to 97.89%. 

Lets now look at the validation curve and confirm that we're not overfitting. For this, we need the individual training score and test scores (here, by test score I mean the average of scores on the 5 validation sets) for each of our 5 "folds", and plot them by varying C. For those values of C which give us a low training and high test score, we have high bias, and our model "underfits" the dataset. 

At some point, the test score starts decreasing with increase in value of C, and this is said to be "overfitting" of the dataset (because our model fits the training data too well, but fails to generalize on the test set). The middle ground, where the test score is highest, is the value of C we are looking for.

In [None]:
#plot validation curve
num_splits = 5
num_C_values = 10 # we iterate over 10 possible C values
logreg = linear_model.LogisticRegression()
kfold = KFold(n_splits=5,random_state=7)
C_values = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
train_scores, valid_scores = validation_curve(logreg, x_norm, y, "C", C_values, cv=kfold)
train_scores = pd.DataFrame(data=train_scores, index=np.arange(0, num_C_values), columns=np.arange(0,num_splits)) 
valid_scores = pd.DataFrame(data=valid_scores, index=np.arange(0, num_C_values), columns=np.arange(0,num_splits)) 
plt.semilogx(C_values, train_scores.mean(axis=1), label='training score')
plt.semilogx(C_values, valid_scores.mean(axis=1), label='test score')
plt.xlabel('C')
plt.legend()

As we can see, the optimum point is at C=0.1, where we get an accuracy of 97.89%.

Now we move on to the next classifier - Neural Network. We choose 'lbfgs' solver, which works better on small datasets. For the architecture of the neural network, I decided to use 1 hidden layer (which is the standard for most NN problems). 

For the number of hidden units, initially I tried to do this experimentally, I tried 5, 10, 15 and the default for scikit (which is 100). 100 hidden units seemed to be giving the best accuracy.  But then I researched a bit more on this. I was confused about the role cross validation, regularization and number of hidden layers play in overfitting of the data. According to [this][1] post on the statistics stackexchange website, cross validation and regularization will reduce the amount of overfitting in your model, but it does not guarantee that there will be no overfitting. 

So I decided to use a "better" method to decide the number of hidden nodes. According to [this][2] post, the mean of number of input output layers is a good approximation of the number of hidden layers to use. So, let us go ahead with 15 hidden neurons. Even though this gives a lesser score while using cross validation (compared to 100 hidden units), on combining the logistic regressiong and neural network model, using 15 hidden neurons instead of 100 gives a better accuracy. Thus we can <s>claim</s> hypothesize (since I did not exhaustively vary the weights of the 2 classifiers, only did it manually) that a 15 hidden neuron network is better and more "general" predictor model than the 100 hidden neuron neural network.

  [1]: https://stats.stackexchange.com/questions/193661/is-cross-validation-enough-to-prevent-overfitting
  [2]: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

In [None]:
clf = MLPClassifier(solver='lbfgs', random_state=1, activation='logistic', hidden_layer_sizes=(15,))
kfold = KFold(n_splits=5,random_state=7)
cv_results = cross_val_score(clf, x_norm, y, cv=kfold)
print (cv_results.mean()*100, "%")

On optimizing the parameter "alpha" (regularization parameter for neural network) in a similar way to what we did with the regularization parameter in logistic regression.

In [None]:
clf = MLPClassifier(solver='lbfgs', random_state=1, activation='logistic',  hidden_layer_sizes=(15,))
param_grid = {"alpha":10.0 ** -np.arange(-4, 7)}
grid = GridSearchCV(estimator=clf, param_grid=param_grid, cv=kfold)
grid.fit(x_norm,y)
print (grid.best_estimator_)
print (grid.best_score_*100, "%")

Thus, alpha = 1.0 gives an optimal accuracy of 97.7%.

Both models (Logistic regression and neural network models) seem to be giving a good accuracy. Lets see the misclassified examples of both models to figure out if we can combine them in some way.

In [None]:
logreg = linear_model.LogisticRegression(C=0.1)
kfold = KFold(n_splits=5,random_state=7)
cv_results = cross_val_score(logreg, x_norm, y, cv=kfold)
predicted = cross_val_predict(logreg, x_norm, y, cv=kfold)
diff = predicted - y
misclass_indexes = diff[diff != 0].index.tolist()
print (misclass_indexes)

In [None]:
clf = MLPClassifier(solver='lbfgs', random_state=1, activation='logistic', alpha=1.0, hidden_layer_sizes=(15,))
kfold = KFold(n_splits=5,random_state=7)
cv_results = cross_val_score(clf, x_norm, y, cv=kfold)
predicted = cross_val_predict(clf, x_norm, y, cv=kfold)
diff = predicted - y
misclass_indexes = diff[diff != 0].index.tolist()
print (misclass_indexes)

9 objects are misclassified by both classifiers, but we can improve the overall accuracy by using a combination of the 2 classfiers and assigning weights. (If we had 3 classifiers we would also have considered a majority voting ensemble). I played around a bit with the weights manually, and assigning a weight of 2 to logistic regression and 1 for the neural network gave the best accuracy (although not by a huge margin). While its tempting to relate this with the fact that logistic regression had a slightly better accuracy, and claim that this is a "logical" way to choose the weights, I am pretty sure that assigning the weights in this way is also some sort of "overfitting".

In [None]:
clf1 = linear_model.LogisticRegression(C=0.1)
clf2 = MLPClassifier(solver='lbfgs', alpha=1.0,hidden_layer_sizes=(15,), random_state=1, activation='logistic')
eclf = VotingClassifier(estimators=[('lr', clf1), ('nn', clf2)], voting='soft', weights=[2,1])
cv_results = cross_val_score(eclf, x_norm, y, cv=kfold)
print (cv_results.mean()*100, "%")

So, we observe that the combined classifier improves our overall accuracy to 98.24% 

That's it for this kernel. I learnt a lot while working on this dataset, definitely made my share of mistakes. I would really appreciate any sort of feedback that you might have. For people new to the field, I hope this could be of some help to you in understanding the general flow of working through a dataset to solve a machine learning problem.