#Validation
Validation is done on machine learning algorithms to verify that your model is not overfitting and to give you an estimate on the performance of your model.

##Train and Test Split
When given a dataset, it is good to split the data into two categories:
- Training data
- Testing data

The training data is used to fit your machine learning model whereas the testing data is used to check how well your model is performing.  It is good to have at least 10% of your data for testing (90/10 split). Below is an example of using train_test_split in sklearn:

In [2]:
from sklearn import cross_validation
from sklearn import datasets

iris = datasets.load_iris()

features_train,features_test,labels_train,labels_test = cross_validation.train_test_split(iris.data,iris.target,test_size=0.1)

##K-fold Cross Validation

k-fold cross validation is the concept of partitioning your dataset into k partitions and using each partition as the testing data and the other k-1 partitions as the training data.  The algorithm performs fitting and learning k times, and takes the average result of the k experiments.

In [3]:
kf = cross_validation.KFold(len(iris.data), 10)

##GridSearch Cross Validation

GridSearch CV is a systematic way of performing parameter tuning on Machine Learning Algorithm.  It allows you to supply multiple values for a parameter and it will execute the machine learning algorithm using these different parameters and return the one with the best performance.

In [5]:
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV

parameters = {'kernel': ('linear','rbf'), 'C':[1,10]}
svr = SVC()
clf = GridSearchCV(svr,parameters)
clf.fit(iris.data,iris.target)

clf.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

#Evaluation Metrics

##Accuracy Score
accuracy metric used in sklearn is defined as the # of items in a class labeled correctly divided by the # of all items in that class.  One of the shortcomings of the accuracy score is that it could give you deceiving good results if the class is skewed where there is a small number of items in the class.  This pretty much means that if you predict a high percentage in a small number of items, that may not scale well when given a large number of items.

##Confusion Matrix
Confusion Matrix is used to notate how many values you predicted correctly and how many actual values were there.  Confusion matrices illustrate how many false positives, true negatives and true positives you have predicted.  In a confusion matrix, we hope for the diagnol has the highest values, because those values represent the number of correct predictions.
##Precision and Recall

**Recall** is defined as the probability that your algorithm correctly predicts a value.  **Precision** is defined as the probability that your algorithm predicted a value and was correct.

###True Positive
*True Positives* are values in my confusion matrix where I predicted the value and it equals the actual value
###False Positive
*False Positives* are places where I predicted something to be true, but the actual value is false.
###False Negatives
*False Negatives* are places where I predicted the value to false, but the actual value was true.

Precision can now be defined as:
>$\frac{truePositives}{truePositives + falsePositives}$

and Recall can now be defined as:
>$\frac{truePositives}{truePositives + falseNegatives}$

In [8]:
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score

pred = clf.predict(features_test)

print "recall score: ", recall_score(labels_test,pred,iris.target)

print "precision score: ", precision_score(labels_test,pred,iris.target)

confusion_matrix(labels_test,pred,iris.target)

recall score:  1.0
precision score:  1.0


  sample_weight=sample_weight)
  sample_weight=sample_weight)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 7]])