# New approach to evaluating models  
#### Joshua Poirier, [NEOS](http://www.neosgeo.com)  
2016 SEG Machine Learning Contest  

## 1 Introduction  

The purpose of this notebook is to establish a new approach to evaluating models for this contest.  I propose a method which borrows from the **K-Folds** and **Leave-one-out** methods, wherein we build the model several times; each model is built by leaving out one well as the test set.  This method is designed to circumvent the circumstances of the contest wherein the prediction capability for the predefined blind well (**Newby**) is a loss function - leading to overfitting.  

Time to load supporting libraries and the data!

In [28]:
# machine learning packages
library(e1071)
library(caret)

"package 'caret' was built under R version 3.2.5"Loading required package: lattice
Loading required package: ggplot2


In [29]:
# load data
fname <- "../facies_vectors.csv"
data <- read.csv(fname, colClasses=c(rep("factor",3), rep("numeric",6), "factor", "numeric"))

# convert NM_M channel into a binary channel "isMarine"
data$NM_M <- data$NM_M == "2"
names(data)[10] <- "isMarine"

# make the Facies channel more descriptive
levels(data$Facies) <- c("SS", "CSiS", "FSiS", "SiSh", "MS", "WS", "D", "PS", "BS")

# remove any incomplete records (we know from jpoirier001.ipynb PE channel is missing some values)
data <- data[complete.cases(data),]

# display first five rows of data set
head(data)

Facies,Formation,Well.Name,Depth,GR,ILD_log10,DeltaPHI,PHIND,PE,isMarine,RELPOS
FSiS,A1 SH,SHRIMPLIN,2793.0,77.45,0.664,9.9,11.915,4.6,False,1.0
FSiS,A1 SH,SHRIMPLIN,2793.5,78.26,0.661,14.2,12.565,4.1,False,0.979
FSiS,A1 SH,SHRIMPLIN,2794.0,79.05,0.658,14.8,13.05,3.6,False,0.957
FSiS,A1 SH,SHRIMPLIN,2794.5,86.1,0.655,13.9,13.115,3.5,False,0.936
FSiS,A1 SH,SHRIMPLIN,2795.0,74.58,0.647,13.5,13.3,3.4,False,0.915
FSiS,A1 SH,SHRIMPLIN,2795.5,73.97,0.636,14.0,13.385,3.6,False,0.894


## 2 Building a support vector machine model  

Now let's define a function to build a classification model using the support vector machine algorithm.  The function will take in training data and testing data.  We will use the same method (**Well-folds**) to perform cross-validation model tuning for each iteration.  The function will return a data frame of metrics evaluating the cross-validated models performance.  

In [30]:
cv_model <- function(training, cv) {
    tune.out <- tune(svm, Facies ~ ., data=training, kernel="radial",
                    ranges=list(cost=c(5, 10, 15, 20),
                               gamma=c(.1, .5, 1, 5)))
    print(summary(tune.out))
}

build_model <- function(training, testing) {
    set.seed(3124)
    
    training_wells <- unique(training$Well.Name)
    
    # loop through each well - current iteration well is the tuning/cross-validation set
    for (well in training_wells) {
        trainIndex <- training$Well.Name != well
        training_set <- training[trainIndex,]
        cv_set <- training[-trainIndex,]
        
        cv_model(training_set, cv_set)
    }
}

well_folds <- function(data) {
    wells <- unique(data$Well.Name)
    
    # "Recruit F9" is a set of specially selected BS-Bafflestone observations which should always be in the training set
    wells <- wells[-(which(wells == "Recruit F9"))]
    
    # loop through each well - current iteration well is the testing set
    for (well in wells) {
        trainIndex <- data$Well.Name != well
        training <- data[trainIndex,]
        testing <- data[-trainIndex,]
        
        # build model
        build_model(training, testing)
    }
}

well_folds(data)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
   20   0.1

- best performance: 0.1777653 

- Detailed performance results:
   cost gamma     error dispersion
1     5   0.1 0.2024145 0.03083812
2    10   0.1 0.1903027 0.02930776
3    15   0.1 0.1846768 0.02786623
4    20   0.1 0.1777653 0.02531147
5     5   0.5 0.1855501 0.01680914
6    10   0.5 0.1803702 0.02418437
7    15   0.5 0.1842588 0.02265360
8    20   0.5 0.1898772 0.02803774
9     5   1.0 0.1846917 0.02878923
10   10   1.0 0.1877220 0.03162660
11   15   1.0 0.1872891 0.03033712
12   20   1.0 0.1903139 0.03224029
13    5   5.0 0.2906516 0.03198535
14   10   5.0 0.2906535 0.03258735
15   15   5.0 0.2906535 0.03258735
16   20   5.0 0.2906535 0.03258735


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
   15   0.1

- best performance: 0.1786957 

- Detailed performance results:
   cost gamma     error dispersion
1  