# Very naive deep learning on the vector of surrounding bases

---

### Data

Naive feature vectors. The original sequence of validation/test and train data does not overlap! ( but train data points can overlap with train data points, and test-validation can overlap with test-validation data ) This overlapping does not lead to unintentional label leakage!



### Notes


---

Instruct theano to use gpu

In [1]:
import os
os.environ['THEANO_FLAGS']='device=gpu'

import sys
sys.path.append('../my_modules')
from loading_utils import read_my_data

import subprocess
import time

import os,subprocess
workdir='/mnt/Data1/ribli/methylation_code/modelling'
subprocess.call(['mkdir',workdir])
os.chdir(workdir)

Using gpu device 0: GeForce GTX 670 (CNMeM is disabled, CuDNN not available)


### Load data

In [2]:
_,train_x,train_y = read_my_data(
    fname='../prepare_data/big_train_feat_vect.csv')
_,valid_x,valid_y = read_my_data(
    fname='../prepare_data/big_val_feat_vect.csv')
test_id,test_x,test_y = read_my_data(
    fname='../prepare_data/big_test_feat_vect.csv')

#make it image like
train_x,valid_x,test_x=[x.reshape((-1,1,1000,1)) for x in (train_x,valid_x,test_x)]

Loading data... 
Loading data... 
Loading data... 


### Build Convnet

In [3]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D,MaxPooling2D

input_dim=train_x.shape[2]
activation='relu'
loss='binary_crossentropy'
optimizer='adadelta'
init='uniform'
pool_size=(8,1)
window_size=5
dense_n=64

model = Sequential()

#Convolution layer 1
model.add(Convolution2D(20,window_size,1, border_mode='valid',input_shape=(1,input_dim,1)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
#model.add(Dropout(0.25))

#Convolution layer 2
model.add(Convolution2D(50,window_size,1, border_mode='valid'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
#model.add(Dropout(0.25))

#Dense layer
model.add(Flatten())
model.add(Dense(dense_n,activation=activation))
#model.add(Dropout(0.5))

#final layer
model.add(Dense(1, activation='sigmoid'))

#compile model
model.compile(loss=loss,optimizer=optimizer,class_mode='binary')

Using Theano backend.


### Train test model

In [6]:
from keras.callbacks import ModelCheckpoint,EarlyStopping

def fit_keras_model(model,train_x,train_y,test_x,test_y):
    start=time.time()
    
    #callbacks
    best_model=ModelCheckpoint('best_model',save_best_only=True,verbose=1)
    early_stop=EarlyStopping(patience=3,verbose=1)
    
    #train it
    callb_hist=model.fit(train_x,train_y,nb_epoch = 100,
                         show_accuracy=True,verbose=1,
                        validation_split=0.1,
                        callbacks=[best_model,early_stop])
    #predict
    train_pred=model.predict_classes(train_x).ravel()
    test_pred=model.predict_classes(test_x).ravel()

    #check errors
    print 'train score:',list((train_pred==train_y)).count(True)/float(len(train_y))
    print 'test score:',list((test_pred==test_y)).count(True)/float(len(test_y))

    print 'It took:',time.time()-start    
    return train_pred,test_pred

In [7]:
N_train,N_test=118000,10000
train_pred,test_pred=fit_keras_model(
    model,train_x[:N_train],train_y[:N_train],test_x[:N_test],test_y[:N_test])

Train on 106200 samples, validate on 11800 samples
Epoch 1/100
Epoch 00000: val_loss improved from inf to 0.46414, saving model to best_model
Epoch 2/100
Epoch 00001: val_loss improved from 0.46414 to 0.43029, saving model to best_model
Epoch 3/100
Epoch 00002: val_loss improved from 0.43029 to 0.41694, saving model to best_model
Epoch 4/100
Epoch 00003: val_loss improved from 0.41694 to 0.40283, saving model to best_model
Epoch 5/100
Epoch 00004: val_loss did not improve
Epoch 6/100
Epoch 00005: val_loss improved from 0.40283 to 0.37928, saving model to best_model
Epoch 7/100
Epoch 00006: val_loss did not improve
Epoch 8/100
Epoch 00007: val_loss did not improve
Epoch 9/100
Epoch 00008: val_loss improved from 0.37928 to 0.36767, saving model to best_model
Epoch 10/100
Epoch 00009: val_loss did not improve
Epoch 11/100
Epoch 00010: val_loss improved from 0.36767 to 0.36177, saving model to best_model
Epoch 12/100
Epoch 00011: val_loss did not improve
Epoch 13/100
Epoch 00012: val_loss 

### Save test predictions

In [8]:
#oad best model
model.load_weights('best_model')
test_pred=model.predict_classes(test_x).ravel()

import pandas as pd
import numpy as np
result=pd.DataFrame({'id':test_id,'label':test_y,'prediction':test_pred})
result['error']=np.abs(result['label']-result['prediction'])
result.head()



Unnamed: 0,id,label,prediction,error
0,cg19752143,1,1,0
1,cg05219517,0,0,0
2,cg05218696,1,1,0
3,cg09329621,1,1,0
4,cg17608706,1,0,1


In [9]:
result.to_csv('cnn_test_preds.csv',sep='\t',index=False,Header=True)