# Building Models for predicting labels

## Train and Test Data
Before going forward, the first and foremost step is to divide the data into training and test data in 70:30 ration.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText(1).csv"
data = pd.read_csv('data/'+fileName,delimiter="|")

trainSet, testSet = train_test_split(data, test_size=0.3, random_state=3)

trainSet.to_csv('Train(1)',sep="|")
testSet.to_csv('Test(1)',sep="|")

Now that we have divided the data into Train and Test data we are now going to build models

-----------

### Sentence preprocessing
At the start the Brain region mentions that were being looked into were tagged into BR1 and BR2 using replace function. Some more preprocessing were also done, which will be added on as we go on.

So first reading the train and test data.

In [2]:
trainData = pd.read_csv('Train(1)',delimiter='|')
trainSen = trainData['sentence']
trainEn1 = trainData['entity1']
trainEn2 = trainData['entity2']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(1)',delimiter='|')
testSen = testData['sentence']
testEn1 = testData['entity1']
testEn2 = testData['entity2']
testLab = testData['connection']
testLen = len(testSen)
print testLen

15792
6769


We will be replacing every instance of the occurence of Entity1 and Entity2 in a sentence. After that removing of words inside brackets, and removing all characters other than alphabets and lowering the case.

In [3]:
import re
def replaceBR(s, en1, en2):
    s = s.replace(en1," BR1 ")
    s = s.replace(en2," BR2 ")
    return s

def formatSen(x):
    x = re.sub("\((.*?)\)"," ",x.lower())
    x = re.sub("^[ ]*([a-z])",r"\1",x)
    return re.sub("[^a-z0-9]"," ",x)

In [4]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(replaceBR(trainSen[i], trainEn1[i], trainEn2[i]))
    trainFSen[i] = formatSen(trainFSen[i])
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(replaceBR(testSen[i], testEn1[i], testEn2[i]))
    testFSen[i] = formatSen(testFSen[i])

After formatting all the sentences we come to a way to represent sentences as numbers as classfication models accept numbers only.

## Word Embedding
The input for all models that have been created till date take numbers as input. They donot accept words or letters. So we need to convert these words into corresponding numbers that would identity that word. This process is called word embedding.

There are different word embedding techniques.
1. Count Vectorization
2. Tf-idf Vectorization
3. Word2Vec Embedding
4. FastText

CountVectorization take each count of a particular word in a sentence. The whole array formed later is used as the representation basis. This is not being used as this would create a pattern based classifier and also perform bad.

#### Tf-idf Vec
Tf-idf Vectorization takes term frequency and inverse document frequency into consideration for denoting a value for a word. Common words that appear repeatedly in a sentence are given lesser value than rare words. We will be implementing this.


#### Word2Vec
Word2Vec Embedding is a model that is trained on set of sentences. The hidden layer weight after training is used a the representation of a word. Model training is already done on both train and test data together.

To represent a sentence using word2vec here we are going to be taking the mean of all the word vectors present in the sentence.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

def convertTfidf(train, test):
    vectorizer = TfidfVectorizer(ngram_range=(1,3))               #Calling tfidf Vectorizer
    train_vect = vectorizer.fit_transform(train) #Fitting the training data for getting tfidf values
    test_vect = vectorizer.transform(test)       #Transforming test sentences to their respective tfidf vector
    return train_vect, test_vect

### Classifier Models
The different classifier algorithms that are going to be used are - 
1. Bernoulli Navies Bayes
2. Bagging Classifier
3. Decision Tree Classifier
4. Random Forest Classifier
5. Extra Trees Classifier
6. Calibrated Classifier
7. SGD Classifier
8. K-Neighbours Classifier
9. MLP Classifier
Each have their own speciality.

In [6]:
from sklearn.naive_bayes import *
from sklearn.dummy import *
from sklearn.ensemble import *
from sklearn.neighbors import *
from sklearn.tree import *
from sklearn.calibration import *
from sklearn.linear_model import *
from sklearn.multiclass import *
from sklearn.svm import *
from sklearn.neural_network import *
import pickle
from sklearn.metrics import confusion_matrix
from prettytable import PrettyTable


classifiers = [BernoulliNB(), 
               RandomForestClassifier(n_estimators=100, n_jobs=-1),
               BaggingClassifier(n_estimators=100, n_jobs=-1), 
               ExtraTreesClassifier(n_jobs=-1),
               DecisionTreeClassifier(criterion='gini',splitter='random'), 
               CalibratedClassifierCV(),
               SGDClassifier(n_jobs=-1), 
               KNeighborsClassifier(n_neighbors=1,weights='distance',n_jobs=-1,algorithm='kd_tree'),
               MLPClassifier(hidden_layer_sizes=(100,100,),verbose=True)]

def classify(train_vect,trainLab,test_vect,testLab):
    tableRep = PrettyTable(['Name','Precision','Recall','F1 Score','Accuracy'])
    #tableSent = []
    
    for classifier in classifiers:

        print "Training ",classifier.__class__.__name__
        classifier.fit(train_vect, trainLab)

        score = classifier.predict(test_vect)
        
        mat = confusion_matrix(testLab, score)

        print mat
        tp = mat[1][1]
        fp = mat[0][1]
        fn = mat[1][0]
        tn = mat[0][0]
        if tp == 0 :
            recall = 0.0
            precision = 0.0
            f1 = 0.0
        else :
            recall = float(tp)/float((tp+fn))
            precision = float(tp)/float((tp+fp))
            f1 = 2 * ((precision*recall)/(precision+recall))
        accuracy = float(tp+tn)/float(len(testLab))
        print precision,recall,f1
        tableRep.add_row([classifier.__class__.__name__,precision,recall,f1,accuracy])
    return tableRep

### 1) Tf-idf Vectorizer

In [9]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[3543 2319]
 [ 190  717]]
0.236166007905 0.790518191841 0.363682475273
Training  RandomForestClassifier
[[5564  298]
 [ 694  213]]
0.416829745597 0.234840132304 0.300423131171
Training  BaggingClassifier
[[5512  350]
 [ 644  263]]
0.429037520392 0.289966923925 0.346052631579
Training  ExtraTreesClassifier
[[5522  340]
 [ 698  209]]
0.380692167577 0.230429988975 0.287087912088
Training  DecisionTreeClassifier
[[5354  508]
 [ 593  314]]
0.38199513382 0.346196251378 0.363215731637
Training  CalibratedClassifierCV
[[5781   81]
 [ 821   86]]
0.51497005988 0.0948180815877 0.160148975791
Training  SGDClassifier
[[5846   16]
 [ 894   13]]
0.448275862069 0.0143329658214 0.0277777777778
Training  KNeighborsClassifier




[[5197  665]
 [ 548  359]]
0.3505859375 0.395810363837 0.371828068358
Training  MLPClassifier
Iteration 1, loss = 0.45372467
Iteration 2, loss = 0.32724173
Iteration 3, loss = 0.28322503
Iteration 4, loss = 0.25340585
Iteration 5, loss = 0.23047133
Iteration 6, loss = 0.20991659
Iteration 7, loss = 0.19354485
Iteration 8, loss = 0.17821918
Iteration 9, loss = 0.16480716
Iteration 10, loss = 0.15450122
Iteration 11, loss = 0.14341708
Iteration 12, loss = 0.13663411
Iteration 13, loss = 0.12929221
Iteration 14, loss = 0.12174132
Iteration 15, loss = 0.11587072
Iteration 16, loss = 0.11092697
Iteration 17, loss = 0.10739457
Iteration 18, loss = 0.10189406
Iteration 19, loss = 0.10012151
Iteration 20, loss = 0.09569567
Iteration 21, loss = 0.09365734
Iteration 22, loss = 0.09021999
Iteration 23, loss = 0.09009115
Iteration 24, loss = 0.08721443
Iteration 25, loss = 0.08474139
Iteration 26, loss = 0.08319176
Iteration 27, loss = 0.08179760
Iteration 28, loss = 0.07954279
Iteration 29, loss 

Tf-idf has a argument called n_gram where the words are taken together according to the argument value. So here we are going to be taking n_gram=(1,3) as this had better classifier report than other n_grams. From here on Tf-IDF Vectorizer will be taking n_grams.

In [11]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[4418 1444]
 [ 338  569]]
0.282662692499 0.627342888644 0.389726027397
Training  RandomForestClassifier
[[5616  246]
 [ 688  219]]
0.470967741935 0.241455347299 0.319241982507
Training  BaggingClassifier
[[5549  313]
 [ 590  317]]
0.503174603175 0.349503858875 0.412491867274
Training  ExtraTreesClassifier
[[5591  271]
 [ 688  219]]
0.44693877551 0.241455347299 0.313528990694
Training  DecisionTreeClassifier
[[5382  480]
 [ 548  359]]
0.42789034565 0.395810363837 0.411225658648
Training  CalibratedClassifierCV
[[5715  147]
 [ 688  219]]
0.598360655738 0.241455347299 0.344069128044
Training  SGDClassifier
[[5809   53]
 [ 832   75]]
0.5859375 0.0826901874311 0.144927536232
Training  KNeighborsClassifier
[[5263  599]
 [ 551  356]]
0.37277486911 0.39250275634 0.38238453276
Training  MLPClassifier
Iteration 1, loss = 0.41399153
Iteration 2, loss = 0.25059892
Iteration 3, loss = 0.17640629
Iteration 4, loss = 0.13491671
Iteration 5, loss = 0.11339382
Iteration 6, loss =

## 2) Word2Vec


For using word2vec embedding we need to first load the model here for use. 

In [111]:
from gensim.models import word2vec

def loadW2V(modelName):
    model = word2vec.Word2Vec.load(modelName)
    return dict(zip(model.wv.index2word, model.wv.vectors))

def convertW2V(data,w2v):
    wholeM = []
    count = 0
    le = 0
    print 'Embedding...',len(data)
    for sentence in data:
        le += 1
        arr = []    
        for word in sentence.split():
            if word in w2v:
                arr.append(np.array(w2v[word],copy=True))  
                                #Each word is checked if it is there in the word2vec vocabulary. If there then
                                #the vector space for the word is taken and then the mean is calculated.

        mean = np.zeros(100)
        for mat in arr:
            for j in range(len(mat)):
                mean[j] += mat[j]
        if len(arr) != 0:
            mean = np.array(mean/len(arr))
        else:
            count +=1
        wholeM.append(mean)
    print count,le
    return wholeM

def buildW2V(modelName, train, test):
    w2v = loadW2V("w2v_models/"+modelName)
    train_vect = convertW2V(train,w2v)
    test_vect = convertW2V(test,w2v)
    return train_vect, test_vect

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [13]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4663 1199]
 [ 539  368]]
0.234843650287 0.405733186329 0.297493936944
Training  RandomForestClassifier
[[5678  184]
 [ 782  125]]
0.404530744337 0.137816979052 0.205592105263
Training  BaggingClassifier
[[5639  223]
 [ 765  142]]
0.38904109589 0.156560088203 0.223270440252
Training  ExtraTreesClassifier
[[5624  238]
 [ 759  148]]
0.383419689119 0.163175303197 0.228924980665
Training  DecisionTreeClassifier
[[5277  585]
 [ 636  271]]
0.316588785047 0.298787210584 0.307430516166
Training  CalibratedClassifierCV
[[5801   61]
 [ 850   57]]
0.483050847458 0.0628445424476 0.111219512195
Training  SGDClassifier
[[5817   45]
 [ 873   34]]
0.430379746835 0.0374862183021 0.0689655172414
Training  KNeighborsClassifier
[[5303  559]
 [ 590  317]]
0.361872146119 0.349503858875 0.355580482333
Training  MLPClassifier
Iteration 1, loss = 0.42508162
Iteration 2, loss = 0.34804531
Iteration 3, loss = 0.33855452
Iteration 4, loss = 0.33037543
Iteration 5, 

#### b. Second we take the word2vec model that built after BR's are marked

In [14]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4886  976]
 [ 551  356]]
0.267267267267 0.39250275634 0.317999106744
Training  RandomForestClassifier
[[5606  256]
 [ 742  165]]
0.391923990499 0.181918412348 0.248493975904
Training  BaggingClassifier
[[5579  283]
 [ 717  190]]
0.401691331924 0.209481808159 0.275362318841
Training  ExtraTreesClassifier
[[5562  300]
 [ 721  186]]
0.382716049383 0.205071664829 0.267049533381
Training  DecisionTreeClassifier
[[5300  562]
 [ 644  263]]
0.318787878788 0.289966923925 0.303695150115
Training  CalibratedClassifierCV
[[5806   56]
 [ 852   55]]
0.495495495495 0.0606394707828 0.108055009823
Training  SGDClassifier
[[5832   30]
 [ 882   25]]
0.454545454545 0.0275633958104 0.0519750519751
Training  KNeighborsClassifier
[[5307  555]
 [ 606  301]]
0.351635514019 0.331863285557 0.341463414634
Training  MLPClassifier
Iteration 1, loss = 0.41875534
Iteration 2, loss = 0.35524657
Iteration 3, loss = 0.34697990
Iteration 4, loss = 0.33860186
Iteration 5, 

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [15]:
modelName = 'csvReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4328 1534]
 [ 480  427]]
0.217746047935 0.470782800441 0.297768479777
Training  RandomForestClassifier
[[5692  170]
 [ 805  102]]
0.375 0.112458654906 0.173027989822
Training  BaggingClassifier
[[5661  201]
 [ 786  121]]
0.375776397516 0.133406835722 0.19690805533
Training  ExtraTreesClassifier
[[5648  214]
 [ 780  127]]
0.372434017595 0.140022050717 0.203525641026
Training  DecisionTreeClassifier
[[5299  563]
 [ 667  240]]
0.298879202989 0.264608599779 0.280701754386
Training  CalibratedClassifierCV
[[5797   65]
 [ 843   64]]
0.496124031008 0.0705622932745 0.123552123552
Training  SGDClassifier
[[5524  338]
 [ 753  154]]
0.313008130081 0.169790518192 0.220157255182
Training  KNeighborsClassifier
[[5284  578]
 [ 597  310]]
0.349099099099 0.341786108049 0.345403899721
Training  MLPClassifier
Iteration 1, loss = 0.41251046
Iteration 2, loss = 0.34756342
Iteration 3, loss = 0.33361988
Iteration 4, loss = 0.32395314
Iteration 5, loss = 0.31

#### d. next word2vec model taken where it was trained with phrases.

In [16]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_3_ReplaceBR_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))



Embedding...
Embedding...
Training  BernoulliNB
[[4747 1115]
 [ 501  406]]
0.266929651545 0.44762954796 0.334431630972
Training  RandomForestClassifier
[[5613  249]
 [ 757  150]]
0.375939849624 0.165380374862 0.229709035222
Training  BaggingClassifier
[[5576  286]
 [ 724  183]]
0.390191897655 0.201764057332 0.265988372093
Training  ExtraTreesClassifier
[[5568  294]
 [ 713  194]]
0.397540983607 0.213891951488 0.278136200717
Training  DecisionTreeClassifier
[[5352  510]
 [ 649  258]]
0.3359375 0.284454244763 0.308059701493
Training  CalibratedClassifierCV
[[5820   42]
 [ 855   52]]
0.553191489362 0.0573318632856 0.103896103896
Training  SGDClassifier
[[5783   79]
 [ 863   44]]
0.357723577236 0.0485115766262 0.0854368932039
Training  KNeighborsClassifier
[[5261  601]
 [ 593  314]]
0.343169398907 0.346196251378 0.344676180022
Training  MLPClassifier
Iteration 1, loss = 0.40471328
Iteration 2, loss = 0.35736895
Iteration 3, loss = 0.34956691
Iteration 4, loss = 0.34418098
Iteration 5, loss 

--------
## Middle Sentences
Until now we were taking into consideration the whole sentence. But on closer inspection most the connection related words appear near of BR1 and BR2, mostly in-between them and just before and just after. So this time we will be only considering words in between the entities taken into consideration and some words before and after the occurence of it say a window of size 3.

In [89]:
from nltk.tokenize import word_tokenize
def breakSen(sentences):
    middleSen = []
    for sentence in sentences:
        s = sentence.split()

        ind1 = -1
        ind2 = -1
        for j in range(0,len(s)):
            if s[j].__contains__('br'):
                if ind1 == -1:
                    ind1 = j
                else:
                    ind2 = j
        for j in range(0,len(s)):
            if s[j].__contains__('br'):
                if j > ind2:
                    ind2 = j

        if ind1-3 < 0:
            ind1 = 0
        else: ind1 -= 3

        if ind2+3 > len(s):
            ind2 = len(s)
        else: ind2 += 3

        middleSen.append(' '.join(s[ind1:ind2]))
    return middleSen

In [90]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['divisions of the br1 br br and br2',
 'the br therefore resembles other areas of primate br1 such as the br2 where a',
 'almost all br including the br br br1 br br2 br and br displayed peak',
 'suggest that the br1 input to the br may be directed toward specific subpopulation of br neurons and may influence not only cells in the br2 but also in the br',
 'implants in the br1 many retrogradely labeled cells were observed mainly in the br the br br2 br and br']

### 1. Tf-idf Vectorizer

In [30]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[5135  727]
 [ 526  381]]
0.343862815884 0.42006615215 0.378163771712
Training  RandomForestClassifier
[[5740  122]
 [ 656  251]]
0.672922252011 0.276736493936 0.3921875
Training  BaggingClassifier
[[5626  236]
 [ 539  368]]
0.609271523179 0.405733186329 0.487094639312
Training  ExtraTreesClassifier
[[5704  158]
 [ 606  301]]
0.655773420479 0.331863285557 0.440702781845
Training  DecisionTreeClassifier
[[5495  367]
 [ 490  417]]
0.531887755102 0.459757442117 0.493199290361
Training  CalibratedClassifierCV
[[5703  159]
 [ 553  354]]
0.690058479532 0.390297684675 0.498591549296
Training  SGDClassifier
[[5805   57]
 [ 737  170]]
0.748898678414 0.18743109151 0.299823633157
Training  KNeighborsClassifier
[[5369  493]
 [ 468  439]]
0.471030042918 0.48401323043 0.477433387711
Training  MLPClassifier
Iteration 1, loss = 0.42751735
Iteration 2, loss = 0.22667112
Iteration 3, loss = 0.13298977
Iteration 4, loss = 0.09406474
Iteration 5, loss = 0.07573491
Iteration 6, loss 

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [31]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4771 1091]
 [ 492  415]]
0.275564409031 0.457552370452 0.343970161625
Training  RandomForestClassifier
[[5742  120]
 [ 780  127]]
0.514170040486 0.140022050717 0.220103986135
Training  BaggingClassifier
[[5729  133]
 [ 766  141]]
0.514598540146 0.15545755237 0.238780694327
Training  ExtraTreesClassifier
[[5705  157]
 [ 735  172]]
0.522796352584 0.189636163175 0.278317152104
Training  DecisionTreeClassifier
[[5266  596]
 [ 631  276]]
0.316513761468 0.304299889746 0.310286677909
Training  CalibratedClassifierCV
[[5771   91]
 [ 838   69]]
0.43125 0.0760749724366 0.129334582943
Training  SGDClassifier
[[5546  316]
 [ 779  128]]
0.288288288288 0.141124586549 0.189489267209
Training  KNeighborsClassifier
[[5395  467]
 [ 487  420]]
0.473506200676 0.463065049614 0.468227424749
Training  MLPClassifier
Iteration 1, loss = 0.37051210
Iteration 2, loss = 0.32808467
Iteration 3, loss = 0.31497449
Iteration 4, loss = 0.30384293
Iteration 5, loss = 0.

#### b. Second we take the word2vec model that built after BR's are marked

In [32]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4939  923]
 [ 517  390]]
0.29702970297 0.429988974642 0.351351351351
Training  RandomForestClassifier
[[5736  126]
 [ 745  162]]
0.5625 0.178610804851 0.271129707113
Training  BaggingClassifier
[[5711  151]
 [ 713  194]]
0.56231884058 0.213891951488 0.309904153355
Training  ExtraTreesClassifier
[[5701  161]
 [ 688  219]]
0.576315789474 0.241455347299 0.340326340326
Training  DecisionTreeClassifier
[[5356  506]
 [ 607  300]]
0.372208436725 0.330760749724 0.350262697023
Training  CalibratedClassifierCV
[[5761  101]
 [ 849   58]]
0.364779874214 0.06394707828 0.108818011257
Training  SGDClassifier
[[5747  115]
 [ 845   62]]
0.350282485876 0.0683572216097 0.114391143911
Training  KNeighborsClassifier
[[5412  450]
 [ 474  433]]
0.490373725934 0.477398015436 0.483798882682
Training  MLPClassifier
Iteration 1, loss = 0.39236779
Iteration 2, loss = 0.33815081
Iteration 3, loss = 0.32479392
Iteration 4, loss = 0.31462323
Iteration 5, loss = 0.306

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [33]:
modelName = 'csvReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4175 1687]
 [ 416  491]]
0.225436179982 0.541345093716 0.318314424635
Training  RandomForestClassifier
[[5749  113]
 [ 799  108]]
0.488687782805 0.119073869901 0.191489361702
Training  BaggingClassifier
[[5712  150]
 [ 788  119]]
0.442379182156 0.131201764057 0.202380952381
Training  ExtraTreesClassifier
[[5702  160]
 [ 768  139]]
0.464882943144 0.153252480706 0.230514096186
Training  DecisionTreeClassifier
[[5229  633]
 [ 605  302]]
0.322994652406 0.332965821389 0.327904451683
Training  CalibratedClassifierCV
[[5769   93]
 [ 840   67]]
0.41875 0.0738699007718 0.125585754452
Training  SGDClassifier
[[5487  375]
 [ 813   94]]
0.200426439232 0.103638368247 0.136627906977
Training  KNeighborsClassifier
[[5373  489]
 [ 461  446]]
0.477005347594 0.491730981257 0.484256243214
Training  MLPClassifier
Iteration 1, loss = 0.40640677
Iteration 2, loss = 0.33248189
Iteration 3, loss = 0.31839432
Iteration 4, loss = 0.30695360
Iteration 5, loss = 0

#### d. next word2vec model taken where it was trained with phrases.

In [35]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_3_ReplaceBR_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4807 1055]
 [ 489  418]]
0.28377460964 0.460859977949 0.351260504202
Training  RandomForestClassifier
[[5727  135]
 [ 736  171]]
0.558823529412 0.188533627343 0.281945589448
Training  BaggingClassifier
[[5686  176]
 [ 696  211]]
0.545219638243 0.232635060639 0.326120556414
Training  ExtraTreesClassifier
[[5668  194]
 [ 688  219]]
0.530266343826 0.241455347299 0.331818181818
Training  DecisionTreeClassifier
[[5308  554]
 [ 585  322]]
0.367579908676 0.355016538037 0.361189007291
Training  CalibratedClassifierCV
[[5772   90]
 [ 845   62]]
0.407894736842 0.0683572216097 0.117091595845
Training  SGDClassifier
[[5308  554]
 [ 675  232]]
0.295165394402 0.25578831312 0.27406969876
Training  KNeighborsClassifier
[[5401  461]
 [ 475  432]]
0.483762597984 0.476295479603 0.48
Training  MLPClassifier
Iteration 1, loss = 0.37498792
Iteration 2, loss = 0.34110363
Iteration 3, loss = 0.33099838
Iteration 4, loss = 0.32309451
Iteration 5, loss = 0.31530

_________
## BR1 BR2 tags re-tagging
There appeared to be many replacement of BR1 or BR2 in a single sentence itself but the other BR tag didn't signify any relation. So using CharOffset BR1 and BR2 was tagged. (2)csv file used here

In [131]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText(2).csv"
data = pd.read_csv('data/'+fileName,delimiter="|")

trainSet, testSet = train_test_split(data, test_size=0.3, random_state=3)

trainSet.to_csv('Train(2)',sep="|")
testSet.to_csv('Test(2)',sep="|")

In [132]:
trainData = pd.read_csv('Train(2)',delimiter='|')
trainSen = trainData['sentence']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(2)',delimiter='|')
testSen = testData['sentence']
testLab = testData['connection']
testLen = len(testSen)
print testLen


def formatSen(x):
    x = re.sub("\((.*?)\)"," ",x.lower())
    x = re.sub("^[ ]*([a-z])",r"\1",x)
    return re.sub("[^a-z0-9]"," ",x)

15792
6769


In [133]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

### 1. Tf-idf Vectorizer

In [42]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[5134  728]
 [ 609  298]]
0.29044834308 0.32855567806 0.308329022245
Training  RandomForestClassifier
[[5704  158]
 [ 566  341]]
0.683366733467 0.375964718853 0.48506401138
Training  BaggingClassifier
[[5568  294]
 [ 358  549]]
0.651245551601 0.605292171996 0.627428571429
Training  ExtraTreesClassifier
[[5680  182]
 [ 537  370]]
0.670289855072 0.407938257993 0.507196710075
Training  DecisionTreeClassifier
[[5452  410]
 [ 355  552]]
0.573804573805 0.608599779493 0.590690208668
Training  CalibratedClassifierCV
[[5690  172]
 [ 433  474]]
0.733746130031 0.522601984564 0.610431423052
Training  SGDClassifier
[[5805   57]
 [ 670  237]]
0.80612244898 0.261300992282 0.39467110741
Training  KNeighborsClassifier
[[5432  430]
 [ 378  529]]
0.551616266945 0.583241455347 0.566988210075
Training  MLPClassifier
Iteration 1, loss = 0.42037182
Iteration 2, loss = 0.25157603
Iteration 3, loss = 0.16215516
Iteration 4, loss = 0.11806090
Iteration 5, loss = 0.09596060
Iteration 6, lo

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [43]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4387 1475]
 [ 454  453]]
0.234958506224 0.499448732084 0.319576719577
Training  RandomForestClassifier
[[5623  239]
 [ 806  101]]
0.297058823529 0.111356119074 0.161988773055
Training  BaggingClassifier
[[5639  223]
 [ 811   96]]
0.300940438871 0.105843439912 0.15660685155
Training  ExtraTreesClassifier
[[5687  175]
 [ 832   75]]
0.3 0.0826901874311 0.129645635264
Training  DecisionTreeClassifier
[[5620  242]
 [ 812   95]]
0.281899109792 0.104740904079 0.152733118971
Training  CalibratedClassifierCV
[[5806   56]
 [ 863   44]]
0.44 0.0485115766262 0.0873882820258
Training  SGDClassifier
[[4488 1374]
 [ 423  484]]
0.260495156082 0.533627342889 0.350090415913
Training  KNeighborsClassifier
[[5206  656]
 [ 634  273]]
0.293864370291 0.300992282249 0.297385620915
Training  MLPClassifier
Iteration 1, loss = 0.39915115
Iteration 2, loss = 0.35121816
Iteration 3, loss = 0.33976074
Iteration 4, loss = 0.33113768
Iteration 5, loss = 0.32305589
Ite

#### b. Second we take the word2vec model that built after BR's are marked

In [44]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4403 1459]
 [ 452  455]]
0.237722048067 0.501653803749 0.322580645161
Training  RandomForestClassifier
[[5634  228]
 [ 803  104]]
0.313253012048 0.114663726571 0.16787732042
Training  BaggingClassifier
[[5629  233]
 [ 793  114]]
0.328530259366 0.125689084895 0.181818181818
Training  ExtraTreesClassifier
[[5682  180]
 [ 836   71]]
0.282868525896 0.0782800441014 0.122625215889
Training  DecisionTreeClassifier
[[5645  217]
 [ 803  104]]
0.323987538941 0.114663726571 0.169381107492
Training  CalibratedClassifierCV
[[5814   48]
 [ 861   46]]
0.489361702128 0.0507166482911 0.0919080919081
Training  SGDClassifier
[[5167  695]
 [ 638  269]]
0.279045643154 0.29658213892 0.287546766435
Training  KNeighborsClassifier
[[5259  603]
 [ 667  240]]
0.284697508897 0.264608599779 0.274285714286
Training  MLPClassifier
Iteration 1, loss = 0.40861922
Iteration 2, loss = 0.35540155
Iteration 3, loss = 0.34405066
Iteration 4, loss = 0.33609945
Iteration 5, l

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [45]:
modelName = 'csv_2_ReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4853 1009]
 [ 541  366]]
0.266181818182 0.403528114664 0.320771253287
Training  RandomForestClassifier
[[5620  242]
 [ 812   95]]
0.281899109792 0.104740904079 0.152733118971
Training  BaggingClassifier
[[5625  237]
 [ 804  103]]
0.302941176471 0.113561190739 0.165196471532
Training  ExtraTreesClassifier
[[5689  173]
 [ 835   72]]
0.29387755102 0.0793825799338 0.125
Training  DecisionTreeClassifier
[[5639  223]
 [ 803  104]]
0.318042813456 0.114663726571 0.168557536467
Training  CalibratedClassifierCV
[[5815   47]
 [ 868   39]]
0.453488372093 0.0429988974642 0.0785498489426
Training  SGDClassifier
[[5546  316]
 [ 733  174]]
0.355102040816 0.19184123484 0.249105225483
Training  KNeighborsClassifier
[[5302  560]
 [ 632  275]]
0.329341317365 0.303197353914 0.315729047072
Training  MLPClassifier
Iteration 1, loss = 0.37046787
Iteration 2, loss = 0.33599634
Iteration 3, loss = 0.32424094
Iteration 4, loss = 0.31612397
Iteration 5, loss = 0.3

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [46]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_ReplaceBR12_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4707 1155]
 [ 542  365]]
0.240131578947 0.402425578831 0.300782859497
Training  RandomForestClassifier
[[5629  233]
 [ 815   92]]
0.283076923077 0.101433296582 0.149350649351
Training  BaggingClassifier
[[5631  231]
 [ 787  120]]
0.34188034188 0.13230429989 0.190779014308
Training  ExtraTreesClassifier
[[5685  177]
 [ 839   68]]
0.277551020408 0.0749724366042 0.118055555556
Training  DecisionTreeClassifier
[[5623  239]
 [ 814   93]]
0.280120481928 0.102535832415 0.150121065375
Training  CalibratedClassifierCV
[[5816   46]
 [ 865   42]]
0.477272727273 0.0463065049614 0.0844221105528
Training  SGDClassifier
[[5619  243]
 [ 798  109]]
0.309659090909 0.120176405733 0.173153296267
Training  KNeighborsClassifier
[[5367  495]
 [ 660  247]]
0.332884097035 0.272326350606 0.299575500303
Training  MLPClassifier
Iteration 1, loss = 0.37736587
Iteration 2, loss = 0.34384498
Iteration 3, loss = 0.33059085
Iteration 4, loss = 0.32021666
Iteration 5, l

### Middle Sentences taken

In [134]:
from nltk.tokenize import word_tokenize
def breakSen(sentences):
    middleSen = []
    for sentence in sentences:
        s = word_tokenize(sentence)

        ind1 = -1
        ind2 = -1
        for j in range(0,len(s)):
            if s[j].__contains__('br1'):
                if ind1 == -1:
                    ind1 = j
                else:
                    ind2 = j
        for j in range(0,len(s)):
            if s[j].__contains__('br2'):
                if j > ind2:
                    ind2 = j

        if ind1-3 < 0:
            ind1 = 0
        else: ind1 -= 3

        if ind2+3 > len(s):
            ind2 = len(s)
        else: ind2 += 3

        middleSen.append(' '.join(s[ind1:ind2]))
    return middleSen

In [135]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['divisions of the br1 br br and br2',
 'areas of primate br1 such as the br2 where a',
 'the br br br1 br br2 br and',
 'suggest that the br1 input to the br may be directed toward specific subpopulation of br neurons and may influence not only cells in the br2 but also',
 'implants in the br1 many retrogradely labeled cells were observed mainly in the br the br br2 br and']

In [136]:
len(trainFSen)

15792

### 1. Tf-idf Vectorizer

In [82]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[5421  441]
 [ 614  293]]
0.399182561308 0.323042998897 0.357099329677
Training  RandomForestClassifier
[[5779   83]
 [ 441  466]]
0.848816029144 0.513781697905 0.64010989011
Training  BaggingClassifier
[[5686  176]
 [ 337  570]]
0.764075067024 0.628445424476 0.689655172414
Training  ExtraTreesClassifier
[[5755  107]
 [ 411  496]]
0.822553897181 0.546857772878 0.656953642384
Training  DecisionTreeClassifier
[[5538  324]
 [ 344  563]]
0.63472378805 0.620727673649 0.627647714604
Training  CalibratedClassifierCV
[[5686  176]
 [ 284  623]]
0.77972465582 0.686879823594 0.730363423212
Training  SGDClassifier
[[5782   80]
 [ 464  443]]
0.847036328872 0.48842337376 0.61958041958
Training  KNeighborsClassifier
[[5655  207]
 [ 371  536]]
0.721399730821 0.590959206174 0.649696969697
Training  MLPClassifier
Iteration 1, loss = 0.40705318
Iteration 2, loss = 0.18446840
Iteration 3, loss = 0.08510522
Iteration 4, loss = 0.05434422
Iteration 5, loss = 0.04010676
Iteration 6, lo

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [83]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4768 1094]
 [ 440  467]]
0.299167200512 0.514884233738 0.378444084279
Training  RandomForestClassifier
[[5785   77]
 [ 597  310]]
0.801033591731 0.341786108049 0.47913446677
Training  BaggingClassifier
[[5743  119]
 [ 565  342]]
0.741865509761 0.377067254686 0.5
Training  ExtraTreesClassifier
[[5743  119]
 [ 552  355]]
0.748945147679 0.391400220507 0.514120202752
Training  DecisionTreeClassifier
[[5413  449]
 [ 413  494]]
0.523860021209 0.544652701213 0.534054054054
Training  CalibratedClassifierCV
[[5764   98]
 [ 827   80]]
0.449438202247 0.0882028665932 0.147465437788
Training  SGDClassifier
[[5670  192]
 [ 824   83]]
0.301818181818 0.0915104740904 0.140439932318
Training  KNeighborsClassifier
[[5551  311]
 [ 353  554]]
0.640462427746 0.610804851158 0.625282167043
Training  MLPClassifier
Iteration 1, loss = 0.39189568
Iteration 2, loss = 0.32727486
Iteration 3, loss = 0.30870581
Iteration 4, loss = 0.29435820
Iteration 5, loss = 0.282

#### b. Second we take the word2vec model that built after BR's are marked

In [84]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4951  911]
 [ 495  412]]
0.311413454271 0.454244762955 0.369506726457
Training  RandomForestClassifier
[[5769   93]
 [ 560  347]]
0.788636363636 0.382579933848 0.515219005197
Training  BaggingClassifier
[[5747  115]
 [ 530  377]]
0.766260162602 0.41565600882 0.538956397427
Training  ExtraTreesClassifier
[[5755  107]
 [ 540  367]]
0.774261603376 0.404630650496 0.531498913831
Training  DecisionTreeClassifier
[[5408  454]
 [ 467  440]]
0.492170022371 0.485115766262 0.488617434758
Training  CalibratedClassifierCV
[[5784   78]
 [ 829   78]]
0.5 0.0859977949283 0.146754468485
Training  SGDClassifier
[[5694  168]
 [ 840   67]]
0.285106382979 0.0738699007718 0.117338003503
Training  KNeighborsClassifier
[[5545  317]
 [ 346  561]]
0.638952164009 0.618522601985 0.628571428571
Training  MLPClassifier
Iteration 1, loss = 0.40962077
Iteration 2, loss = 0.33494788
Iteration 3, loss = 0.31454425
Iteration 4, loss = 0.30239191
Iteration 5, loss = 0.293

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [85]:
modelName = 'csv_3_ReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4921  941]
 [ 474  433]]
0.315138282387 0.477398015436 0.379658044717
Training  RandomForestClassifier
[[5775   87]
 [ 568  339]]
0.795774647887 0.373759647189 0.508627156789
Training  BaggingClassifier
[[5759  103]
 [ 532  375]]
0.784518828452 0.413450937155 0.541516245487
Training  ExtraTreesClassifier
[[5752  110]
 [ 554  353]]
0.762419006479 0.389195148842 0.515328467153
Training  DecisionTreeClassifier
[[5369  493]
 [ 453  454]]
0.479408658923 0.500551267916 0.48975188781
Training  CalibratedClassifierCV
[[5770   92]
 [ 816   91]]
0.497267759563 0.10033076075 0.166972477064
Training  SGDClassifier
[[5471  391]
 [ 681  226]]
0.366288492707 0.249173098126 0.296587926509
Training  KNeighborsClassifier
[[5523  339]
 [ 354  553]]
0.619955156951 0.609702315325 0.614785992218
Training  MLPClassifier
Iteration 1, loss = 0.40676753
Iteration 2, loss = 0.33243473
Iteration 3, loss = 0.31347340
Iteration 4, loss = 0.30124853
Iteration 5, loss

### Grouping
When we breaked the sentences to contain only the middle context it was observed that comma seperated BR's had a disadvantage as in some sentences that was taken as middle context, and no other words appeared. So to get more out the context of the surronding words around BR, BR's appearing together were grouped into a single BR entity.

In [137]:
def formatSen(sentence):
    sentence = re.sub("\s(the|The)\s"," ",sentence)
    sentence = re.sub("^(the|The)","",sentence)
    sentence = re.sub("(nucleus)"," ",sentence)
    sentence = re.sub(r","," ", sentence)
    sentence = re.sub("\([0-9]\)",".",sentence)
    sentence = re.sub("[, ]*(BR[^12])([, ]*(BR[^12]))+[, ]*"," BR ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*BR1([, ]+BR[^12])*[, ]*"," BR1 ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*BR2([, ]+BR[^12])*[, ]*"," BR2 ",sentence)
    return sentence.lower()

In [138]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

In [139]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

[') divisions of br1 ( mgc ) br ( sg ) br ( lim ) and br2 .',
 'areas of primate br1 such as br2 where a',
 'all br including br1 br2 and br',
 'results suggest that br1 input to br may be directed toward specific subpopulation ( s ) of br neurons and may influence not only cells in br2 but also',
 '-gel implants in br1 many retrogradely labeled cells were observed mainly in br2 and br']

### 1. Tf-idf Vectorizer

In [125]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[5513  349]
 [ 609  298]]
0.460587326121 0.32855567806 0.383526383526
Training  RandomForestClassifier
[[5783   79]
 [ 413  494]]
0.862129144852 0.544652701213 0.667567567568
Training  BaggingClassifier
[[5697  165]
 [ 321  586]]
0.780292942743 0.646085997795 0.70687575392
Training  ExtraTreesClassifier
[[5743  119]
 [ 373  534]]
0.817764165391 0.588754134509 0.684615384615
Training  DecisionTreeClassifier
[[5589  273]
 [ 307  600]]
0.687285223368 0.661521499449 0.674157303371
Training  CalibratedClassifierCV
[[5704  158]
 [ 272  635]]
0.800756620429 0.700110253583 0.747058823529
Training  SGDClassifier
[[5794   68]
 [ 454  453]]
0.869481765835 0.499448732084 0.634453781513
Training  KNeighborsClassifier
[[5542  320]
 [ 294  613]]
0.657020364416 0.67585446527 0.666304347826
Training  MLPClassifier
Iteration 1, loss = 0.43017429
Iteration 2, loss = 0.16907243
Iteration 3, loss = 0.05609700
Iteration 4, loss = 0.02700541
Iteration 5, loss = 0.01663817
Iteration 6, 

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [118]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 15792
0 15792
Embedding... 6769
0 6769
Training  BernoulliNB
[[4758 1104]
 [ 425  482]]
0.303909205549 0.531422271224 0.386682711592
Training  RandomForestClassifier
[[5760  102]
 [ 629  278]]
0.731578947368 0.306504961411 0.432012432012
Training  BaggingClassifier
[[5737  125]
 [ 582  325]]
0.722222222222 0.358324145535 0.478997789241
Training  ExtraTreesClassifier
[[5723  139]
 [ 601  306]]
0.687640449438 0.337375964719 0.452662721893
Training  DecisionTreeClassifier
[[5301  561]
 [ 476  431]]
0.434475806452 0.475192943771 0.45392311743
Training  CalibratedClassifierCV
[[5754  108]
 [ 792  115]]
0.515695067265 0.126791620728 0.203539823009
Training  SGDClassifier
[[5673  189]
 [ 845   62]]
0.247011952191 0.0683572216097 0.107081174439
Training  KNeighborsClassifier
[[5512  350]
 [ 327  580]]
0.623655913978 0.6394707828 0.631464344039
Training  MLPClassifier
Iteration 1, loss = 0.37957125
Iteration 2, loss = 0.32048802
Iteration 3, loss = 0.30545818
Iteration 4, loss = 0.

#### b. Second we take the word2vec model that built after BR's are marked

In [119]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 15792
6 15792
Embedding... 6769
1 6769
Training  BernoulliNB
[[4825 1037]
 [ 484  423]]
0.289726027397 0.466372657111 0.357414448669
Training  RandomForestClassifier
[[5762  100]
 [ 600  307]]
0.7542997543 0.338478500551 0.467275494673
Training  BaggingClassifier
[[5721  141]
 [ 558  349]]
0.712244897959 0.384785005513 0.499642090193
Training  ExtraTreesClassifier
[[5725  137]
 [ 596  311]]
0.694196428571 0.342888643881 0.459040590406
Training  DecisionTreeClassifier
[[5380  482]
 [ 477  430]]
0.47149122807 0.474090407938 0.472787245739
Training  CalibratedClassifierCV
[[5750  112]
 [ 796  111]]
0.497757847534 0.122381477398 0.196460176991
Training  SGDClassifier
[[5737  125]
 [ 794  113]]
0.474789915966 0.124586549063 0.197379912664
Training  KNeighborsClassifier
[[5542  320]
 [ 341  566]]
0.638826185102 0.624035281147 0.631344116007
Training  MLPClassifier
Iteration 1, loss = 0.36384746
Iteration 2, loss = 0.32002731
Iteration 3, loss = 0.30475602
Iteration 4, loss = 0.2

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [120]:
modelName = 'csv_2_groupBR12'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 15792
0 15792
Embedding... 6769
0 6769
Training  BernoulliNB
[[4428 1434]
 [ 394  513]]
0.263482280431 0.565600882029 0.359495444989
Training  RandomForestClassifier
[[5773   89]
 [ 627  280]]
0.758807588076 0.308710033076 0.438871473354
Training  BaggingClassifier
[[5743  119]
 [ 595  312]]
0.723897911833 0.343991179713 0.466367713004
Training  ExtraTreesClassifier
[[5751  111]
 [ 616  291]]
0.723880597015 0.320837927233 0.44461420932
Training  DecisionTreeClassifier
[[5327  535]
 [ 491  416]]
0.437434279706 0.458654906284 0.447793326157
Training  CalibratedClassifierCV
[[5737  125]
 [ 801  106]]
0.458874458874 0.116868798236 0.186291739895
Training  SGDClassifier
[[3329 2533]
 [ 247  660]]
0.206702160977 0.727673649394 0.321951219512
Training  KNeighborsClassifier
[[5519  343]
 [ 343  564]]
0.621830209482 0.621830209482 0.621830209482
Training  MLPClassifier
Iteration 1, loss = 0.35413348
Iteration 2, loss = 0.30921414
Iteration 3, loss = 0.29319977
Iteration 4, loss = 0

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [121]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_groupBR12_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 15792
0 15792
Embedding... 6769
0 6769
Training  BernoulliNB
[[4502 1360]
 [ 450  457]]
0.251513483764 0.503858875413 0.335535976505
Training  RandomForestClassifier
[[5788   74]
 [ 655  252]]
0.773006134969 0.277839029768 0.408759124088
Training  BaggingClassifier
[[5744  118]
 [ 630  277]]
0.701265822785 0.305402425579 0.425499231951
Training  ExtraTreesClassifier
[[5749  113]
 [ 612  295]]
0.723039215686 0.325248070562 0.448669201521
Training  DecisionTreeClassifier
[[5324  538]
 [ 466  441]]
0.450459652707 0.486218302095 0.467656415695
Training  CalibratedClassifierCV
[[5734  128]
 [ 802  105]]
0.450643776824 0.115766262404 0.184210526316
Training  SGDClassifier
[[5810   52]
 [ 876   31]]
0.373493975904 0.0341786108049 0.0626262626263
Training  KNeighborsClassifier
[[5504  358]
 [ 339  568]]
0.613390928726 0.626240352811 0.619749045281
Training  MLPClassifier
Iteration 1, loss = 0.37360274
Iteration 2, loss = 0.31832403
Iteration 3, loss = 0.30152261
Iteration 4, loss 

### Tokenization
Till now we were spliting the sentences using white space seperator. So the problem was the brackets were considered together with the words and not seperately. So before getting the vector we tokenize the sentence using nltk package function. This is only for word2vec

In [140]:
from nltk.tokenize import word_tokenize

def convertW2V(data,w2v):
    wholeM = []
    print 'Embedding...'
    for sentence in data:
        arr = []    
        for word in word_tokenize(sentence):
            if word in w2v:
                arr.append(np.array(w2v[word],copy=True))  
                                #Each word is checked if it is there in the word2vec vocabulary. If there then
                                #the vector space for the word is taken and then the mean is calculated.

        mean = np.zeros(100)
        for mat in arr:
            for j in range(len(mat)):
                mean[j] += mat[j]
        if len(arr) != 0:
            mean = np.array(mean/len(arr))
        wholeM.append(mean)
    return wholeM

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [141]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4758 1104]
 [ 425  482]]
0.303909205549 0.531422271224 0.386682711592
Training  RandomForestClassifier
[[5754  108]
 [ 626  281]]
0.72236503856 0.309812568908 0.433641975309
Training  BaggingClassifier
[[5736  126]
 [ 566  341]]
0.730192719486 0.375964718853 0.496360989811
Training  ExtraTreesClassifier
[[5718  144]
 [ 585  322]]
0.690987124464 0.355016538037 0.469045884924
Training  DecisionTreeClassifier
[[5371  491]
 [ 480  427]]
0.4651416122 0.470782800441 0.467945205479
Training  CalibratedClassifierCV
[[5754  108]
 [ 792  115]]
0.515695067265 0.126791620728 0.203539823009
Training  SGDClassifier
[[5772   90]
 [ 829   78]]
0.464285714286 0.0859977949283 0.14511627907
Training  KNeighborsClassifier
[[5512  350]
 [ 327  580]]
0.623655913978 0.6394707828 0.631464344039
Training  MLPClassifier
Iteration 1, loss = 0.38301467
Iteration 2, loss = 0.31924996
Iteration 3, loss = 0.30405911
Iteration 4, loss = 0.29076975
Iteration 5, loss = 

#### b. Second we take the word2vec model that built after BR's are marked

In [142]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4825 1037]
 [ 484  423]]
0.289726027397 0.466372657111 0.357414448669
Training  RandomForestClassifier
[[5761  101]
 [ 598  309]]
0.753658536585 0.340683572216 0.469248291572
Training  BaggingClassifier
[[5712  150]
 [ 563  344]]
0.696356275304 0.379272326351 0.49107780157
Training  ExtraTreesClassifier
[[5741  121]
 [ 586  321]]
0.726244343891 0.353914002205 0.475908080059
Training  DecisionTreeClassifier
[[5372  490]
 [ 470  437]]
0.471413160734 0.481808158765 0.476553980371
Training  CalibratedClassifierCV
[[5750  112]
 [ 796  111]]
0.497757847534 0.122381477398 0.196460176991
Training  SGDClassifier
[[4767 1095]
 [ 479  428]]
0.281024294156 0.471885336273 0.352263374486
Training  KNeighborsClassifier
[[5542  320]
 [ 341  566]]
0.638826185102 0.624035281147 0.631344116007
Training  MLPClassifier
Iteration 1, loss = 0.37786374
Iteration 2, loss = 0.32143615
Iteration 3, loss = 0.30537608
Iteration 4, loss = 0.29240312
Iteration 5, los

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [123]:
modelName = 'csv_2_groupBR12_token'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[5138  724]
 [ 566  341]]
0.320187793427 0.375964718853 0.34584178499
Training  RandomForestClassifier
[[5782   80]
 [ 622  285]]
0.780821917808 0.314222712238 0.448113207547
Training  BaggingClassifier
[[5767   95]
 [ 589  318]]
0.769975786925 0.350606394708 0.481818181818
Training  ExtraTreesClassifier
[[5746  116]
 [ 573  334]]
0.742222222222 0.368246968026 0.492262343405
Training  DecisionTreeClassifier
[[5340  522]
 [ 440  467]]
0.47219413549 0.514884233738 0.492616033755
Training  CalibratedClassifierCV
[[5751  111]
 [ 818   89]]
0.445 0.0981256890849 0.160794941283
Training  SGDClassifier
[[5370  492]
 [ 677  230]]
0.318559556787 0.253583241455 0.282381829343
Training  KNeighborsClassifier
[[5541  321]
 [ 332  575]]
0.641741071429 0.633958103638 0.637825845813
Training  MLPClassifier
Iteration 1, loss = 0.37784889
Iteration 2, loss = 0.31608462
Iteration 3, loss = 0.29659850
Iteration 4, loss = 0.28065110
Iteration 5, loss = 0.265

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [124]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_groupBR12_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4469 1393]
 [ 450  457]]
0.247027027027 0.503858875413 0.331519767864
Training  RandomForestClassifier
[[5776   86]
 [ 629  278]]
0.763736263736 0.306504961411 0.437450826121
Training  BaggingClassifier
[[5744  118]
 [ 599  308]]
0.723004694836 0.339581036384 0.462115528882
Training  ExtraTreesClassifier
[[5728  134]
 [ 582  325]]
0.708061002179 0.358324145535 0.475841874085
Training  DecisionTreeClassifier
[[5303  559]
 [ 480  427]]
0.433062880325 0.470782800441 0.451135763339
Training  CalibratedClassifierCV
[[5744  118]
 [ 820   87]]
0.424390243902 0.0959206174201 0.156474820144
Training  SGDClassifier
[[3961 1901]
 [ 269  638]]
0.251280031508 0.70341786108 0.370284387696
Training  KNeighborsClassifier
[[5503  359]
 [ 330  577]]
0.616452991453 0.636163175303 0.626153011394
Training  MLPClassifier
Iteration 1, loss = 0.35300818
Iteration 2, loss = 0.30916939
Iteration 3, loss = 0.29196523
Iteration 4, loss = 0.27825120
Iteration 5, lo

------------------------
# Fin
For the final time we take another csv(4) file this time. We had duplicates after grouping of BR's. This file is without any duplicate. Senteces were preprocessed as per need. 

In [143]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText_re(Fin).csv"
data = pd.read_csv('data/'+fileName,delimiter="|")

trainSet, testSet = train_test_split(data, test_size=0.3, random_state=3)

trainSet.to_csv('Train(Fin)',sep="|")
testSet.to_csv('Test(Fin)',sep="|")

In [144]:
trainData = pd.read_csv('Train(Fin)',delimiter='|')
trainSen = trainData['sentence']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(Fin)',delimiter='|')
testSen = testData['sentence']
testLab = testData['connection']
testLen = len(testSen)
print testLen


def formatSen(sentence):
    sentence = re.sub("\s(the|The)\s"," ",sentence)
    sentence = re.sub("^(the|The)","",sentence)
    sentence = re.sub("(nucleus)"," ",sentence)
    sentence = re.sub(r","," ", sentence)
    sentence = re.sub("\([0-9]\)",".",sentence)
    return sentence.lower()

11578
4963


In [145]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

### 1. Tf-idf Vectorizer

In [146]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[3714  430]
 [ 616  203]]
0.320695102686 0.247863247863 0.279614325069
Training  RandomForestClassifier
[[3998  146]
 [ 563  256]]
0.636815920398 0.312576312576 0.419328419328
Training  BaggingClassifier
[[3888  256]
 [ 418  401]]
0.610350076104 0.489621489621 0.543360433604
Training  ExtraTreesClassifier
[[3975  169]
 [ 529  290]]
0.631808278867 0.35409035409 0.453834115806
Training  DecisionTreeClassifier
[[3765  379]
 [ 382  437]]
0.535539215686 0.533577533578 0.534556574924
Training  CalibratedClassifierCV
[[4017  127]
 [ 484  335]]
0.725108225108 0.409035409035 0.523028883685
Training  SGDClassifier
[[4040  104]
 [ 565  254]]
0.709497206704 0.310134310134 0.4316057774
Training  KNeighborsClassifier
[[3730  414]
 [ 411  408]]
0.496350364964 0.498168498168 0.497257769653
Training  MLPClassifier
Iteration 1, loss = 0.46197860
Iteration 2, loss = 0.29437933
Iteration 3, loss = 0.18331209
Iteration 4, loss = 0.12009028
Iteration 5, loss = 0.08769208
Iteration 6, 

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [149]:
len(trainFSen)

11578

In [150]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3263  881]
 [ 418  401]]
0.3127925117 0.489621489621 0.381722989053
Training  RandomForestClassifier
[[3915  229]
 [ 700  119]]
0.341954022989 0.145299145299 0.203941730934
Training  BaggingClassifier
[[3914  230]
 [ 681  138]]
0.375 0.168498168498 0.23251895535
Training  ExtraTreesClassifier
[[3956  188]
 [ 732   87]]
0.316363636364 0.106227106227 0.159049360146
Training  DecisionTreeClassifier
[[3889  255]
 [ 706  113]]
0.307065217391 0.137973137973 0.190395956192
Training  CalibratedClassifierCV
[[4099   45]
 [ 764   55]]
0.55 0.0671550671551 0.119695321001
Training  SGDClassifier
[[4105   39]
 [ 796   23]]
0.370967741935 0.028083028083 0.0522133938706
Training  KNeighborsClassifier
[[3603  541]
 [ 568  251]]
0.316919191919 0.306471306471 0.311607697083
Training  MLPClassifier
Iteration 1, loss = 0.43822278
Iteration 2, loss = 0.40154866
Iteration 3, loss = 0.39449492
Iteration 4, loss = 0.38697838
Iteration 5, loss = 0.37901194
Iter

#### b. Second we take the word2vec model that built after BR's are marked

In [151]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3355  789]
 [ 443  376]]
0.322746781116 0.459096459096 0.379032258065
Training  RandomForestClassifier
[[3922  222]
 [ 697  122]]
0.354651162791 0.148962148962 0.209802235598
Training  BaggingClassifier
[[3917  227]
 [ 694  125]]
0.355113636364 0.152625152625 0.213492741247
Training  ExtraTreesClassifier
[[3954  190]
 [ 719  100]]
0.344827586207 0.1221001221 0.180342651037
Training  DecisionTreeClassifier
[[3899  245]
 [ 699  120]]
0.328767123288 0.14652014652 0.202702702703
Training  CalibratedClassifierCV
[[4098   46]
 [ 754   65]]
0.585585585586 0.0793650793651 0.139784946237
Training  SGDClassifier
[[4122   22]
 [ 807   12]]
0.352941176471 0.014652014652 0.0281359906213
Training  KNeighborsClassifier
[[3670  474]
 [ 577  242]]
0.337988826816 0.295482295482 0.315309446254
Training  MLPClassifier
Iteration 1, loss = 0.46671889
Iteration 2, loss = 0.40723854
Iteration 3, loss = 0.39853633
Iteration 4, loss = 0.39064242
Iteration 5, los

#### c. bigram

In [152]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_Fin_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3579  565]
 [ 535  284]]
0.334511189635 0.346764346764 0.340527577938
Training  RandomForestClassifier
[[3912  232]
 [ 695  124]]
0.348314606742 0.151404151404 0.211063829787
Training  BaggingClassifier
[[3914  230]
 [ 691  128]]
0.357541899441 0.156288156288 0.217502124044
Training  ExtraTreesClassifier
[[3945  199]
 [ 726   93]]
0.318493150685 0.113553113553 0.167416741674
Training  DecisionTreeClassifier
[[3865  279]
 [ 689  130]]
0.317848410758 0.15873015873 0.211726384365
Training  CalibratedClassifierCV
[[4089   55]
 [ 736   83]]
0.601449275362 0.101343101343 0.173458725183
Training  SGDClassifier
[[4094   50]
 [ 758   61]]
0.54954954955 0.0744810744811 0.131182795699
Training  KNeighborsClassifier
[[3559  585]
 [ 560  259]]
0.306872037915 0.316239316239 0.311485267589
Training  MLPClassifier
Iteration 1, loss = 0.44234723
Iteration 2, loss = 0.40346794
Iteration 3, loss = 0.39269022
Iteration 4, loss = 0.38808562
Iteration 5, los

### Middle Sentences taken

In [153]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['than br ; br1 ; br other than br ; br ; br2 ; purkinje',
 'to specification of br1 in mouse br2 .',
 'influenced by sound-evoked br1 br2 activity .',
 ') located in br1 at about 1.2 mm from lateral edge of br ; br ( po ) projects to a zone ( br ) about 0.7 mm from lateral edge of br2 ; finally',
 '( 1 ) br1 ( ppb ) ( m ) region ( br and waist area ) diffusely projects to br ( br ( bst ) ) br ( including br and br ) and to a lesser extent br ; ( 2 ) br ( ppb ) ( 1 ) region [ br ( ppb ) cl ) and br ] densely projects to br2 ( bst']

### 1. Tf-idf Vectorizer

In [154]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[3944  200]
 [ 618  201]]
0.501246882793 0.245421245421 0.329508196721
Training  RandomForestClassifier
[[4076   68]
 [ 468  351]]
0.837708830549 0.428571428571 0.56704361874
Training  BaggingClassifier
[[3972  172]
 [ 380  439]]
0.718494271686 0.53601953602 0.613986013986
Training  ExtraTreesClassifier
[[4046   98]
 [ 425  394]]
0.80081300813 0.481074481074 0.601067887109
Training  DecisionTreeClassifier
[[3832  312]
 [ 339  480]]
0.606060606061 0.586080586081 0.595903165736
Training  CalibratedClassifierCV
[[3996  148]
 [ 315  504]]
0.773006134969 0.615384615385 0.685248130523
Training  SGDClassifier
[[4058   86]
 [ 433  386]]
0.817796610169 0.471306471306 0.59798605732
Training  KNeighborsClassifier
[[3858  286]
 [ 337  482]]
0.627604166667 0.588522588523 0.607435412728
Training  MLPClassifier
Iteration 1, loss = 0.44715200
Iteration 2, loss = 0.22163679
Iteration 3, loss = 0.07758928
Iteration 4, loss = 0.03701396
Iteration 5, loss = 0.02253073
Iteration 6, l

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [155]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3269  875]
 [ 386  433]]
0.331039755352 0.528693528694 0.407146215327
Training  RandomForestClassifier
[[4082   62]
 [ 668  151]]
0.708920187793 0.184371184371 0.292635658915
Training  BaggingClassifier
[[4069   75]
 [ 629  190]]
0.716981132075 0.23199023199 0.350553505535
Training  ExtraTreesClassifier
[[4047   97]
 [ 642  177]]
0.64598540146 0.216117216117 0.323879231473
Training  DecisionTreeClassifier
[[3626  518]
 [ 528  291]]
0.359703337454 0.355311355311 0.357493857494
Training  CalibratedClassifierCV
[[4063   81]
 [ 711  108]]
0.571428571429 0.131868131868 0.214285714286
Training  SGDClassifier
[[3642  502]
 [ 496  323]]
0.391515151515 0.394383394383 0.392944038929
Training  KNeighborsClassifier
[[3844  300]
 [ 390  429]]
0.588477366255 0.52380952381 0.554263565891
Training  MLPClassifier
Iteration 1, loss = 0.43545675
Iteration 2, loss = 0.37700587
Iteration 3, loss = 0.36014847
Iteration 4, loss = 0.34741174
Iteration 5, loss 

#### b. Second we take the word2vec model that built after BR's are marked

In [156]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3414  730]
 [ 429  390]]
0.348214285714 0.47619047619 0.402269210933
Training  RandomForestClassifier
[[4066   78]
 [ 642  177]]
0.694117647059 0.216117216117 0.329608938547
Training  BaggingClassifier
[[4035  109]
 [ 606  213]]
0.66149068323 0.260073260073 0.373356704645
Training  ExtraTreesClassifier
[[4022  122]
 [ 633  186]]
0.603896103896 0.227106227106 0.33007985803
Training  DecisionTreeClassifier
[[3649  495]
 [ 517  302]]
0.378920953576 0.368742368742 0.373762376238
Training  CalibratedClassifierCV
[[4070   74]
 [ 721   98]]
0.56976744186 0.119658119658 0.197780020182
Training  SGDClassifier
[[3692  452]
 [ 520  299]]
0.398135818908 0.365079365079 0.380891719745
Training  KNeighborsClassifier
[[3843  301]
 [ 407  412]]
0.577840112202 0.503052503053 0.537859007833
Training  MLPClassifier
Iteration 1, loss = 0.43734837
Iteration 2, loss = 0.37979072
Iteration 3, loss = 0.36600347
Iteration 4, loss = 0.35585196
Iteration 5, loss =

#### c. bigram

In [157]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_Fin_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[3590  554]
 [ 506  313]]
0.361014994233 0.382173382173 0.371293001186
Training  RandomForestClassifier
[[4094   50]
 [ 697  122]]
0.709302325581 0.148962148962 0.246215943491
Training  BaggingClassifier
[[4080   64]
 [ 660  159]]
0.713004484305 0.194139194139 0.305182341651
Training  ExtraTreesClassifier
[[4062   82]
 [ 633  186]]
0.694029850746 0.227106227106 0.342226310948
Training  DecisionTreeClassifier
[[3605  539]
 [ 510  309]]
0.364386792453 0.377289377289 0.370725854829
Training  CalibratedClassifierCV
[[4058   86]
 [ 714  105]]
0.549738219895 0.128205128205 0.207920792079
Training  SGDClassifier
[[2349 1795]
 [ 160  659]]
0.268541157294 0.80463980464 0.402688664833
Training  KNeighborsClassifier
[[3853  291]
 [ 362  457]]
0.610962566845 0.557997557998 0.583280153159
Training  MLPClassifier
Iteration 1, loss = 0.42691503
Iteration 2, loss = 0.38011916
Iteration 3, loss = 0.36562674
Iteration 4, loss = 0.35620714
Iteration 5, los