# Building Models for predicting labels

## Train and Test Data
Before going forward, the first and foremost step is to divide the data into training and test data in 70:30 ration.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText(1).csv"
trainSet = pd.read_csv('data/'+fileName,delimiter="|")

fileEval = "WhiteTextUnseenEval(1).csv"
testSet = pd.read_csv('data/'+fileEval,delimiter="|")


trainSet.to_csv('Train(1)',sep="|")
testSet.to_csv('Test(1)',sep="|")

Now that we have divided the data into Train and Test data we are now going to build models

-----------

### Sentence preprocessing
At the start the Brain region mentions that were being looked into were tagged into BR1 and BR2 using replace function. Some more preprocessing were also done, which will be added on as we go on.

So first reading the train and test data.

In [2]:
trainData = pd.read_csv('Train(1)',delimiter='|')
trainSen = trainData['sentence']
trainEn1 = trainData['entity1']
trainEn2 = trainData['entity2']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(1)',delimiter='|')
testSen = testData['sentence']
testEn1 = testData['entity1']
testEn2 = testData['entity2']
testLab = testData['connection']
testLen = len(testSen)
print testLen

22572
11219


We will be replacing every instance of the occurence of Entity1 and Entity2 in a sentence. After that removing of words inside brackets, and removing all characters other than alphabets and lowering the case.

In [3]:
import re
def replaceBR(s, en1, en2):
    s = s.replace(en1," BR1 ")
    s = s.replace(en2," BR2 ")
    return s

def formatSen(x):
    x = re.sub("\((.*?)\)"," ",x.lower())
    x = re.sub("^[ ]*([a-z])",r"\1",x)
    return re.sub("[^a-z0-9]"," ",x)

In [4]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(replaceBR(trainSen[i], trainEn1[i], trainEn2[i]))
    trainFSen[i] = formatSen(trainFSen[i])
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(replaceBR(testSen[i], testEn1[i], testEn2[i]))
    testFSen[i] = formatSen(testFSen[i])

After formatting all the sentences we come to a way to represent sentences as numbers as classfication models accept numbers only.

## Word Embedding
The input for all models that have been created till date take numbers as input. They donot accept words or letters. So we need to convert these words into corresponding numbers that would identity that word. This process is called word embedding.

There are different word embedding techniques.
1. Count Vectorization
2. Tf-idf Vectorization
3. Word2Vec Embedding
4. FastText

CountVectorization take each count of a particular word in a sentence. The whole array formed later is used as the representation basis. This is not being used as this would create a pattern based classifier and also perform bad.

#### Tf-idf Vec
Tf-idf Vectorization takes term frequency and inverse document frequency into consideration for denoting a value for a word. Common words that appear repeatedly in a sentence are given lesser value than rare words. We will be implementing this.


#### Word2Vec
Word2Vec Embedding is a model that is trained on set of sentences. The hidden layer weight after training is used a the representation of a word. Model training is already done on both train and test data together.

To represent a sentence using word2vec here we are going to be taking the mean of all the word vectors present in the sentence.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

def convertTfidf(train, test):
    vectorizer = TfidfVectorizer(ngram_range=(1,3))               #Calling tfidf Vectorizer
    train_vect = vectorizer.fit_transform(train) #Fitting the training data for getting tfidf values
    test_vect = vectorizer.transform(test)       #Transforming test sentences to their respective tfidf vector
    return train_vect, test_vect

### Classifier Models
The different classifier algorithms that are going to be used are - 
1. Bernoulli Navies Bayes
2. Bagging Classifier
3. Decision Tree Classifier
4. Random Forest Classifier
5. Extra Trees Classifier
6. Calibrated Classifier
7. SGD Classifier
8. K-Neighbours Classifier
9. MLP Classifier
Each have their own speciality.

In [7]:
from sklearn.naive_bayes import *
from sklearn.dummy import *
from sklearn.ensemble import *
from sklearn.neighbors import *
from sklearn.tree import *
from sklearn.calibration import *
from sklearn.linear_model import *
from sklearn.multiclass import *
from sklearn.svm import *
from sklearn.neural_network import *
import pickle
from sklearn.metrics import confusion_matrix
# from prettytable import PrettyTable


classifiers = [BernoulliNB(), 
               RandomForestClassifier(n_estimators=100, n_jobs=-1),
               BaggingClassifier(n_estimators=100, n_jobs=-1), 
               ExtraTreesClassifier(n_jobs=-1),
               DecisionTreeClassifier(criterion='gini',splitter='random'), 
               CalibratedClassifierCV(),
               SGDClassifier(n_jobs=-1), 
               #KNeighborsClassifier(n_neighbors=1,weights='distance',n_jobs=2,algorithm='kd_tree'),
               MLPClassifier(hidden_layer_sizes=(100,100,),verbose=True)]

def classify(train_vect,trainLab,test_vect,testLab):
#     tableRep = PrettyTable(['Name','Precision','Recall','F1 Score','Accuracy'])
    tableSent = []
    
    for classifier in classifiers:

        print "Training ",classifier.__class__.__name__
        classifier.fit(train_vect, trainLab)

        score = classifier.predict(test_vect)
        
        mat = confusion_matrix(testLab, score)

        print mat
        tp = mat[1][1]
        fp = mat[0][1]
        fn = mat[1][0]
        tn = mat[0][0]
        if tp == 0 :
            recall = 0.0
            precision = 0.0
            f1 = 0.0
        else :
            recall = float(tp)/float((tp+fn))
            precision = float(tp)/float((tp+fp))
            f1 = 2 * ((precision*recall)/(precision+recall))
        accuracy = float(tp+tn)/float(len(testLab))
        print precision,recall,f1
#         tableRep.add_row([classifier.__class__.__name__,precision,recall,f1,accuracy])
        tableSent.append([classifier.__class__.__name__,precision,recall,f1,accuracy])
    for i in tableSent:
        print i
    return 'OK'
#     return tableSent

### 1) Tf-idf Vectorizer

In [None]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Tf-idf has a argument called n_gram where the words are taken together according to the argument value. So here we are going to be taking n_gram=(1,3) as this had better classifier report than other n_grams. From here on Tf-IDF Vectorizer will be taking n_grams.

In [8]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[7541 1671]
 [1129  878]]
0.344448803452 0.437468858994 0.385425812116
Training  RandomForestClassifier
[[9126   86]
 [1827  180]]
0.676691729323 0.0896860986547 0.158380994281
Training  BaggingClassifier
[[8772  440]
 [1534  473]]
0.518072289157 0.23567513702 0.32397260274
Training  ExtraTreesClassifier
[[9048  164]
 [1817  190]]
0.536723163842 0.0946686596911 0.160948750529
Training  DecisionTreeClassifier
[[7860 1352]
 [1300  707]]
0.343370568237 0.352267065272 0.347761928185
Training  CalibratedClassifierCV
[[9115   97]
 [1767  240]]
0.712166172107 0.119581464873 0.204778156997
Training  SGDClassifier
[[9192   20]
 [1921   86]]
0.811320754717 0.0428500249128 0.0814008518694
Training  MLPClassifier




Iteration 1, loss = 0.36876203
Iteration 2, loss = 0.23150532
Iteration 3, loss = 0.16845859
Iteration 4, loss = 0.13692128
Iteration 5, loss = 0.11581572
Iteration 6, loss = 0.10286865
Iteration 7, loss = 0.09371703
Iteration 8, loss = 0.08640060
Iteration 9, loss = 0.08104282
Iteration 10, loss = 0.07766270
Iteration 11, loss = 0.07349220
Iteration 12, loss = 0.07100190
Iteration 13, loss = 0.06882192
Iteration 14, loss = 0.06695088
Iteration 15, loss = 0.06325177
Iteration 16, loss = 0.06256271
Iteration 17, loss = 0.06135273
Iteration 18, loss = 0.05938897
Iteration 19, loss = 0.05805327
Iteration 20, loss = 0.05720237
Iteration 21, loss = 0.05595367
Iteration 22, loss = 0.05525690
Iteration 23, loss = 0.05525735
Iteration 24, loss = 0.05488095
Iteration 25, loss = 0.05446911
Iteration 26, loss = 0.05301586
Iteration 27, loss = 0.05258843
Iteration 28, loss = 0.05157864
Iteration 29, loss = 0.05084661
Iteration 30, loss = 0.05041137
Iteration 31, loss = 0.04959424
Iteration 32, los

## 2) Word2Vec


For using word2vec embedding we need to first load the model here for use. 

In [21]:
from gensim.models import word2vec

def loadW2V(modelName):
    model = word2vec.Word2Vec.load(modelName)
    return dict(zip(model.wv.index2word, model.wv.vectors))

def convertW2V(data,w2v):
    wholeM = []
    count = 0
    le = 0
    print 'Embedding...',len(data)
    for sentence in data:
        le += 1
        arr = []    
        for word in sentence.split():
            if word in w2v:
                arr.append(np.array(w2v[word],copy=True))  
                                #Each word is checked if it is there in the word2vec vocabulary. If there then
                                #the vector space for the word is taken and then the mean is calculated.

        mean = np.zeros(100)
        for mat in arr:
            for j in range(len(mat)):
                mean[j] += mat[j]
        if len(arr) != 0:
            mean = np.array(mean/len(arr))
        else:
            count +=1
        wholeM.append(mean)
    print count,le
    return wholeM

def buildW2V(modelName, train, test):
    w2v = loadW2V("w2v_models/"+modelName)
    train_vect = convertW2V(train,w2v)
    test_vect = convertW2V(test,w2v)
    return train_vect, test_vect

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [22]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
0 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7748 1464]
 [1166  841]]
0.364859002169 0.419033383159 0.390074211503
Training  RandomForestClassifier
[[9144   68]
 [1903  104]]
0.604651162791 0.0518186347783 0.0954566314823
Training  BaggingClassifier
[[9108  104]
 [1874  133]]
0.561181434599 0.0662680617838 0.118538324421
Training  ExtraTreesClassifier
[[9027  185]
 [1883  124]]
0.401294498382 0.061783756851 0.107081174439
Training  DecisionTreeClassifier
[[7713 1499]
 [1515  492]]
0.247112004018 0.24514200299 0.246123061531
Training  CalibratedClassifierCV
[[9169   43]
 [1924   83]]
0.65873015873 0.0413552566019 0.0778246601031
Training  SGDClassifier
[[8987  225]
 [1812  195]]
0.464285714286 0.0971599402093 0.160692212608
Training  KNeighborsClassifier
[[7991 1221]
 [1566  441]]
0.265342960289 0.219730941704 0.240392477514
Training  MLPClassifier
Iteration 1, loss = 0.37752531
Iteration 2, loss = 0.33732279
Iteration 3, loss = 0.32549862
Iteration 4, l

#### b. Second we take the word2vec model that built after BR's are marked

In [23]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
5 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7786 1426]
 [1277  730]]
0.338589981447 0.363726955655 0.350708623589
Training  RandomForestClassifier
[[9142   70]
 [1879  128]]
0.646464646465 0.0637767812656 0.116099773243
Training  BaggingClassifier
[[9118   94]
 [1849  158]]
0.626984126984 0.0787244643747 0.139884904825
Training  ExtraTreesClassifier
[[9045  167]
 [1869  138]]
0.452459016393 0.0687593423019 0.11937716263
Training  DecisionTreeClassifier
[[7764 1448]
 [1446  561]]
0.279243404679 0.279521674141 0.27938247012
Training  CalibratedClassifierCV
[[9180   32]
 [1938   69]]
0.683168316832 0.034379671151 0.0654648956357
Training  SGDClassifier
[[9011  201]
 [1820  187]]
0.481958762887 0.0931738913802 0.156158663883
Training  KNeighborsClassifier
[[7922 1290]
 [1525  482]]
0.272009029345 0.240159441953 0.255093940196
Training  MLPClassifier
Iteration 1, loss = 0.38729613
Iteration 2, loss = 0.34390020
Iteration 3, loss = 0.33299805
Iteration 4, lo

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [24]:
modelName = 'csvReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
0 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7461 1751]
 [1192  815]]
0.317614964926 0.406078724464 0.356439973759
Training  RandomForestClassifier
[[9170   42]
 [1931   76]]
0.64406779661 0.0378674638764 0.0715294117647
Training  BaggingClassifier
[[9124   88]
 [1879  128]]
0.592592592593 0.0637767812656 0.115159694107
Training  ExtraTreesClassifier
[[9117   95]
 [1936   71]]
0.427710843373 0.0353761833582 0.0653474459273
Training  DecisionTreeClassifier
[[7992 1220]
 [1588  419]]
0.255643685174 0.208769307424 0.229840921558
Training  CalibratedClassifierCV
[[9189   23]
 [1946   61]]
0.72619047619 0.0303936223219 0.0583452893352
Training  SGDClassifier
[[8921  291]
 [1756  251]]
0.463099630996 0.125062282013 0.196939976461
Training  KNeighborsClassifier
[[8048 1164]
 [1528  479]]
0.291539866099 0.238664673642 0.262465753425
Training  MLPClassifier
Iteration 1, loss = 0.38545695
Iteration 2, loss = 0.34045677
Iteration 3, loss = 0.32568083
Iteration 4, 

#### d. next word2vec model taken where it was trained with phrases.

In [25]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_3_ReplaceBR_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))



Embedding... 22572
5 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7271 1941]
 [1154  853]]
0.30529706514 0.425012456403 0.355342636951
Training  RandomForestClassifier
[[9146   66]
 [1904  103]]
0.609467455621 0.0513203786746 0.0946691176471
Training  BaggingClassifier
[[9099  113]
 [1872  135]]
0.54435483871 0.067264573991 0.119733924612
Training  ExtraTreesClassifier
[[9047  165]
 [1880  127]]
0.434931506849 0.0632785251619 0.110482818617
Training  DecisionTreeClassifier
[[7919 1293]
 [1526  481]]
0.271138669673 0.23966118585 0.254430044962
Training  CalibratedClassifierCV
[[9176   36]
 [1933   74]]
0.672727272727 0.0368709516692 0.0699102503543
Training  SGDClassifier
[[9177   35]
 [1967   40]]
0.533333333333 0.0199302441455 0.0384245917387
Training  KNeighborsClassifier
[[7821 1391]
 [1441  566]]
0.289218191109 0.282012954659 0.285570131181
Training  MLPClassifier
Iteration 1, loss = 0.38366683
Iteration 2, loss = 0.34687879
Iteration 3, loss = 0.33910307
Iteration 4, l

--------
## Middle Sentences
Until now we were taking into consideration the whole sentence. But on closer inspection most the connection related words appear near of BR1 and BR2, mostly in-between them and just before and just after. So this time we will be only considering words in between the entities taken into consideration and some words before and after the occurence of it say a window of size 3.

In [12]:
from nltk.tokenize import word_tokenize
def breakSen(sentences):
    middleSen = []
    for sentence in sentences:
        s = sentence.split()

        ind1 = -1
        ind2 = -1
        for j in range(0,len(s)):
            if s[j].__contains__('br'):
                if ind1 == -1:
                    ind1 = j
                else:
                    ind2 = j
        for j in range(0,len(s)):
            if s[j].__contains__('br'):
                if j > ind2:
                    ind2 = j

        if ind1-3 < 0:
            ind1 = 0
        else: ind1 -= 3

        if ind2+3 > len(s):
            ind2 = len(s)
        else: ind2 += 3

        middleSen.append(' '.join(s[ind1:ind2]))
    return middleSen

In [13]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['for this', 'for this', 'for this', 'for this', 'for this']

### 1. Tf-idf Vectorizer

In [14]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[8284  928]
 [1428  579]]
0.384207033842 0.288490284006 0.32953898691
Training  RandomForestClassifier
[[9121   91]
 [1796  211]]
0.698675496689 0.105132037867 0.182763100909
Training  BaggingClassifier
[[8695  517]
 [1353  654]]
0.558497011102 0.325859491779 0.411579609817
Training  ExtraTreesClassifier
[[9090  122]
 [1833  174]]
0.587837837838 0.0866965620329 0.151107251411
Training  DecisionTreeClassifier
[[8120 1092]
 [1265  742]]
0.404580152672 0.369706028899 0.386357719344
Training  CalibratedClassifierCV
[[9037  175]
 [1761  246]]
0.58432304038 0.122571001495 0.202635914333
Training  SGDClassifier
[[9129   83]
 [1853  154]]
0.649789029536 0.0767314399601 0.137254901961
Training  MLPClassifier
Iteration 1, loss = 0.39351100
Iteration 2, loss = 0.20047421
Iteration 3, loss = 0.13153227
Iteration 4, loss = 0.10019751
Iteration 5, loss = 0.08356953
Iteration 6, loss = 0.07469413
Iteration 7, loss = 0.06617498
Iteration 8, loss = 0.06290669
Iteration 9, loss = 

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [28]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
11 22572
Embedding... 11219
1 11219
Training  BernoulliNB
[[7594 1618]
 [1133  874]]
0.350722311396 0.435475834579 0.388530784619
Training  RandomForestClassifier
[[9161   51]
 [1893  114]]
0.690909090909 0.0568011958146 0.104972375691
Training  BaggingClassifier
[[9119   93]
 [1864  143]]
0.60593220339 0.0712506228201 0.127507802051
Training  ExtraTreesClassifier
[[9077  135]
 [1857  150]]
0.526315789474 0.0747384155456 0.130890052356
Training  DecisionTreeClassifier
[[7902 1310]
 [1527  480]]
0.268156424581 0.239162929746 0.252831182513
Training  CalibratedClassifierCV
[[9154   58]
 [1958   49]]
0.457943925234 0.0244145490782 0.046357615894
Training  SGDClassifier
[[9144   68]
 [1936   71]]
0.510791366906 0.0353761833582 0.0661696178938
Training  KNeighborsClassifier
[[7817 1395]
 [1354  653]]
0.31884765625 0.325361235675 0.322071516646
Training  MLPClassifier
Iteration 1, loss = 0.35755027
Iteration 2, loss = 0.31784005
Iteration 3, loss = 0.30402039
Iteration 4, 

#### b. Second we take the word2vec model that built after BR's are marked

In [29]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
29 22572
Embedding... 11219
7 11219
Training  BernoulliNB
[[7784 1428]
 [1224  783]]
0.354138398915 0.390134529148 0.371266002845
Training  RandomForestClassifier
[[9138   74]
 [1867  140]]
0.654205607477 0.0697558545092 0.126069338136
Training  BaggingClassifier
[[9054  158]
 [1794  213]]
0.574123989218 0.106128550075 0.179142136249
Training  ExtraTreesClassifier
[[9033  179]
 [1838  169]]
0.485632183908 0.0842052815147 0.143524416136
Training  DecisionTreeClassifier
[[7872 1340]
 [1488  519]]
0.279182356105 0.258594917788 0.268494568029
Training  CalibratedClassifierCV
[[9171   41]
 [1974   33]]
0.445945945946 0.01644245142 0.031715521384
Training  SGDClassifier
[[9057  155]
 [1889  118]]
0.432234432234 0.0587942202292 0.10350877193
Training  KNeighborsClassifier
[[7722 1490]
 [1361  646]]
0.302434456929 0.32187344295 0.311851315472
Training  MLPClassifier
Iteration 1, loss = 0.36606436
Iteration 2, loss = 0.32712804
Iteration 3, loss = 0.31491537
Iteration 4, loss

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [30]:
modelName = 'csvReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
1 22572
Embedding... 11219
4 11219
Training  BernoulliNB
[[7102 2110]
 [ 991 1016]]
0.325015994882 0.506228201295 0.395869861679
Training  RandomForestClassifier
[[9180   32]
 [1915   92]]
0.741935483871 0.0458395615346 0.0863444392304
Training  BaggingClassifier
[[9145   67]
 [1880  127]]
0.654639175258 0.0632785251619 0.115402089959
Training  ExtraTreesClassifier
[[9100  112]
 [1852  155]]
0.580524344569 0.0772296960638 0.136323658751
Training  DecisionTreeClassifier
[[8026 1186]
 [1490  517]]
0.303581914269 0.25759840558 0.278706199461
Training  CalibratedClassifierCV
[[9169   43]
 [1972   35]]
0.448717948718 0.0174389636273 0.0335731414868
Training  SGDClassifier
[[8691  521]
 [1636  371]]
0.415919282511 0.184853014449 0.255950327699
Training  KNeighborsClassifier
[[7878 1334]
 [1364  643]]
0.325240263025 0.320378674639 0.322791164659
Training  MLPClassifier
Iteration 1, loss = 0.35462162
Iteration 2, loss = 0.31490376
Iteration 3, loss = 0.29902402
Iteration 4, 

#### d. next word2vec model taken where it was trained with phrases.

In [31]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_3_ReplaceBR_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
29 22572
Embedding... 11219
7 11219
Training  BernoulliNB
[[7260 1952]
 [1025  982]]
0.33469665985 0.489287493772 0.397490386561
Training  RandomForestClassifier
[[9127   85]
 [1859  148]]
0.635193133047 0.0737419033383 0.132142857143
Training  BaggingClassifier
[[9047  165]
 [1796  211]]
0.561170212766 0.105132037867 0.177087704574
Training  ExtraTreesClassifier
[[9022  190]
 [1795  212]]
0.52736318408 0.105630293971 0.17600664176
Training  DecisionTreeClassifier
[[7799 1413]
 [1465  542]]
0.277237851662 0.270054808171 0.273599192327
Training  CalibratedClassifierCV
[[9176   36]
 [1982   25]]
0.409836065574 0.0124564025909 0.0241779497099
Training  SGDClassifier
[[9078  134]
 [1943   64]]
0.323232323232 0.0318883906328 0.0580498866213
Training  KNeighborsClassifier
[[7900 1312]
 [1353  654]]
0.332655137335 0.325859491779 0.329222250189
Training  MLPClassifier
Iteration 1, loss = 0.36678657
Iteration 2, loss = 0.32997380
Iteration 3, loss = 0.31787366
Iteration 4, lo

_________
## BR1 BR2 tags re-tagging
There appeared to be many replacement of BR1 or BR2 in a single sentence itself but the other BR tag didn't signify any relation. So using CharOffset BR1 and BR2 was tagged. (2)csv file used here

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText(2).csv"
trainSet = pd.read_csv('data/'+fileName,delimiter="|")

fileEval = "WhiteTextUnseenEval(2).csv"
testSet = pd.read_csv('data/'+fileEval,delimiter="|")


trainSet.to_csv('Train(2)',sep="|")
testSet.to_csv('Test(2)',sep="|")

In [16]:
trainData = pd.read_csv('Train(2)',delimiter='|')
trainSen = trainData['sentence']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(2)',delimiter='|')
testSen = testData['sentence']
testLab = testData['connection']
testLen = len(testSen)
print testLen


def formatSen(x):
    x = re.sub("\((.*?)\)"," ",x.lower())
    x = re.sub("^[ ]*([a-z])",r"\1",x)
    return re.sub("[^a-z0-9]"," ",x)

22572
11219


In [17]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

### 1. Tf-idf Vectorizer

In [18]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[8628  584]
 [1654  353]]
0.376734258271 0.175884404584 0.239809782609
Training  RandomForestClassifier
[[9104  108]
 [1710  297]]
0.733333333333 0.14798206278 0.246268656716
Training  BaggingClassifier
[[8468  744]
 [1262  745]]
0.500335795836 0.37120079721 0.426201372998
Training  ExtraTreesClassifier
[[9057  155]
 [1760  247]]
0.614427860697 0.123069257598 0.205064342051
Training  DecisionTreeClassifier
[[8068 1144]
 [1289  718]]
0.385606874329 0.357747882412 0.371155337296
Training  CalibratedClassifierCV
[[8956  256]
 [1396  611]]
0.704728950404 0.304434479322 0.425191370912
Training  SGDClassifier
[[9103  109]
 [1648  359]]
0.767094017094 0.178873941206 0.290101010101
Training  MLPClassifier
Iteration 1, loss = 0.39001203
Iteration 2, loss = 0.20489178
Iteration 3, loss = 0.14073977
Iteration 4, loss = 0.11332832
Iteration 5, loss = 0.09523400
Iteration 6, loss = 0.08469402
Iteration 7, loss = 0.07419547
Iteration 8, loss = 0.06793087
Iteration 9, loss = 0.

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [35]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
15 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[6759 2453]
 [1064  943]]
0.277679623086 0.46985550573 0.349065334074
Training  RandomForestClassifier
[[9017  195]
 [1865  142]]
0.421364985163 0.0707523667165 0.121160409556
Training  BaggingClassifier
[[9049  163]
 [1838  169]]
0.509036144578 0.0842052815147 0.14450619923
Training  ExtraTreesClassifier
[[8648  564]
 [1749  258]]
0.313868613139 0.128550074738 0.182396606575
Training  DecisionTreeClassifier
[[7670 1542]
 [1509  498]]
0.244117647059 0.248131539611 0.246108228317
Training  CalibratedClassifierCV
[[9150   62]
 [1982   25]]
0.287356321839 0.0124564025909 0.0238777459408
Training  SGDClassifier
[[9126   86]
 [1968   39]]
0.312 0.0194319880419 0.0365853658537
Training  KNeighborsClassifier
[[7710 1502]
 [1385  622]]
0.292843691149 0.309915296462 0.301137739046
Training  MLPClassifier
Iteration 1, loss = 0.38515211
Iteration 2, loss = 0.34243146
Iteration 3, loss = 0.32866795
Iteration 4, loss = 0.

#### b. Second we take the word2vec model that built after BR's are marked

In [36]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
15 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[6405 2807]
 [1062  945]]
0.251865671642 0.470852017937 0.328181976038
Training  RandomForestClassifier
[[9050  162]
 [1869  138]]
0.46 0.0687593423019 0.119635890767
Training  BaggingClassifier
[[9033  179]
 [1847  160]]
0.47197640118 0.079720976582 0.136402387042
Training  ExtraTreesClassifier
[[8892  320]
 [1845  162]]
0.336099585062 0.0807174887892 0.130172760145
Training  DecisionTreeClassifier
[[7536 1676]
 [1468  539]]
0.243340857788 0.26856003986 0.255329227854
Training  CalibratedClassifierCV
[[9204    8]
 [1989   18]]
0.692307692308 0.00896860986547 0.0177078209543
Training  SGDClassifier
[[8898  314]
 [1839  168]]
0.348547717842 0.0837070254111 0.134993973483
Training  KNeighborsClassifier
[[7531 1681]
 [1360  647]]
0.277920962199 0.322371699053 0.298500576701
Training  MLPClassifier
Iteration 1, loss = 0.37988354
Iteration 2, loss = 0.34390767
Iteration 3, loss = 0.33076545
Iteration 4, loss = 0.3

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [37]:
modelName = 'csv_2_ReplaceBR12'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
15 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7213 1999]
 [1229  778]]
0.280158444364 0.38764324863 0.32525083612
Training  RandomForestClassifier
[[9143   69]
 [1925   82]]
0.543046357616 0.0408570004983 0.0759962928638
Training  BaggingClassifier
[[9086  126]
 [1894  113]]
0.47280334728 0.056302939711 0.100623330365
Training  ExtraTreesClassifier
[[9007  205]
 [1872  135]]
0.397058823529 0.067264573991 0.115040477205
Training  DecisionTreeClassifier
[[7556 1656]
 [1556  451]]
0.214048410062 0.22471350274 0.219251336898
Training  CalibratedClassifierCV
[[9200   12]
 [1980   27]]
0.692307692308 0.0134529147982 0.0263929618768
Training  SGDClassifier
[[6627 2585]
 [1041  966]]
0.272036046184 0.481315396114 0.347607052897
Training  KNeighborsClassifier
[[7448 1764]
 [1450  557]]
0.239982766049 0.277528649726 0.257393715342
Training  MLPClassifier
Iteration 1, loss = 0.37111459
Iteration 2, loss = 0.33596986
Iteration 3, loss = 0.32389924
Iteration 4, loss

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [38]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_ReplaceBR12_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
15 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7774 1438]
 [1317  690]]
0.324248120301 0.34379671151 0.333736396614
Training  RandomForestClassifier
[[9160   52]
 [1924   83]]
0.614814814815 0.0413552566019 0.077497665733
Training  BaggingClassifier
[[9072  140]
 [1871  136]]
0.492753623188 0.0677628300947 0.119141480508
Training  ExtraTreesClassifier
[[8952  260]
 [1871  136]]
0.343434343434 0.0677628300947 0.113191843529
Training  DecisionTreeClassifier
[[7873 1339]
 [1496  511]]
0.276216216216 0.254608868959 0.26497277677
Training  CalibratedClassifierCV
[[9201   11]
 [1981   26]]
0.702702702703 0.0129546586946 0.0254403131115
Training  SGDClassifier
[[9099  113]
 [1905  102]]
0.474418604651 0.050822122571 0.0918091809181
Training  KNeighborsClassifier
[[7676 1536]
 [1454  553]]
0.264719961704 0.275535625311 0.27001953125
Training  MLPClassifier
Iteration 1, loss = 0.37543477
Iteration 2, loss = 0.34189493
Iteration 3, loss = 0.32883007
Iteration 4, l

### Middle Sentences taken

In [19]:
from nltk.tokenize import word_tokenize
def breakSen(sentences):
    middleSen = []
    for sentence in sentences:
        s = word_tokenize(sentence)

        ind1 = -1
        ind2 = -1
        for j in range(0,len(s)):
            if s[j].__contains__('br1'):
                if ind1 == -1:
                    ind1 = j
                else:
                    ind2 = j
        for j in range(0,len(s)):
            if s[j].__contains__('br2'):
                if j > ind2:
                    ind2 = j

        if ind1-3 < 0:
            ind1 = 0
        else: ind1 -= 3

        if ind2+3 > len(s):
            ind2 = len(s)
        else: ind2 += 3

        middleSen.append(' '.join(s[ind1:ind2]))
    return middleSen

In [20]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['for this', 'for this', 'for this', 'for this', 'for this']

In [21]:
len(trainFSen)

22572

### 1. Tf-idf Vectorizer

In [22]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[8739  473]
 [1719  288]]
0.378449408673 0.143497757848 0.208092485549
Training  RandomForestClassifier
[[9062  150]
 [1576  431]]
0.74182444062 0.214748380668 0.333075734158
Training  BaggingClassifier
[[8760  452]
 [1269  738]]
0.620168067227 0.367713004484 0.461682827651
Training  ExtraTreesClassifier
[[9029  183]
 [1600  407]]
0.689830508475 0.20279023418 0.31343858298
Training  DecisionTreeClassifier
[[8280  932]
 [1218  789]]
0.458454386984 0.39312406577 0.423283261803
Training  CalibratedClassifierCV
[[8850  362]
 [1363  644]]
0.640159045726 0.320876930742 0.427480916031
Training  SGDClassifier
[[8961  251]
 [1479  528]]
0.677792041078 0.26307922272 0.37903804738
Training  MLPClassifier
Iteration 1, loss = 0.39875737
Iteration 2, loss = 0.15347717
Iteration 3, loss = 0.07824594
Iteration 4, loss = 0.05566423
Iteration 5, loss = 0.04385750
Iteration 6, loss = 0.03726237
Iteration 7, loss = 0.03400020
Iteration 8, loss = 0.03073874
Iteration 9, loss = 0.0284

### 2. Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [42]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
1193 22572
Embedding... 11219
489 11219
Training  BernoulliNB
[[7109 2103]
 [ 914 1093]]
0.341989987484 0.544593921276 0.420142225639
Training  RandomForestClassifier
[[9110  102]
 [1827  180]]
0.63829787234 0.0896860986547 0.157273918742
Training  BaggingClassifier
[[9003  209]
 [1776  231]]
0.525 0.11509715994 0.188802615447
Training  ExtraTreesClassifier
[[9014  198]
 [1779  228]]
0.535211267606 0.113602391629 0.187422934649
Training  DecisionTreeClassifier
[[7925 1287]
 [1376  631]]
0.328988529718 0.314399601395 0.32152866242
Training  CalibratedClassifierCV
[[9126   86]
 [1947   60]]
0.41095890411 0.0298953662182 0.0557361820715
Training  SGDClassifier
[[8371  841]
 [1575  432]]
0.339355852317 0.215246636771 0.263414634146
Training  KNeighborsClassifier
[[8089 1123]
 [1363  644]]
0.364459535937 0.320876930742 0.34128245893
Training  MLPClassifier
Iteration 1, loss = 0.35862211
Iteration 2, loss = 0.30612505
Iteration 3, loss = 0.28904674
Iteration 4, loss = 0.27

#### b. Second we take the word2vec model that built after BR's are marked

In [43]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
1214 22572
Embedding... 11219
506 11219
Training  BernoulliNB
[[7329 1883]
 [ 979 1028]]
0.353143249742 0.512207274539 0.418056120374
Training  RandomForestClassifier
[[9104  108]
 [1795  212]]
0.6625 0.105630293971 0.1822088526
Training  BaggingClassifier
[[8973  239]
 [1744  263]]
0.52390438247 0.131041355257 0.209645277003
Training  ExtraTreesClassifier
[[8994  218]
 [1759  248]]
0.532188841202 0.123567513702 0.200566114032
Training  DecisionTreeClassifier
[[7988 1224]
 [1397  610]]
0.332606324973 0.303936223219 0.317625618329
Training  CalibratedClassifierCV
[[9137   75]
 [1945   62]]
0.452554744526 0.0308918784255 0.0578358208955
Training  SGDClassifier
[[9144   68]
 [1998    9]]
0.116883116883 0.00448430493274 0.00863723608445
Training  KNeighborsClassifier
[[8011 1201]
 [1303  704]]
0.369553805774 0.350772296961 0.359918200409
Training  MLPClassifier
Iteration 1, loss = 0.37564167
Iteration 2, loss = 0.31416076
Iteration 3, loss = 0.29497308
Iteration 4, loss 

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [44]:
modelName = 'csv_3_ReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
1214 22572
Embedding... 11219
506 11219
Training  BernoulliNB
[[7085 2127]
 [ 924 1083]]
0.33738317757 0.539611360239 0.415181138585
Training  RandomForestClassifier
[[9118   94]
 [1817  190]]
0.669014084507 0.0946686596911 0.165866433872
Training  BaggingClassifier
[[9039  173]
 [1740  267]]
0.606818181818 0.133034379671 0.218226399673
Training  ExtraTreesClassifier
[[9034  178]
 [1828  179]]
0.501400560224 0.0891878425511 0.151438240271
Training  DecisionTreeClassifier
[[7959 1253]
 [1477  530]]
0.297251822771 0.264075734928 0.279683377309
Training  CalibratedClassifierCV
[[9135   77]
 [1929   78]]
0.503225806452 0.0388639760837 0.0721554116559
Training  SGDClassifier
[[8929  283]
 [1921   86]]
0.233062330623 0.0428500249128 0.0723905723906
Training  KNeighborsClassifier
[[8010 1202]
 [1298  709]]
0.371009942439 0.353263577479 0.361919346605
Training  MLPClassifier
Iteration 1, loss = 0.35711549
Iteration 2, loss = 0.30939865
Iteration 3, loss = 0.29294240
Iteratio

### Grouping
When we breaked the sentences to contain only the middle context it was observed that comma seperated BR's had a disadvantage as in some sentences that was taken as middle context, and no other words appeared. So to get more out the context of the surronding words around BR, BR's appearing together were grouped into a single BR entity.

In [23]:
def formatSen(sentence):
    sentence = re.sub("\s(the|The)\s"," ",sentence)
    sentence = re.sub("^(the|The)","",sentence)
    sentence = re.sub("(nucleus)"," ",sentence)
    sentence = re.sub(r","," ", sentence)
    sentence = re.sub("\([0-9]\)",".",sentence)
    sentence = re.sub("[, ]*(BR[^12])([, ]*(BR[^12]))+[, ]*"," BR ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*BR1([, ]+BR[^12])*[, ]*"," BR1 ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*BR2([, ]+BR[^12])*[, ]*"," BR2 ",sentence)
    return sentence.lower()

In [24]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

In [25]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['examined optic ( br1 so ) intermediate gray ( br2 sgi )',
 'examined optic ( br1 so ) intermediate gray ( br sgi ) intermediate white ( br sai ) and deep gray ( br2 sgp )',
 'intermediate gray ( br1 sgi ) intermediate white ( br sai ) and deep gray ( br2 sgp )',
 'examined optic ( br1 so ) intermediate gray ( br sgi ) intermediate white ( br2 sai )',
 'intermediate gray ( br1 sgi ) intermediate white ( br2 sai )']

### 1. Tf-idf Vectorizer

In [26]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[9021  191]
 [1845  162]]
0.458923512748 0.0807174887892 0.137288135593
Training  RandomForestClassifier
[[9060  152]
 [1559  448]]
0.746666666667 0.223218734429 0.343690065209
Training  BaggingClassifier
[[8650  562]
 [1276  731]]
0.565351894818 0.364225211759 0.44303030303
Training  ExtraTreesClassifier
[[9006  206]
 [1613  394]]
0.656666666667 0.196312904833 0.302263137706
Training  DecisionTreeClassifier
[[8396  816]
 [1169  838]]
0.506650544135 0.417538614848 0.457798415733
Training  CalibratedClassifierCV
[[8720  492]
 [1267  740]]
0.600649350649 0.368709516692 0.45693115159
Training  SGDClassifier
[[8919  293]
 [1514  493]]
0.627226463104 0.245640259093 0.353025420695
Training  MLPClassifier
Iteration 1, loss = 0.35870140
Iteration 2, loss = 0.11035559
Iteration 3, loss = 0.04746342
Iteration 4, loss = 0.02892450
Iteration 5, loss = 0.02160953
Iteration 6, loss = 0.01584850
Iteration 7, loss = 0.01319013
Iteration 8, loss = 0.01166109
Iteration 9, loss = 0

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [48]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
0 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7058 2154]
 [ 942 1065]]
0.330848089469 0.530642750374 0.40757749713
Training  RandomForestClassifier
[[9133   79]
 [1873  134]]
0.629107981221 0.0667663178874 0.120720720721
Training  BaggingClassifier
[[9014  198]
 [1807  200]]
0.502512562814 0.0996512207275 0.16632016632
Training  ExtraTreesClassifier
[[9004  208]
 [1834  173]]
0.45406824147 0.0861983059292 0.144891122278
Training  DecisionTreeClassifier
[[7715 1497]
 [1483  524]]
0.259277585354 0.261086198306 0.260178748759
Training  CalibratedClassifierCV
[[9015  197]
 [1858  149]]
0.43063583815 0.074240159442 0.126646833829
Training  SGDClassifier
[[8917  295]
 [1833  174]]
0.371002132196 0.0866965620329 0.140549273021
Training  KNeighborsClassifier
[[8288  924]
 [1425  582]]
0.386454183267 0.289985052317 0.331340734415
Training  MLPClassifier
Iteration 1, loss = 0.35100552
Iteration 2, loss = 0.30122300
Iteration 3, loss = 0.28292462
Iteration 4, loss 

#### b. Second we take the word2vec model that built after BR's are marked

In [49]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
10 22572
Embedding... 11219
1 11219
Training  BernoulliNB
[[7234 1978]
 [1024  983]]
0.331982438365 0.489785749875 0.395732689211
Training  RandomForestClassifier
[[9116   96]
 [1847  160]]
0.625 0.079720976582 0.141405214317
Training  BaggingClassifier
[[9031  181]
 [1751  256]]
0.585812356979 0.127553562531 0.209492635025
Training  ExtraTreesClassifier
[[8943  269]
 [1820  187]]
0.410087719298 0.0931738913802 0.151847340641
Training  DecisionTreeClassifier
[[7778 1434]
 [1468  539]]
0.27318803852 0.26856003986 0.270854271357
Training  CalibratedClassifierCV
[[9037  175]
 [1822  185]]
0.513888888889 0.0921773791729 0.156316011829
Training  SGDClassifier
[[6034 3178]
 [ 932 1075]]
0.252762755702 0.53562531141 0.343450479233
Training  KNeighborsClassifier
[[8026 1186]
 [1358  649]]
0.353678474114 0.323368211261 0.337844872462
Training  MLPClassifier
Iteration 1, loss = 0.35300460
Iteration 2, loss = 0.30599332
Iteration 3, loss = 0.28854990
Iteration 4, loss = 0.27350

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [50]:
modelName = 'csv_2_groupBR12'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
0 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[7136 2076]
 [ 977 1030]]
0.331616226658 0.513203786746 0.402894582437
Training  RandomForestClassifier
[[9150   62]
 [1879  128]]
0.673684210526 0.0637767812656 0.116522530724
Training  BaggingClassifier
[[9038  174]
 [1808  199]]
0.533512064343 0.0991529646238 0.167226890756
Training  ExtraTreesClassifier
[[9023  189]
 [1835  172]]
0.476454293629 0.0857000498256 0.14527027027
Training  DecisionTreeClassifier
[[7690 1522]
 [1435  572]]
0.273161413563 0.285002491281 0.278956352109
Training  CalibratedClassifierCV
[[9018  194]
 [1865  142]]
0.422619047619 0.0707523667165 0.121212121212
Training  SGDClassifier
[[7703 1509]
 [1184  823]]
0.352915951973 0.410064773293 0.379350080664
Training  KNeighborsClassifier
[[8291  921]
 [1384  623]]
0.403497409326 0.310413552566 0.350887074064
Training  MLPClassifier
Iteration 1, loss = 0.36131800
Iteration 2, loss = 0.30452838
Iteration 3, loss = 0.28842600
Iteration 4, lo

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [51]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_groupBR12_bigram'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding... 22572
0 22572
Embedding... 11219
0 11219
Training  BernoulliNB
[[6787 2425]
 [ 874 1133]]
0.31843732434 0.564524165421 0.407187780773
Training  RandomForestClassifier
[[9138   74]
 [1885  122]]
0.622448979592 0.0607872446437 0.110758057195
Training  BaggingClassifier
[[9068  144]
 [1819  188]]
0.566265060241 0.0936721474838 0.160752458316
Training  ExtraTreesClassifier
[[9032  180]
 [1851  156]]
0.464285714286 0.0777279521674 0.133162612036
Training  DecisionTreeClassifier
[[7818 1394]
 [1411  596]]
0.299497487437 0.296960637768 0.298223667751
Training  CalibratedClassifierCV
[[8972  240]
 [1846  161]]
0.401496259352 0.0802192326856 0.133720930233
Training  SGDClassifier
[[9006  206]
 [1862  145]]
0.413105413105 0.0722471350274 0.122985581001
Training  KNeighborsClassifier
[[8054 1158]
 [1372  635]]
0.354155047407 0.31639262581 0.334210526316
Training  MLPClassifier
Iteration 1, loss = 0.34518678
Iteration 2, loss = 0.30647807
Iteration 3, loss = 0.29051010
Iteration 4, lo

### Tokenization
Till now we were spliting the sentences using white space seperator. So the problem was the brackets were considered together with the words and not seperately. So before getting the vector we tokenize the sentence using nltk package function. This is only for word2vec

In [27]:
from nltk.tokenize import word_tokenize

def convertW2V(data,w2v):
    wholeM = []
    print 'Embedding...'
    for sentence in data:
        arr = []    
        for word in word_tokenize(sentence):
            if word in w2v:
                arr.append(np.array(w2v[word],copy=True))  
                                #Each word is checked if it is there in the word2vec vocabulary. If there then
                                #the vector space for the word is taken and then the mean is calculated.

        mean = np.zeros(100)
        for mat in arr:
            for j in range(len(mat)):
                mean[j] += mat[j]
        if len(arr) != 0:
            mean = np.array(mean/len(arr))
        wholeM.append(mean)
    return wholeM

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [53]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[7058 2154]
 [ 942 1065]]
0.330848089469 0.530642750374 0.40757749713
Training  RandomForestClassifier
[[9120   92]
 [1875  132]]
0.589285714286 0.0657698056801 0.118332586284
Training  BaggingClassifier
[[9010  202]
 [1824  183]]
0.475324675325 0.0911808669656 0.153010033445
Training  ExtraTreesClassifier
[[9016  196]
 [1789  218]]
0.526570048309 0.108619830593 0.180090871541
Training  DecisionTreeClassifier
[[7808 1404]
 [1422  585]]
0.294117647059 0.291479820628 0.292792792793
Training  CalibratedClassifierCV
[[9015  197]
 [1858  149]]
0.43063583815 0.074240159442 0.126646833829
Training  SGDClassifier
[[8969  243]
 [1943   64]]
0.208469055375 0.0318883906328 0.0553154710458
Training  KNeighborsClassifier
[[8288  924]
 [1425  582]]
0.386454183267 0.289985052317 0.331340734415
Training  MLPClassifier
Iteration 1, loss = 0.34221632
Iteration 2, loss = 0.29992809
Iteration 3, loss = 0.28144065
Iteration 4, loss = 0.26681330
Iteration 5, 

#### b. Second we take the word2vec model that built after BR's are marked

In [54]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[7234 1978]
 [1024  983]]
0.331982438365 0.489785749875 0.395732689211
Training  RandomForestClassifier
[[9137   75]
 [1841  166]]
0.688796680498 0.0827105132038 0.14768683274
Training  BaggingClassifier
[[8957  255]
 [1759  248]]
0.493041749503 0.123567513702 0.197609561753
Training  ExtraTreesClassifier
[[9028  184]
 [1785  222]]
0.546798029557 0.110612855007 0.184003315375
Training  DecisionTreeClassifier
[[7795 1417]
 [1511  496]]
0.259278619969 0.247135027404 0.25306122449
Training  CalibratedClassifierCV
[[9037  175]
 [1822  185]]
0.513888888889 0.0921773791729 0.156316011829
Training  SGDClassifier
[[8880  332]
 [1736  271]]
0.449419568823 0.135027404086 0.207662835249
Training  KNeighborsClassifier
[[8026 1186]
 [1358  649]]
0.353678474114 0.323368211261 0.337844872462
Training  MLPClassifier
Iteration 1, loss = 0.34707201
Iteration 2, loss = 0.30414611
Iteration 3, loss = 0.28779226
Iteration 4, loss = 0.27312975
Iteration 5, lo

#### c. next word2vec model is the one where BR1 and BR2 are marked

In [55]:
modelName = 'csv_2_groupBR12_token'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[7514 1698]
 [1213  794]]
0.318619582665 0.395615346288 0.352967326072
Training  RandomForestClassifier
[[9144   68]
 [1919   88]]
0.564102564103 0.0438465371201 0.081368469718
Training  BaggingClassifier
[[9107  105]
 [1868  139]]
0.569672131148 0.0692575984056 0.123500666371
Training  ExtraTreesClassifier
[[9032  180]
 [1865  142]]
0.44099378882 0.0707523667165 0.121940747102
Training  DecisionTreeClassifier
[[7707 1505]
 [1455  552]]
0.268351968887 0.275037369208 0.271653543307
Training  CalibratedClassifierCV
[[8969  243]
 [1875  132]]
0.352 0.0657698056801 0.110831234257
Training  SGDClassifier
[[8955  257]
 [1889  118]]
0.314666666667 0.0587942202292 0.0990764063812
Training  KNeighborsClassifier
[[8160 1052]
 [1370  637]]
0.377146240379 0.317389138017 0.344696969697
Training  MLPClassifier
Iteration 1, loss = 0.33603187
Iteration 2, loss = 0.29617287
Iteration 3, loss = 0.27917034
Iteration 4, loss = 0.26075980
Iteration 5, loss =

#### c. next word2vec model is the one where BR1 and BR2 are marked and bigrams

In [56]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_2_groupBR12_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[6763 2449]
 [ 908 1099]]
0.309751972943 0.547583457897 0.395679567957
Training  RandomForestClassifier
[[9155   57]
 [1872  135]]
0.703125 0.067264573991 0.12278308322
Training  BaggingClassifier
[[9096  116]
 [1844  163]]
0.584229390681 0.0812157448929 0.142607174103
Training  ExtraTreesClassifier
[[9030  182]
 [1862  145]]
0.443425076453 0.0722471350274 0.124250214225
Training  DecisionTreeClassifier
[[7709 1503]
 [1421  586]]
0.280516993777 0.291978076731 0.2861328125
Training  CalibratedClassifierCV
[[8979  233]
 [1852  155]]
0.399484536082 0.0772296960638 0.129436325678
Training  SGDClassifier
[[8851  361]
 [1785  222]]
0.380789022298 0.110612855007 0.171428571429
Training  KNeighborsClassifier
[[8185 1027]
 [1422  585]]
0.362903225806 0.291479820628 0.323293727549
Training  MLPClassifier
Iteration 1, loss = 0.35546664
Iteration 2, loss = 0.30747280
Iteration 3, loss = 0.28913525
Iteration 4, loss = 0.27500203
Iteration 5, loss = 0

------------------------
# Fin
For the final time we take another csv(4) file this time. We had duplicates after grouping of BR's. This file is without any duplicate. Senteces were preprocessed as per need. 

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split

fileName = "WhiteText(Fin).csv"
trainSet = pd.read_csv('data/'+fileName,delimiter="|")

fileEval = "WhiteTextUnseenEval(Fin).csv"
testSet = pd.read_csv('data/'+fileEval,delimiter="|")

trainSet.to_csv('Train(Fin)',sep="|")
testSet.to_csv('Test(Fin)',sep="|")

In [29]:
trainData = pd.read_csv('Train(Fin)',delimiter='|')
trainSen = trainData['sentence']
trainLab = trainData['connection']
trainLen = len(trainSen)
print trainLen

testData = pd.read_csv('Test(Fin)',delimiter='|')
testSen = testData['sentence']
testLab = testData['connection']
testLen = len(testSen)
print testLen


def formatSen(sentence):
    sentence = re.sub("\s(the|The)\s"," ",sentence)
    sentence = re.sub("^(the|The)","",sentence)
    sentence = re.sub("(nucleus)"," ",sentence)
    sentence = re.sub(r","," ", sentence)
    sentence = re.sub("\([0-9]\)",".",sentence)
    return sentence.lower()

16547
8475


In [30]:
trainFSen = []
for i in range(0,trainLen):
    trainFSen.append(formatSen(trainSen[i]))
    
testFSen = []
for i in range(0,testLen):
    testFSen.append(formatSen(testSen[i]))

### 1. Tf-idf Vectorizer

In [31]:
train_vect, test_vect = convertTfidf(trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[6647   80]
 [1632  116]]
0.591836734694 0.0663615560641 0.119341563786
Training  RandomForestClassifier
[[6623  104]
 [1446  302]]
0.743842364532 0.172768878719 0.280408542247
Training  BaggingClassifier
[[5959  768]
 [1042  706]]
0.478968792402 0.403890160183 0.438237119801
Training  ExtraTreesClassifier
[[6586  141]
 [1473  275]]
0.661057692308 0.157322654462 0.25415896488
Training  DecisionTreeClassifier
[[5471 1256]
 [ 977  771]]
0.380365071534 0.441075514874 0.408476821192
Training  CalibratedClassifierCV
[[6513  214]
 [1293  455]]
0.680119581465 0.260297482838 0.376499793132
Training  SGDClassifier
[[6634   93]
 [1433  315]]
0.772058823529 0.180205949657 0.292207792208
Training  MLPClassifier
Iteration 1, loss = 0.42868549
Iteration 2, loss = 0.26328427
Iteration 3, loss = 0.16206960
Iteration 4, loss = 0.11424762
Iteration 5, loss = 0.08879925
Iteration 6, loss = 0.07187474
Iteration 7, loss = 0.05938229
Iteration 8, loss = 0.05084608
Iteration 9, loss = 

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [60]:
len(trainFSen)

16547

In [61]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[4925 1802]
 [ 958  790]]
0.304783950617 0.451945080092 0.364055299539
Training  RandomForestClassifier
[[6634   93]
 [1633  115]]
0.552884615385 0.0657894736842 0.117586912065
Training  BaggingClassifier
[[6620  107]
 [1604  144]]
0.573705179283 0.0823798627002 0.144072036018
Training  ExtraTreesClassifier
[[6490  237]
 [1573  175]]
0.424757281553 0.100114416476 0.162037037037
Training  DecisionTreeClassifier
[[5284 1443]
 [1198  550]]
0.275965880582 0.314645308924 0.294039026998
Training  CalibratedClassifierCV
[[6670   57]
 [1673   75]]
0.568181818182 0.0429061784897 0.0797872340426
Training  SGDClassifier
[[6692   35]
 [1717   31]]
0.469696969697 0.0177345537757 0.0341786108049
Training  KNeighborsClassifier
[[5627 1100]
 [1261  487]]
0.306868304978 0.278604118993 0.292053973013
Training  MLPClassifier
Iteration 1, loss = 0.42658315
Iteration 2, loss = 0.38934386
Iteration 3, loss = 0.37827551
Iteration 4, loss = 0.37150356
Iteration

#### b. Second we take the word2vec model that built after BR's are marked

In [62]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainFSen, testFSen)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[5311 1416]
 [1066  682]]
0.325071496663 0.390160183066 0.354654186167
Training  RandomForestClassifier
[[6635   92]
 [1629  119]]
0.563981042654 0.0680778032037 0.121490556406
Training  BaggingClassifier
[[6602  125]
 [1602  146]]
0.538745387454 0.08352402746 0.144626052501
Training  ExtraTreesClassifier
[[6516  211]
 [1606  142]]
0.402266288952 0.0812356979405 0.135173726797
Training  DecisionTreeClassifier
[[5642 1085]
 [1355  393]]
0.265899864682 0.224828375286 0.243645381277
Training  CalibratedClassifierCV
[[6660   67]
 [1675   73]]
0.521428571429 0.04176201373 0.0773305084746
Training  SGDClassifier
[[3093 3634]
 [ 509 1239]]
0.254258157193 0.70881006865 0.374263706389
Training  KNeighborsClassifier
[[5284 1443]
 [1178  570]]
0.283159463487 0.326086956522 0.303110874767
Training  MLPClassifier
Iteration 1, loss = 0.43922314
Iteration 2, loss = 0.39524567
Iteration 3, loss = 0.38395140
Iteration 4, loss = 0.37637043
Iteration 5, lo

#### c. bigram

In [63]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_Fin_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainFSen], bigram_trans[testFSen])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[5923  804]
 [1387  361]]
0.309871244635 0.20652173913 0.247854445589
Training  RandomForestClassifier
[[6702   25]
 [1685   63]]
0.715909090909 0.0360411899314 0.0686274509804
Training  BaggingClassifier
[[6689   38]
 [1661   87]]
0.696 0.0497711670481 0.0928990923652
Training  ExtraTreesClassifier
[[6579  148]
 [1651   97]]
0.395918367347 0.0554919908467 0.0973406924235
Training  DecisionTreeClassifier
[[5774  953]
 [1367  381]]
0.285607196402 0.217963386728 0.247242050616
Training  CalibratedClassifierCV
[[6660   67]
 [1659   89]]
0.570512820513 0.0509153318078 0.093487394958
Training  SGDClassifier
[[6381  346]
 [1484  264]]
0.432786885246 0.151029748284 0.223918575064
Training  KNeighborsClassifier
[[5886  841]
 [1364  384]]
0.313469387755 0.219679633867 0.258324924319
Training  MLPClassifier
Iteration 1, loss = 0.43982131
Iteration 2, loss = 0.40195343
Iteration 3, loss = 0.39163010
Iteration 4, loss = 0.38470989
Iteration 5, loss 

### Middle Sentences taken

In [32]:
trainMid = breakSen(trainFSen)
testMid = breakSen(testFSen)
trainMid[:5]

['examined optic ( br1 so ) intermediate gray ( br2 sgi )',
 'examined optic ( br1 so ) intermediate gray ( br sgi ) intermediate white ( br sai ) and deep gray ( br2 sgp )',
 'intermediate gray ( br1 sgi ) intermediate white ( br sai ) and deep gray ( br2 sgp )',
 'examined optic ( br1 so ) intermediate gray ( br sgi ) intermediate white ( br2 sai )',
 'intermediate gray ( br1 sgi ) intermediate white ( br2 sai )']

### 1. Tf-idf Vectorizer

In [33]:
train_vect, test_vect = convertTfidf(trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Training  BernoulliNB
[[6638   89]
 [1634  114]]
0.56157635468 0.0652173913043 0.116863147104
Training  RandomForestClassifier
[[6578  149]
 [1336  412]]
0.73440285205 0.235697940503 0.356864443482
Training  BaggingClassifier
[[6303  424]
 [1103  645]]
0.603367633302 0.368993135011 0.457933972311
Training  ExtraTreesClassifier
[[6580  147]
 [1376  372]]
0.71676300578 0.212814645309 0.328187031319
Training  DecisionTreeClassifier
[[5835  892]
 [1032  716]]
0.445273631841 0.409610983982 0.426698450536
Training  CalibratedClassifierCV
[[6363  364]
 [1141  607]]
0.625128733265 0.347254004577 0.446487679294
Training  SGDClassifier
[[6447  280]
 [1236  512]]
0.646464646465 0.29290617849 0.403149606299
Training  MLPClassifier
Iteration 1, loss = 0.41572883
Iteration 2, loss = 0.15999117
Iteration 3, loss = 0.06183704
Iteration 4, loss = 0.03497055
Iteration 5, loss = 0.02390761
Iteration 6, loss = 0.01634154
Iteration 7, loss = 0.01267299
Iteration 8, loss = 0.00971220
Iteration 9, loss = 0.0

## Word2Vec

#### a. For the first task we will be taking the initial model that was built from the xml file without any tagging.

In [65]:
modelName = 'xmlOriginalSen'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[5111 1616]
 [ 901  847]]
0.34388956557 0.484553775744 0.402279743529
Training  RandomForestClassifier
[[6662   65]
 [1602  146]]
0.691943127962 0.08352402746 0.149055640633
Training  BaggingClassifier
[[6606  121]
 [1549  199]]
0.621875 0.113844393593 0.192456479691
Training  ExtraTreesClassifier
[[6574  153]
 [1573  175]]
0.533536585366 0.100114416476 0.16859344894
Training  DecisionTreeClassifier
[[5554 1173]
 [1237  511]]
0.303444180523 0.29233409611 0.297785547786
Training  CalibratedClassifierCV
[[6572  155]
 [1583  165]]
0.515625 0.0943935926773 0.159574468085
Training  SGDClassifier
[[5010 1717]
 [ 849  899]]
0.343654434251 0.514302059497 0.412007332722
Training  KNeighborsClassifier
[[5969  758]
 [1265  483]]
0.389202256245 0.276315789474 0.32318501171
Training  MLPClassifier
Iteration 1, loss = 0.40381963
Iteration 2, loss = 0.35897360
Iteration 3, loss = 0.34312163
Iteration 4, loss = 0.33232742
Iteration 5, loss = 0.32169671


#### b. Second we take the word2vec model that built after BR's are marked

In [66]:
modelName = 'xmlReplaceBR'
train_vect, test_vect = buildW2V(modelName,trainMid, testMid)
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[5451 1276]
 [ 988  760]]
0.373280943026 0.434782608696 0.401691331924
Training  RandomForestClassifier
[[6655   72]
 [1580  168]]
0.7 0.0961098398169 0.169014084507
Training  BaggingClassifier
[[6575  152]
 [1519  229]]
0.601049868766 0.131006864989 0.215124471583
Training  ExtraTreesClassifier
[[6602  125]
 [1575  173]]
0.580536912752 0.0989702517162 0.169110459433
Training  DecisionTreeClassifier
[[5699 1028]
 [1296  452]]
0.305405405405 0.258581235698 0.280049566295
Training  CalibratedClassifierCV
[[6572  155]
 [1568  180]]
0.537313432836 0.102974828375 0.172827652424
Training  SGDClassifier
[[4923 1804]
 [ 864  884]]
0.328869047619 0.505720823799 0.398557258792
Training  KNeighborsClassifier
[[5818  909]
 [1193  555]]
0.379098360656 0.317505720824 0.345579078456
Training  MLPClassifier
Iteration 1, loss = 0.42374897
Iteration 2, loss = 0.36568700
Iteration 3, loss = 0.35143264
Iteration 4, loss = 0.34003247
Iteration 5, loss = 0.32

#### c. bigram

In [67]:
from gensim.models import Phrases
bigram_trans = Phrases(trainFSen)

modelName = 'csv_Fin_bigramToken'
train_vect, test_vect = buildW2V(modelName,bigram_trans[trainMid], bigram_trans[testMid])
print(classify(train_vect,trainLab,test_vect,testLab))

Embedding...
Embedding...
Training  BernoulliNB
[[6073  654]
 [1342  406]]
0.383018867925 0.232265446224 0.289173789174
Training  RandomForestClassifier
[[6688   39]
 [1668   80]]
0.672268907563 0.045766590389 0.0856989823246
Training  BaggingClassifier
[[6655   72]
 [1621  127]]
0.638190954774 0.0726544622426 0.130457113508
Training  ExtraTreesClassifier
[[6578  149]
 [1636  112]]
0.429118773946 0.0640732265446 0.11149825784
Training  DecisionTreeClassifier
[[5574 1153]
 [1262  486]]
0.296522269677 0.278032036613 0.286979627989
Training  CalibratedClassifierCV
[[6589  138]
 [1612  136]]
0.496350364964 0.0778032036613 0.134520276954
Training  SGDClassifier
[[5693 1034]
 [1084  664]]
0.391048292108 0.379862700229 0.385374347069
Training  KNeighborsClassifier
[[5874  853]
 [1186  562]]
0.397173144876 0.321510297483 0.355358836548
Training  MLPClassifier
Iteration 1, loss = 0.41130927
Iteration 2, loss = 0.37081959
Iteration 3, loss = 0.35622398
Iteration 4, loss = 0.34662213
Iteration 5,