# Converting XML to CSV file
Problem given is to find if given sentence has a connection between any two brain regions specified in the sentence.

The Brain regions have already been given along with the sentences in an xml file. To work on it, it is converted to a DataFrame and stored in a csv file format for further manipulation.

## Reading XML file
First step is to read the xml file. The xml files that are given are - 
1. WhiteText.xml - This is the initial data
2. WhiteTextUnseen.xml - This is the test data that will be used later for testing and evaluating the model

The format given in the file is in xml tags. The sentence is under the "sentence" tag and the entities in "entity" tag and the interaction is given in "pair" tag. If there is interaction the argument interaction is 'True' else 'False

In [1]:
fileName = "WhiteText_re" #Specify the xml file to be read
# fileName = "WhiteTextUnseen"

In [2]:
import xml.etree.ElementTree as ET
tree = ET.parse('data/'+fileName+'.xml')
root = tree.getroot()
print root.tag #Checking the tag of the root to see if file was read

corpus


As the WhiteTextNeg.xml file has extra tags that are not being used and to decrease the memory usage the unwanted 'sentenceanalyses' tags are deleted below

In [3]:
for child in root:
    for c in child:
        for se in c:
            if se.tag == 'sentenceanalyses':
                c.remove(se) # removing the tag
tree.write('data/'+fileName+'_re.xml')

Viewing and counting the number of sentences in the corpus

In [5]:
count = 0
for elem in root.iter('sentence'):
    #print elem.attrib['text']
    count += 1
print "\nThe number of sentences in the xml document = ",count


The number of sentences in the xml document =  4334


## Convert to DataFrame
Now that we have seen the data in the xml file, we are going to looking at all the connections and storing it in a Pandas DataFrame structure

The DataFrame structure that we are following has following columns - [Entity-1, Entity-2. Connection, Sentence]

The sentences are read one by one taking 2 brain region mentions at a time. If there are more than 2 brain region mentions then a particular sentence then, is looked into again taking the other 2 brain region combination.

The brain region taken into consideration are denoted in the sentence by 'BR1' and 'BR2'. All other mentions of brain regions are denoted by 'BR'. Then it is stored in a Pandas variable.

Following are the different methods that were used denoted by numbers. Each method has an additional process before storing into xml file.

### 1.
In the beginning the sentence was not tounched during the creation of csv file. 

In [5]:
import pandas as pd
data = pd.DataFrame(columns = ['connection','entity1','entity2','sentence'])
print 'Reading xml file'
for corpus in root:
    for document in corpus:
        sentence = document.attrib['text']  #the text was in the argument 'text' of sentence tag
        entity = []                         #to store all the entities included in the sentence
        br1 = ""
        br2 = ""
        direction = -1
        for connection in document:
            if connection.tag == 'entity':
                entity.append(connection.attrib['text'])
            if connection.tag == 'pair':                 #Whenever an pair tag encountered
                if connection.attrib['interaction'] == 'False':
                    direction = 0
                elif connection.attrib['interaction'] == 'True' :
                    direction = 1
                    
                br1 = entity[int(connection.attrib['e1'][-1])]   #the number after e denoted index of entity
                br2 = entity[int(connection.attrib['e2'][-1])]
                data.loc[len(data)] = [direction,br1,br2,sentence]  #storing into the DataFrame

print 'Finished...'

Reading xml file
Finished...


After storing all the connection sentences in DataFrame, we look at the count of entries in the data

In [6]:
print "Number of entries in DataFrame = ", len(data)
count = 0
for i in data['connection']:
    if i == 0:
        count += 1
print "Number of sentences with no connection = ", count
print "Number of sentences with connection = ", len(data) - count
data.head()

Number of entries in DataFrame =  22561
Number of sentences with no connection =  19464
Number of sentences with connection =  3097


Unnamed: 0,connection,entity1,entity2,sentence
0,0,stratum opticum,stratum griseum intermedium,"For this study, we examined the optic (stratum..."
1,0,stratum opticum,stratum griseum profundum,"For this study, we examined the optic (stratum..."
2,0,stratum griseum intermedium,stratum griseum profundum,"For this study, we examined the optic (stratum..."
3,0,stratum album intermedium,stratum opticum,"For this study, we examined the optic (stratum..."
4,0,stratum album intermedium,stratum griseum intermedium,"For this study, we examined the optic (stratum..."


After storing the data in DataFrame, we are going to store it in a CSV file for use later on.

In [7]:
data.to_csv('data/'+fileName+'(1).csv',sep='|') #fileName is the name of the file going to be created. sep denoted the delimiter.

### 2.
After some thinking before creating the csv file, replaced the brain region mentions with associated BR. The brain region was replaced looking at the attribute "charOffset" of the "entity" tag. This is done because in a sentence there may be multiple mentions of the same BR but the sentence may not show any relation with the other BR. So for better understanding it is done.

In [9]:
import pandas as pd
data = pd.DataFrame(columns = ['connection','entity1','entity2','sentence'])

print 'Reading xml file..'
for corpus in root:
    for document in corpus:
        sentence = document.attrib['text']
        entityStart = []
        entityEnd = []
        br1 = ""
        br2 = ""
        direction = -1
        for connection in document:
            if connection.tag == 'entity':
                sep = connection.attrib['charOffset'].index('-')  #to get the charOffset index
                entityStart.append(int(connection.attrib['charOffset'][:sep]))
                entityEnd.append(int(connection.attrib['charOffset'][sep+1:])+1)
            
            if connection.tag == 'pair':           #If tag is pair tag make entry
                if connection.attrib['interaction'] == 'False':
                    direction = 0
                elif connection.attrib['interaction'] == 'True' :
                    direction = 1
                
                ind1 = int(connection.attrib['e1'][connection.attrib['e1'].index('e',20)+1:])
                ind2 = int(connection.attrib['e2'][connection.attrib['e2'].index('e',20)+1:])
                if ind1 > ind2:         #BR1 is the first encountered mention. Checking if assumed is true.
                    temp = ind1
                    ind1 = ind2
                    ind2 = temp
                br1 = sentence[entityStart[ind1]:entityEnd[ind1]]
                br2 = sentence[entityStart[ind2]:entityEnd[ind2]]
                
                if ind1 == 0:
                    s = sentence[:entityStart[0]] + "BR1"
                else :
                    s = sentence[:entityStart[0]] + "BR"
                for i in range(1,len(entityStart)):
                    if i == ind1:
                        s += sentence[entityEnd[i-1]:entityStart[i]] + "BR1"
                        continue
                    if i == ind2:
                        s += sentence[entityEnd[i-1]:entityStart[i]] + "BR2"
                        continue
                    s += sentence[entityEnd[i-1]:entityStart[i]] + "BR"
                s += sentence[entityEnd[-1]:]
                
                data.loc[len(data)] = [direction,br1,br2,s]

print 'Finished reading...'

print "Number of entries in DataFrame = ", len(data)
count = 0
for i in data['connection']:
    if i == 0:
        count += 1
print "Number of sentences with no connection = ", count
print "Number of sentences with connection = ", len(data) - count

data.to_csv('data/'+fileName+'(2).csv',sep='|') #fileName is the name of the file going to be created
data.head()

Reading xml file..
Finished reading...
Number of entries in DataFrame =  22561
Number of sentences with no connection =  19464
Number of sentences with connection =  3097


Unnamed: 0,connection,entity1,entity2,sentence
0,0,stratum opticum,stratum griseum intermedium,"For this study, we examined the optic (BR1, SO..."
1,0,stratum opticum,stratum griseum profundum,"For this study, we examined the optic (BR1, SO..."
2,0,stratum griseum intermedium,stratum griseum profundum,"For this study, we examined the optic (BR, SO)..."
3,0,stratum opticum,stratum album intermedium,"For this study, we examined the optic (BR1, SO..."
4,0,stratum griseum intermedium,stratum album intermedium,"For this study, we examined the optic (BR, SO)..."


### 3.
Another type of corpus was built straight from xml file itself. The way of thinking was that the other methods had sentences with same structure with different BR1 and BR2 tags. So from the xml file each sentence is taken and all brain regions are denoted as BR only. As only the connection is being looked into, denoting the region differently should not effect it. 

The file being generated is only to be used for building the word2vec vector model.

In [6]:
import pandas as pd
data = pd.DataFrame(columns = ['sentence'])

print 'Reading xml file..'
for corpus in root:
    for document in corpus:
        sentence = document.attrib['text']
        entityStart = []
        entityEnd = []
        
        for connection in document:
            if connection.tag == 'entity':
                sep = connection.attrib['charOffset'].index('-')  #to get the charOffset index
                entityStart.append(int(connection.attrib['charOffset'][:sep]))
                entityEnd.append(int(connection.attrib['charOffset'][sep+1:])+1)
                
        s = sentence[:entityStart[0]] + "BR"
        for i in range(1,len(entityStart)):
            s += sentence[entityEnd[i-1]:entityStart[i]] + "BR"
        s += sentence[entityEnd[-1]:]

        data.loc[len(data)] = [s]

print 'Finished reading...'

print "Number of entries in DataFrame = ", len(data)

data.to_csv('data/'+fileName+'(3).csv',sep='|') #fileName is the name of the file going to be created
data.head()

Reading xml file..
Finished reading...
Number of entries in DataFrame =  4334


Unnamed: 0,sentence
0,"For this study, we examined the optic (BR, SO)..."
1,Connections of the BR with BR in the rat.
2,The present study was undertaken to establish ...
3,The anterograde tracer Phaseolus vulgaris-leuc...
4,The results of these tracing experiments confi...


### 4. 
Just as above but with BR1 and BR2 denoted. So multiple sentences. There were occurence of duplicates i.e., sentences with same structure of sentence. These sentences were mostly comma seperated list. So these duplicates were removed.

In [5]:
sentences = []
result = []
print 'Reading xml file..'
for corpus in root:
    for document in corpus:
        sentence = document.attrib['text']
        entityStart = []
        entityEnd = []
        direction = ""
        
        for connection in document:
            if connection.tag == 'entity':
                sep = connection.attrib['charOffset'].index('-')
                entityStart.append(int(connection.attrib['charOffset'][:sep]))
                entityEnd.append(int(connection.attrib['charOffset'][sep+1:])+1)
            
            if connection.tag == 'pair':
                if connection.attrib['interaction'] == 'False':
                    direction = "0"
                elif connection.attrib['interaction'] == 'True' :
                    direction = "1"
                
                ind1 = int(connection.attrib['e1'][connection.attrib['e1'].index('e',20)+1:])
                ind2 = int(connection.attrib['e2'][connection.attrib['e2'].index('e',20)+1:])
                if ind1 > ind2:
                    temp = ind1
                    ind1 = ind2
                    ind2 = temp
               
                if ind1 == 0:
                    s = sentence[:entityStart[0]] + "BR1"
                else :
                    s = sentence[:entityStart[0]] + "BR"
                for i in range(1,len(entityStart)):
                    if i == ind1:
                        s += sentence[entityEnd[i-1]:entityStart[i]] + "BR1"
                        continue
                    if i == ind2:
                        s += sentence[entityEnd[i-1]:entityStart[i]] + "BR2"
                        continue
                    s += sentence[entityEnd[i-1]:entityStart[i]] + "BR"
                s += sentence[entityEnd[-1]:]
                
                sentences.append(s)
                result.append(direction)

print 'Finished reading...'


Reading xml file..
Finished reading...


After denoting of all the BR's we will now remove the duplicates after tokenization of the sentence.

In [7]:
from nltk.tokenize import word_tokenize
import re
import pandas as pd

uniqueSent = []
for sentence in sentences:
    sentence = re.sub("\s(the|The)\s"," ",sentence)
    sentence = re.sub("[, ]*(BR)([, ]*(BR[, ]))+"," BR ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*(BR1[, ]*)(BR[, ])*"," BR1 ",sentence)
    sentence = re.sub("[, ]*(BR[, ]+)*(BR2[, ]*)(BR[, ])*"," BR2 ",sentence)
    sentence = word_tokenize(sentence)
    uniqueSent.append(' '.join(sentence))
uniqueSent

data = pd.DataFrame(columns = ['connection','sentence'])
count = 0
for i,isent in enumerate(uniqueSent):
    flag = 0
    for j,jsent in enumerate(uniqueSent[:i]): #Checking if the sentence occurred anywhere before
        if isent == jsent:
            if result[i] == result[j]:     #Sentence same, but to be sure ground truth also checked
                flag = 1
                break
            else:
                flag = 2
    if flag == 2:
        print i,isent
        count += 1
    if flag == 0:
        data.loc[len(data)] = [result[i],isent]
print count
len(data)


print "Number of entries in DataFrame = ", len(data)

data.to_csv('data/'+fileName+'(Fin).csv',sep='|') #fileName is the name of the file going to be created
data.head()

595 The areas which had only efferent connections from BR1 ( MP ) were BR2 and BR among which BR and BR were previously reported .
1142 Among layer 5 pyramidal cells , approximately 27.4 % in infralimbic ( IL ) / BR ( PL ) / BR1 projected to BR2 ( LH ) , 22.9 % in infralimbic ( IL ) /BR ( PL ) to BR , 18.3 % in BR/BR ( PL ) to BR , and 8.1 % in areas infralimbic ( IL ) / BR ( PL ) to BR ( BLA ) ; and 37 % of layer 6 pyramidal cells in infralimbic ( IL ) / BR ( PL ) /BR projected to BR ( MD ) .
2656 The BR1 ( EP ) is a major outflow nucleus of BR and innervates BR2 ( VL ) , BR ( VM ) , and BR .
3624 Nearly every BR1 site received a projection from BR2 and BR .
3710 Most BR1 were also innervated by unique combinations of BR such as BR and BR ; BR2 and BR ; and , BR .
4040 The results are as follows : 1 ) part of BR located around BR ( BR1 ) receives its main input from BR and PEip ( BR2 of Pandya and Seltzer , [ 1982 ] J . Comp . )
4221 The nuclei composing BR1 ( BR and BR ) displayed di

Unnamed: 0,connection,sentence
0,0,"For this study , we examined optic ( BR1 SO ) ..."
1,0,"For this study , we examined optic ( BR1 SO ) ..."
2,0,"For this study , we examined optic ( BR , SO )..."
3,0,"For this study , we examined optic ( BR1 SO ) ..."
4,0,"For this study , we examined optic ( BR , SO )..."
