A simple (ie. no error checking or sensible engineering) notebook to extract the student answer data from an xml file. 

The semeval data here is obtained from the [semeval 2013 website](https://www.cs.york.ac.uk/semeval-2013/task7/index.php%3Fid=data.html)

I'm not 100% sure what we actually need for the moment, so I'm just going to extract the student answer data from a single file. That is, I'm not at first going to use the reference answer etc.

In [1]:
filename='semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml'

In [2]:
import pandas as pd

In [3]:
from xml.etree import ElementTree as ET

In [4]:
tree=ET.parse(filename)

The reference answers are the third daughter node of the tree:

In [5]:
r=tree.getroot()
r[2]

<Element 'studentAnswers' at 0x114b01598>

Now iterate over the student answers to get the specific responses. For the moment, we'll just stick to the text and the accuracy. I'll also add an index term to make it a bit easier to convert to a dataframe.

In [6]:
responses_ls=[{'accuracy':a.attrib['accuracy'], 'text':a.text, 'idx':i} for (i, a) in enumerate(r[2])]

responses_ls

[{'accuracy': 'correct',
  'idx': 0,
  'text': 'positive battery terminal is separated by a gap from terminal 1'},
 {'accuracy': 'correct',
  'idx': 1,
  'text': 'terminal 1 is not connected to the positive terminal'},
 {'accuracy': 'contradictory',
  'idx': 2,
  'text': 'Because terminal 1 is connected to the positive battery terminal'},
 {'accuracy': 'contradictory',
  'idx': 3,
  'text': 'because terminal 1 is not seperated by any gaps'},
 {'accuracy': 'contradictory',
  'idx': 4,
  'text': 'because terminal one is  connected to both the negative and positive battery terminal'},
 {'accuracy': 'non_domain', 'idx': 5, 'text': 'no'},
 {'accuracy': 'non_domain', 'idx': 6, 'text': 'i do not understand'},
 {'accuracy': 'correct',
  'idx': 7,
  'text': 'Terminal 1 is seperated from the positive terminal'},
 {'accuracy': 'correct',
  'idx': 8,
  'text': 'the positive battery terminal is not connected to terminal 1.'},
 {'accuracy': 'contradictory',
  'idx': 9,
  'text': 'because terminal on

Next, we need to carry out whatever analysis we want on the answers. In this case, we'll split on whitespace, convert to lower case, and strip punctuation. Feel free to redefine the `to_tokens` function to do whatever analysis you prefer.

In [7]:
from string import punctuation

def to_tokens(textIn):
    '''Convert the input textIn to a list of tokens'''
    tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
    # remove any empty tokens
    return [t for t in tokens_ls if t]

str='"Help!" yelped the banana, who was obviously scared out of his skin.'
print(str)
print(to_tokens(str))

"Help!" yelped the banana, who was obviously scared out of his skin.
['help', 'yelped', 'the', 'banana', 'who', 'was', 'obviously', 'scared', 'out', 'of', 'his', 'skin']


So now we can apply the `to_tokens` function to each of the student responses:

In [8]:
for resp_dict in responses_ls:
    resp_dict['tokens']=to_tokens(resp_dict['text'])
responses_ls

[{'accuracy': 'correct',
  'idx': 0,
  'text': 'positive battery terminal is separated by a gap from terminal 1',
  'tokens': ['positive',
   'battery',
   'terminal',
   'is',
   'separated',
   'by',
   'a',
   'gap',
   'from',
   'terminal',
   '1']},
 {'accuracy': 'correct',
  'idx': 1,
  'text': 'terminal 1 is not connected to the positive terminal',
  'tokens': ['terminal',
   '1',
   'is',
   'not',
   'connected',
   'to',
   'the',
   'positive',
   'terminal']},
 {'accuracy': 'contradictory',
  'idx': 2,
  'text': 'Because terminal 1 is connected to the positive battery terminal',
  'tokens': ['because',
   'terminal',
   '1',
   'is',
   'connected',
   'to',
   'the',
   'positive',
   'battery',
   'terminal']},
 {'accuracy': 'contradictory',
  'idx': 3,
  'text': 'because terminal 1 is not seperated by any gaps',
  'tokens': ['because',
   'terminal',
   '1',
   'is',
   'not',
   'seperated',
   'by',
   'any',
   'gaps']},
 {'accuracy': 'contradictory',
  'idx': 4,
  '

OK, good. So now let's see how big the vocabulary is for the complete set:

In [9]:
vocab_set=set()
for resp_dict in responses_ls:
    vocab_set=vocab_set.union(set(resp_dict['tokens']))
    
len(vocab_set)

97

Now we can set up a document frequency dict:

In [10]:
docFreq_dict={}

for t in vocab_set:
    docFreq_dict[t]=len([resp_dict for resp_dict in responses_ls if t in resp_dict['tokens']])
    
docFreq_dict

{'1': 40,
 '1.5': 3,
 '2': 1,
 'a': 31,
 'and': 20,
 'answer': 1,
 'any': 1,
 'are': 12,
 'aren"t': 1,
 'at': 3,
 'batteries': 1,
 'battery': 39,
 'battery"s': 1,
 'becaquse': 1,
 'because': 28,
 'becuase': 1,
 'between': 9,
 'both': 2,
 'bulb': 7,
 'by': 10,
 'c': 1,
 'charge': 2,
 'circuit': 3,
 'closed': 1,
 'closing': 1,
 'components': 1,
 'connected': 50,
 'connection': 5,
 'contact': 1,
 'created': 1,
 'damaged': 3,
 'difference': 1,
 'different': 3,
 'dint': 1,
 'direct': 1,
 'do': 1,
 'dont': 1,
 'each': 2,
 'electrical': 3,
 'end': 1,
 'from': 6,
 'gap': 27,
 'gaps': 1,
 'get': 1,
 'had': 2,
 'has': 1,
 'have': 1,
 'he': 2,
 'i': 4,
 'in': 4,
 'is': 54,
 'it': 6,
 'its': 2,
 'know': 2,
 'making': 1,
 'me': 1,
 'negative': 13,
 'no': 9,
 'not': 26,
 'of': 2,
 'on': 2,
 'one': 8,
 'other': 2,
 'path': 1,
 'positive': 52,
 'posittive': 1,
 'positve': 1,
 'postive': 1,
 'psoitive': 1,
 'reading': 1,
 'same': 1,
 'separated': 6,
 'separates': 1,
 'separation': 2,
 'separted': 1,
 '

Now add a tf.idf dict to each of the responses:

In [11]:
for resp_dict in responses_ls:
    resp_dict['tfidf']={t:resp_dict['tokens'].count(t)/docFreq_dict[t] for t in resp_dict['tokens']}
    
responses_ls[6]

{'accuracy': 'non_domain',
 'idx': 6,
 'text': 'i do not understand',
 'tfidf': {'do': 1.0,
  'i': 0.25,
  'not': 0.038461538461538464,
  'understand': 1.0},
 'tokens': ['i', 'do', 'not', 'understand']}

Finally, convert the response data into a dataframe:

In [14]:
out_df=pd.DataFrame(index=docFreq_dict.keys())
for resp_dict in responses_ls:
    out_df[resp_dict['idx']]=pd.Series(resp_dict['tfidf'], index=out_df.index)

out_df=out_df.fillna(0).T
out_df.head()

Unnamed: 0,its,the,from,components,one,are,other,bulb,any,different,...,understand,no,v,postive,he,circuit,to,get,at,making
0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0
2,0.0,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.014085,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0


In [15]:
accuracy_ss=pd.Series({r['idx']:r['accuracy'] for r in responses_ls})
accuracy_ss.head()

0          correct
1          correct
2    contradictory
3    contradictory
4    contradictory
dtype: object