Skip to content

Exploring how to identify the nationality of authors who answered exam questions in the ESL dataset

License

Notifications You must be signed in to change notification settings

tomelf/CNIT623-Native-Language-Identification-On-English-Learner-Dataset

Repository files navigation

Dataset

Introduction

UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at esltreebank.org

File format

Raw data

Exam questions: dataset/UD_English-ESL/fce-released-dataset/prompts/[folders]/doc[number].xml

Learner answers: dataset/UD_English-ESL/fce-released-dataset/dataset/[folders]/doc[number].xml

Each xml file contains the textual answers for 2 exams written by a English learner. The following are the attribute tags:

  • language: native language of the learner
  • age: age range of the learner
  • score: ??

For each exam:

  • question_number
  • exam_score
  • coded_answer: text content of answer (with tags of FCE error codes)

The details of exams and tags of FCE error codes can be found in dataset/UD_English-ESL/fce-released-dataset/dataset/README

Labeled data in CoNLL-U format

The labeled dataset is built in CoNLL-U format.

Original sentences:

  • dataset/UD_English-ESL/data/en_esl-ud-train.conllu
  • dataset/UD_English-ESL/data/en_esl-ud-dev.conllu
  • dataset/UD_English-ESL/data/en_esl-ud-test.conllu

Corrected sentences:

  • dataset/UD_English-ESL/data/corrected/en_cesl-ud-train.conllu
  • dataset/UD_English-ESL/data/corrected/en_cesl-ud-dev.conllu
  • dataset/UD_English-ESL/data/corrected/en_cesl-ud-test.conllu

The following are the attributes for each word in a sentence:
['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']

  • id: index of word in sentence
  • form: word
  • lemma:
  • upostag: POS tag
  • xpostag: POS tag
  • feats:
  • head:
  • deprel:
  • deps:
  • misc:

("_" means null)

Example of the representation of word:
([('id', 1), ('form', 'I'), ('lemma', '_'), ('upostag', 'PRON'), ('xpostag', 'PRP'), ('feats', None), ('head', 3), ('deprel','nsubj'), ('deps', None), ('misc', None)])

Data Loader

To use the data loader, you need to first install the CoNLL-U Parser built by Emil Stenström.

The following is an example to use data_loader:

import data_loader

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)

train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list

train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list
train_meta.head()
id doc_id sent errors native_language age_range score
1 doc2664 I was <ns type="S"><i>shoked</i><c>shocked</c>... {'S': 2, 'RV': 1} Russian 21-25 21.0
2 doc648 I am very sorry to say it was definitely not a... {'MT': 1, 'RT': 1} French 26-30 38.0
3 doc1081 Of course, I became aware of her feelings sinc... {'AGQ': 1} Spanish 16-20 36.0
4 doc724 I also suggest that more plays and films shoul... {'FV': 1, 'RV': 1} Japanese 21-25 33.0
5 doc567 Although my parents were very happy <ns type="... {'FD': 1, 'RT': 1, 'RJ': 1, 'MT': 1} Spanish 31-40 34.0
train_data[0]
id form lemma upostag xpostag feats head deprel deps misc meta_id
1 I _ PRON PRP None 3 nsubj None None 1
2 was _ VERB VBD None 3 cop None None 1
3 shoked _ ADJ JJ None 0 root None None 1
4 because _ SCONJ IN None 8 mark None None 1
5 I _ PRON PRP None 8 nsubj None None 1
6 had _ AUX VBD None 8 aux None None 1
7 alredy _ ADV RB None 8 advmod None None 1
8 spoken _ VERB VBN None 3 advcl None None 1
9 with _ ADP IN None 10 case None None 1
10 them _ PRON PRP None 8 nmod None None 1
11 and _ CONJ CC None 8 cc None None 1
12 I _ PRON PRP None 14 nsubj None None 1
13 had _ AUX VBD None 14 aux None None 1
14 taken _ VERB VBN None 8 conj None None 1
15 two _ NUM CD None 16 nummod None None 1
16 autographs _ NOUN NNS None 14 dobj None None 1
17 . _ PUNCT . None 3 punct None None 1

Dumped files are under ./preprocessed/[name]/.

  • meta.csv: the same format as the above variable "train_meta"
  • [number].csv: the same format as the above variable "train_data[0]"

About

Exploring how to identify the nationality of authors who answered exam questions in the ESL dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published