This repository contains a dataset of speech and text features extracted from the International Corpus Network of Asian Learners of English (ICNALE)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md

README.md

Comparing Speech and Text Classification of Native and Non-native English Speakers

This repository contains a dataset of speech and text features extracted from the International Corpus Network of Asian Learners of English (ICNALE).

Description:

  • the data is structured in libsvm format with files corresponding to each pair of native languages
  • speech.ftrs and function_words.ftrs contain the features and the corresponding integer used in the libsvm representation
  • the first line in each pair file is a libsvm comment of the form: 1 1:2 2:1 # , LANG_1, LANG_2 that indicates the labels of the two native languages used
Speech:
  • the speech files are split into 2 seconds chunks
  • each class (native language) is represented by an equal number of chunks randomly sampled
Text:
  • the text files are short ~110 words/file and used as they are
  • except for lowercasing and tokenization, no preprocessing was done on these files

For more details about this particular dataset, mailto:sergiu nisioi gmail dot com