Skip to content
Simple NLP Data Loader for All Deep Learning Frameworks in Python
Branch: master
Clone or download
Latest commit a8ec723 Jun 24, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples Update examples Jun 21, 2019
lineflow Replace textfile with easyfile Jun 20, 2019
tests Replace textfile with easyfile Jun 20, 2019
.flake8 add .flake8 Mar 1, 2019
.gitignore Initial commit Feb 25, 2019
.travis.yml Prevent conflicts when deploying Jun 6, 2019
LICENSE Initial commit Feb 25, 2019
Makefile Update Makefile Jun 24, 2019
Pipfile Replace textfile with easyfile Jun 20, 2019
Pipfile.lock Update Pipfile.lock Jun 21, 2019
README.md Fix a broken link Jun 24, 2019
setup.py Replace textfile with easyfile Jun 20, 2019
tox.ini Update tox.ini Mar 7, 2019

README.md

Lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

Lineflow is a simple text dataset loader for NLP deep learning tasks.

  • Lineflow was designed to use in all deep learning frameworks.
  • Lineflow enables you to build pipelines.
  • Lineflow supports functional API and lazy evaluation.

Lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install Lineflow:

pip install lineflow

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

Load the predefined dataset:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train')
>>> train.first()
("i can 't tell who will arrive first .", '誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。')

Split the sentence to the words:

>>> # continuing from above
>>> train = train.map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'],
 ['', '', '一番', '', '', '', '', '', '', '', '分か', '', '', '', '', ''])

Obtain words in dataset:

>>> # continuing from above
>>> import lineflow as lf
>>> en_tokens = lf.flat_map(lambda x: x[0], train)
>>> en_tokens[:5] # This is useful to build vocabulary.
['i', 'can', "'t", 'tell', 'who']

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

WikiText-2 (Added by sobamchan, thanks.)

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')
You can’t perform that action at this time.