## DataPipes development tutorial. Loaders DataPipes.

As DataSet now constructed by stacking `DataPipe`-s it is recommended to keep `DataPipe` functionality as primitive as possible. For example loading data from CSV file will look like sequence of DataPipes: ListFiles FileLoader CSVParser.



`ExampleListFilesDataPipe` scans all files in `root` folder and yields full file names. Avoid loading entire list in `__init__` function to save memory.

In [1]:
import csv
import io
import os

from torch.utils.data import IterDataPipe, functional_datapipe


class ExampleListFilesDataPipe(IterDataPipe):
    def __init__(self, *, root):
        self.root = root

    def __iter__(self):
        for (dirpath, dirnames, filenames) in os.walk(self.root):
            for file_name in filenames:
                yield os.path.join(dirpath, file_name)

`ExampleFileLoaderDataPipe` registered as `load_files_as_string` consumes file names from source_datapipe and yields file names and file lines.

In [2]:
@functional_datapipe('load_files_as_string')
class ExampleFileLoaderDataPipe(IterDataPipe):
    def __init__(self, source_datapipe):
        self.source_datapipe = source_datapipe

    def __iter__(self):
        for file_name in self.source_datapipe:
            with open(file_name) as file:
                lines = file.read()
                yield (file_name, lines)


`ExampleCSVParserDataPipe` registered as `parse_csv_files` consumes file lines and expands them as CSV rows.

In [3]:
@functional_datapipe('parse_csv_files')
class ExampleCSVParserDataPipe(IterDataPipe):
    def __init__(self, source_datapipe):
        self.source_datapipe = source_datapipe

    def __iter__(self):
        for file_name, lines in self.source_datapipe:
            reader = csv.reader(io.StringIO(lines))
            for row in reader:
                yield [file_name] + row


In [4]:
FOLDER = 'define your folder with csv files here'
FOLDER = '/home/vitaly/dataset/data'
dp = ExampleListFilesDataPipe(root = FOLDER).filter(lambda filename: filename.endswith('.csv')).load_files_as_string().parse_csv_files()

for data in dp:
    print(data)

['/home/vitaly/dataset/data/datapipes/load/iter/test/example_2.csv', '10', " 'foo'"]
['/home/vitaly/dataset/data/datapipes/load/iter/test/example_2.csv', '11', " 'bar'"]
['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '12', " 'aaaa'"]
['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '13', " 'bbbb'"]


This approach allows to replace any DataPipe to get different functionality. For example you can pick individual files.


In [5]:
FILE = 'define your file with csv data here'
FILE = '/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv'
dp = ExampleFileLoaderDataPipe([FILE]).parse_csv_files()

for data in dp:
    print(data)

['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '12', " 'aaaa'"]
['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '13', " 'bbbb'"]
