# Test the Datasets Class

Import the Lexos API classes and define the paths to available test datasets. Also define some labels for the texts available in all of the test datasets.

In [1]:
from lexos.io.dataset import Dataset, DatasetLoader
from lexos.io.smart import Loader

dataset_types = {
    "raw_str_no_headers": "Test1\nTest2",
    "local_path": "../test_data/datasets/base.txt",
    "local_csv_valid": "../test_data/datasets/csv_valid.csv",
    "local_tsv_valid": "../test_data/datasets/csv_valid.tsv",
    "local_json_valid": "../test_data/datasets/json_valid.json",
    "local_jsonl_valid": "../test_data/datasets/jsonl_valid.jsonl",
    "local_excel_valid": "../test_data/datasets/excel_valid.xlsx",
    "local_csv_invalid": "../test_data/datasets/csv_invalid.csv",
    "local_tsv_invalid": "../test_data/datasets/csv_invalid.tsv",
    "local_json_invalid": "../test_data/datasets/json_invalid.json",
    "local_jsonl_invalid": "../test_data/datasets/jsonl_invalid.jsonl",
    "local_excel_invalid": "../test_data/datasets/excel_invalid.xlsx",
    "local_zip_csv_valid_short": "../test_data/datasets/csv_valid_short.zip",
    "local_dir_csv_valid": "../test_data/datasets/dir_csv_valid",
    "remote_csv_valid": "https://github.com/scottkleinman/lexos/raw/main/tests/test_data/datasets/csv_valid.csv",
    "remote_dir_csv_valid": "https://github.com/scottkleinman/lexos/tree/main/tests/test_data/datasets/dir_csv_valid"
}

LABELS = [
    "Ainsworth_Guy_Fawkes",
    "Ainsworth_Lancashire_Witches",
    "Ainsworth_Old_Saint_Pauls",
    "Ainsworth_Tower_of_London",
    "Ainsworth_Windsor_Castle"
]

In the cell below, select the name of the dataset to test.

In [2]:
source = dataset_types["local_csv_valid"]

### Test the `Dataset` Class

Note that you may need to change the constructor method or arguments, depending on the format of the source data you are using.

In [3]:
# dataset = Dataset.parse_string(source, labels=LABELS)
dataset = Dataset.parse_csv(source)
for item in dataset:
    print(f"{item['title']}: {item['text'][0:50]}...")

Ainsworth_Guy_Fawkes: ﻿The Project Gutenberg EBook of Guy Fawkes, by Wil...
Ainsworth_Lancashire_Witches: ﻿Project Gutenberg's The Lancashire Witches, by Wi...
Ainsworth_Old_Saint_Pauls: ﻿Project Gutenberg's Old Saint Paul's, by William ...
Ainsworth_Tower_of_London: ﻿ Project Gutenberg's The Tower of London, by Will...
Ainsworth_Windsor_Castle: ﻿The Project Gutenberg EBook of Windsor Castle, by...


### Test the `DatasetLoader` Class

Note that you may need to change the arguments, depending on the format of the source data you are using.

In [4]:
# loader = DatasetLoader(source, labels=LABELS)
loader = DatasetLoader(source)
for item in loader: 
    print(f"{item['title']}: {item['text'][0:50]}...")

Ainsworth_Guy_Fawkes: ﻿The Project Gutenberg EBook of Guy Fawkes, by Wil...
Ainsworth_Lancashire_Witches: ﻿Project Gutenberg's The Lancashire Witches, by Wi...
Ainsworth_Old_Saint_Pauls: ﻿Project Gutenberg's Old Saint Paul's, by William ...
Ainsworth_Tower_of_London: ﻿ Project Gutenberg's The Tower of London, by Will...
Ainsworth_Windsor_Castle: ﻿The Project Gutenberg EBook of Windsor Castle, by...


### Display Object Properties

Using `Dataset`:

In [5]:
print("Dataset Names:")
print(dataset.names)

print()

print("Dataset Excerpts:")
print([f"{x[0:46]}..." for x in dataset.texts])

print()

print("Export to Pandas Dataframe:")
display(dataset.df().head())

Dataset Names:
['Ainsworth_Guy_Fawkes', 'Ainsworth_Lancashire_Witches', 'Ainsworth_Old_Saint_Pauls', 'Ainsworth_Tower_of_London', 'Ainsworth_Windsor_Castle']

Dataset Excerpts:
['\ufeffThe Project Gutenberg EBook of Guy Fawkes, by...', "\ufeffProject Gutenberg's The Lancashire Witches, b...", "\ufeffProject Gutenberg's Old Saint Paul's, by Will...", "\ufeff Project Gutenberg's The Tower of London, by ...", '\ufeffThe Project Gutenberg EBook of Windsor Castle...']

Export to Pandas Dataframe:


Unnamed: 0,title,text
0,Ainsworth_Guy_Fawkes,"﻿The Project Gutenberg EBook of Guy Fawkes, by..."
1,Ainsworth_Lancashire_Witches,"﻿Project Gutenberg's The Lancashire Witches, b..."
2,Ainsworth_Old_Saint_Pauls,"﻿Project Gutenberg's Old Saint Paul's, by Will..."
3,Ainsworth_Tower_of_London,"﻿ Project Gutenberg's The Tower of London, by ..."
4,Ainsworth_Windsor_Castle,﻿The Project Gutenberg EBook of Windsor Castle...


Using `DatasetLoader`:

In [6]:
print("Loader Names:")
print(loader.names)

print()

print("Loader Excerpts:")
print([f"{x[0:46]}..." for x in loader.texts])

print()

print("Export to Pandas Dataframe:")
display(dataset.df().head())

Loader Names:
['Ainsworth_Guy_Fawkes', 'Ainsworth_Lancashire_Witches', 'Ainsworth_Old_Saint_Pauls', 'Ainsworth_Tower_of_London', 'Ainsworth_Windsor_Castle']

Loader Excerpts:
['\ufeffThe Project Gutenberg EBook of Guy Fawkes, by...', "\ufeffProject Gutenberg's The Lancashire Witches, b...", "\ufeffProject Gutenberg's Old Saint Paul's, by Will...", "\ufeff Project Gutenberg's The Tower of London, by ...", '\ufeffThe Project Gutenberg EBook of Windsor Castle...']

Export to Pandas Dataframe:


Unnamed: 0,title,text
0,Ainsworth_Guy_Fawkes,"﻿The Project Gutenberg EBook of Guy Fawkes, by..."
1,Ainsworth_Lancashire_Witches,"﻿Project Gutenberg's The Lancashire Witches, b..."
2,Ainsworth_Old_Saint_Pauls,"﻿Project Gutenberg's Old Saint Paul's, by Will..."
3,Ainsworth_Tower_of_London,"﻿ Project Gutenberg's The Tower of London, by ..."
4,Ainsworth_Windsor_Castle,﻿The Project Gutenberg EBook of Windsor Castle...


### Add Datasets to a Standard Loader

The example below uses the `dataset` and `loader` variables defined in the cells above.

In [None]:
standard_loader = Loader()

# From a `Dataset`
standard_loader.texts = dataset.texts
standard_loader.names = dataset.names

# From a `DatasetLoader`
standard_loader.texts = loader.texts
standard_loader.names = loader.names
