# `Loader` Tutorial
   
Python has numerous ways to open files on your computer or download them from the internet. The Lexos `Loader` is a "helper" that invisibly takes care of many of the gotchas (like non-standard character encodings) so that you can get on with your work.

The `Loader` is in active development and has a number of different versions the most advanced version is currently the "smart" version, and you'll see this referenced in the `import` statement below.

You use the `Loader` by first instantiating a `Loader` class and then calling the `load()` function. This allows you to add texts to your loader multiple times.

## Import the `Loader` Module

In [None]:
from lexos.io.smart import Loader

## Load a Local File or a List of Local Files

Notice in the list below that you can load `.txt`, `.docx`, or `.pdf` formats.

When files are loaded into a `Loader`, their character encoding is automatically converted into UTF-8 format.

You can see the names of the text you have uploaded by printing `Loader.names`. The filepaths can be accessed from `Loader.locations`.

In [None]:
# A single file
data = "../test_data/txt/Austen_Pride.txt"

loader1 = Loader()
loader1.load(data)

# A list files
data = ["../test_data/txt/Austen_Pride.txt",
        "../test_data/docx/Austen_Sense_sm.docx",
        "../test_data/pdf/Austen_Pride_sm.pdf"]

loader2 = Loader()
loader2.load(data)

print(f"Loader 1: {loader1.names}")
print()
print(f"Loader 2: {loader2.locations}")


## Accessing Texts in a `Loader`

Texts are accessed with `Loader.texts`. This is a list, so, if you wish to access a single text, you must do so by its index in the list (e.g. `Loader.texts[0]`). For example:

In [None]:
# Print a single text (first 100 characters)
print(f"Text1:")
print("==========================")
print(f"{loader2.texts[0][0:100]}...\n")

# Print multiple texts (first 100 characters)
for i, item in enumerate(loader2.texts):
    print(f"Text{i + 1}:")
    print("==========================")
    print(f"{item[0:100]}...\n")

You can also loop through the `Loader` directly an print the text of each item with the `text` property. You can also access the `name`, `location`, and `source` of each of the text.

In [None]:

# Loop through the Loader
for i, item in enumerate(loader2):
    print(f"Text{i + 1}: {item.name}")
    print("==========================")
    print(f"{item.text[0:100]}...\n")

## Loading Local Directories or Zip Files   

Directories or zip files containing files of `.txt`, `.docx`, and `.pdf` extenstions can be loaded just like other files.

In [None]:
# Get all the files in the docx directory
loader1 = Loader()
loader1.load("../test_data/docx")

# Get all the files in a zip file
loader2.load("../test_data/zip/txt.zip")

# Print the first 100 characters of the first file in the directory
print(loader1.texts[0][0:100])

# Print the first 100 characters of the first file in the zip file
print(loader1.texts[0][0:100])

## Load Texts from a URL

Use the same technique to download a text or texts from a url or a list of urls.

In [None]:
loader = Loader()
loader.load("https://www.gutenberg.org/files/84/84-0.txt")

print (loader.texts[0][0:1000])

## Load a Dataset

Text analysis often requires datasets consisting of a large number of documents. Such datasets are often packaged with multiple documents in a single file in a variety for formats. The `DatasetLoader` class provides a convenient means of loading many common formats. Valid inputs are:

- Plain text files with one document per line
- CSV and TSV files with one document per line
- Excel files
- JSON files
- JSONL files (newline-delimited JSON)
- Folders and zip archives containing files in the above formats
- Urls to files in the above formats

A simple example of the use of the `DatasetLoader` class is given below. In this example we have a plain text file with one document per line and no titles. If we try to load it with `DatasetLoader(source)`, we will receive an error. In order to get around this problem, we supply a list of titles using the `labels` parameter. If you do not know the number of lines in your dataset, you can use `labels=[1]`, and you will get an error telling you how many lines are in the file (and thus how many labels you need to supply).

Note that the other formats listed above often require you to specify metadata information. These requirements are discussed further below.

In [None]:
from lexos.io.dataset import DatasetLoader

source = "../test_data/datasets/base.txt"
labels = [
    "Ainsworth_Guy_Fawkes",
    "Ainsworth_Lancashire_Witches",
    "Ainsworth_Old_Saint_Pauls",
    "Ainsworth_Tower_of_London",
    "Ainsworth_Windsor_Castle"
]

dataset_loader = DatasetLoader(source, labels=labels)

# Print a list of titles in the dataset
print(dataset_loader.names)

print("\n==========================\n")

# Iterate through the DatasetLoader and print items from its data dict
for item in dataset_loader:
    print(f"{item['title']}: {item['text'][0:50]}...")


You can access the titles and texts using `DatasetLoader.names` and `DatasetLoader.texts`, or you can access them together in a dict as `DatasetLoader.data`.

The code below loads a CSV file.

In [None]:
source = "../test_data/datasets/csv_valid.csv"

dataset_loader = DatasetLoader(source)

for item in dataset_loader: 
    print(f"{item['title']}: {item['text'][0:50]}...")

Notice that the test file is called `csv_valid.csv`. This naming convention indicates that the first line of the CSV file is "title,text" &mdash; the two headers required by the `DatasetLoader`. If your CSV file has different headers for the title and text, you can indicate the headers that should be converted with `title_col` and `text_col`. You can see this in action in the following example:

In [None]:
source = "../test_data/datasets/csv_invalid.csv"

dataset_loader = DatasetLoader(source, title_col="label", text_col="content")

for item in dataset_loader: 
    print(f"{item['title']}: {item['text'][0:50]}...")

If you have a tab-separated file (TSV), simply add the parameter `sep="\t"`.

The `DatasetLoader` will also load Excel files and takes the `title_col` and `text_col` parameters.

Loading a file from JSON works the same way, except that, if you don't have `title` and `text` fields, you should specify which fields should be used with the `title_field` and `text_field` parameters. Additionally, if your JSON is newline-delimited, you should specify `lines=True`.

In [None]:
source = "../test_data/datasets/json_valid.json"
# # source = "../test_data/datasets/json_invalid.json"
# source = "../test_data/datasets/jsonl_valid.jsonl"

dataset_loader = DatasetLoader(source, title_field="label", text_field="content")

for item in dataset_loader: 
    print(f"{item['title']}: {item['text'][0:50]}...")

### The `Dataset` Class

`DatasetLoader` is actually a wrapper for the `Dataset` class, which has parsing methods for different dataset formats. These methods can be called individually with commands like `dataset = Dataset.parse_string(source, labels=LABELS)`. The available methods are:

- `parse_string()`
- `parse_csv()`
- `parse_excel()`
- `parse_json()`
- `parse_jsonl()`

Each method takes the same arguments described for the `DatasetLoader` class.

Apart from `parse_string()`, these methods read their source using methods from the pandas library: `pandas.read_csv()`, `pandas.read_excel()`, `pandas.read_json()`. Any keywords accepted by those methods can also be passed through their equivalent `Dataset` methods.

An example is provided below.

In [None]:
from lexos.io.dataset import Dataset

source = "../test_data/datasets/csv_valid.csv"

dataset = Dataset.parse_csv(source)

for item in dataset:
    print(f"{item['title']}: {item['text'][0:50]}...")

## Adding Datasets to a Standard Lexos Loader

If you already have a `Loader`, it is easy to add datasets to it.

In [None]:

# Import the loaders
from lexos.io.smart import Loader
from lexos.io.dataset import Dataset, DatasetLoader

# Create and empty `Loader`
loader = Loader()

# Create a `DatasetLoader` and load a dataset
dataset_loader = DatasetLoader(source, labels=labels)

# Load a dataset with `Dataset`
dataset = Dataset.parse_csv(source)

# Add the text and names for each dataset to the standard loader
for item in [dataset_loader, dataset]:
    loader.names.extend(item.names)
    loader.texts.extend(item.texts)

# Print the names of the first 10 documents
print(loader.names[0:10])