# Loading Texts

The `IO` module contains the classes and methods useful for loading in texts and text data from various souces and formats into a consistant structure so they can be analyzed within the Lexos enviroment.

This module contains three main components:

1. `BaseLoader`: Central abstract class. Not used directly for loading files, but provides a blueprint and common features for the other loader classes.
2. `Loader`: The main loader used for Lexos. Designed to handle individual files (.txt, .pdf, and docx), directories of files, and zip archives.
3. `DataLoader`: A specialized loader for structured data files such as CSVs, JSON, or Excel files.

## Common Features in `BaseLoader`

`BaseLoader` is an abstract base class that is used to define methods and features for the other loader types. There are many key featues that both the `DataLoader` and `Loader` can use/implement.

### Key Features:

All loaders built on `BaseLoader` have the following attributes for storing loaded data:
- `paths`: File paths or other sources of the loaded texts.
- `mime_types`: MIME types of the loaded items.
- `names:`: Names assigned to each loaded text.
- `texts:`: The text content of the loaded items.
- `errors:`: Any errors encountered during loading.

Additionally loaders will have access to the following properties and methods:
- `records` **(property)**: Returns a list of dictionaries, with each representing a loaded item with keys such as `name`, `path`, and `mime_type`.
- `data` **(property)**: Returns a single dictionary containing all of the data stored in the loader.
- `df` **(property)**: Returns the loaded file records in the form of a Pandas DataFrame.
- `load_dataset` **(method)**: Abstract method to be implemented by loaders.
- `dedupe`  **(method)**: Removes duplicate entries from the loaded data and returns a DataFrame with the duplicates removed. The fields to be checked for duplication can be specified.
- `show_duplicates`  **(method)**: Returns a DataFrame containing any duplicates found in the data. Can specify which fields to check for duplicates.
- `reset` **(method)**: Clear all data from a loader instance. Reset to an empty loader.
- `to_csv`, `to_excel`, `to_json` **(methods)**: Methods for saving the loaded data to various common file formats.

`BaseLoader` is abstract and therefore not instantiated directly. These features will be demonstrated using a `Loader` instance below:

## Loading Files with `Loader`

The `Loader` (from `lexos.io.loader`) is the main class used for loading individual file types such as plain text, PDFs, and word documents. It contains all the features from `BaseLoader`. It can also scan entire directories or zip files. `Loader` is able to automatically detect file types using their MIME types.

In [None]:
# Imports
from lexos.io.loader import Loader
from pathlib import Path

### Loading Individual Files and Directories <br>

The main method is `load()`, which accepts a single file/directory path or a list of paths.

In [None]:
# Create a Loader instance
file_loader = Loader()

# Load a single text file
file_path = Path("sample_files") / "simple_text1.txt"
file_loader.load([file_path])

print(f"\nLoaded {len(file_loader.texts)} item(s) so far.")
print(file_loader.names)

# Load files located in sample_folder folder
directory_path = Path("sample_files") / "sample_folder"
file_loader.load([directory_path])

print(f"\nLoaded {len(file_loader.texts)} item(s) so far.")
print(file_loader.names)

# Load multiple files using a list of paths
paths = [Path("sample_files/simple_text2.txt"), Path("sample_files/simple_text3.docx")]
file_loader.load(paths)

print(f"\nLoaded {len(file_loader.texts)} item(s) so far.")
print(file_loader.names)

## Loading Structed Data with `DataLoader`

The `DataLoader` (from `lexos.io.data_loader`) is designed for importing data sources that are aleady structured. It works well for documents that are organized within a single file or similar set of files. It contains all the features from `BaseLoader`.

In [None]:
# Imports
from lexos.io.data_loader import DataLoader
from pathlib import Path

### CSV files

The `load_csv()` method allows for reading data from CSV (Comma-Separated Values) files. It can optionally be told which columns contain the document names/identifiers as well as text using the `name_col` and `text_col` parameters. 

TSV (Tab-Separated Value) files can also be read by adding `sep"\t"` to the arguments for `load_csv`.

In [None]:
# Create a new DataLoader instance
data_loader_csv = DataLoader()

# Load a CSV file
csv_file_path = Path("sample_files") / "test_data.csv"
data_loader_csv.load_csv(csv_file_path, name_col="text_name", text_col="text_content") # Specify the columns for names and content

# Print the loaded data
data_loader_csv.data


### JSON, Excel, and Lineated Texts

Loading for these remaining file types that can be used with `DataLoader` all share the same workflow as shown for the CSV file above. The key difference is the method used to pass in the fields used for the names and text.

#### CSV, Excel, JSON

All three follow the same workflow as the example shown above for a CSV file. JSON however uses different parameters for the name and text fields.

- `load_csv()` / `load_excel()`
    - Name: `name_col`
    - Text: `text_col`
- `load_json()`
    - Name: `name_field`
    - Text: `text_field`


#### Lineated Text

To read data from a lineated text file with custom data names, the names of the lines must be placed in a list with one label per line in the file. For example, `custom_doc_names = ["L3", "L2", "L1"]`. These are then passed into the `load_lineated_text()` using `names_list = custom_doc_names`.


#### Merging DataLoaders

Lastly data can be merged from one `DataLoader` into another using the `load_dataset()` command. The following example shows the data from the previously used `data_loader_csv` being merged into a new blank `DataLoader`.

In [None]:
# Create a new empty DataLoader instance
new_data_loader = DataLoader()

# Merge data from the existing DataLoader into the new one
new_data_loader.load_dataset(data_loader_csv)

# Print the merged data
new_data_loader.data

### Merging Data into Loaders <br>

Like `DataLoader`, the `Loader` class also has a `load_dataset()` method. It takes a `DataLoader` instance as an argument and merges it into the given data set.



In [None]:
# Merge csv data from first example into file_loader
print(f"\nLoaded {len(file_loader.texts)} item(s) so far.")
print(file_loader.names)

file_loader.load_dataset(data_loader_csv)

print(f"\nLoaded {len(file_loader.texts)} item(s) so far.")
print(file_loader.names)