A fast and memory-efficient indexing tool for reading large-scale corpora.
CorpusReader
is a Python class designed to efficiently access documents in a corpus using an index. This tool is particularly useful for large-scale corpora where standard reading methods might be memory-intensive.
from corpus_reader import CorpusReader
reader = CorpusReader(index_path="path_to_index", verbose=True)
You can fetch a document by its ID (string) or by its index (integer).
doc = reader["document_id"]
doc = reader[index]
num_docs = len(reader)
str_data = reader.to_str(data)
This method allows you to build an index file for a given data file.
reader.build_index(data_path="path_to_data", index_path="path_for_index", keys=["key1", "key2"], verbose=True)
- init(index_path: str, verbose: bool=False): Initialize the corpus reader.
- del(): Clean up resources.
- getitem(index: Union[int, str]) -> str: Fetch a document by its ID or index.
- len() -> int: Return the number of documents in the corpus.
- to_str(data: Union[str, list]) -> str: Convert data into string representation.
- build_index(data_path: str, index_path: str, keys: List[str], verbose: bool=False): Build an index file.