Corpus Reader

A fast and memory-efficient indexing tool for reading large-scale corpora.

Introduction

CorpusReader is a Python class designed to efficiently access documents in a corpus using an index. This tool is particularly useful for large-scale corpora where standard reading methods might be memory-intensive.

Usage

Initialization

from corpus_reader import CorpusReader
reader = CorpusReader(index_path="path_to_index", verbose=True)

Fetching a Document

You can fetch a document by its ID (string) or by its index (integer).

doc = reader["document_id"]
doc = reader[index]

Getting the Number of Documents

num_docs = len(reader)

Converting Data to String

str_data = reader.to_str(data)

Building an Index for Data

This method allows you to build an index file for a given data file.

reader.build_index(data_path="path_to_data", index_path="path_for_index", keys=["key1", "key2"], verbose=True)

Methods

init(index_path: str, verbose: bool=False): Initialize the corpus reader.
del(): Clean up resources.
getitem(index: Union[int, str]) -> str: Fetch a document by its ID or index.
len() -> int: Return the number of documents in the corpus.
to_str(data: Union[str, list]) -> str: Convert data into string representation.
build_index(data_path: str, index_path: str, keys: List[str], verbose: bool=False): Build an index file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
corpus_reader		corpus_reader
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus_reader

corpus_reader

.gitignore

.gitignore

README.md

README.md

setup.py

setup.py

Repository files navigation

Corpus Reader

Introduction

Usage

Initialization

Fetching a Document

Getting the Number of Documents

Converting Data to String

Building an Index for Data

Methods

About

Releases

Packages

Languages

waylight3/corpus_reader

Folders and files

Latest commit

History

Repository files navigation

Corpus Reader

Introduction

Usage

Initialization

Fetching a Document

Getting the Number of Documents

Converting Data to String

Building an Index for Data

Methods

About

Resources

Stars

Watchers

Forks

Languages