onefilellm: Command Line Data Aggregation Tool for LLM Ingestion

onefilellm is a command-line tool that aggregates and preprocesses data from various sources into one text file and the clipboard for easier ingestion into large language models (LLMs).

Features

Automatic source type detection based on provided path, URL, or identifier
Support for local files and/or directories, GitHub repositories, academic papers from ArXiv, YouTube transcripts, web page documentation, Sci-Hub hosted papers via DOI or PMID
Handling of multiple file formats, including Jupyter Notebooks (.ipynb), and PDFs
Web crawling functionality to extract content from linked pages up to a specified depth
Integration with Sci-Hub for automatic downloading of research papers using DOIs or PMIDs
Text preprocessing, including compressed and uncompressed outputs, stopword removal, and lowercase conversion
Automatic copying of uncompressed text to the clipboard for easy pasting into LLMs
Token count reporting for both compressed and uncompressed outputs

Installation

Prerequisites

Install the required dependencies:

pip install -U -r requirements.txt

Optionally, create a virtual environment for isolation:

python -m venv .venv
source .venv/bin/activate
pip install -U -r requirements.txt

GitHub Personal Access Token

To access private GitHub repositories, generate a personal access token as described in the 'Obtaining a GitHub Personal Access Token' section.

Setup

Clone the repository or download the source code.

Usage

Run the script using the following command:

python onefilellm.py

At the prompt, enter the file or folder path, Documenation, Paper or Repo URL, or for Sci-Hub papers, the DOI or PMID of the data source you want to process:

Enter the local or remote path, URL, DOI, or PMID for ingestion:

The tool supports the following input options:

Local file path (e.g., C:\documents\report.pdf)
Local directory path (e.g., C:\projects\research) -> (files of selected filetypes segmented into one flat text file)
GitHub repository URL (e.g., https://github.com/username/repo) -> (Repo files of selected filetypes segmented into one flat text file)
ArXiv paper URL (e.g., https://arxiv.org/abs/2401.14295) -> (Full paper PDF to text file)
YouTube video URL (e.g., https://www.youtube.com/watch?v=video_id) -> (Video transcript to text file)
Webpage URL (e.g., https://example.com/page | https://example.com/page/page.pdf) -> (To scrape pages to x depth or remote file in segmented text file)
Sci-Hub Paper DOI (Digital Object Identifier of Sci-Hub hosted paper) (e.g., 10.1234/example.doi) -> (Full Sci-Hub paper PDF to text file)
Sci-Hub Paper PMID (PubMed Identifier of Sci-Hub hosted paper) (e.g., 12345678) -> (Full Sci-Hub paper PDF to text file)

The script generates the following output files in a subdirectory named after the input source, following the naming convention {base_name}_{token_count}_{type}.txt where {type} is either full for uncompressed or min for compressed output:

{base_name}_full_output.txt: Contains the full text output, which is also automatically copied to the clipboard.
{base_name}_min_output.txt: Contains cleaned and compressed text.
{base_name}_processed_urls.txt: Lists all URLs processed during web crawling, if applicable.

The output files are located within a dynamically named subdirectory under the output folder, structured as follows:

output/
    |- <input_source_name>/
        |- <input_source_name>_<token_count>_full.txt
        |- <input_source_name>_<token_count>_min.txt
        |- <input_source_name>_processed_urls.txt

Configuration

To modify the allowed file types for repository processing, update the allowed_extensions list in the code.
To change the depth of web crawling, adjust the max_depth variable in the code.

Obtaining a GitHub Personal Access Token

To access private GitHub repositories, you need a personal access token. Follow these steps:

Log in to your GitHub account and go to Settings.
Navigate to Developer settings > Personal access tokens.
Click on "Generate new token" and provide a name.
Select the necessary scopes (at least repo for private repositories).
Click "Generate token" and copy the token value.

In the onefilellm.py script, replace GITHUB_TOKEN with your actual token or set it as an environment variable:

For Windows:
```
setx GITHUB_TOKEN "YourGitHubToken"
```

For Linux:

echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc
source ~/.bashrc

.env file: Create a .env file in the root directory of the project and add the following line:

Notes

For Repos, Modify this line of code to add or remove filetypes processed: allowed_extensions = ['.py', '.txt', '.ts', '.tsx', '.js', '.rst', '.sh', '.md', '.pyx', '.html', '.yaml','.json', '.jsonl', '.ipynb', '.h', '.c', '.sql', '.csv']
For Web scraping, Modify this line of code to change how many links deep from the starting URL to include max_depth = 2
Token counts are displayed in the console for both output files.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
onefilellm.py		onefilellm.py
requirements.txt		requirements.txt
test_onefilellm.md		test_onefilellm.md
test_onefilellm.py		test_onefilellm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

onefilellm: Command Line Data Aggregation Tool for LLM Ingestion

Features

Installation

Prerequisites

GitHub Personal Access Token

Setup

Usage

Configuration

Obtaining a GitHub Personal Access Token

Notes

About

Uh oh!

Releases

Packages

Languages

License

tribixbite/1filellm

Folders and files

Latest commit

History

Repository files navigation

onefilellm: Command Line Data Aggregation Tool for LLM Ingestion

Features

Installation

Prerequisites

GitHub Personal Access Token

Setup

Usage

Configuration

Obtaining a GitHub Personal Access Token

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages