Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't crash on non-unicode files #5

Closed
simonw opened this issue Apr 8, 2024 · 3 comments
Closed

Don't crash on non-unicode files #5

simonw opened this issue Apr 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Apr 8, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 10

Got this error running against a folder with a binary in it.

>>> import pdb
>>> pdb.pm()
> /opt/homebrew/Caskroom/miniconda/base/lib/python3.10/codecs.py(322)decode()
-> (result, consumed) = self._buffer_decode(data, self.errors, final)
(Pdb) u
> /Users/simon/.local/pipx/venvs/files-to-prompt/lib/python3.10/site-packages/files_to_prompt/cli.py(66)process_path()
-> file_contents = f.read()
(Pdb) list
 61  	                ]
 62  	
 63  	            for file in files:
 64  	                file_path = os.path.join(root, file)
 65  	                with open(file_path, "r") as f:
 66  ->	                    file_contents = f.read()
 67  	
 68  	                click.echo(file_path)
 69  	                click.echo("---")
 70  	                click.echo(file_contents)
 71  	                click.echo()
@simonw simonw added the bug Something isn't working label Apr 8, 2024
@simonw
Copy link
Owner Author

simonw commented Apr 8, 2024

Easiest option: silently ignore files that cannot be treated as UTF-8 (maybe showing a warning).

But what if users want to run this against files with different encodings? For the moment I'll leave them to convert those files themselves, future releases might add some kind of supported encoding option.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2024

Easy way to replicate this problem in the files-to-prompt checkout itself:

python -m pip install build
python -m build
files-to-prompt .

It crashes on the binary wheel that was built and dropped into dist/.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2024

files-to-prompt files_to_prompt/cli.py | llm -m opus --system \
  'catch unicodedecodeerror reading the file and output a click warning about the file, skipping it and moving on'

Took a few follow-ups:

llm -c 'remember to use err=True on those click echo lines'
llm -c 'How would I show those in a different color?'

https://gist.github.com/simonw/9b83f42a1b87d3fcb3b4b8e6f482af38

Then to get it to write the tests:

git diff > diff.txt
files-to-prompt diff.txt tests/test_files_to_prompt.py | llm -m opus -s \
  'output one more test that can exercise the new code that writes warnings about binary files'
llm -c 'modify that test to capture stdout and stderr separately and check for the message in stderr'
llm -c 'ValueError: stderr not separately captured'
llm -c "TypeError: CliRunner.__init__() got an unexpected keyword argument 'stderr'"
# I had to give it a clue:
llm -c 'Use CliRunner(mix_stderr=False)'

https://gist.github.com/simonw/511e1dbede6aba25b2d7027c55cdf759

The test it added failed, because it turned out it had tried writing a binary string b"\x00\x01\x02\x03\x04\x05" which decoded as utf-8. I switched that out for \xff instead.

@simonw simonw closed this as completed in 84df8a6 Apr 8, 2024
simonw added a commit that referenced this issue Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant