Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview for text files in version 1.6.0 #122

Open
NataliaBondarenko opened this issue Jun 15, 2020 · 10 comments
Open

Preview for text files in version 1.6.0 #122

NataliaBondarenko opened this issue Jun 15, 2020 · 10 comments
Labels
enhancement New feature or request
Milestone

Comments

@NataliaBondarenko
Copy link
Collaborator

I propose to completely solve the issue with previewing text files in this version.
And we can leave a preview for the binaries until the next version of the package.

Option 1

  1. Try to open all files in text mode.
  2. Show a preview of the text of the file, if possible. Otherwise, show information or error message.

Option 2 with skipping some known binaries
For example, files with known signatures.

# short list of known binaries
binaries = ['gz', 'jpg', 'png', 'mp3', ...]

def generate_preview(filepath: str, max_size: int = 390) -> str:
    extension = get_file_extension(filepath, case_sensitive=False).lower()

    if extension in binaries:
        # skip the extension in version 1.6.0
        return "[A preview of this file type is not yet implemented.]"
    else:
        # try to open other files in text mode
        excerpt = generic_text_preview(filepath, max_size)
        if excerpt:
            # return excerpt or error string
            return f"{excerpt}"
        else:
            return "[This file can be empty.]"
@victordomingos
Copy link
Owner

I would rather issue a 1.6 with the current feature set and then make 1.7 the one with the file previews, if it's ok for you. Will still need to see what's missing from documentation, with regard to the latest changes, and make sure the translations are on sync.

@victordomingos victordomingos added the enhancement New feature or request label Jun 19, 2020
@victordomingos victordomingos added this to the 1.7 milestone Jun 19, 2020
@NataliaBondarenko
Copy link
Collaborator Author

Hello!
CLI help is updated with previous PR.
English docs were last updated for search by pattern (--filename-match argument). This is a priority task.
Other docs are more outdated. But few people will notice it. In terms of traffic, people do not often view these pages.

TODO:

  • add tags to generate documentation on the Read the Docs
    Tags are needed in this repository on commits f9559b6 (1.4.0) and e94d152 (1.5.0)
    Could you add tags for the corresponding versions?

  • conduct tests for this version and update the list of tested operating systems

@NataliaBondarenko
Copy link
Collaborator Author

Also:

  • add version branches similar to Django
    To be able to fix errors for active versions (Table issue #118).
    To be able to support incompatible active versions/extensions.

@NataliaBondarenko
Copy link
Collaborator Author

NataliaBondarenko commented Jun 19, 2020

Previewing text files is an old issue. Features that were not even planned were added to this version. Why postpone the preview solution?

@victordomingos
Copy link
Owner

Ok. Let’s improve the preview for text files for 1.6 and let binary formats for later. Special care must be put in choosing which files are binary or text, and proper treatment of any exceptions.

Regarding branches, until now all versions were intended to be compatible backwards, so it made some sense to fix any bugs in the next update within the same branch. Our public releases are published on PyPI, not on GitHub’s development repo. When we decide to switch to v2.x, then yes, we must keep a separate v1.x branch for bug fixes.

I believe I have added tags for all previous releases, could you please check again? I missed one release, so I added a new tag recently. Maybe that’s the one you were referring to?

With regards to tests, I can test on macOS Catalina, iOS/Pythonista, Haiku R1/beta2 and maybe a few virtual machines. The last time I tested on macOS, I got one failing test. I believe it has something to do with the creation of a comparison file, and you have already explained that to me but I confess I can’t remember. I will submit an issue to see if you are able to help, ok?

Finally, keeping documentation in sync across different languages can easily become a mess. I would like to find some sort of technical solution to help keep them synchronized, but not sure what the best solution is. I know there are some specialized web apps, like Pootle which I have used for Haiku, but that would require setting up a server and probably some costs. I have heard of GlobalSight and OmegaT, but I haven’t tried any of those yet.

@victordomingos victordomingos modified the milestones: 1.7, 1.6 Jun 20, 2020
@NataliaBondarenko
Copy link
Collaborator Author

Hello!
I have updated the preview for text files.
New branch https://github.com/NataliaBondarenko/Count-files/tree/textpreview/count_files
This version allows us to extend the preview capabilities without external dependencies.

This version is proposed by me for discussion. This has its pros and cons. What do you think?

Updated def generic_text_preview

https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L12

Added encoding in open(filepath, mode='r', encoding='utf-8').

UTF-8 is one of the most commonly used encodings (w3techs.com stats).
UTF-8 has several convenient properties: docs.python.org Unicode HOWTO

Also, this encoding renders text with mixed characters (like Cyrillic and Latin) quite correctly.
I tried this with the README files in the repository as well as a Japanese text file.

The previous version of this function was with open (filepath, mode = 'r').
Docs: If encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
This option is left as a fallback for opening files.
First, we try to open a file with encoding='utf-8'. If this fails (UnicodeDecodeError), then we try to open the file with the user's preferred encoding.

Added new shell-command argument to Search group

https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/help_text.py#L397

The idea is to use the Unix "file" command, a file type detector (wiki).
Using this program allows the CLI to detect text files with or without an extension and display a preview of those files.
Determining the file type is done with this command through the subprocess module.
In general, it gets the output of $ file /path/to/file.ext.

Depending on whether this program is available, we can create a preview with different functions.
def generate_preview_with_file
https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L91
or
def generate_preview
https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L61
In this case, preview is only available for files with certain extensions.
This function can be used for all operating systems.

I have added two functions to check if the Unix "file" command is available and works as expected.
https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L78

This utility works with files pretty quickly.
The "file" command is a standard program on Unix and Unix-like OS.
It is also ported to Windows. This program can be used on Windows. For example, if the user installed it along with Git (https://git-scm.com) and added it to the PATH environment variable.
I don't see anything like this for Haiku and StaSh.
Thus, the use of this argument is limited to desktop operating systems such as Linux, Mac OS, and Windows.

If this version is appropriate, there will be no more significant changes for v1.6.

TODO:

@NataliaBondarenko
Copy link
Collaborator Author

NataliaBondarenko commented Jul 24, 2020

Ok. Let’s improve the preview for text files for 1.6 and let binary formats for later.

I think we can preview some binaries using the Python standard library.
For example, if you want a list of files with the same extension.
count-files --file-extension ext_name --preview
For all files in a directory --file-extension .., choosing the correct function and processing the files can slow down the program.

Special care must be put in choosing which files are binary or text, and proper treatment of any exceptions.

Determining which files are binary or text files is difficult.
To increase the likelihood of correctly detecting the file type, we can use OS utilities.
I already mentioned the "file" command in the comment above.

Regarding branches, until now all versions were intended to be compatible backwards, so it made some sense to fix any bugs in the next update within the same branch. Our public releases are published on PyPI, not on GitHub’s development repo. When we decide to switch to v2.x, then yes, we must keep a separate v1.x branch for bug fixes.

Ok. It makes sense to me.

I believe I have added tags for all previous releases, could you please check again? I missed one release, so I added a new tag recently. Maybe that’s the one you were referring to?

There were changes in the version documentation after these tags.
I made small clarifications to the text of the documentation, not to the code itself later.
Existing tags do not cover several pull requests.

With regards to tests, I can test on macOS Catalina, iOS/Pythonista, Haiku R1/beta2 and maybe a few virtual machines.

I have Windows and Linux.

The last time I tested on macOS, I got one failing test. I believe it has something to do with the creation of a comparison file, and you have already explained that to me but I confess I can’t remember. I will submit an issue to see if you are able to help, ok?

Comparison files are generated automatically in the latest tests. It might be an old test file.

Finally, keeping documentation in sync across different languages can easily become a mess. I would like to find some sort of technical solution to help keep them synchronized, but not sure what the best solution is. I know there are some specialized web apps, like Pootle which I have used for Haiku, but that would require setting up a server and probably some costs. I have heard of GlobalSight and OmegaT, but I haven’t tried any of those yet.

I suggest maintaining only English documentation (Read The Docs and README) after v1.6.

@victordomingos
Copy link
Owner

Hi! I had a quick look over your new branch and it seems a nice improvement indeed. Thanks.

As usual, documentation must be clear about availability issues and IMO it should also include some guidance on how to get it to work on Windows.

This utility works with files pretty quickly.
The "file" command is a standard program on Unix and Unix-like OS.
It is also ported to Windows. This program can be used on Windows. For example, if the user installed it along with Git (https://git-scm.com) and added it to the PATH environment variable.
I don't see anything like this for Haiku and StaSh.
Thus, the use of this argument is limited to desktop operating systems such as Linux, Mac OS, and Windows.

Actually, I believe we can also count with file availability on Haiku:

Captura de ecrã 2020-07-25, às 16 47 57

iOS/StaSh has no file binary, so in this case we must make sure that a proper message is given to the user.

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

@NataliaBondarenko
Copy link
Collaborator Author

As usual, documentation must be clear about availability issues and IMO it should also include some guidance on how to get it to work on Windows.

Actually, I believe we can also count with file availability on Haiku:

Currently, command availability checking is limited to specific operating systems (win, linux, darwin).
https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L130
This limitation can be removed. We can try using the "file" command on any operating system.

The --shell-command argument can take either a command name or the path to an executable file.

--shell-command file
or
--shell-command /path/to/file
This can be useful on systems where the "file" command is not standard.
That is, you can install the program and use it without adding it to your PATH environment variable.

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

A shorter version of the documentation in one markdown file for each language?

@victordomingos
Copy link
Owner

Currently, command availability checking is limited to specific operating systems (win, linux, darwin).
https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L130
This limitation can be removed. We can try using the "file" command on any operating system.

The --shell-command argument can take either a command name or the path to an executable file.

--shell-command file
or
--shell-command /path/to/file
This can be useful on systems where the "file" command is not standard.
That is, you can install the program and use it without adding it to your PATH environment variable.

This information may be useful, especially the shutil.which(command) part:

https://stackoverflow.com/questions/11210104/check-if-a-program-exists-from-a-python-script

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

A shorter version of the documentation in one markdown file for each language?

I am not sure if we can make it much shorter without leaving some features undocumented, but we may consider keeping it in a single file if it helps. At this time, we have that situation in Portuguese (a short Readme and a longer single-file documentation). The simplest workflow (not necessarily the best one though) would be going back to a single file per language, merging back readme and documentation. That would let us with a single documentation file for each language.

Now, the most important IMHO bit is to establish a workflow. For instance, whenever the user interface changes, e.g. a new feature is added/removed or it gets a new behaviour, the developer could also add a new issue indicating the changes that need to be updated in the documentation. If possible, the English version should be updated together with the code pull request itself, so that at least the English documentation is always up to date. The issue tracker would let us keep track of any sections that need to have their translation updated. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants