Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More meaningful warnings for unknown filetypes in detect_filetype #208

Closed
MthwRobinson opened this issue Feb 8, 2023 · 1 comment · Fixed by #355
Closed

More meaningful warnings for unknown filetypes in detect_filetype #208

MthwRobinson opened this issue Feb 8, 2023 · 1 comment · Fixed by #355
Labels
help wanted Extra attention is needed python Pull requests that update Python code

Comments

@MthwRobinson
Copy link
Contributor

Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

@MthwRobinson MthwRobinson added help wanted Extra attention is needed python Pull requests that update Python code labels Feb 8, 2023
@sparkbrains
Copy link
Contributor

@cragwolfe

I have updated the code and added the filename in the warning message. But this happens when detect_filetype detects the file which is empty. So the warning message should be same or needs to be changed?

cragwolfe pushed a commit that referenced this issue Sep 22, 2023
Addresses
[#1332](#1332)
with `unstructured-inference` PR
[#208](Unstructured-IO/unstructured-inference#208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed python Pull requests that update Python code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants