You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.
MIME type was inode/x-empty. This file type is not currently supported in unstructured.
The text was updated successfully, but these errors were encountered:
I have updated the code and added the filename in the warning message. But this happens when detect_filetype detects the file which is empty. So the warning message should be same or needs to be changed?
Addresses
[#1332](#1332)
with `unstructured-inference` PR
[#208](Unstructured-IO/unstructured-inference#208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing
from unstructured.partition.pdf import partition_pdf
f_path = "example-docs/embedded-images.pdf"
# default image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
)
# specific image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
image_output_dir_path=<directory path>,
)
Currently the
detect_filetype
function inunstructured.file_utils.filetype
emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.MIME type was inode/x-empty. This file type is not currently supported in unstructured.
The text was updated successfully, but these errors were encountered: