More meaningful warnings for unknown filetypes in `detect_filetype` #208

MthwRobinson · 2023-02-08T20:38:15Z

Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

The text was updated successfully, but these errors were encountered:

sparkbrains · 2023-02-10T10:56:00Z

@cragwolfe

I have updated the code and added the filename in the warning message. But this happens when detect_filetype detects the file which is empty. So the warning message should be same or needs to be changed?

Addresses [#1332](#1332) with `unstructured-inference` PR [#208](Unstructured-IO/unstructured-inference#208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )

MthwRobinson added help wanted Extra attention is needed python Pull requests that update Python code labels Feb 8, 2023

MthwRobinson mentioned this issue Feb 8, 2023

feat: file info dataframe from filenames and file content #204

Merged

MthwRobinson mentioned this issue Feb 16, 2023

ISD dictionaries are not JSON serializable if the filename has a POSIX path #232

Closed

tomaarsen mentioned this issue Mar 10, 2023

Enhancement: improve detect_filetype warning to include filename #355

Merged

MthwRobinson closed this as completed in #355 Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More meaningful warnings for unknown filetypes in `detect_filetype` #208

More meaningful warnings for unknown filetypes in `detect_filetype` #208

MthwRobinson commented Feb 8, 2023

sparkbrains commented Feb 10, 2023

More meaningful warnings for unknown filetypes in detect_filetype #208

More meaningful warnings for unknown filetypes in detect_filetype #208

Comments

MthwRobinson commented Feb 8, 2023

sparkbrains commented Feb 10, 2023

More meaningful warnings for unknown filetypes in `detect_filetype` #208

More meaningful warnings for unknown filetypes in `detect_filetype` #208