Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation xtractmime #7

Merged
merged 44 commits into from
Aug 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
9f65cc6
Add main sniffing algo
akshaysharmajs Jul 2, 2021
1d2895e
Add 7.1, 7.2, and 7.3
akshaysharmajs Jul 5, 2021
8c1d006
Merge branch 'highlvl-API' into compute-mime
akshaysharmajs Jul 5, 2021
0c9424e
Complete 7.3
akshaysharmajs Jul 5, 2021
0f27318
Small Fix
akshaysharmajs Jul 5, 2021
981b9ab
Update test_utils.py
akshaysharmajs Jul 9, 2021
8edacda
Merge branch 'highlvl-API' into compute-mime
akshaysharmajs Jul 9, 2021
92317e9
Add tests for extract_mime
akshaysharmajs Jul 9, 2021
de958a1
Add more tests
akshaysharmajs Jul 11, 2021
438ee50
Add tests for mislabled feed
akshaysharmajs Jul 12, 2021
948a456
Small fix
akshaysharmajs Jul 12, 2021
88afd26
Add left over tests
akshaysharmajs Jul 12, 2021
55823ee
Static fix
akshaysharmajs Jul 12, 2021
2371a4d
Fix typing check
akshaysharmajs Jul 13, 2021
4b503fc
Conflicts
akshaysharmajs Jul 13, 2021
8ec5c1a
Merge branch 'main' into compute-mime
akshaysharmajs Jul 13, 2021
75d202e
Add mime groups
akshaysharmajs Jul 18, 2021
15fe0ff
Add tests
akshaysharmajs Jul 18, 2021
91ecd5a
Test all mime_types
akshaysharmajs Jul 25, 2021
9da49ba
Add more tests
akshaysharmajs Jul 25, 2021
8f665af
Remove text mime
akshaysharmajs Jul 26, 2021
7645470
All mime types
akshaysharmajs Jul 27, 2021
88f540c
mime type split
akshaysharmajs Jul 30, 2021
d19d5c5
Add mime groups (#5)
akshaysharmajs Aug 1, 2021
236029b
Merge branch 'compute-mime' into html-fix
akshaysharmajs Aug 1, 2021
9970a64
Add Description
akshaysharmajs Aug 4, 2021
d8b3d84
Function definition
akshaysharmajs Aug 4, 2021
698a85e
content_types
akshaysharmajs Aug 4, 2021
37fabb3
http_origin
akshaysharmajs Aug 4, 2021
d8b137d
no_sniff
akshaysharmajs Aug 4, 2021
f9de94a
Update README.md
akshaysharmajs Aug 5, 2021
f1b3095
Merge branch 'main' into docs
akshaysharmajs Aug 5, 2021
c56a0e8
Format change
akshaysharmajs Aug 5, 2021
03f349b
Update README.md
akshaysharmajs Aug 5, 2021
68517d4
Update README.md
akshaysharmajs Aug 5, 2021
3396c15
Update README.md
akshaysharmajs Aug 5, 2021
d912c40
More Docs
akshaysharmajs Aug 6, 2021
03fcbe6
Docs for binary data function
akshaysharmajs Aug 6, 2021
f1ca39e
Update README.md
akshaysharmajs Aug 6, 2021
3f42116
Update Docs
akshaysharmajs Aug 6, 2021
40f1968
Small fix
akshaysharmajs Aug 6, 2021
0fe76b3
Small fix
akshaysharmajs Aug 9, 2021
bf2eb31
Some final changes
akshaysharmajs Aug 16, 2021
6a9721b
Small changes
akshaysharmajs Aug 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
120 changes: 120 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,121 @@
# xtractmime

`xtractmime` is a [BSD-licensed](https://opensource.org/licenses/BSD-3-Clause)
Python 3.6+ implementation of the [MIME Sniffing
Standard](https://mimesniff.spec.whatwg.org/).

Install from [`PyPI`](https://pypi.python.org/pypi/xtractmime):

```
pip install xtractmime
```

---

## Basic usage

Below mentioned are some simple examples of using `xtractmime.extract_mime`:

```python
>>> from xtractmime import extract_mime
>>> extract_mime(b'Sample text content')
b'text/plain'
>>> extract_mime(b'', content_types=(b'text/html',))
b'text/html'
```

Additional functionality to check if a MIME type belongs to a specific MIME type group using
methods included in `xtractmime.mimegroups`:

```python
>>> from xtractmime.mimegroups import is_html_mime_type, is_image_mime_type
>>> mime_type = b'text/html'
>>> is_html_mime_type(mime_type)
True
>>> is_image_mime_type(mime_type)
False
```

---

## API Reference

### function `xtractmime.extract_mime(*args, **kwargs) -> Optional[bytes]`
**Parameters:**

* `body: bytes`
* `content_types: Optional[Tuple[bytes]] = None`
* `http_origin: bool = True`
* `no_sniff: bool = False`
* `extra_types: Optional[Tuple[Tuple[bytes, bytes, Optional[Set[bytes]], bytes], ...]] = None`
* `supported_types: Set[bytes] = None`

Return the [MIME type essence](https://mimesniff.spec.whatwg.org/#mime-type-essence) (e.g. `text/html`) matching the input data, or
`None` if no match can be found.

akshaysharmajs marked this conversation as resolved.
Show resolved Hide resolved
The `body` parameter is the byte sequence of which MIME type is to be determined. `xtractmime` only considers the first few
bytes of the `body` and the specific number of bytes read is defined in the `xtractmime.RESOURCE_HEADER_BUFFER_LENGTH` constant.

`content_types` is a tuple of MIME types given in the resource metadata. For example, for resources retrieved via HTTP, users should pass the list of MIME types mentioned in the `Content-Type` header.

`http_origin` indicates if the resource has been retrieved via HTTP (`True`, default) or not (`False`).

`no_sniff` is a flag which is *`True`* if the user agent does not wish to
perform sniffing on the resource and *`False`* (by default) otherwise. Users may want to set
this parameter to *`True`* if the [`X-Content-Type-Options`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options) response header is set to `nosniff`. For more info, see [here](https://mimesniff.spec.whatwg.org/#no-sniff-flag).

`extra_types` is a tuple of patterns to support detecting additional MIME types. Each entry in the tuple should follow the format
**(Byte Pattern, Pattern Mask, Leading Bytes, MIME type)**:

* **Byte Pattern** is a byte sequence to compare with the first few bytes (``xtractmime.RESOURCE_HEADER_BUFFER_LENGTH``) of the `body`.
* **Pattern Mask** is a byte sequence that indicates the significance of **Byte Pattern** bytes: `b"\xff"` indicates the matching byte is strictly significant, `b"\xdf"` indicates that the byte is significant in an ASCII case-insensitive way, and `b"\x00"` indicates that the byte is not significant.
* **Leading Bytes** is a set of bytes to be ignored while matching the leading bytes in the content.
* **MIME type** should be returned if the pattern matches.

**Sample `extra_types`:**
```python
extra_types = ((b'test', b'\xff\xff\xff\xff', None, b'text/test'), ...)
```

---
**NOTE**

*Be careful while using the `extra_types` argument, as it may introduce some privilege escalation vulnerabilities for `xtractmime`. For more info, see [here](https://mimesniff.spec.whatwg.org/#ref-for-mime-type%E2%91%A1%E2%91%A8).*

---

Optional `supported_types` is a set of all [MIME types supported the by user agent](https://mimesniff.spec.whatwg.org/#supported-by-the-user-agent). If `supported_types` is not
specified, all MIME types are assumed to be supported. Using this parameter can improve the performance of `xtractmime`.

### function `xtractmime.is_binary_data(input_bytes: bytes) -> bool`

Return *`True`* if the provided byte sequence contains any binary data bytes, else *`False`*

### MIME type group functions

The following functions return `True` if a given MIME type belongs to a certain
[MIME type group](https://mimesniff.spec.whatwg.org/#mime-type-groups), or
`False` otherwise:
```
xtractmime.mimegroups.is_archive_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_audio_video_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_font_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_html_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_image_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_javascript_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_json_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_scriptable_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_xml_mime_type(mime_type: bytes) -> bool
xtractmime.mimegroups.is_zip_mime_type(mime_type: bytes) -> bool
```
**Example**
```python
>>> from xtractmime.mimegroups import is_html_mime_type, is_image_mime_type, is_zip_mime_type
>>> mime_type = b'text/html'
>>> is_html_mime_type(mime_type)
True
>>> is_image_mime_type(mime_type)
False
>>> is_zip_mime_type(mime_type)
False
```
3 changes: 1 addition & 2 deletions xtractmime/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ def _sniff_mislabled_feed(input_bytes: bytes, supplied_type: bytes) -> Optional[

while index < input_size:
while True:
print(input_bytes[index : index + 1])
if not input_bytes[index : index + 1]:
return supplied_type

Expand Down Expand Up @@ -186,7 +185,7 @@ def extract_mime(
http_origin: bool = True,
no_sniff: bool = False,
extra_types: Optional[Tuple[Tuple[bytes, bytes, Optional[Set[bytes]], bytes], ...]] = None,
supported_types: Set[bytes] = None,
supported_types: Optional[Set[bytes]] = None,
) -> Optional[bytes]:
extra_types = extra_types or tuple()
supplied_type = content_types[-1] if content_types else b""
Expand Down