Add a fuzz target for `_elementtree.XMLParser._parse_whole` #111477

bradlarsen · 2023-10-30T02:17:13Z

This PR adds a new fuzz target for the _elementtree.XMLParser._parse_whole method, which is implemented in C. The fuzz target merely attempts to parse the given input data using that method, checking that it does not crash. No higher-level properties are checked.

A dictionary of XML-relevant tokens is included, as is a seed corpus based on the XML files from Lib/test/xmltestdata.

These files were copied from `Lib/test/xmltestdata` in this same repo.

bradlarsen · 2023-10-30T02:17:51Z

@alex @gpshead here's a new fuzzing target!

bradlarsen · 2023-10-30T03:15:02Z

It looks like there is CIFuzz CI job. But that is failing to get a seed corpus, even though I've added one to the source tree. I wonder if I'm missing a step to ensure that the corpus is used?

2023-10-30 02:28:42,174 - root - INFO - Downloading corpus from OSS-Fuzz: https://storage.googleapis.com/cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip
2023-10-30 02:28:42,177 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): storage.googleapis.com:443
2023-10-30 02:28:42,256 - urllib3.connectionpool - DEBUG - [https://storage.googleapis.com:443](https://storage.googleapis.com/) "GET /cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip HTTP/1.1" 403 298
2023-10-30 02:28:42,257 - root - ERROR - Unable to download from: https://storage.googleapis.com/cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip. Code: 403. Content: b"<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details></Error>".
2023-10-30 02:28:42,260 - root - WARNING - Failed to download corpus for fuzz_elementtree_parsewhole.

bradlarsen · 2023-10-30T03:17:50Z

It also doesn't look like the provided dictionary file for the new fuzz target is being used in the CIFuzz job...

gvanrossum · 2023-10-30T03:20:06Z

Why 60 files?

bradlarsen · 2023-10-30T03:25:29Z

Oh, it looks like I made a typo in the corpus and dictionary pathnames ("elementree" instead of "elementtree"), which was probably responsible for those being missed.

bradlarsen · 2023-10-30T03:33:24Z

Why 60 files?

2 files are changed to include the actual new fuzzer target implementation (Modules/_xxtestfuzz/fuzzer.c and Modules/_xxtestfuzz/fuzz_tests.txt)
1 file is changed to appease a global variable static checker, similar to how the other fuzz targets were handled (Tools/c-analyzer/cpython/ignored.tsv)
1 file is added to provide a fuzzing dictionary for the XML parser
56 files were added as seed corpus files, used to bootstrap the fuzzing process. They were copied from Lib/test/xmltestdata.

So it's actually just 3 existing files that were changed. I suppose the seed corpus could be shrunk down if that's objectionable.

gvanrossum · 2023-10-30T04:46:55Z

So it's actually just 3 existing files that were changed. I suppose the seed corpus could be shrunk down if that's objectionable.

No worries, I was just curious (few PRs touch that many files so I thought there might have been a mistake).

alex · 2023-10-30T22:42:54Z

Have you run this locally and verified that it makes progress / reaches new functions / gains coverage?

bradlarsen · 2023-10-31T01:21:36Z

Have you run this locally and verified that it makes progress / reaches new functions / gains coverage?

Yes! Since it has a relatively small starting corpus, it gets new coverage and hits new functions very quickly. In my own non-oss-fuzz setup, I have a corpus of some 15k inputs that would get regularly minimized and pruned.

You can see in the CIFuzz GitHub Actions job (look at the Run fuzzers (address) step) some evidence of the new coverage and functions being hit:

2023-10-30 03:38:19,704 - root - INFO - Starting fuzzing
Fuzzing logs:
/github/workspace/build-out/fuzz_elementtree_parsewhole -timeout=25 -rss_limit_mb=2560 -dict=/github/workspace/build-out/fuzz_elementtree_parsewhole.dict -len_control=0 -seed=1337 -artifact_prefix=/tmp/tmpmh82bt33/ -max_total_time=60 -print_final_stats=1 /github/workspace/cifuzz-corpus/fuzz_elementtree_parsewhole >fuzz-0.log 2>&1
================== Job 0 exited with exit code 0 ============
Dictionary: 129 entries
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1337
INFO: Loaded 1 modules   (139811 inline 8-bit counters): 139811 [0x15ee0f0, 0x1610313), 
INFO: Loaded 1 PC tables (139811 PCs): 139811 [0x119fea0,0x13c20d0), 
INFO:       56 files found in /github/workspace/cifuzz-corpus/fuzz_elementtree_parsewhole
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: seed corpus: files: 56 min: 13b max: 1388b total: 14704b rss: 66Mb
#113	INITED cov: 2192 ft: 5036 corp: 53/13786b exec/s: 0 rss: 71Mb
#115	NEW    cov: 2196 ft: 5057 corp: 54/14173b lim: 4096 exec/s: 0 rss: 71Mb L: 387/1388 MS: 1 ManualDict- DE: "\\"-
#119	NEW    cov: 2198 ft: 5062 corp: 55/14281b lim: 4096 exec/s: 0 rss: 71Mb L: 108/1388 MS: 2 InsertByte-CrossOver-
#121	NEW    cov: 2202 ft: 5068 corp: 56/14678b lim: 4096 exec/s: 0 rss: 71Mb L: 397/1388 MS: 1 InsertRepeatedBytes-
#127	NEW    cov: 2203 ft: 5070 corp: 57/15023b lim: 4096 exec/s: 0 rss: 72Mb L: 345/1388 MS: 3 ChangeBinInt-InsertByte-ChangeBit-
#129	NEW    cov: 2203 ft: 5074 corp: 58/15124b lim: 4096 exec/s: 0 rss: 72Mb L: 101/1388 MS: 1 EraseBytes-
	NEW_FUNC[1/2]: 0x7f10c3b78a00 in utf8_isName2 /src/cpython3/./Modules/expat/xmltok.c:140
	NEW_FUNC[2/2]: 0x7f10c3b78c20 in utf8_isNmstrt2 /src/cpython3/./Modules/expat/xmltok.c:154
#133	NEW    cov: 2207 ft: 5084 corp: 59/15149b lim: 4096 exec/s: 133 rss: 72Mb L: 25/1388 MS: 2 CopyPart-CrossOver-
#139	NEW    cov: 2207 ft: 5085 corp: 60/15257b lim: 4096 exec/s: 139 rss: 72Mb L: 108/1388 MS: 3 PersAutoDict-ChangeBit-ShuffleBytes- DE: "\\"-
#141	NEW    cov: 2208 ft: 5091 corp: 61/15404b lim: 4096 exec/s: 141 rss: 72Mb L: 147/1388 MS: 1 CopyPart-
#143	NEW    cov: 2210 ft: 5095 corp: 62/15449b lim: 4096 exec/s: 143 rss: 72Mb L: 45/1388 MS: 1 InsertByte-
#145	NEW    cov: 2211 ft: 5097 corp: 63/15700b lim: 4096 exec/s: 145 rss: 72Mb L: 251/1388 MS: 1 CMP- DE: "$\000\000\000\000\000\000\000"-
#147	NEW    cov: 2213 ft: 5099 corp: 64/15801b lim: 4096 exec/s: 147 rss: 72Mb L: 101/1388 MS: 1 PersAutoDict- DE: "$\000\000\000\000\000\000\000"-
#149	NEW    cov: 2214 ft: 5106 corp: 65/15905b lim: 4096 exec/s: 149 rss: 72Mb L: 104/1388 MS: 1 CMP- DE: "\001\000\000w"-
#436	NEW    cov: 2285 ft: 5405 corp: 112/28Kb lim: 4096 exec/s: 436 rss: 81Mb L: 280/1397 MS: 1 EraseBytes-
#448	REDUCE cov: 2285 ft: 5405 corp: 112/28Kb lim: 4096 exec/s: 448 rss: 81Mb L: 176/1397 MS: 1 EraseBytes-
	NEW_FUNC[1/1]: 0x7f10c3b31a00 in prologProcessor /src/cpython3/./Modules/expat/xmlparse.c:4604
...
#727	NEW    cov: 2339 ft: 5574 corp: 144/35Kb lim: 4096 exec/s: 727 rss: 84Mb L: 614/1397 MS: 1 CopyPart-
	NEW_FUNC[1/306]: 0x587c40 in _Py_CheckSlotResult /src/cpython3/Objects/call.c:76
	NEW_FUNC[2/306]: 0x588290 in _PyObject_VectorcallDictTstate /src/cpython3/Objects/call.c:114
#735	NEW    cov: 4237 ft: 7773 corp: 145/35Kb lim: 4096 exec/s: 735 rss: 84Mb L: 62/1397 MS: 1 ShuffleBytes-
...
#814	NEW    cov: 4239 ft: 7788 corp: 151/37Kb lim: 4096 exec/s: 814 rss: 84Mb L: 266/1397 MS: 5 EraseBytes-InsertByte-EraseBytes-CopyPart-InsertRepeatedBytes-
#818	NEW    cov: 4240 ft: 7790 corp: 152/38Kb lim: 4096 exec/s: 818 rss: 84Mb L: 517/1397 MS: 2 ShuffleBytes-ChangeBit-
	NEW_FUNC[1/2]: 0x7f10c3b793d0 in little2_prologTok /src/cpython3/./Modules/expat/xmltok_impl.c:1018
	NEW_FUNC[2/2]: 0x7f10c3b81c80 in little2_updatePosition /src/cpython3/./Modules/expat/xmltok_impl.c:1779
...
#1029	REDUCE cov: 4262 ft: 7817 corp: 161/39Kb lim: 4096 exec/s: 1029 rss: 85Mb L: 370/1397 MS: 1 EraseBytes-
#1037	NEW    cov: 4262 ft: 7819 corp: 162/40Kb lim: 4096 exec/s: 1037 rss: 85Mb L: 572/1397 MS: 1 ChangeBinInt-
#1041	NEW    cov: 4262 ft: 7821 corp: 163/40Kb lim: 4096 exec/s: 1041 rss: 85Mb L: 371/1397 MS: 3 CMP-ChangeBinInt-CrossOver- DE: "\001w"-
#1050	NEW    cov: 4264 ft: 7823 corp: 164/41Kb lim: 4096 exec/s: 1050 rss: 85Mb L: 520/1397 MS: 5 InsertRepeatedBytes-PersAutoDict-CopyPart-ChangeByte-ChangeBinInt- DE: "2"-

gpshead

overall, makes sense to me. A couple of minor edits to make in the C code that really only matter for pedantic correctness in the never likely MemoryError situation during initialization.

Modules/_xxtestfuzz/fuzzer.c

pitrou · 2023-10-31T12:10:38Z

Not related to this PR, but I'm curious: what is _Py_FUZZ_ONE for? Is there a mode where all fuzzers are run on the same fuzz inputs? That sounds a bit weird.

pitrou · 2023-10-31T12:15:50Z

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

alex · 2023-10-31T13:37:51Z

LGTM modulo @gpshead's comments.

Modules/_xxtestfuzz/fuzzer.c

gpshead · 2023-10-31T23:12:22Z

[off-topic-ish]

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

This PR is for the set oss-fuzz runs which we maintain directly (_xxtestfuzz) which is configured in https://github.com/google/oss-fuzz/tree/master/projects/cpython3. (via #73691)

There is a second Python fuzzing project not in our repo using a different approach from a later contributor, also linked from the above issue that can be found in https://github.com/guidovranken/python-library-fuzzers configured to run via https://github.com/google/oss-fuzz/tree/master/projects/python3-libraries. (it tends to write things directly in Python)

bradlarsen · 2023-11-01T01:22:17Z

Not related to this PR, but I'm curious: what is _Py_FUZZ_ONE for? Is there a mode where all fuzzers are run on the same fuzz inputs? That sounds a bit weird.

Yes, it looks like something along those lines. The plumbing is a bit weird.

All the fuzz targets in the cpython tree are defined in that one fuzzer.c source file, and that gets compiled N different ways to produce N different targets, based on the names written in Modules/_xxtestfuzz/fuzz_tests.txt. It is the external oss-fuzz project that does that build.

The entire file also gets compiled into a Python module at Modules/_xxtestfuzz/_xxtestfuzz.c, which gets very briefly tested in Lib/test/test_xxtestfuzz.py. It is cpython itself that does this build.

bradlarsen · 2023-11-01T01:31:35Z

[off-topic-ish]

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

This PR is for the set oss-fuzz runs which we maintain directly (_xxtestfuzz) which is configured in https://github.com/google/oss-fuzz/tree/master/projects/cpython3. (via #73691)

There is a second Python fuzzing project not in our repo using a different approach from a later contributor, also linked from the above issue that can be found in https://github.com/guidovranken/python-library-fuzzers configured to run via https://github.com/google/oss-fuzz/tree/master/projects/python3-libraries. (it tends to write things directly in Python)

Indeed, a different approach to fuzz-testing Python would be to write the fuzz targets in normal Python code. Guido Vranken's repo above takes that approach. I also experimented with that years ago when I was first doing fuzzing of CPython.

The advantages of writing fuzz targets in Python, when trying to fuzz CPython itself:

It's much more expressive per line of code than C, and the targets end up more readable
It's easier to write small bits of Python code than small bits of C code that call the Python C APIs

The disadvantages of writing fuzz targets in Python, when trying to fuzz CPython itself:

With fuzzing, you are usually trying to find memory errors, which are difficult to express in pure Python (barring horrible bugs). It's the underlying C code that would exhibit this problems. So writing targets in Python distances you from the code you actually want to test.
The performance of a fuzz target written carefully against the Python C API is likely to be an order of magnitude faster than going through interpreted Python code. (This may have changed somewhat in the meantime with things like Atheris, which didn't exist when I started.) A fuzzing target that can do 10k iterations per second is going to be much more effective than a functionally equivalent target that does <1k iterations per second.

bradlarsen · 2023-11-01T01:33:26Z

@gpshead @alex I believe I've addressed all your comments on this PR. If so, just awaiting merge. Thank you for the reviews!

…11477) * Add a fuzzer for `_elementtree.XMLParser._parse_whole`

bradlarsen added 4 commits October 29, 2023 18:26

Add a fuzzer for _elementtree.XMLParser._parse_whole

ae96a12

Add a dictionary for fuzz_elementtree_parsewhole

5366c0e

Add a corpus for fuzz_elementtree_parsewhole

e8202e0

These files were copied from `Lib/test/xmltestdata` in this same repo.

Merge branch 'python:main' into main

53e36d9

bedevere-app bot added the awaiting review label Oct 30, 2023

Note new fuzzer global variables to appease the checker

446f1c0

Fix typo in dictionary and corpus names

d1f489a

alex added skip issue skip news labels Oct 30, 2023

gpshead approved these changes Oct 31, 2023

View reviewed changes

Modules/_xxtestfuzz/fuzzer.c Outdated Show resolved Hide resolved

Modules/_xxtestfuzz/fuzzer.c Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Oct 31, 2023

gpshead reviewed Oct 31, 2023

View reviewed changes

Modules/_xxtestfuzz/fuzzer.c Show resolved Hide resolved

Rename global variable; fix refcounting leaks

38d5abc

Merge branch 'main' into main

a1186c9

move DECREF io_module to the right spot

fd53745

gpshead approved these changes Nov 1, 2023

View reviewed changes

gpshead merged commit f21b230 into python:main Nov 3, 2023
28 checks passed

bedevere-app bot removed the awaiting merge label Nov 3, 2023

gpshead added the tests Tests in the Lib/test dir label Nov 3, 2023

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024

Add a fuzz target for _elementtree.XMLParser._parse_whole (python#1…

53e2ef9

…11477) * Add a fuzzer for `_elementtree.XMLParser._parse_whole`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fuzz target for `_elementtree.XMLParser._parse_whole` #111477

Add a fuzz target for `_elementtree.XMLParser._parse_whole` #111477

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023 •

edited

bradlarsen commented Oct 30, 2023

gvanrossum commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

gvanrossum commented Oct 30, 2023

alex commented Oct 30, 2023

bradlarsen commented Oct 31, 2023

gpshead left a comment

pitrou commented Oct 31, 2023

pitrou commented Oct 31, 2023

alex commented Oct 31, 2023

gpshead commented Oct 31, 2023 •

edited

bradlarsen commented Nov 1, 2023

bradlarsen commented Nov 1, 2023

bradlarsen commented Nov 1, 2023

Add a fuzz target for _elementtree.XMLParser._parse_whole #111477

Add a fuzz target for _elementtree.XMLParser._parse_whole #111477

Conversation

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023 • edited

bradlarsen commented Oct 30, 2023

gvanrossum commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

bradlarsen commented Oct 30, 2023

gvanrossum commented Oct 30, 2023

alex commented Oct 30, 2023

bradlarsen commented Oct 31, 2023

gpshead left a comment

Choose a reason for hiding this comment

pitrou commented Oct 31, 2023

pitrou commented Oct 31, 2023

alex commented Oct 31, 2023

gpshead commented Oct 31, 2023 • edited

bradlarsen commented Nov 1, 2023

bradlarsen commented Nov 1, 2023

bradlarsen commented Nov 1, 2023

Add a fuzz target for `_elementtree.XMLParser._parse_whole` #111477

Add a fuzz target for `_elementtree.XMLParser._parse_whole` #111477

bradlarsen commented Oct 30, 2023 •

edited

gpshead commented Oct 31, 2023 •

edited