Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fuzz target for _elementtree.XMLParser._parse_whole #111477

Merged
merged 9 commits into from Nov 3, 2023

Conversation

bradlarsen
Copy link
Contributor

This PR adds a new fuzz target for the _elementtree.XMLParser._parse_whole method, which is implemented in C. The fuzz target merely attempts to parse the given input data using that method, checking that it does not crash. No higher-level properties are checked.

A dictionary of XML-relevant tokens is included, as is a seed corpus based on the XML files from Lib/test/xmltestdata.

@bradlarsen
Copy link
Contributor Author

@alex @gpshead here's a new fuzzing target!

@bradlarsen
Copy link
Contributor Author

bradlarsen commented Oct 30, 2023

It looks like there is CIFuzz CI job. But that is failing to get a seed corpus, even though I've added one to the source tree. I wonder if I'm missing a step to ensure that the corpus is used?

2023-10-30 02:28:42,174 - root - INFO - Downloading corpus from OSS-Fuzz: https://storage.googleapis.com/cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip
2023-10-30 02:28:42,177 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): storage.googleapis.com:443
2023-10-30 02:28:42,256 - urllib3.connectionpool - DEBUG - [https://storage.googleapis.com:443](https://storage.googleapis.com/) "GET /cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip HTTP/1.1" 403 298
2023-10-30 02:28:42,257 - root - ERROR - Unable to download from: https://storage.googleapis.com/cpython3-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/cpython3_fuzz_elementtree_parsewhole/public.zip. Code: 403. Content: b"<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details></Error>".
2023-10-30 02:28:42,260 - root - WARNING - Failed to download corpus for fuzz_elementtree_parsewhole.

@bradlarsen
Copy link
Contributor Author

It also doesn't look like the provided dictionary file for the new fuzz target is being used in the CIFuzz job...

@gvanrossum
Copy link
Member

Why 60 files?

@bradlarsen
Copy link
Contributor Author

Oh, it looks like I made a typo in the corpus and dictionary pathnames ("elementree" instead of "elementtree"), which was probably responsible for those being missed.

@bradlarsen
Copy link
Contributor Author

Why 60 files?

  • 2 files are changed to include the actual new fuzzer target implementation (Modules/_xxtestfuzz/fuzzer.c and Modules/_xxtestfuzz/fuzz_tests.txt)
  • 1 file is changed to appease a global variable static checker, similar to how the other fuzz targets were handled (Tools/c-analyzer/cpython/ignored.tsv)
  • 1 file is added to provide a fuzzing dictionary for the XML parser
  • 56 files were added as seed corpus files, used to bootstrap the fuzzing process. They were copied from Lib/test/xmltestdata.

So it's actually just 3 existing files that were changed. I suppose the seed corpus could be shrunk down if that's objectionable.

@gvanrossum
Copy link
Member

So it's actually just 3 existing files that were changed. I suppose the seed corpus could be shrunk down if that's objectionable.

No worries, I was just curious (few PRs touch that many files so I thought there might have been a mistake).

@alex
Copy link
Member

alex commented Oct 30, 2023

Have you run this locally and verified that it makes progress / reaches new functions / gains coverage?

@bradlarsen
Copy link
Contributor Author

Have you run this locally and verified that it makes progress / reaches new functions / gains coverage?

Yes! Since it has a relatively small starting corpus, it gets new coverage and hits new functions very quickly. In my own non-oss-fuzz setup, I have a corpus of some 15k inputs that would get regularly minimized and pruned.

You can see in the CIFuzz GitHub Actions job (look at the Run fuzzers (address) step) some evidence of the new coverage and functions being hit:

2023-10-30 03:38:19,704 - root - INFO - Starting fuzzing
Fuzzing logs:
/github/workspace/build-out/fuzz_elementtree_parsewhole -timeout=25 -rss_limit_mb=2560 -dict=/github/workspace/build-out/fuzz_elementtree_parsewhole.dict -len_control=0 -seed=1337 -artifact_prefix=/tmp/tmpmh82bt33/ -max_total_time=60 -print_final_stats=1 /github/workspace/cifuzz-corpus/fuzz_elementtree_parsewhole >fuzz-0.log 2>&1
================== Job 0 exited with exit code 0 ============
Dictionary: 129 entries
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1337
INFO: Loaded 1 modules   (139811 inline 8-bit counters): 139811 [0x15ee0f0, 0x1610313), 
INFO: Loaded 1 PC tables (139811 PCs): 139811 [0x119fea0,0x13c20d0), 
INFO:       56 files found in /github/workspace/cifuzz-corpus/fuzz_elementtree_parsewhole
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: seed corpus: files: 56 min: 13b max: 1388b total: 14704b rss: 66Mb
#113	INITED cov: 2192 ft: 5036 corp: 53/13786b exec/s: 0 rss: 71Mb
#115	NEW    cov: 2196 ft: 5057 corp: 54/14173b lim: 4096 exec/s: 0 rss: 71Mb L: 387/1388 MS: 1 ManualDict- DE: "\\"-
#119	NEW    cov: 2198 ft: 5062 corp: 55/14281b lim: 4096 exec/s: 0 rss: 71Mb L: 108/1388 MS: 2 InsertByte-CrossOver-
#121	NEW    cov: 2202 ft: 5068 corp: 56/14678b lim: 4096 exec/s: 0 rss: 71Mb L: 397/1388 MS: 1 InsertRepeatedBytes-
#127	NEW    cov: 2203 ft: 5070 corp: 57/15023b lim: 4096 exec/s: 0 rss: 72Mb L: 345/1388 MS: 3 ChangeBinInt-InsertByte-ChangeBit-
#129	NEW    cov: 2203 ft: 5074 corp: 58/15124b lim: 4096 exec/s: 0 rss: 72Mb L: 101/1388 MS: 1 EraseBytes-
	NEW_FUNC[1/2]: 0x7f10c3b78a00 in utf8_isName2 /src/cpython3/./Modules/expat/xmltok.c:140
	NEW_FUNC[2/2]: 0x7f10c3b78c20 in utf8_isNmstrt2 /src/cpython3/./Modules/expat/xmltok.c:154
#133	NEW    cov: 2207 ft: 5084 corp: 59/15149b lim: 4096 exec/s: 133 rss: 72Mb L: 25/1388 MS: 2 CopyPart-CrossOver-
#139	NEW    cov: 2207 ft: 5085 corp: 60/15257b lim: 4096 exec/s: 139 rss: 72Mb L: 108/1388 MS: 3 PersAutoDict-ChangeBit-ShuffleBytes- DE: "\\"-
#141	NEW    cov: 2208 ft: 5091 corp: 61/15404b lim: 4096 exec/s: 141 rss: 72Mb L: 147/1388 MS: 1 CopyPart-
#143	NEW    cov: 2210 ft: 5095 corp: 62/15449b lim: 4096 exec/s: 143 rss: 72Mb L: 45/1388 MS: 1 InsertByte-
#145	NEW    cov: 2211 ft: 5097 corp: 63/15700b lim: 4096 exec/s: 145 rss: 72Mb L: 251/1388 MS: 1 CMP- DE: "$\000\000\000\000\000\000\000"-
#147	NEW    cov: 2213 ft: 5099 corp: 64/15801b lim: 4096 exec/s: 147 rss: 72Mb L: 101/1388 MS: 1 PersAutoDict- DE: "$\000\000\000\000\000\000\000"-
#149	NEW    cov: 2214 ft: 5106 corp: 65/15905b lim: 4096 exec/s: 149 rss: 72Mb L: 104/1388 MS: 1 CMP- DE: "\001\000\000w"-
#436	NEW    cov: 2285 ft: 5405 corp: 112/28Kb lim: 4096 exec/s: 436 rss: 81Mb L: 280/1397 MS: 1 EraseBytes-
#448	REDUCE cov: 2285 ft: 5405 corp: 112/28Kb lim: 4096 exec/s: 448 rss: 81Mb L: 176/1397 MS: 1 EraseBytes-
	NEW_FUNC[1/1]: 0x7f10c3b31a00 in prologProcessor /src/cpython3/./Modules/expat/xmlparse.c:4604
...
#727	NEW    cov: 2339 ft: 5574 corp: 144/35Kb lim: 4096 exec/s: 727 rss: 84Mb L: 614/1397 MS: 1 CopyPart-
	NEW_FUNC[1/306]: 0x587c40 in _Py_CheckSlotResult /src/cpython3/Objects/call.c:76
	NEW_FUNC[2/306]: 0x588290 in _PyObject_VectorcallDictTstate /src/cpython3/Objects/call.c:114
#735	NEW    cov: 4237 ft: 7773 corp: 145/35Kb lim: 4096 exec/s: 735 rss: 84Mb L: 62/1397 MS: 1 ShuffleBytes-
...
#814	NEW    cov: 4239 ft: 7788 corp: 151/37Kb lim: 4096 exec/s: 814 rss: 84Mb L: 266/1397 MS: 5 EraseBytes-InsertByte-EraseBytes-CopyPart-InsertRepeatedBytes-
#818	NEW    cov: 4240 ft: 7790 corp: 152/38Kb lim: 4096 exec/s: 818 rss: 84Mb L: 517/1397 MS: 2 ShuffleBytes-ChangeBit-
	NEW_FUNC[1/2]: 0x7f10c3b793d0 in little2_prologTok /src/cpython3/./Modules/expat/xmltok_impl.c:1018
	NEW_FUNC[2/2]: 0x7f10c3b81c80 in little2_updatePosition /src/cpython3/./Modules/expat/xmltok_impl.c:1779
...
#1029	REDUCE cov: 4262 ft: 7817 corp: 161/39Kb lim: 4096 exec/s: 1029 rss: 85Mb L: 370/1397 MS: 1 EraseBytes-
#1037	NEW    cov: 4262 ft: 7819 corp: 162/40Kb lim: 4096 exec/s: 1037 rss: 85Mb L: 572/1397 MS: 1 ChangeBinInt-
#1041	NEW    cov: 4262 ft: 7821 corp: 163/40Kb lim: 4096 exec/s: 1041 rss: 85Mb L: 371/1397 MS: 3 CMP-ChangeBinInt-CrossOver- DE: "\001w"-
#1050	NEW    cov: 4264 ft: 7823 corp: 164/41Kb lim: 4096 exec/s: 1050 rss: 85Mb L: 520/1397 MS: 5 InsertRepeatedBytes-PersAutoDict-CopyPart-ChangeByte-ChangeBinInt- DE: "2"-

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, makes sense to me. A couple of minor edits to make in the C code that really only matter for pedantic correctness in the never likely MemoryError situation during initialization.

Modules/_xxtestfuzz/fuzzer.c Outdated Show resolved Hide resolved
Modules/_xxtestfuzz/fuzzer.c Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Oct 31, 2023

Not related to this PR, but I'm curious: what is _Py_FUZZ_ONE for? Is there a mode where all fuzzers are run on the same fuzz inputs? That sounds a bit weird.

@pitrou
Copy link
Member

pitrou commented Oct 31, 2023

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

@alex
Copy link
Member

alex commented Oct 31, 2023

LGTM modulo @gpshead's comments.

@gpshead
Copy link
Member

gpshead commented Oct 31, 2023

[off-topic-ish]

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

This PR is for the set oss-fuzz runs which we maintain directly (_xxtestfuzz) which is configured in https://github.com/google/oss-fuzz/tree/master/projects/cpython3. (via #73691)

There is a second Python fuzzing project not in our repo using a different approach from a later contributor, also linked from the above issue that can be found in https://github.com/guidovranken/python-library-fuzzers configured to run via https://github.com/google/oss-fuzz/tree/master/projects/python3-libraries. (it tends to write things directly in Python)

@bradlarsen
Copy link
Contributor Author

Not related to this PR, but I'm curious: what is _Py_FUZZ_ONE for? Is there a mode where all fuzzers are run on the same fuzz inputs? That sounds a bit weird.

Yes, it looks like something along those lines. The plumbing is a bit weird.

All the fuzz targets in the cpython tree are defined in that one fuzzer.c source file, and that gets compiled N different ways to produce N different targets, based on the names written in Modules/_xxtestfuzz/fuzz_tests.txt. It is the external oss-fuzz project that does that build.

The entire file also gets compiled into a Python module at Modules/_xxtestfuzz/_xxtestfuzz.c, which gets very briefly tested in Lib/test/test_xxtestfuzz.py. It is cpython itself that does this build.

@bradlarsen
Copy link
Contributor Author

[off-topic-ish]

Also another question: was it considered to write the fuzzers as Python code and execute them using the PyCompile / PyEval APIs?

This PR is for the set oss-fuzz runs which we maintain directly (_xxtestfuzz) which is configured in https://github.com/google/oss-fuzz/tree/master/projects/cpython3. (via #73691)

There is a second Python fuzzing project not in our repo using a different approach from a later contributor, also linked from the above issue that can be found in https://github.com/guidovranken/python-library-fuzzers configured to run via https://github.com/google/oss-fuzz/tree/master/projects/python3-libraries. (it tends to write things directly in Python)

Indeed, a different approach to fuzz-testing Python would be to write the fuzz targets in normal Python code. Guido Vranken's repo above takes that approach. I also experimented with that years ago when I was first doing fuzzing of CPython.

The advantages of writing fuzz targets in Python, when trying to fuzz CPython itself:

  • It's much more expressive per line of code than C, and the targets end up more readable
  • It's easier to write small bits of Python code than small bits of C code that call the Python C APIs

The disadvantages of writing fuzz targets in Python, when trying to fuzz CPython itself:

  • With fuzzing, you are usually trying to find memory errors, which are difficult to express in pure Python (barring horrible bugs). It's the underlying C code that would exhibit this problems. So writing targets in Python distances you from the code you actually want to test.
  • The performance of a fuzz target written carefully against the Python C API is likely to be an order of magnitude faster than going through interpreted Python code. (This may have changed somewhat in the meantime with things like Atheris, which didn't exist when I started.) A fuzzing target that can do 10k iterations per second is going to be much more effective than a functionally equivalent target that does <1k iterations per second.

@bradlarsen
Copy link
Contributor Author

@gpshead @alex I believe I've addressed all your comments on this PR. If so, just awaiting merge. Thank you for the reviews!

@gpshead gpshead merged commit f21b230 into python:main Nov 3, 2023
28 checks passed
@gpshead gpshead added the tests Tests in the Lib/test dir label Nov 3, 2023
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…11477)

* Add a fuzzer for `_elementtree.XMLParser._parse_whole`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip issue skip news tests Tests in the Lib/test dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants