process.py locking up on certain pdfs #54

mallorybobalice · 2016-02-22T10:30:40Z

hi,

we seem to have a 'lucky stream' of corrupted PCAP extracts where PDFs cause the processing to hang indefinitely...(normally executable or other PDFs submit, process and report ok)
While I'll try to post a PCAP, that might not be possible (can Brad message me privately re that?) .

Anyhow, we're running processing tasks not within the main cuckoo process, but via the utils/process.py helper.

e.g. /utils/process.py -d -p 7 auto

eventually enough of those PDFs queue up that processing halts completely as all process processes hang if enough corrupted PDFs get submitted.
2016-02-18 23:03:57,947 [modules.processing.static] DEBUG: Starting to load PDF
2016-02-18 23:04:00,365 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:03,622 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:06,989 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:07,711 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:18,115 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:18,923 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:19,794 [modules.processing.static] DEBUG: About to parse with PDFParser
^and then nothing will happen , and without extra debug statements in static.py after instantiating pdf parser and trying to see if it freezes there or on result processing not too sure. Suspect in PDFParser/peepdf PDFCore

occasionally it seems to restart processing (probably if i kick it over )and then the same set of tasks freezes again

as far as I can tell static analysis happens via peepdf (that's a bit newer in their repo by the way ? (using that still makes processing freeze ). While on the subject - did they merge the google summer of code pdf malscore changes and did we pull them into csb or brad-csb?)

*So anyhow , was hoping there was a quick workaround someone (who knows a bit more python and about safely killing processing tasks and marking them as processing or reporting failed) can help with, for example a watchdog timer in process.py that we get an extra config setting for. and we see processing taking over 600 seconds that task gets marked as failed and the helper process for that task ID terminates itself, preferably safely. *

thought/please help/questions?
Mb.

mallorybobalice · 2016-02-22T10:45:25Z

ps I was eyeing the CSB2 process.py but it was sufficiently different for me to give up on that idea (just replacing the spender-sandbox ones doesn't work)

Fryyyyy · 2016-02-22T22:46:22Z

We see this too, extremely frequently.
Enough that I wrapped it in an InterruptingCow (https://pypi.python.org/pypi/interruptingcow) try/catch to allow the rest of the processing module to proceed.

mallorybobalice · 2016-02-23T02:22:37Z

Hmm the workaround tip is much appreciated .
It looks like this way I don't necessarily have to fail the task.

Do you up wrap the self parse here? Or even more outside and somehow indicate part of static or all of it failed? (Would you mind sharing?)

log.debug("Starting to load PDF")
results = self._parse(self.file_path)
return results

I was thinking of wrapping the results bit but might be too deep and things above might not like partially populated results

Ala

from interruptingcow import timeout

log.debug("Starting to load PDF")
try: with timeout(40, exception=RuntimeError):
results = self._parse(self.file_path)

except RuntimeError:
log.error("pdf analysis failed, task I'd blaa. Name bla"

return results

While on the subject of pdfs, @fryyyy - do you also often get corrupted pdfs locking up reader on opening in the guest VM and producing large bson logs that consume a lot of memory on csb host processing? Any thoughts on that (I just increased RAM in line with number of parallel tasks, and reduced Max log size, but that doesn't make me feel fuzzy)

Fryyyyy · 2016-02-23T02:30:09Z

Unfortunately I'm not permitted to share code, though I am working on getting permission.
I haven't noticed lockups or large logs, but then again we have a lot of hosts with a lot of guests doing a lot of analyses so it's not easy to keep an eye on individual submissions.

mallorybobalice · 2016-02-23T10:34:24Z

I see. No worries and thanks again

spender-sandbox · 2016-02-23T12:31:37Z

If someone gives me a hash for one of these PDFs, I can fix the actual problem instead of working around it.

-Brad

mallorybobalice · 2016-02-23T12:57:30Z

Thanks Brad, I think fryyyy and I assumed or Fry may have checked it's upstream in peepdf. (I suppose checking is one break or print statement away.) Hence looking for workarounds

I'll have a look tomorrow and if already on malwr or ha shared, will supply hash. if it's a public unrelated to employer happy to share. I can probably try to get permission to share a public related one but it'd have to go out of band for an archive password. Is the latter an option?
Would also like to try to share the huge reader freezing bson Pdf ones, maybe those operations can be coalesced or ... idk, haven't really looked closely at it

mallorybobalice · 2016-02-23T13:26:00Z

Ps are you sure the pdf specific issue aside there shouldn't be a timeout processing or per processing module or signature and reporting feature (or not worth it /normally works?) ? I'd rather find out from syslog alerts saying warning task processing failed then find all tasks hung next time (from an ops monitoring perspective bounded run times feel all warm and fuzzy inside )

Fryyyyy · 2016-02-23T23:16:26Z

I've modified static.py to dump the PDFs out to a temp directory when it triggers the timeout. Once we've got a few, I'll see if I can get some cleared for release or at least do some analysis on them to see if there's anything similar about them.

Fryyyyy · 2016-02-23T23:44:11Z

3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0 took more than a minute to statically analyse, which hits our timeout.

Edit for additional samples:

1e3db20bb77178cabe8e32a47510a027bb38bc585ed02a95052e3965ac9a9b26
2c2c956d74dcc245655a6c56aa052212ac1a933e22e5a41d63afc1aa9d2eccf5
3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0
63e0063d43ae9578c328b4683b53c868497dd41c8c112f3365308907ad444a84

housemusic42 · 2016-02-24T14:17:04Z

i seemed to stumble upon one of these every once inna while. Here's the latest one where i turned up as much debug as I could. The jserror log does not have any time stamps though. If you need the PDF just let me know.

SHA256 2ab11d83ae2cbd12f0f6c30aacad8a8e16df5255646d08e923054b9f521c4b83
debug_log_pdf.txt

mallorybobalice · 2016-03-01T10:53:15Z

ps if anyone's interested in the specifics of what Fryyy was suggesting , what we've done for now as a workaround is wrap the whole static analysis case statement in interrupting cow. (appreciate the Tip Fry). Could optionally copy these off, but for now we just print task id info if required later.

in case anyone wants it:

pip install interruptingcow

vi modules/processing/static.py

add new import:
from interruptingcow import timeout

replace Static(Processing): with a wrapped version.

class Static(Processing):
    """Static analysis."""

    def run(self):
        """Run analysis.
        @return: results dict.
        """
        self.key = "static"
        static = {}
        TOSeconds=60 #you can replace this with a config value
        try:
            with timeout(TOSeconds, exception=RuntimeError):

                if self.task["category"] == "file":
                    thetype = File(self.file_path).get_type()
                    if HAVE_PEFILE and ("PE32" in thetype or "MS-DOS executable" in thetype):
                        static = PortableExecutable(self.file_path, self.results).run()
                        if static and "Mono" in thetype:
                            static.update(DotNETExecutable(self.file_path, self.results).run())
                    elif "PDF" in thetype or self.task["target"].endswith(".pdf"):
                        static = PDF(self.file_path).run()
                    elif "Word 2007" in thetype or "Excel 2007" in thetype or "PowerPoint 2007" in thetype or "MIME entity" in thetype:
                        static = Office(self.file_path).run()
                    elif "Composite Document File" in thetype:
                        static = Office(self.file_path).run()
                    elif self.task["target"].endswith((".doc", ".docx", ".rtf", ".xls", ".mht", ".mso", ".xlsx", ".ppt", ".pptx", ".pps", ".ppsx", ".pptm", ".potm", 

".potx", ".ppsm")):
                        static = Office(self.file_path).run()
                    elif "Java Jar" in thetype or self.task["target"].endswith(".jar"):
                        decomp_jar = self.options.get("procyon_path", None)
                        if decomp_jar and not os.path.exists(decomp_jar):
                            log.error("procyon_path specified in processing.conf but the file does not exist.")
                        static = Java(self.file_path, decomp_jar).run()
                    # It's possible to fool libmagic into thinking our 2007+ file is a
                    # zip. So until we have static analysis for zip files, we can use
                    # oleid to fail us out silently, yeilding no static analysis
                    # results for actual zip files.
                    elif "Zip archive data, at least v2.0" in thetype:
                        static = Office(self.file_path).run()
                elif self.task["category"] == "url":
                    enabled_whois = self.options.get("whois", True)
                    if HAVE_WHOIS and enabled_whois:
                        static = URL(self.task["target"]).run()
        except RuntimeError:
                log.error("Error performing static analysis for task %d, within %d s, type: %s, file: %s",self.task["id"],TOSeconds,thetype,self.file_path)
        return static

housemusic42 · 2016-03-18T13:12:37Z

here's a PDF that I can consistently get PDFParser to stall out on. let me know if you need more information.

ed2732bce8351b839e924a0cf5512ce90fcdfd4274796824bab40eb2d1850ff0.pdf

housemusic42 · 2016-08-26T04:10:03Z

garrr.. i'm still having problems with this. @mallorybobalice - i tried your wrapper but it doesn't seem to send an interrupt? any help with this would be great - thinking about just restarting the processing script every hour or slapping something together to regenerate unprocessed tasks.

mallorybobalice · 2016-08-26T04:36:53Z

Hmm processing log please.(or cuckoo if processing is not done by I utils) Mine interrupts ok. You sure you definitely installed ic and added the import aside from the wrapper?
Try setting it to 20s. It would be good if Brad or someone merge it in? getting a bit old merging it myself each git update pull . Has anyone bothered to report upstream to peepdf? I haven't: (

Thing is restarting processing won't help if it tries the same file again. Interrupting partially fails static analysis but continues

mallorybobalice · 2016-08-26T04:38:19Z

Ps and a copy of the modified modules/processing/static.py

mallorybobalice · 2016-08-26T04:39:12Z

And a sample freezing pdf to try if different than above

mallorybobalice · 2016-08-26T04:43:53Z

Ps but really we need to open an issue here https://github.com/jesparza/peepdf
Otherwise our csb servers needlessly chew cpu for pdfs for the timeout period and for some of us quite often

mallorybobalice · 2016-08-26T04:46:54Z

@jesparza/peepdf#59

That said, personally I'd rather have the interrupt code to make static analysis time bounded

housemusic42 · 2016-08-26T15:42:46Z

`class Static(Processing):
"""Static analysis."""

def run(self):
    """Run analysis.
    @return: results dict.
    """
    self.key = "static"
    static = {}
    TOSeconds=60 #you can replace this with a config value
    try:
        with timeout(TOSeconds, exception=RuntimeError):

            if self.task["category"] == "file":
                package = ""
            if "info" in self.results and "package" in self.results["info"]:
                     package = self.results["info"]["package"]

                     thetype = File(self.file_path).get_type()
            if HAVE_PEFILE and ("PE32" in thetype or "MS-DOS executable" in thetype):
                     static = PortableExecutable(self.file_path, self.results).run()
            if static and "Mono" in thetype:
                     static.update(DotNETExecutable(self.file_path, self.results).run())
            elif "PDF" in thetype or self.task["target"].endswith(".pdf"):
                     static = PDF(self.file_path).run()
            elif package in ("doc", "ppt", "xls"):
                     static = Office(self.file_path).run()
            elif "Java Jar" in thetype or self.task["target"].endswith(".jar"):
                     decomp_jar = self.options.get("procyon_path", None)
            if decomp_jar and not os.path.exists(decomp_jar):
                     log.error("procyon_path specified in processing.conf but the file does not exist.")
                     static = Java(self.file_path, decomp_jar).run()
        # It's possible to fool libmagic into thinking our 2007+ file is a
        # zip. So until we have static analysis for zip files, we can use
        # oleid to fail us out silently, yeilding no static analysis
        # results for actual zip files.
            elif "Zip archive data, at least v2.0" in thetype:
                     static = Office(self.file_path).run()
            elif package == "wsf" or thetype == "XML document text" or self.task["target"].endswith(".wsf") or package == "hta":
                 static = WindowsScriptFile(self.file_path).run()
            elif package == "js" or package == "vbs":
                 static = EncodedScriptFile(self.file_path).run()

            elif self.task["category"] == "url":
                    enabled_whois = self.options.get("whois", True)
        if HAVE_WHOIS and enabled_whois:
            static = URL(self.task["target"]).run()
    except RuntimeError:
            log.error("Error performing static analysis for task %d, within %d s, type: %s, file: %s",self.task["id"],TOSeconds,thetype,self.file_path)
    return static

`

`==> jserror.log <==
SyntaxError: Unexpected number ( @ 2 : 9 ) -> 0 >> BDC 0.357 0.608 0.835 rg
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
'ascii' codec can't decode byte 0xf3 in position 2443: ordinal not in range(128)
'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 15 ) -> 0 >> BDC / GS5 gs
'ascii' codec can't decode byte 0xd6 in position 2: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 9 ) -> 0 >> BDC BT / F1 9.6 Tf
SyntaxError: Unexpected number ( @ 1 : 3 ) -> 52 0 56 52 55 113 57 175 59 227 61 280 62 333 60 402 58 465 64 534 65 585 66 6
'ascii' codec can't decode byte 0xe9 in position 15: ordinal not in range(128)
'ascii' codec can't decode byte 0xd4 in position 17: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xd4 in position 18: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xe9 in position 15: ordinal not in range(128)
'ascii' codec can't decode byte 0xaf in position 16: ordinal not in range(128)
'ascii' codec can't decode byte 0xa5 in position 16: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xd4 in position 17: ordinal not in range(128)
'ascii' codec can't decode byte 0xff in position 5: ordinal not in range(128)

==> errors.txt <==
Traceback (most recent call last):
File "/opt/cuckoo-modified/utils/../lib/cuckoo/common/peepdf/JSAnalysis.py", line 105, in analyseJS
escapedVars = re.findall('(\w_?)\s_?=\s_?(unescape((._?)))', code, re.DOTALL)
File "/usr/lib/python2.7/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "/usr/local/lib/python2.7/dist-packages/interruptingcow/init.py", line 24, in handler
raise exception
RuntimeError

==> jserror.log <==
'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
'ascii' codec can't decode byte 0xf3 in position 1406: ordinal not in range(128)
'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
`

Additionally and strangely, this seems to happen the most when i submit things over the API which i didn't think should matter but I see the behavior a lot when i submit that way.

mallorybobalice · 2016-08-27T00:39:22Z

File "/usr/local/lib/python2.7/dist-packages/interruptingcow/_init.py", line 24, in handler raise exceptionRuntimeError

That sounds like interrupting cow is working? The expectation is it issues a runtime error after hitting the timeout

mallorybobalice · 2016-08-27T00:40:35Z

Although, the line number is a bit weird

https://bitbucket.org/evzijst/interruptingcow/src/17b67cf1105b1b1524e632c59d9940caf10e82a2/interruptingcow/__init__.py?at=master&fileviewer=file-view-default

24

Same version?

muhviehstah · 2016-08-27T07:57:46Z

i have analysed the pdf posted by housemusic42 a time ago.

Processing ( 60.978 seconds )

System is Manjaro/Arch with very very latest versions of modules installed.
analyse.tar.gz

i had uploaded the whole storage folder, maybe i can be helpful for you guys .)

Nwinternights · 2016-10-29T12:50:15Z

same problem here. I've to cuckoos (supposed to have the same config and updates) but one the processing ends with no problem with other the process locks on pdfparser.py,
50.zip.
I even tried to use the most recent modules from https://github.com/DidierStevens/DidierStevensSuite with no luck.

mallorybobalice · 2016-11-02T10:19:28Z

which bits are from Dider Stevens - pdfid?

Nwinternights · 2016-11-12T15:55:46Z

both: pdfparser and pdfid

mallorybobalice mentioned this issue May 18, 2016

any interest in gzipping reports and decompressing on demand? also, webui/api -> storage analyses question #142

Open

mallorybobalice mentioned this issue Aug 26, 2016

some pdfs delay static analysis indefinitely jesparza/peepdf#59

Open

mallorybobalice mentioned this issue Sep 26, 2016

PDF parsing too long #297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process.py locking up on certain pdfs #54

process.py locking up on certain pdfs #54

mallorybobalice commented Feb 22, 2016

mallorybobalice commented Feb 22, 2016

Fryyyyy commented Feb 22, 2016

mallorybobalice commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

spender-sandbox commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

housemusic42 commented Feb 24, 2016

mallorybobalice commented Mar 1, 2016

housemusic42 commented Mar 18, 2016

housemusic42 commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016 •

edited

Loading

housemusic42 commented Aug 26, 2016 •

edited

Loading

mallorybobalice commented Aug 27, 2016 •

edited

Loading

mallorybobalice commented Aug 27, 2016

muhviehstah commented Aug 27, 2016

Nwinternights commented Oct 29, 2016

mallorybobalice commented Nov 2, 2016

Nwinternights commented Nov 12, 2016

process.py locking up on certain pdfs #54

process.py locking up on certain pdfs #54

Comments

mallorybobalice commented Feb 22, 2016

mallorybobalice commented Feb 22, 2016

Fryyyyy commented Feb 22, 2016

mallorybobalice commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

spender-sandbox commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

mallorybobalice commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

Fryyyyy commented Feb 23, 2016

housemusic42 commented Feb 24, 2016

mallorybobalice commented Mar 1, 2016

housemusic42 commented Mar 18, 2016

housemusic42 commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016

mallorybobalice commented Aug 26, 2016 • edited Loading

housemusic42 commented Aug 26, 2016 • edited Loading

mallorybobalice commented Aug 27, 2016 • edited Loading

mallorybobalice commented Aug 27, 2016

muhviehstah commented Aug 27, 2016

Nwinternights commented Oct 29, 2016

mallorybobalice commented Nov 2, 2016

Nwinternights commented Nov 12, 2016

mallorybobalice commented Aug 26, 2016 •

edited

Loading

housemusic42 commented Aug 26, 2016 •

edited

Loading

mallorybobalice commented Aug 27, 2016 •

edited

Loading