Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.py locking up on certain pdfs #54

Open
mallorybobalice opened this issue Feb 22, 2016 · 26 comments
Open

process.py locking up on certain pdfs #54

mallorybobalice opened this issue Feb 22, 2016 · 26 comments

Comments

@mallorybobalice
Copy link

hi,

we seem to have a 'lucky stream' of corrupted PCAP extracts where PDFs cause the processing to hang indefinitely...(normally executable or other PDFs submit, process and report ok)
While I'll try to post a PCAP, that might not be possible (can Brad message me privately re that?) .

Anyhow, we're running processing tasks not within the main cuckoo process, but via the utils/process.py helper.

e.g. /utils/process.py -d -p 7 auto

eventually enough of those PDFs queue up that processing halts completely as all process processes hang if enough corrupted PDFs get submitted.
2016-02-18 23:03:57,947 [modules.processing.static] DEBUG: Starting to load PDF
2016-02-18 23:04:00,365 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:03,622 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:06,989 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:07,711 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:18,115 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:18,923 [modules.processing.static] DEBUG: About to parse with PDFParser
2016-02-18 23:04:19,794 [modules.processing.static] DEBUG: About to parse with PDFParser
^and then nothing will happen , and without extra debug statements in static.py after instantiating pdf parser and trying to see if it freezes there or on result processing not too sure. Suspect in PDFParser/peepdf PDFCore

occasionally it seems to restart processing (probably if i kick it over )and then the same set of tasks freezes again

as far as I can tell static analysis happens via peepdf (that's a bit newer in their repo by the way ? (using that still makes processing freeze ). While on the subject - did they merge the google summer of code pdf malscore changes and did we pull them into csb or brad-csb?)

*So anyhow , was hoping there was a quick workaround someone (who knows a bit more python and about safely killing processing tasks and marking them as processing or reporting failed) can help with, for example a watchdog timer in process.py that we get an extra config setting for. and we see processing taking over 600 seconds that task gets marked as failed and the helper process for that task ID terminates itself, preferably safely. *

thought/please help/questions?
Mb.

@mallorybobalice
Copy link
Author

ps I was eyeing the CSB2 process.py but it was sufficiently different for me to give up on that idea (just replacing the spender-sandbox ones doesn't work)

@Fryyyyy
Copy link

Fryyyyy commented Feb 22, 2016

We see this too, extremely frequently.
Enough that I wrapped it in an InterruptingCow (https://pypi.python.org/pypi/interruptingcow) try/catch to allow the rest of the processing module to proceed.

@mallorybobalice
Copy link
Author

Hmm the workaround tip is much appreciated .
It looks like this way I don't necessarily have to fail the task.

Do you up wrap the self parse here? Or even more outside and somehow indicate part of static or all of it failed? (Would you mind sharing?)

log.debug("Starting to load PDF")
results = self._parse(self.file_path)
return results

I was thinking of wrapping the results bit but might be too deep and things above might not like partially populated results

Ala

from interruptingcow import timeout

log.debug("Starting to load PDF")
try: with timeout(40, exception=RuntimeError):
results = self._parse(self.file_path)

except RuntimeError:
log.error("pdf analysis failed, task I'd blaa. Name bla"

return results

While on the subject of pdfs, @fryyyy - do you also often get corrupted pdfs locking up reader on opening in the guest VM and producing large bson logs that consume a lot of memory on csb host processing? Any thoughts on that (I just increased RAM in line with number of parallel tasks, and reduced Max log size, but that doesn't make me feel fuzzy)

@Fryyyyy
Copy link

Fryyyyy commented Feb 23, 2016

Unfortunately I'm not permitted to share code, though I am working on getting permission.
I haven't noticed lockups or large logs, but then again we have a lot of hosts with a lot of guests doing a lot of analyses so it's not easy to keep an eye on individual submissions.

@mallorybobalice
Copy link
Author

I see. No worries and thanks again

@spender-sandbox
Copy link
Owner

If someone gives me a hash for one of these PDFs, I can fix the actual problem instead of working around it.

-Brad

@mallorybobalice
Copy link
Author

Thanks Brad, I think fryyyy and I assumed or Fry may have checked it's upstream in peepdf. (I suppose checking is one break or print statement away.) Hence looking for workarounds

I'll have a look tomorrow and if already on malwr or ha shared, will supply hash. if it's a public unrelated to employer happy to share. I can probably try to get permission to share a public related one but it'd have to go out of band for an archive password. Is the latter an option?
Would also like to try to share the huge reader freezing bson Pdf ones, maybe those operations can be coalesced or ... idk, haven't really looked closely at it

@mallorybobalice
Copy link
Author

Ps are you sure the pdf specific issue aside there shouldn't be a timeout processing or per processing module or signature and reporting feature (or not worth it /normally works?) ? I'd rather find out from syslog alerts saying warning task processing failed then find all tasks hung next time (from an ops monitoring perspective bounded run times feel all warm and fuzzy inside )

@Fryyyyy
Copy link

Fryyyyy commented Feb 23, 2016

I've modified static.py to dump the PDFs out to a temp directory when it triggers the timeout. Once we've got a few, I'll see if I can get some cleared for release or at least do some analysis on them to see if there's anything similar about them.

@Fryyyyy
Copy link

Fryyyyy commented Feb 23, 2016

3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0 took more than a minute to statically analyse, which hits our timeout.

Edit for additional samples:

1e3db20bb77178cabe8e32a47510a027bb38bc585ed02a95052e3965ac9a9b26
2c2c956d74dcc245655a6c56aa052212ac1a933e22e5a41d63afc1aa9d2eccf5
3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0
63e0063d43ae9578c328b4683b53c868497dd41c8c112f3365308907ad444a84

@housemusic42
Copy link

i seemed to stumble upon one of these every once inna while. Here's the latest one where i turned up as much debug as I could. The jserror log does not have any time stamps though. If you need the PDF just let me know.

SHA256 2ab11d83ae2cbd12f0f6c30aacad8a8e16df5255646d08e923054b9f521c4b83
debug_log_pdf.txt

@mallorybobalice
Copy link
Author

ps if anyone's interested in the specifics of what Fryyy was suggesting , what we've done for now as a workaround is wrap the whole static analysis case statement in interrupting cow. (appreciate the Tip Fry). Could optionally copy these off, but for now we just print task id info if required later.

in case anyone wants it:

pip install interruptingcow

vi modules/processing/static.py

add new import:
from interruptingcow import timeout

replace Static(Processing): with a wrapped version.

class Static(Processing):
    """Static analysis."""

    def run(self):
        """Run analysis.
        @return: results dict.
        """
        self.key = "static"
        static = {}
        TOSeconds=60 #you can replace this with a config value
        try:
            with timeout(TOSeconds, exception=RuntimeError):

                if self.task["category"] == "file":
                    thetype = File(self.file_path).get_type()
                    if HAVE_PEFILE and ("PE32" in thetype or "MS-DOS executable" in thetype):
                        static = PortableExecutable(self.file_path, self.results).run()
                        if static and "Mono" in thetype:
                            static.update(DotNETExecutable(self.file_path, self.results).run())
                    elif "PDF" in thetype or self.task["target"].endswith(".pdf"):
                        static = PDF(self.file_path).run()
                    elif "Word 2007" in thetype or "Excel 2007" in thetype or "PowerPoint 2007" in thetype or "MIME entity" in thetype:
                        static = Office(self.file_path).run()
                    elif "Composite Document File" in thetype:
                        static = Office(self.file_path).run()
                    elif self.task["target"].endswith((".doc", ".docx", ".rtf", ".xls", ".mht", ".mso", ".xlsx", ".ppt", ".pptx", ".pps", ".ppsx", ".pptm", ".potm", 

".potx", ".ppsm")):
                        static = Office(self.file_path).run()
                    elif "Java Jar" in thetype or self.task["target"].endswith(".jar"):
                        decomp_jar = self.options.get("procyon_path", None)
                        if decomp_jar and not os.path.exists(decomp_jar):
                            log.error("procyon_path specified in processing.conf but the file does not exist.")
                        static = Java(self.file_path, decomp_jar).run()
                    # It's possible to fool libmagic into thinking our 2007+ file is a
                    # zip. So until we have static analysis for zip files, we can use
                    # oleid to fail us out silently, yeilding no static analysis
                    # results for actual zip files.
                    elif "Zip archive data, at least v2.0" in thetype:
                        static = Office(self.file_path).run()
                elif self.task["category"] == "url":
                    enabled_whois = self.options.get("whois", True)
                    if HAVE_WHOIS and enabled_whois:
                        static = URL(self.task["target"]).run()
        except RuntimeError:
                log.error("Error performing static analysis for task %d, within %d s, type: %s, file: %s",self.task["id"],TOSeconds,thetype,self.file_path)
        return static



@housemusic42
Copy link

here's a PDF that I can consistently get PDFParser to stall out on. let me know if you need more information.

ed2732bce8351b839e924a0cf5512ce90fcdfd4274796824bab40eb2d1850ff0.pdf

@housemusic42
Copy link

garrr.. i'm still having problems with this. @mallorybobalice - i tried your wrapper but it doesn't seem to send an interrupt? any help with this would be great - thinking about just restarting the processing script every hour or slapping something together to regenerate unprocessed tasks.

@mallorybobalice
Copy link
Author

Hmm processing log please.(or cuckoo if processing is not done by I utils) Mine interrupts ok. You sure you definitely installed ic and added the import aside from the wrapper?
Try setting it to 20s. It would be good if Brad or someone merge it in? getting a bit old merging it myself each git update pull . Has anyone bothered to report upstream to peepdf? I haven't: (

Thing is restarting processing won't help if it tries the same file again. Interrupting partially fails static analysis but continues

@mallorybobalice
Copy link
Author

Ps and a copy of the modified modules/processing/static.py

@mallorybobalice
Copy link
Author

And a sample freezing pdf to try if different than above

@mallorybobalice
Copy link
Author

Ps but really we need to open an issue here https://github.com/jesparza/peepdf
Otherwise our csb servers needlessly chew cpu for pdfs for the timeout period and for some of us quite often

@mallorybobalice
Copy link
Author

mallorybobalice commented Aug 26, 2016

@jesparza/peepdf#59

That said, personally I'd rather have the interrupt code to make static analysis time bounded

@housemusic42
Copy link

housemusic42 commented Aug 26, 2016

`class Static(Processing):
"""Static analysis."""

def run(self):
    """Run analysis.
    @return: results dict.
    """
    self.key = "static"
    static = {}
    TOSeconds=60 #you can replace this with a config value
    try:
        with timeout(TOSeconds, exception=RuntimeError):

            if self.task["category"] == "file":
                package = ""
            if "info" in self.results and "package" in self.results["info"]:
                     package = self.results["info"]["package"]

                     thetype = File(self.file_path).get_type()
            if HAVE_PEFILE and ("PE32" in thetype or "MS-DOS executable" in thetype):
                     static = PortableExecutable(self.file_path, self.results).run()
            if static and "Mono" in thetype:
                     static.update(DotNETExecutable(self.file_path, self.results).run())
            elif "PDF" in thetype or self.task["target"].endswith(".pdf"):
                     static = PDF(self.file_path).run()
            elif package in ("doc", "ppt", "xls"):
                     static = Office(self.file_path).run()
            elif "Java Jar" in thetype or self.task["target"].endswith(".jar"):
                     decomp_jar = self.options.get("procyon_path", None)
            if decomp_jar and not os.path.exists(decomp_jar):
                     log.error("procyon_path specified in processing.conf but the file does not exist.")
                     static = Java(self.file_path, decomp_jar).run()
        # It's possible to fool libmagic into thinking our 2007+ file is a
        # zip. So until we have static analysis for zip files, we can use
        # oleid to fail us out silently, yeilding no static analysis
        # results for actual zip files.
            elif "Zip archive data, at least v2.0" in thetype:
                     static = Office(self.file_path).run()
            elif package == "wsf" or thetype == "XML document text" or self.task["target"].endswith(".wsf") or package == "hta":
                 static = WindowsScriptFile(self.file_path).run()
            elif package == "js" or package == "vbs":
                 static = EncodedScriptFile(self.file_path).run()

            elif self.task["category"] == "url":
                    enabled_whois = self.options.get("whois", True)
        if HAVE_WHOIS and enabled_whois:
            static = URL(self.task["target"]).run()
    except RuntimeError:
            log.error("Error performing static analysis for task %d, within %d s, type: %s, file: %s",self.task["id"],TOSeconds,thetype,self.file_path)
    return static

`

`==> jserror.log <==
SyntaxError: Unexpected number ( @ 2 : 9 ) -> 0 >> BDC 0.357 0.608 0.835 rg
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
SyntaxError: Unexpected token ILLEGAL ( @ 1 : 0 ) ->
'ascii' codec can't decode byte 0xf3 in position 2443: ordinal not in range(128)
'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 15 ) -> 0 >> BDC / GS5 gs
'ascii' codec can't decode byte 0xd6 in position 2: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 9 ) -> 0 >> BDC BT / F1 9.6 Tf
SyntaxError: Unexpected number ( @ 1 : 3 ) -> 52 0 56 52 55 113 57 175 59 227 61 280 62 333 60 402 58 465 64 534 65 585 66 6
'ascii' codec can't decode byte 0xe9 in position 15: ordinal not in range(128)
'ascii' codec can't decode byte 0xd4 in position 17: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xd4 in position 18: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xe9 in position 15: ordinal not in range(128)
'ascii' codec can't decode byte 0xaf in position 16: ordinal not in range(128)
'ascii' codec can't decode byte 0xa5 in position 16: ordinal not in range(128)
SyntaxError: Unexpected identifier ( @ 2 : 13 ) -> findresource begin
'ascii' codec can't decode byte 0xd4 in position 17: ordinal not in range(128)
'ascii' codec can't decode byte 0xff in position 5: ordinal not in range(128)

==> errors.txt <==
Traceback (most recent call last):
File "/opt/cuckoo-modified/utils/../lib/cuckoo/common/peepdf/JSAnalysis.py", line 105, in analyseJS
escapedVars = re.findall('(\w_?)\s_?=\s_?(unescape((._?)))', code, re.DOTALL)
File "/usr/lib/python2.7/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "/usr/local/lib/python2.7/dist-packages/interruptingcow/init.py", line 24, in handler
raise exception
RuntimeError

==> jserror.log <==
'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
'ascii' codec can't decode byte 0xf3 in position 1406: ordinal not in range(128)
'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
`

Additionally and strangely, this seems to happen the most when i submit things over the API which i didn't think should matter but I see the behavior a lot when i submit that way.

@mallorybobalice
Copy link
Author

mallorybobalice commented Aug 27, 2016

File "/usr/local/lib/python2.7/dist-packages/interruptingcow/_init.py", line 24, in handler raise exceptionRuntimeError

That sounds like interrupting cow is working? The expectation is it issues a runtime error after hitting the timeout

@mallorybobalice
Copy link
Author

@muhviehstah
Copy link
Contributor

i have analysed the pdf posted by housemusic42 a time ago.

Processing ( 60.978 seconds )

System is Manjaro/Arch with very very latest versions of modules installed.
analyse.tar.gz

i had uploaded the whole storage folder, maybe i can be helpful for you guys .)

@Nwinternights
Copy link

same problem here. I've to cuckoos (supposed to have the same config and updates) but one the processing ends with no problem with other the process locks on pdfparser.py,
50.zip.
I even tried to use the most recent modules from https://github.com/DidierStevens/DidierStevensSuite with no luck.

@mallorybobalice
Copy link
Author

which bits are from Dider Stevens - pdfid?

@Nwinternights
Copy link

both: pdfparser and pdfid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants