# Research
This notebook is used to research functionality details and methods of implementation for the `badsnakes` project. 

Contents:
- [Determining if a file is binary](#Determining-if-a-file-is-binary)
- [Using `ast` to inspect code structure for keywords](#Using-ast-to-inspect-code-structure-for-keywords)
- Using `dis` be to inspect bytecode for keywords
- Detecting (very) long strings 

---
## Determining if a file is binary
This section contains the research involved (and associated links) for determining if a file is plain-text or binary.

Links:
- [How can I detect if a file is binary (non-text) in Python?](https://stackoverflow.com/a/7392391/6340496)
- https://dnmtechs.com/detecting-binary-files-in-python-3/

### Solution(s)
The first linked answer (and its associated links) appears to be the most robust cross-platform solution, and the *fastest*.

#### Timings
- `isbinary_file`: On Linux, using a subprocess to call `file --mime-type` and parse the output.
  - 2.49 ms ± 256 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- `isbinary_read`: Reading a chunk of the file and testing for remaining characters after printable characters are removed. 
  - 84.6 μs ± 2.27 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

#### Chosen method
The `isbinary_read` method was chosen as it's:
1) Cross-platform, meaning two different implementations are not needed.
2) ~29.5 times *faster* than the `isbinary_file` method.

### Rejected:
Using the builtin [`mimetype`](https://docs.python.org/3/library/mimetypes.html) library is **insufficient** as, according to the source code for [`guess_type`](https://github.com/python/cpython/blob/3.12/Lib/mimetypes.py#L103), the file is never opened and the type is "guessed" based on the file extension (and subsequent mappings) only.

In [70]:
import os
import subprocess as sp
from glob import glob

In [107]:
def isbinary_file(file: str):
    """Test for a binary file by calling the GNU 'file' utility."""
    with sp.Popen(['file', '-ib', os.path.realpath(file)], stdout=sp.PIPE) as proc:
        stdout, _ = proc.communicate()
    return stdout.decode().split('/', maxsplit=1)[0] != 'text'

In [100]:
textchars = set({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})

def isbinary_read(file: str, chunksize: int=1024):
    """Reading a file chunk and remove printable ASCII characters."""
    if not os.path.isfile(file):
        return True  # Non-files are considered binary.
    with open(os.path.realpath(file), 'rb') as f:
        return bool(set(f.read(chunksize)) - textchars)

In [92]:
# Verify the output of the two methods agree. No output here is good!
for f in glob('/usr/local/bin/*'):
    tf = isbinary_file(f)
    tr = isbinary_read(f, chunksize=1024)
    if tf != tr:
        print(f'{f}: file={tf}, read={tr}, equal={tf==tr}')

In [94]:
%%timeit
isbinary_file('/usr/local/bin/python3.10')

2.49 ms ± 259 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [105]:
%%timeit
isbinary_read('/usr/local/bin/python3.10', 1024)

84.1 μs ± 2.75 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [106]:
print(f'The isbinary_read method is {round(2.49 / (84.6 / 1000), 1)} times faster.')

The isbinary_read method is 29.4 times faster.


---
## Using `ast` to inspect code structure for keywords
This section contains the research involved in using the builtin `ast` module to parse module code and search for dangerous or suspicious keyword usage. For example, `exec`, `eval`, `compile`, etc.

This approach was researched first, before the 'heavy-handed' approach of simply searching for strings in the code.

Links:
- ...

### Solution(s)
...

#### Timings
- ...
- ...
  
#### Chosen method
...

### Rejected:
...


In [2]:
import ast

In [49]:
#with open('./scripts/evil.py', 'r') as f:
with open('/var/devmt/py/silvar_0.1.1/silvar/rware/__init__.py', 'r') as f:
    code = f.read()

p = ast.parse(code)
#print(ast.dump(p, indent=4))

In [50]:
# List all function calls.
calls = list(filter(lambda x: isinstance(x, (ast.Call,)), ast.walk(p)))
d = {ast.Name: 'id', ast.Attribute: 'attr'}

names = []
for c in calls:
    key = d.get(type(c.func))
    print(key, type(c.func.__getattribute__(key)), c.func.__getattribute__(key))
    names.append(c.func.__getattribute__(key))

# Display potentially dangerous function calls.
print(f"\nDangerous calls: {set(('eval', 'exec', 'compile', '_', '__')) & set(names)}\n")

id <class 'str'> _0xb0
attr <class 'str'> __getattribute__
id <class 'str'> _0xb0
id <class 'str'> _0xb0
id <class 'str'> _
id <class 'str'> __import__
id <class 'str'> __import__
attr <class 'str'> decode
attr <class 'str'> join
id <class 'str'> _0x1
id <class 'str'> __

Dangerous calls: {'__', '_'}



In [286]:
print(ast.dump(p, indent=4))

Module(
    body=[
        Expr(
            value=Constant(value="\nA module docstring.\n\nA long of text here to make for a long string which might be detected, as\nwe'll need to guard against this.\n\nA long of text here to make for a long string which might be detected, as\nwe'll need to guard against this.\n\nA long of text here to make for a long string which might be detected, as\nwe'll need to guard against this.\n\nA long of text here to make for a long string which might be detected, as\nwe'll need to guard against this.\n\n")),
        Assign(
            targets=[
                Name(id='e', ctx=Store())],
            value=Call(
                func=Attribute(
                    value=Name(id='__builtins__', ctx=Load()),
                    attr='__getattribute__',
                    ctx=Load()),
                args=[
                    Subscript(
                        value=Constant(value='lave'),
                        slice=Slice(
                            ste

In [1]:
# Test for long strings as these may be b64 encodings.
const = [c.value for c in filter(lambda x: isinstance(x, (ast.Constant,)), ast.walk(p)) if isinstance(c.value, str)]

for c in const:
    if len(c) > 100:
        print(f'Long string: {c}')

NameError: name 'ast' is not defined

In [7]:
assigns = list(filter(lambda x: isinstance(x, ast.Assign), ast.walk(p)))

In [8]:
assigns

[<ast.Assign at 0x7fe44e62d790>,
 <ast.Assign at 0x7fe44e62ce10>,
 <ast.Assign at 0x7fe44e635510>,
 <ast.Assign at 0x7fe44e635250>,
 <ast.Assign at 0x7fe44e634cd0>]

In [6]:
assigns[-1].value.value

NameError: name 'assigns' is not defined

In [9]:
for a in assigns:
    print(type(a.value))
    #if isinstance(a.value, ast.Constant):
    #    print(a.value.value)

<class 'ast.Call'>
<class 'ast.Call'>
<class 'ast.Constant'>
<class 'ast.Constant'>
<class 'ast.Constant'>


In [29]:
a.targets[0].id

'z'

In [51]:
calls

[<ast.Call at 0x7fe44d8ffb90>,
 <ast.Call at 0x7fe44d8fdc90>,
 <ast.Call at 0x7fe44d8ff950>,
 <ast.Call at 0x7fe44d8fd1d0>,
 <ast.Call at 0x7fe44d923dd0>,
 <ast.Call at 0x7fe44d8fe790>,
 <ast.Call at 0x7fe44d8fd7d0>,
 <ast.Call at 0x7fe44d920710>,
 <ast.Call at 0x7fe44d923cd0>,
 <ast.Call at 0x7fe44d9201d0>,
 <ast.Call at 0x7fe44d9217d0>]

In [59]:
for c in calls:
    print(c.args)

[<ast.Subscript object at 0x7fe44d8fd350>]
[<ast.Subscript object at 0x7fe44d8fc590>]
[<ast.Subscript object at 0x7fe44d8fce90>]
[<ast.Subscript object at 0x7fe44d8ffe10>]
[<ast.BinOp object at 0x7fe44d921150>]
[<ast.Subscript object at 0x7fe44d8fed50>]
[<ast.Subscript object at 0x7fe44d8fcf10>]
[]
[<ast.Call object at 0x7fe44d9201d0>]
[<ast.Name object at 0x7fe44d920150>, <ast.Tuple object at 0x7fe44d923e50>]
[<ast.Constant object at 0x7fe44d921c50>]


In [54]:
c.args[0].value

'XzB4ZD1fX2ltcG9ydF9fO2kwMT1fMHhkKCdzc2FwdGVnJ1s6Oi0xXSk7aTAyPV8weGQoJ2JvbGcnWzo6LTFdKS5fX2dldGF0dHJpYnV0ZV9fKCdib2xnJ1s6Oi0xXSk7aTAzPV8weGQoJ2h0YXAuc28nWzo6LTFdLCBmcm9tbGlzdD1bTm9uZV0pO2kwND1fMHhkKCd0ZWtjb3MnWzo6LTFdKTtfMHg0OT1fMHhkKCdzbml0bGl1YidbOjotMV0pLl9fZ2V0YXR0cmlidXRlX187YjAxPV8weDQ5KCduZWwnWzo6LTFdKTtiMDI9XzB4NDkoJ2V0YXJlbXVuZSdbOjotMV0pO2IwMz1fMHg0OSgncmhjJ1s6Oi0xXSk7YjA0PV8weDQ5KCdkcm8nWzo6LTFdKTtiMDU9XzB4NDkoJ25lcG8nWzo6LTFdKTtiMDY9Z2V0YXR0cihfMHg0OSgncnRzJ1s6Oi0xXSksICduaW9qJ1s6Oi0xXSkKZGVmIF8weDcyNzc2MTcyNjUoKToKICAgIEs9Zid7aTAxLmdldHVzZXIoKX17aTA0LmdldGhvc3RuYW1lKCl9Jy5sb3dlcigpWzo6LTFdO0w9YjAxKEspCiAgICB2MDE9aTAzLmV4cGFuZHVzZXIoJ34vRGVza3RvcCcpO3YwMj1pMDIoaTAzLmpvaW4odjAxLCdyd2FyZWRlbW8nLCcqJykpCiAgICBmb3IgdjAzIGluIHYwMjoKICAgICAgICB3aXRoIGIwNSh2MDMsJ3JiKycpIGFzIHYwNDoKICAgICAgICAgICAgXz1iMDYoJycsKGIwMyhfX15iMDQoS1tfJUxdKSkgZm9yIF8sX18gaW4gYjAyKHYwNC5yZWFkKCkpKSkKICAgICAgICAgICAgdjA0LnNlZWsoMCk7djA0LndyaXRlKF8uZW5jb2RlKCkpCiAgICB3aXRoIGIwNShpMDMuam9pbih2MDEsJ1JFQURNRS5

In [63]:
for c in calls:
    for arg in c.args:
        #print(c, arg)
        if isinstance(arg, ast.Constant):
            print(arg.value)

XzB4ZD1fX2ltcG9ydF9fO2kwMT1fMHhkKCdzc2FwdGVnJ1s6Oi0xXSk7aTAyPV8weGQoJ2JvbGcnWzo6LTFdKS5fX2dldGF0dHJpYnV0ZV9fKCdib2xnJ1s6Oi0xXSk7aTAzPV8weGQoJ2h0YXAuc28nWzo6LTFdLCBmcm9tbGlzdD1bTm9uZV0pO2kwND1fMHhkKCd0ZWtjb3MnWzo6LTFdKTtfMHg0OT1fMHhkKCdzbml0bGl1YidbOjotMV0pLl9fZ2V0YXR0cmlidXRlX187YjAxPV8weDQ5KCduZWwnWzo6LTFdKTtiMDI9XzB4NDkoJ2V0YXJlbXVuZSdbOjotMV0pO2IwMz1fMHg0OSgncmhjJ1s6Oi0xXSk7YjA0PV8weDQ5KCdkcm8nWzo6LTFdKTtiMDU9XzB4NDkoJ25lcG8nWzo6LTFdKTtiMDY9Z2V0YXR0cihfMHg0OSgncnRzJ1s6Oi0xXSksICduaW9qJ1s6Oi0xXSkKZGVmIF8weDcyNzc2MTcyNjUoKToKICAgIEs9Zid7aTAxLmdldHVzZXIoKX17aTA0LmdldGhvc3RuYW1lKCl9Jy5sb3dlcigpWzo6LTFdO0w9YjAxKEspCiAgICB2MDE9aTAzLmV4cGFuZHVzZXIoJ34vRGVza3RvcCcpO3YwMj1pMDIoaTAzLmpvaW4odjAxLCdyd2FyZWRlbW8nLCcqJykpCiAgICBmb3IgdjAzIGluIHYwMjoKICAgICAgICB3aXRoIGIwNSh2MDMsJ3JiKycpIGFzIHYwNDoKICAgICAgICAgICAgXz1iMDYoJycsKGIwMyhfX15iMDQoS1tfJUxdKSkgZm9yIF8sX18gaW4gYjAyKHYwNC5yZWFkKCkpKSkKICAgICAgICAgICAgdjA0LnNlZWsoMCk7djA0LndyaXRlKF8uZW5jb2RlKCkpCiAgICB3aXRoIGIwNShpMDMuam9pbih2MDEsJ1JFQURNRS50

In [66]:
arg.lineno

1

In [67]:
names = []
for c in calls:
    if isinstance(c.func, ast.Name):
        names.append(c.func.id)
    elif isinstance(c.func, ast.Attribute):
        names.append(c.func.attr)
        # names.append(c.func.value.id)
        names.append(c.func.id)


AttributeError: 'Attribute' object has no attribute 'id'

In [70]:
c.func.value.func.id

'__import__'