# Seclists reply parse

__Example: http://seclists.org/fulldisclosure/2017/Jan/0__

With each reply, we'll attempt to parse out the following:
* raw reply text, without html tags
  * the reply text with any signatures stripped out
* an analysis of what html tags are in the message
* a listing of which domains are referenced in links in the message

In [1]:
import re
import requests

from bs4 import BeautifulSoup

We'll gather the contents of a single message. 2017_Jan_0 is one that includes a personal signature, as well as the standard Full Disclosure footer.

2017_Jan_45 is a message that includes a PGP signature.

In [30]:
year = '2005'
month = 'Jan'
id = '0'
url = 'http://seclists.org/fulldisclosure/' + year + '/' + month + '/' + id

r = requests.get(url)
content = r.text
from IPython.display import Pretty
Pretty(content)

<!-- MHonArc v2.6.19 -->
<!--X-Head-End-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                      "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://seclists.org/rss/fulldisclosure.rss">
<title>Full Disclosure: Re: /bin/rm file access vulnerability</title>
<meta property="og:image" content="http://seclists.org/images/fulldisclosure-img.png" />
<link rel="image_src" href="http://seclists.org/images/fulldisclosure-img.png" />
<meta name="Subject" content="Re: /bin/rm file access vulnerability"/>
<meta name="Author" content="bkfsec"/>
<link REL="SHORTCUT ICON" HREF="/shared/images/tiny-eyeicon.png" TYPE="image/png">
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
<meta name="theme-color" content="#2A0D45">
<link rel="stylesheet" href="/shared/css/insecdb.css" type="text/css">
<!--Google Analytics Code-->
<script type="text/javascript">
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject

Each message in the FD list is wrapped in seclists.org code, including navigation, ads, and trackers, all irrelevant to us. The body of the reply is contained between two comments, `<!--X-Body-of-Message-->` and `<!--X-Body-of-Message-End-->`.

BeautifulSoup isn't great at handling comments, so we first use simple indexing to extract the relevant chars. We'll then send it through BeautifulSoup so we can use its __.text__ property to strip out the html tags. BS4 automatically adds tags to create valid html, so remember to parse using the generated `<body>` tags.

What we end up with is a plaintext version of the message's body. 

In [45]:
start = content.index('<!--X-Body-of-Message-->') + 24
end = content.index('<!--X-Body-of-Message-End-->')
body = content[start:end]

soup = BeautifulSoup(body, 'html5lib')
bodyhtml = soup.find('body')
raw = bodyhtml.text
Pretty(raw)

Yeah, I think that someone mistook the new year for April 1st.

Seriously, we seem to be getting more crap like this.  Are people just 
bored? 

            -Barry



Jörg Eschke wrote:

Sure, a user with admin rights is able to access/delete every local
file, regardless of the specific filepermissions.
Your 'exploit' will work with e.g. /bin/cat as well.
But i can't see a vulnerability anyway.

Am i missunderstanding something ?

Am Do, den 30.12.2004 schrieb Lennart Hansen um 2:18:
 

/bin/rm file access vulnerability

Affected Products:
        /bin/rm (all versions, tested on FreeBSD and linux)
        (http://www.freebsd.org    http://www.kernel.org)

Author:
        Xenzeo (Ablazed, Ultralaser, Lennart A. Hansen)
        xenzeo at blackhat dot dk


/bin/rm is a program that removes the named file arguments on unix systems.
When /bin/rm is called it checks the file's permissions and the id of the user
trying to remove the file. If the user does not have the required permissions
to

## Signature extraction

Messages to the FD list usually end with a common footer:

2002-2005:

`_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.netsys.com/full-disclosure-charter.html`

2005-2014:

`_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.grok.org.uk/full-disclosure-charter.html
Hosted and sponsored by Secunia - http://secunia.com/`

2014-onward:

`_______________________________________________
Sent through the Full Disclosure mailing list
http://nmap.org/mailman/listinfo/fulldisclosure
Web Archives & RSS: http://seclists.org/fulldisclosure/`

We'll look for the first line (47 underscores), then test the lines below to make sure it's a match. If so, we'll strip out that footer from our content.

In [60]:
workcopy = raw
footers = [m.start() for m in re.finditer('_{47}', workcopy)]
for f in reversed(footers):
    possible = workcopy[f:f+190]  
    lines = possible.splitlines()
    if(len(lines) == 4
        and lines[1][0:15] == 'Full-Disclosure'
        and lines[2][0:8] == 'Charter:'
        and lines[3][0:20] == 'Hosted and sponsored'):
        workcopy = workcopy[:f] + workcopy[f+213:]
        continue
    
    if(len(lines) == 4
        and lines[1][0:16] == 'Sent through the'
        and lines[2][0:17] == 'https://nmap.org/'
        and lines[3][0:14] == 'Web Archives &'):
        workcopy = workcopy[:f] + workcopy[f+211:]
        continue
    
    
    possible = workcopy[f:f+146]
    lines = possible.splitlines()
    if(len(lines) == 3
        and lines[1][0:15] == 'Full-Disclosure'
        and lines[2][0:8] == 'Charter:'):
        workcopy = workcopy[:f] + workcopy[f+146:]
        continue
        
print(workcopy)

Yeah, I think that someone mistook the new year for April 1st.

Seriously, we seem to be getting more crap like this.  Are people just 
bored? 

            -Barry



Jörg Eschke wrote:

Sure, a user with admin rights is able to access/delete every local
file, regardless of the specific filepermissions.
Your 'exploit' will work with e.g. /bin/cat as well.
But i can't see a vulnerability anyway.

Am i missunderstanding something ?

Am Do, den 30.12.2004 schrieb Lennart Hansen um 2:18:
 

/bin/rm file access vulnerability

Affected Products:
        /bin/rm (all versions, tested on FreeBSD and linux)
        (http://www.freebsd.org    http://www.kernel.org)

Author:
        Xenzeo (Ablazed, Ultralaser, Lennart A. Hansen)
        xenzeo at blackhat dot dk


/bin/rm is a program that removes the named file arguments on unix systems.
When /bin/rm is called it checks the file's permissions and the id of the user
trying to remove the file. If the user does not have the required permissions
to

### PGP messages
As can be expected, many messages offer a PGP signature validation. This isn't useful to our processing, so we'll take it out. First, we define `get_raw_message` with code we've used previously. We then create `strip_pgp`, looking for the PGP signature. We can just use simple text searches again, with an exception of using RE for the Hash, which can change.

http://seclists.org/fulldisclosure/2017/Oct/11 is a message that includes a PGP signature, so we'll use that to test.

In [13]:
def get_raw_message(url):
    r = requests.get(url)
    content = r.text
    start = content.index('<!--X-Body-of-Message-->') + 24
    end = content.index('<!--X-Body-of-Message-End-->')
    body = content[start:end]

    soup = BeautifulSoup(body, 'html5lib')
    bodyhtml = soup.find('body')
    return bodyhtml.text

#rawmsg = get_raw_message('http://seclists.org/fulldisclosure/2017/Oct/11')
rawmsg = get_raw_message('http://seclists.org/fulldisclosure/2005/Jan/719')

def strip_pgp(raw):

    try:
        pgp_sig_start = raw.index('-----BEGIN PGP SIGNATURE-----')
        pgp_sig_end = raw.index('-----END PGP SIGNATURE-----') + 27
        
        cleaned = raw[:pgp_sig_start] + raw[pgp_sig_end:]
        
        # if we find a public key block, then strip that out
        try: 
            pgp_pk_start = raw.index('-----BEGIN PGP PUBLIC KEY BLOCK-----')
            pgp_pk_end = raw.index('-----END PGP PUBLIC KEY BLOCK-----') + 35
            cleaned = cleaned[:pgp_pk_start] + cleaned[pgp_pk_end:]
        except ValueError as ve:
            pass

        # finally, try to remove the signed message header
        pgp_msg = raw.index('-----BEGIN PGP SIGNED MESSAGE-----')
        pgp_hash = re.search('Hash:(.)+\n', raw)
        
        if pgp_hash is not None:
            first_hash = pgp_hash.span(0)
            if first_hash[0] == pgp_msg + 35:
                #if we found a hash designation immediately after the header, strip that too
                cleaned = cleaned[:pgp_msg] + cleaned[first_hash[1]:]
            else:
                #just strip the header
                cleaned = cleaned[:pgp_msg] + cleaned[pgp_msg + 34:]
        else:
            cleaned = cleaned[:pgp_msg] + cleaned[pgp_msg + 34:]
            
                
        return cleaned
    except ValueError as ve:
        return raw

unpgp = strip_pgp(rawmsg)
Pretty(unpgp)
#Pretty(strip_pgp(raw))






______________________________________________________________________________

                        SUSE Security Announcement

        Package:                realplayer 8
        Announcement-ID:        SUSE-SA:2005:004
        Date:                   Monday, Jan 24th 2005 16:00 MET
        Affected products:      8.1, 8.2, 9.0, 9.1
                                SUSE Linux Desktop 1.0
        Vulnerability Type:     remote code execution
        Severity (1-10):        8
        SUSE default package:   yes
        Cross References:       none

    Content of this advisory:
        1) security vulnerability discussed:
               - integer overflow
           problem description
        2) solution/workaround
        3) standard appendix (further information)

______________________________________________________________________________

1) problem description, brief discussion


   RealPlayer is a combined audio and video player for RealMedia formatted
   streaming data.

### Talon processing

Next, we'll attempt to use __talon__ to strip out the signature from the message. Talon provides two different ways to find the signature, "brute force" and "machine learning". 

We'll try the brute force method first. 

In [28]:
import talon
from talon.signature.bruteforce import extract_signature

reply, signature = extract_signature(raw)
if(not signature is None):
    Pretty(signature)

In [29]:
Pretty(reply)

Zend Framework < 2.4.11    Remote Code Execution (CVE-2016-10034)
zend-mail < 2.7.2

Discovered by Dawid Golunski (@dawid_golunski)
https://legalhackers.com

Desc:
An independent research uncovered a critical vulnerability in zend-mail, a
Zend Framework's component that could potentially be used by (unauthenticated)
remote attackers to achieve remote arbitrary code execution in the context
of the web server user and remotely compromise the target web application.

To exploit the vulnerability an attacker could target common website
components such as contact/feedback forms, registration forms, password
email resets and others that send out emails with the help of a vulnerable
version of the zend-mail class.

Full advisory / PoC exploit at:

http://legalhackers.com/advisories/ZendFramework-Exploit-ZendMail-Remote-Code-Exec-CVE-2016-10034-Vuln.html

Video / PoC:

https://legalhackers.com/videos/ZendFramework-Exploit-Remote-Code-Exec-Vuln-CVE-2016-10034-PoC.html

For updates, follow:

htt

At least for 2017_Jan_0, it is pretty effective. 2017_Jan_45 was not successful at all. Now, we'll try the machine learning style, to compare. 

In [8]:
talon.init()
from talon import signature
reply_ml, sig_ml = signature.extract(raw, sender="dawid@legalhackers.com")
print(sig_ml)
#reply_ml

None


This doesn't seem to output anything. I'm unclear whether or not this library is already trained; documentation states that it was trained on the authors' personal email and an ENRON set. There is an open issue on github <https://github.com/mailgun/talon/issues/143> from July asking about the same thing. We will stick with the "brute force" method for now, and continue to look for more libraries.

## Extract HTML tags
We'll use a fairly simple regex to extract any tags from the reply. 

`<([^\s>]+)(\s|/>)+`
  * `[^\s>]+` one or more non-whitespace characters, __followed by__:
  * `\s|/` either a whitespace character, or a slash (/) for self-closing tags.


We then use a dictionary to count the instances of each unique tag. 

In [9]:
rx = re.compile('<([^\s>]+)(\s|/>)+')
tags = {}
for tag in rx.findall(str(bodyhtml)):
    tagtype = tag[0]
    if not tagtype.startswith('/'):
        if tagtype in tags:
            tags[tagtype] = tags[tagtype] + 1
        else:
            tags[tagtype] = 1
print(tags)

{'a': 7, 'pre': 1}


## Extract link domains

We'll record what domains are linked to in each message. We use BeautifulSoup to pull out all `<a>` tags, then urlparse to determine the domain within.

In [10]:
from urllib.parse import urlparse

sites = {}

atags = bodyhtml.find_all('a')
hrefs = [link.get('href') for link in atags]

for link in hrefs:
    parsedurl = urlparse(link)
    site = parsedurl.netloc
    if site in sites:
        sites[site] = sites[site] + 1
    else:
        sites[site] = 1

sites

{'legalhackers.com': 4, 'nmap.org': 1, 'seclists.org': 1, 'twitter.com': 1}