# SecLists Index parsing

Example: http://seclists.org/fulldisclosure/2017/Jan

We will do basic parsing of the index file, mainly to extract the thread hierarchy contained within the `<ul>` tags. 

Normally, we have the file already downloaded via `seclists_crawler_raw.py`, but for this notebook, we'll download a fresh copy.


In [1]:
import requests
from bs4 import BeautifulSoup
from IPython.display import Pretty
import pprint

pp = pprint.PrettyPrinter(indent=4)


url = 'http://seclists.org/fulldisclosure/2017/Jan'
r = requests.get(url)
raw = r.text
Pretty(raw)



<!-- SecLists-Message-Count: 99 -->
<!-- MHonArc v2.6.19 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                      "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://seclists.org/rss/fulldisclosure.rss">
<meta property="og:image" content="http://seclists.org/images/fulldisclosure-img.png" />
<link rel="image_src" href="http://seclists.org/images/fulldisclosure-img.png" />
<title>Full Disclosure: by thread</title>
<link REL="SHORTCUT ICON" HREF="/shared/images/tiny-eyeicon.png" TYPE="image/png">
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
<meta name="theme-color" content="#2A0D45">
<link rel="stylesheet" href="/shared/css/insecdb.css" type="text/css">
<!--Google Analytics Code-->
<script type="text/javascript">
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getEl

The generated code from seclists.org contains an unterminated anchor tag, so to make things easier for BeautifulSoup's parser, we'll just replace this manually. This particular tag is a good tag to use as a locator for the messages section, so it's good to make sure that it's valid html.

Our soup "query" basically consists of:
1. Find me the tag with name = "begin"
2. Find me the first ul after this tag


In [2]:
raw = raw.replace('<a name="begin">', '<a name="begin"></a>')
soup = BeautifulSoup(raw, 'html5lib')

begin = soup.find(attrs={'name':'begin'}) #beginning of msg links
items = begin.find_next('ul').find_all('li', recursive=False)

pp.pprint(items)

[   <li><a href="0" name="0">Zend Framework / zend-mail &lt; 2.4.11 Remote Code Execution	(CVE-2016-10034)</a> <em>Dawid Golunski (Jan 03)</em></li>,
    <li><a href="1" name="1">CINtruder v0.3 released...</a> <em>psy (Jan 03)</em></li>,
    <li><a href="2" name="2">Advisories Unsafe Dll in Audacity, telegram and Akamai</a> <em>filipe (Jan 03)</em></li>,
    <li><a href="3" name="3">Persisted Cross-Site Scripting (XSS) in Confluence Jira	Software</a> <em>jlss (Jan 03)</em>
<ul>
<li><a href="9" name="9">Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira Software</a> <em>Moritz Naumann (Jan 04)</em>
<ul>
<li><a href="12" name="12">Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira Software</a> <em>jlss (Jan 06)</em>
</li>
</ul>
</li>
<li>&lt;Possible follow-ups&gt;</li>
<li><a href="11" name="11">Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira	Software</a> <em>David Black (Jan 06)</em>
</li>
 </ul>
</li>,
    <li><a href="4" name="4">0-day: QNAP NAS Devices 

We end up with an array of `<li>` tags, but note that children tags are encoded as embedded `<ul>` portions. 

Example:

`<li><a href="3" name="3">Persisted Cross-Site Scripting (XSS) in Confluence Jira	Software</a> <em>jlss (Jan 03)</em>
     <ul>
         <li><a href="9" name="9">Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira Software</a> <em>Moritz Naumann (Jan 04)</em>
             <ul>
                 <li><a href="12" name="12">Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira Software</a> <em>jlss (Jan 06)</em>
                 </li>`
 
We'll have to use recursion to be able to run through each layer, so lets define a function

Args:
* items: `<li>` tags at our given thread level
* messages: array used to hold our results
* idroot: 
* parent: id of the parent message


In [3]:
import re

def read_messages(items, messages, idroot, parent):
    for li in items:
        msg = li.find('a')
        if msg == None:
            #some messages just read "Possible follow-ups" with no link--skip
            continue
        id = idroot + msg['href']
        title = msg.text
        
        whowhen = li.find('em').text
        rx = re.compile('(.+) \((.+)\)')
        m = rx.search(whowhen)
        who = m.group(1)
        when = m.group(2)

        messages.append({
            'index': msg['href'],
            'id': id,
            'title': title,
            'parent': parent,
            'author': who,
            'date': when
        })
        
        replies = li.find('ul')
        if replies != None:
            read_messages(replies.find_all('li', recursive=False), messages, idroot, id)

    return messages

messages = []
idroot = '2017_Jan_'
read_messages(items, messages, idroot, None)
pp.pprint(messages)


[   {   'author': 'Dawid Golunski',
        'date': 'Jan 03',
        'id': '2017_Jan_0',
        'index': '0',
        'parent': None,
        'title': 'Zend Framework / zend-mail < 2.4.11 Remote Code '
                 'Execution\t(CVE-2016-10034)'},
    {   'author': 'psy',
        'date': 'Jan 03',
        'id': '2017_Jan_1',
        'index': '1',
        'parent': None,
        'title': 'CINtruder v0.3 released...'},
    {   'author': 'filipe',
        'date': 'Jan 03',
        'id': '2017_Jan_2',
        'index': '2',
        'parent': None,
        'title': 'Advisories Unsafe Dll in Audacity, telegram and Akamai'},
    {   'author': 'jlss',
        'date': 'Jan 03',
        'id': '2017_Jan_3',
        'index': '3',
        'parent': None,
        'title': 'Persisted Cross-Site Scripting (XSS) in Confluence Jira\t'
                 'Software'},
    {   'author': 'Moritz Naumann',
        'date': 'Jan 04',
        'id': '2017_Jan_9',
        'index': '9',
        'parent': '2017_J

        'date': 'Jan 19',
        'id': '2017_Jan_49',
        'index': '49',
        'parent': None,
        'title': 'Persistent XSS in Ghost 0.11.3'},
    {   'author': 'Julien Ahrens',
        'date': 'Jan 19',
        'id': '2017_Jan_51',
        'index': '51',
        'parent': None,
        'title': '[RCESEC-2016-012] Mattermost <= 3.5.1 "/error" '
                 'Unauthenticated Reflected Cross-Site Scripting / Content '
                 'Injection'},
    {   'author': 'Curesec Research Team (CRT)',
        'date': 'Jan 19',
        'id': '2017_Jan_52',
        'index': '52',
        'parent': None,
        'title': "Tap 'n' Sniff"},
    {   'author': 'Vulnerability Lab',
        'date': 'Jan 20',
        'id': '2017_Jan_53',
        'index': '53',
        'parent': None,
        'title': 'Apple iOS 10.2 (Notify - iTunes) - Filter Bypass & '
                 'Persistent Vulnerability'},
    {   'author': 'Stefan Kanthak',
        'date': 'Jan 22',
        'id': '2017_Jan_54',

The index file summarizes the reply's details, but we would like the full author with email, and a complete timestamp. To do this, we'll need to delve into the actual raw.html. For this notebook, we'll download the file again, but in actual usage, we would just open the existing file. 

The full author and timestamp are contained between the X-Head-of-Message comments.

In [9]:
import pendulum

message = messages[4]
reply_url = url + '/' + message['index']
r = requests.get(reply_url)
reply = r.text

start = reply.index('<!--X-Head-of-Message-->') + 24
end = reply.index('<!--X-Head-of-Message-End-->')

head = reply[start:end]
soup = BeautifulSoup(head, 'html5lib')
ems = soup.find_all('em')

for em in ems:
    if em.text == 'From':
        author = em.next_sibling
        #list obfuscates email by replacing @ with ' () ' and removing periods from domain name
        if author.startswith(': '):
            author = author[2:]
        author = author.replace(' () ', '@')
        at = author.find('@')
        author = author[:at] + author[at:].replace(' ', '.') 
        message['author'] = author
    elif em.text == 'Date':
        date = em.next_sibling
        if date.startswith(': '):
            date = date[2:]
        message['date'] = str(pendulum.parse(date).in_timezone('UTC'))

print(message)


{'id': '2017_Jan_9', 'author': 'Moritz Naumann <moritz.naumann@unbelievable-machine.com>', 'index': '9', 'parent': '2017_Jan_3', 'title': 'Re: Persisted Cross-Site Scripting (XSS) in Confluence Jira Software', 'date': '2017-01-04T11:57:15+00:00'}


Finally, let's put this data into csv format.

In [None]:
import csv
import sys

output = csv.writer(sys.stdout)
output.writerow(['id', 'title', 'date', 'author', 'parent'])
for x in messages:
    output.writerow([x['id'],
                     x['title'],
                     x['date'],
                     x['author'],
                     x['parent']])

