# Parsing raw XML from Stackoverflow [data dump](https://archive.org/details/stackexchange)

To use these scripts, download the "Comments" and "Posts" XML files. These scripts will parse out any attributes you want from these files. I used them to parse out the titles and body-text from the Posts file, and the text from the comments.

Note: for stack overflow, "posts" encompass both initial questions posed by users and the answers supplied by other users. Only the initial questions have titles, so the overall yield of titles will be lower than the number of "posts" processed.

Note: the posts file is huge, so these scripts extract titles from each post serially and write to file.

Inspiration:

http://boscoh.com/programming/reading-xml-serially.html

https://www.ibm.com/developerworks/library/x-hiperfparse/index.html

In [1]:
from lxml import etree

In [2]:
def serially_parse_xml(source, destination, attributes, start=0, stop=100000):
    """ there are many millions of posts. use 'start' and 'stop' if you
    ever want to parse some titles at one point in time, and then want to
    add more at a later point"""
    if isinstance(attributes, str):
        attributes = [attributes]
    context = etree.iterparse(source)
    i=start
    for _, elem in context:
        i+=1
        if i <= start:
            continue
        for attribute in attributes:
            try: 
                destination.write(elem.attrib[attribute]+'\n')
            except KeyError:
                continue
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]  # clean up preceding siblings
        if i%1000000==0:
            print("{} posts parsed".format(i))
        if stop and i >= stop:
            break
    print("Parsing completed at {}".format(i))

### Comments

In [None]:
path = "/"
with open(path+'comments2.txt','w') as dest:
    serially_parse_xml(path+'Comments.xml', dest, 'Text')

### Full Posts (title, if exists, and body)

In [None]:
path = "/"
with open(path+'posts.txt','w') as dest:
    serially_parse_xml(path+'Posts.xml', dest, ['Title','Body'])

1000000 posts parsed
2000000 posts parsed
3000000 posts parsed
4000000 posts parsed
5000000 posts parsed
6000000 posts parsed
7000000 posts parsed
8000000 posts parsed
9000000 posts parsed
10000000 posts parsed
11000000 posts parsed
12000000 posts parsed
13000000 posts parsed
14000000 posts parsed
15000000 posts parsed
16000000 posts parsed
17000000 posts parsed
18000000 posts parsed
19000000 posts parsed
20000000 posts parsed
21000000 posts parsed
22000000 posts parsed
23000000 posts parsed
24000000 posts parsed
25000000 posts parsed
26000000 posts parsed
27000000 posts parsed
28000000 posts parsed
29000000 posts parsed
30000000 posts parsed
31000000 posts parsed
32000000 posts parsed
33000000 posts parsed
34000000 posts parsed


## Just titles

In [3]:
# iterate through 75,000,000 posts and get titles for those that have them
srcpath = '/'
destpath = '/'
with open(destpath+'posts_titles_50M.txt','a') as dest:
    serially_parse_xml(srcpath+'Posts.xml', dest, ['Title'], stop=75000000)

51000000 posts parsed
52000000 posts parsed
53000000 posts parsed
54000000 posts parsed
55000000 posts parsed
56000000 posts parsed
57000000 posts parsed
58000000 posts parsed
59000000 posts parsed
60000000 posts parsed
61000000 posts parsed
62000000 posts parsed
63000000 posts parsed
64000000 posts parsed
65000000 posts parsed
66000000 posts parsed
67000000 posts parsed
68000000 posts parsed
69000000 posts parsed
70000000 posts parsed
71000000 posts parsed
72000000 posts parsed
73000000 posts parsed
74000000 posts parsed
75000000 posts parsed
Parsing completed at 75000000
