-
Notifications
You must be signed in to change notification settings - Fork 2
/
finishXML.py
executable file
·103 lines (87 loc) · 3.23 KB
/
finishXML.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#!/usr/bin/env python
import sys
from xml.parsers import expat
def finishXMLFile(filename):
"""
This function will read in a file which is a truncated well-formed XML file,
and fix it up at the end so that it is well-formed.
What it does *not* do:
1) Cope with ill-formed files (ie files that could not be transformed into
a well-formed file simply by appending characters)
2) Make any guarantees of the validitiy of the output file.
Known bugs:
1) If the file hasn't got a root element - or the root element opening
tag is incomplete - this will not create one, so the result will
be ill-formed. Probably this should throw an exception.
2) I think it's possible that where the file is Unicode, and the
truncation happens halfway through a Unicode character, that the
result will not be well-formed.
3) No checking is made that the input file is actually correct in its
nearly-well-formedness.
Warning: this *overwrites* the file which is given to it (Done this
way to save space since I am dealing with several-hundred
megabyte files ...
"""
tagStack = []
def start(name, attributes): tagStack.append(name)
def end(name): tagStack.pop()
p = expat.ParserCreate()
p.StartElementHandler = start
p.EndElementHandler = end
e = None
fIn = open(filename,'r+')
try:
p.ParseFile(fIn)
except expat.ExpatError, e:
pass
if not e: return
fIn.seek(0, 0)
for i in range(e.lineno-1): fIn.readline()
lastLine = fIn.readline()
fIn.seek(-len(lastLine), 1)
fIn.truncate()
lastLine = lastLine.rstrip() # for some reason python appends a newline
if e.message.startswith("no element found"):
# We're in a text section, carry on
pass
elif e.message.startswith("unclosed token"):
# throw away the final token and finish
lastLine = lastLine[:e.offset]
elif e.message.startswith("unclosed CDATA section"):
lastLine = lastLine + u']]>'
elif e.message.startswith("not well-formed (invalid token)"):
# We need to worry about where we are. These
# are the possibilities
if (lastLine[-1] == u'/'):
# We have "<tagName /"
lastLine = lastLine + u'>'
elif (lastLine[-1] == u'<'):
# We have simply "<"
lastLine = lastLine[:-1]
elif (lastLine[-1] == u'<'):
# We have "</t" with offset before the "<"
lastLine = lastLine[:e.offset]
elif (lastLine[-1] == u'!'):
# We have "<!"
lastLine = lastLine[:-2]
elif (lastLine[-1] == u'-'):
if (lastLine[-3:] == u'<!-'):
lastLine = lastLine[:-3]
elif (lastLine[-4:] == u'<!--'):
lastLine = lastLine[:-4]
else:
# We have "<!-- blah --"
lastLine = lastLine + u'>'
elif (lastLine[-1] == u"A"):
# We have "<![CDATA"
lastLine = lastLine[:-8]
elif (lastLine[-1] == u"?"):
# We have "<?"
lastLine = lastLine[:-2]
fIn.write(lastLine)
tagStack.reverse()
for tag in tagStack:
fIn.write("</"+tag+">")
fIn.close()
if __name__ == '__main__':
finishXMLFile(sys.argv[1])