You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently use the following regular expression to replace the escaped newline characters present in file PostHistory.xml, which is part of the official Stack Overflow data dump:
((?:
|
)?
)
The head of PostHistory.xml looks like this:
In some cases, this may break posts containing the character sequence 
.
One example is this post, others can be found using Stack Overflow's search feature.
The 
 sequences themselves are escaped within the posts:
We have to make sure that those sequences are preserved while the newlines are replaced.
We use the same character sequence when exporting the SOTorrent dataset versions, thus our export and import scripts are also affected.
Actually, I am wondering if you can replace all & in the raw text by &, which is something like an escape character in HTML but only for &. You can guarantee that 
 or 
 will only exist if you append them to the output. All original 
 and 
 in raw text will be escaped to 
 and 
, respectively.
Should be fixed in the most recent database versions (2020-08-31 and 2020-11-16).
I'm now keeping the newlines, hence I had to switch to SQL dumps instead of CSV files.
MySQL's CSV export is broken, see:
We currently use the following regular expression to replace the escaped newline characters present in file
PostHistory.xml
, which is part of the official Stack Overflow data dump:((?:
|
)?
)
The head of
PostHistory.xml
looks like this:In some cases, this may break posts containing the character sequence


.One example is this post, others can be found using Stack Overflow's search feature.
The


sequences themselves are escaped within the posts:We have to make sure that those sequences are preserved while the newlines are replaced.
We use the same character sequence when exporting the SOTorrent dataset versions, thus our export and import scripts are also affected.
Thanks @laitingsheng for pointing this out to me.
The text was updated successfully, but these errors were encountered: