Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve escaping/unescaping newline characters #19

Closed
sbaltes opened this issue Jun 11, 2020 · 2 comments
Closed

Improve escaping/unescaping newline characters #19

sbaltes opened this issue Jun 11, 2020 · 2 comments
Assignees

Comments

@sbaltes
Copy link
Member

sbaltes commented Jun 11, 2020

We currently use the following regular expression to replace the escaped newline characters present in file PostHistory.xml, which is part of the official Stack Overflow data dump:

((?:
|
)?
)

The head of PostHistory.xml looks like this:

2020-06-11 12_47_07-Window

In some cases, this may break posts containing the character sequence 
.
One example is this post, others can be found using Stack Overflow's search feature.

The 
 sequences themselves are escaped within the posts:
image
We have to make sure that those sequences are preserved while the newlines are replaced.

We use the same character sequence when exporting the SOTorrent dataset versions, thus our export and import scripts are also affected.

Thanks @laitingsheng for pointing this out to me.

@laitingsheng
Copy link

Actually, I am wondering if you can replace all & in the raw text by &, which is something like an escape character in HTML but only for &. You can guarantee that 
 or 
 will only exist if you append them to the output. All original 
 and 
 in raw text will be escaped to 
 and 
, respectively.

@sbaltes
Copy link
Member Author

sbaltes commented Nov 25, 2020

@sbaltes sbaltes closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants