Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent identification of PostTypeId in PostReferenceGH #16

Closed
CosmosAtlas opened this issue Mar 29, 2019 · 9 comments

Comments

Projects
None yet
2 participants
@CosmosAtlas
Copy link

commented Mar 29, 2019

temp

In the table PostReferenceGH.csv, some entries are self conflicting by PostId and PostTypeId. Example given above, the last row with same PostId identified with different PostTypeId. This issue exists in more than 1 row.

@CosmosAtlas

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

Just for better reviewability, attached the problematic GHUrl to the misclassified row.

https://raw.githubusercontent.com/Demonware/cog/trunk/cog/util.py

@sbaltes sbaltes self-assigned this Mar 29, 2019

@sbaltes sbaltes added the bug label Mar 29, 2019

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Thanks for reporting!
I'm currently investigating the issue and will get back to you soon.

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

The PostTypeId in table PostReferenceGH is based on the link found in the corresponding file. In the file you linked, the user simply posted the wrong link (using a instead of q). However, Stack Overflow directs such requests to the correct thread.

I thought about using data from table Posts to set the PostTypeId correctly, but than that column would be inconsistent with the matched content. Therefore, if you don't have further thoughts on this, I'll to keep the current behavior, but add a comment to the SOTorrent project page, mentioning that column PostTypeId in table PostReferenceGH is based on the extracted link, not on the data from Stack Overflow.

@CosmosAtlas

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

My take on this is that PostTypeId and PostId have a one to one relation. If remaining the current way, there contain errors in the table.

Since the SOUrl is often not the actual matched Url anyway (actual matched is in GHMatches.csv, why not have the correct SOUrl and PostTypeId in PostReferenceGH.csv?

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

My take on this is that PostTypeId and PostId have a one to one relation. If remaining the current way, there contain errors in the table.

Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the PostIds are valid.

Since the SOUrl is often not the actual matched Url anyway (actual matched is in GHMatches.csv, why not have the correct SOUrl and PostTypeId in PostReferenceGH.csv?

The SOUrls are normalized to sharing links.
There is no one-to-one connection between the individual extracted URLs in PostReferenceGH and the matches in GHMatches. Those tables are only connected via column FileId. One line may contain several links pointing to Stack Overflow posts.

I will consider fixing broken PostTypeIds or excluding invalidPostIds in an upcoming release. Until then, feel free to ignore column PostTypeId. You can always join PostReferenceGH and Posts based in PostId to retrieve the correct PostTypeId.

@sbaltes sbaltes added the enhancement label Mar 29, 2019

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Also, we use PostTypeId 99 to mark links to comments, which does not exist in the official data dump.
To avoid confusion, it may be better to drop column PostTypeId for table PostReferenceGH. Comments can also identified by column CommentId not being NULL and for posts, the PostTypeId can be retrieved from table Posts.

@CosmosAtlas

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the PostIds are valid.

I can confirm that it's true, there are PostId from PostReferenceGH.csv that have no entry in Post.XML

I guess it's a really hard problem to format the data then. Maybe better documentation on the subject could help people have a more thourough understanding of the dataset.

Thanks for the fast update on the problem!

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

No worries and thanks for reporting this!
I'll try to figure out how many PostIds and CommentIds are invalid, maybe I can filter them out even in the current release. The upload to Zenodo is not finished yet and I can always modify the BigQuery version. And you are right: I should update the documentation afterwards.

@sbaltes

This comment has been minimized.

Copy link
Member

commented Mar 30, 2019

About 0.36% of the links in table PostReferenceGH have either an invalid PostId or an invalid CommentId. I used the distinct Ids in table Comment and the distinct PostIds in table PostHistory to check this. The latter implies that there could still be some PostIds that are not in table Posts (e.g., deleted posts with history in PostHistory).

I decided to remove column PostTypeId and also the custom id 99 for comments. To retrieve the PostTypeId of links pointing to questions or answers, one could use:

SELECT links.*, PostTypeId
FROM `sotorrent-org.2019_03_17.PostReferenceGH` links
JOIN `sotorrent-org.2019_03_17.Posts` posts
ON links.PostId = posts.Id
WHERE CommentId IS NULL;

I further added a column PostIds to table GHMatches, containing a space-separated list of post ids found in the matched line.

sbaltes added a commit that referenced this issue Mar 30, 2019

@sbaltes sbaltes closed this Mar 30, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.