distinguish POSSE posts vs non-POSSE mentions and handle accordingly #51

Closed
snarfed opened this Issue Jan 31, 2014 · 51 comments

Projects

None yet

8 participants

@snarfed
Owner
snarfed commented Jan 31, 2014

this would be nice for catching when other people post a link to your post in a silo.

i did this for a while in mid 2012, before bridgy's re-release with webmentions. i stopped because the POSSEd posts showed up as comments on the original posts, and i kept that decision in the re-release because i didn't see enough people using rel-syndication links, which meant i couldn't prevent the same thing happening to them.

on the other hand, we've been thinking more about de-duping and similar issues recently, and @tantek proposed that this kind of noise might help motivate people to make their mention handling smarter. worth a thought.

@snarfed
Owner
snarfed commented Jan 31, 2014

concretely, these would only differ from current webmentions in that they wouldn't have an in-reply-to, since they truly are "mentions."

@snarfed snarfed changed the title from send webmentions for original POSSE silo posts to send webmentions for posts as well as responses Apr 14, 2014
@snarfed
Owner
snarfed commented Apr 14, 2014

two possible approaches for distinguishing the original author's POSSEd posts:

  • don't bother. ideally, webmention handlers would detect them and filter them out, or whatever they want. (@tantek advocates this.)
  • omit original silo posts from the author, but not from other people.

both are reasonable, and this would be a good feature. promoting to now.

@snarfed snarfed added now listen and removed maybe labels Apr 14, 2014
@snarfed
Owner
snarfed commented Aug 26, 2014

lots of discussion about this on IRC today.

summary: when tweet links to a post, but isn't the official POSSE tweet of that post, responses are backfed and rendered as if they were responses to the original post. two examples. some people like this somewhat (e.g. @snarfed, @kevinmarks, maybe @kylewm); others don't (@aaronpk, @tantek).

it's hard to prevent this. @tantek correctly notes that we can use rel=me to identify the original author, and only treat their tweets as POSSE candidates. that's a good step.

however, the common case is that the original author later links to their post from a different (non-POSSE) tweet. we could use u-syndication and permashortcitations to distinguish that from the original POSSE tweet, but both of those have low adoption rates among bridgy users, so we'd end up muzzling the majority of responses, which i don't want to do.

@kevinmarks suggests that we use time as a heuristic. if the author links to their post over 24h after it's originally posted, don't consider that a POSSE. definitely a good idea!

(i'd re-emphasize that this is all tradeoffs. given real world usage, i don't see a single best answer so far, and leaving the current behavior is on the table. good to hash through options though!)

@snarfed snarfed changed the title from send webmentions for posts as well as responses to send webmentions for (non-POSSE) posts as well as responses Aug 26, 2014
@snarfed snarfed added now and removed now labels Aug 26, 2014
@snarfed snarfed removed the later label Sep 4, 2014
@snarfed snarfed referenced this issue in aaronpk/webmention.io Dec 13, 2014
Closed

"favorited a tweet linking to"... #34

@snarfed
Owner
snarfed commented Jan 8, 2015

current proposal from @tantek in IRC today: only consider a link to be the original copy if it's on a domain in the user's silo profile. sounds ok to me, we could consider implementing it.

@kylewm
Collaborator
kylewm commented Apr 13, 2015

In case it's useful, here's an example where Bridgy is being overly aggressive in assuming a tweet is the POSSE copy of an original.

here's the original: https://adactio.com/journal/8710
here's a tweet from someone else (another bridgy user) linking to the original: https://twitter.com/jgarber/status/587245857034133504

and then a bunch of RT's of that tweet are backfed to the original as if they are RTs of the original. e.g., https://brid-gy.appspot.com/repost/twitter/jgarber/587245857034133504/587680705938907136

@snarfed
Owner
snarfed commented Apr 13, 2015

thanks @kylewm!

one way to mitigate: when the post's domain isn't one of the tweet author's domains, demote to u-mention.

@snarfed
Owner
snarfed commented Aug 28, 2015

some new thoughts from #452:

here's a concrete example. i recently tweeted this:

My silly privacy antics landed me in a @VICE @motherboard article on prepaid credit cards. Fun, mildly embarrassing. http://motherboard.vice.com/read/the-simple-trick-ashley-madisons-users-could-have-used-to-protect-themselves

with this new feature, we'd attempt to send a webmention with this tweet as the source and the motherboard.vice.com link as the target. of course, the source wouldn't actually be the twitter.com permalink, it'd be the bridgy proxy URL that renders the tweet as mf2.

one interesting question is whether to do consider this part of "listen" or "publish." ie should we start doing this when you sign up for backfeed? or only when you enable publish? it's not clear to me which one it belongs to. i'm leaning toward listen (backfeed), but not sure.

also, a catch: POSSE/PESOSed silo posts would end up sending multiple wms, one from the original post and one from each silo post, so the target would end up showing duplicates. bridgy already causes this for POSSEd comments/likes/reposts, though, so it's not a new problem, and we've pretty much agreed that it's the recipient's job to use syndication links, etc to de-dupe.

@snarfed
Owner
snarfed commented Aug 28, 2015

an idea for expanding this: search silos for any posts, from anyone, that link to the user's domain(s), and send wms for them too. these are effectively mentions.

silo support for this is mixed:

moved this to #456

@snarfed
Owner
snarfed commented Aug 29, 2015

added the full set of OPD heuristics to the IWC wiki. the important part for implementing is:

When considering a backlink in a silo post, use most or all of these heuristics to determine whether it's a POSSE:

  • The backlink must be at or near the end. (Allow e.g. a close paren after the link.)
  • The backlink must point to one of the user's domains, as determined by rel-me and links in their silo profile.
  • The silo post must be published within 24h of the original post.
  • New: compare the silo post's text and the original post's name, summary, and/or content, taking prefixes if they're meaningfully longer. (If the silo post has an ellipsis at or near the end, that's a strong hint to use a prefix.) The edit distance should be below a certain threshold, disregarding common differences like @-usernames in silo posts vs human names in original posts (e.g. this OP vs this POSSE).

current plan is to skip the last one due to complexity. i think the first three get us 80-95% of the value.

@snarfed
Owner
snarfed commented Sep 1, 2015

reorganizing this slightly. this issue will cover implementing the algorithm above for determining whether a silo post is a POSSE. if it is, we won't send a wm from it to the original post, but we will send its responses. if it isn't a POSSE, we'll send wms to each link in its text (and attachments, etc), as mentions, but we won't send wms for its responses anywhere.

@kylewm @tantek @kevinmarks @aaronpk @kartikprabhu i know this has been controversial for a while now. does that sound like the ideal behavior?

i'm opening a new issue for the feature to search all silo posts for links to users' sites and send mentions for those: #456

@snarfed snarfed changed the title from send webmentions for (non-POSSE) posts as well as responses to distinguish POSSE posts vs non-POSSE mentions and handle accordingly Sep 1, 2015
@snarfed snarfed added the now label Sep 1, 2015
@kevinmarks

Not sure that is ideal - the pattern I get currently is that I quote an old post, my link to it is assumed to be POSSE, and so it isn't shown, but replies are. If it shows my non-pOSEE link, the follow-ups are often interesting too, with that context.

@snarfed
Owner
snarfed commented Sep 2, 2015

@kevinmarks thanks for reviewing, and good point! ok, so for non-POSSE mentions, we backfeed replies, but not likes or reposts. sound good?

@snarfed
Owner
snarfed commented Sep 2, 2015

@kevinmarks on second thought, comparing to pure indieweb behavior...if i include a link in a post, I'd send a mention to it, but i wouldn't also send wms to it for each comment i get on my post, nor would i expect the commenters to send wms directly from their comment posts, since they're not replying to or mentioning that link. so... maybe we shouldn't backfeed replies to mentions after all?

@kylewm
Collaborator
kylewm commented Sep 2, 2015

I agree with that last bit -- Instead of backfeeding only the responses to a mention, it should only backfeed the mention itself. Replies to a mention are not replies to the original.

Unfortunately that means it matters even more that Bridgy guess correctly that something is a mention rather than a syndication (or err on the side of assuming syndication unless proven otherwise)... @snarfed in particular often rewords the silo copy so that I don't think edit distance would find them very similar at all, even though all the same information is contained (e.g. https://snarfed.org/2015-08-26_15313).

@kylewm
Collaborator
kylewm commented Sep 2, 2015

I suppose u-syndication could always represent a stronger claim on the posse copy. If publishers are having trouble with Bridgy classifying their dissimilar posts as mentions, they could start publishing u-syndication links.

@snarfed
Owner
snarfed commented Sep 2, 2015

right! syndication links override all of this. and as kevin mentioned in our initial IRC discussion, occasional false positives for high edit distances can probably be forgiven. deleting an occasional unwanted comment here and there generally shouldn't be too hard.

@kylewm
Collaborator
kylewm commented Sep 2, 2015

occasional false positives for high edit distances can probably be forgiven

if we adopt the convention of not backfeeding replies-to-mentions though, a false positive (true posse copy that bridgy thinks is a mere mention), it'd mean losing all replies to that post though :(

maybe that's a good argument in favor of backfeeding replies to mentions

@kartikprabhu

How does this interact with "salmentions"? https://indiewebcamp.com/salmention Are salmentions to be sent only for reply posts? not mentions, likes etc...?

@snarfed
Owner
snarfed commented Sep 2, 2015

@kartikprabhu good timing! #458 may be relevant to your interests. :P

short answer: bridgy already kinda does its own salmentions for silo posts, and i'm not sure we have a concrete use case yet where bridgy responses need to interoperate with real salmentions. i'd love to see one!

@kartikprabhu

But this issue seems like salmention from silo via bridgy. If someone only mentions a link on Twitter shouldn't the replies be sent to the link too like salmentions?

@snarfed
Owner
snarfed commented Sep 2, 2015

i don't know if they should. we don't really have a concrete spec for expected salmention behavior afaik (@kylewm @acegiak @dissolve correct me if i'm wrong). https://indiewebcamp.com/comment-propagation only talks about direct replies/comments, not mentions, but it's not clear if that's intentional.

we're working hard enough here just to getting the POSSE-or-mention? logic right and agreeing on the expected behavior in each case. i'm inclined to punt discussion of salmention interop to #458 or elsewhere, if that's ok.

@kartikprabhu

Of course. They seemed to be related and so I brought this up.

@snarfed
Owner
snarfed commented Sep 4, 2015

i'm hoping to start working on this over the weekend. i know people have felt strongly about this, so i'd love to hear more of you weigh in on #51 (comment) before i start, even if it's just "sounds good" or "not so sure, let's discuss more first." thanks in advance!

@kartikprabhu

@snarfed looks good enough to try out and see if additional issues crop up

@snarfed
Owner
snarfed commented Sep 4, 2015

thanks @kartikprabhu!

@kylewm
Collaborator
kylewm commented Sep 4, 2015

@snarfed I'm totally ambivalent on backfeeding replies to mentions*, but I think the heuristic and all looks great.

* I'm leaning toward backfeeding the whole chain because the cost of guessing wrong (i.e. whether something is a reply or a mention) is lower. and for most of us, we're probably interested in all tweets that mention us even if they're not totally 100% relevant.

@snarfed snarfed referenced this issue in snarfed/granary Sep 6, 2015
Merged

original post discovery v2.0 #36

@snarfed
Owner
snarfed commented Sep 6, 2015

hey @kylewm, mind reviewing snarfed/granary#36 and #465 when you get a chance? they made my head hurt, but i think they implement the first two (checked) parts of #51 (comment). if we can convince ourselves they're correct, the 24h check should be easy to bolt on.

(the hard part was converting granary and bridgy to the explicit POSSE-vs-mention logic and data model in AS upstreamDuplicates vs tags. the two checked checkboxes in the new algorithm were pretty small compared to that.)

thanks in advance!

@kylewm
Collaborator
kylewm commented Sep 7, 2015

I looked over the changes, and to the extent that I understand it, everything looked good. I actually hadn't realized that upstreamDuplicates vs tags weren't already mapping directly to mention vs in-reply-to; but the mapping seems clearer now.

@snarfed
Owner
snarfed commented Sep 7, 2015

thanks for reviewing!

ok, so the two things left before closing this are the 24h check and only backfeeding replies, not likes/reposts/rsvps, to mentions.

@snarfed snarfed closed this in #465 Sep 7, 2015
@snarfed snarfed reopened this Sep 7, 2015
@snarfed
Owner
snarfed commented Sep 8, 2015

i may drop the near-the-end requirement. within a day after pushing out the changes i have so far, i got bug reports from bridgy users. example POSSE tweet with backlink in the middle (or at least >4 chars from the end): https://twitter.com/alohastone/status/639771864647135232 , reported in acegiak/Semantic-Linkbacks#26.

@snarfed
Owner
snarfed commented Sep 8, 2015

(should have attached here: snarfed/granary@08d0493)

@snarfed snarfed added a commit to snarfed/granary that referenced this issue Sep 9, 2015
@snarfed OPD: consider tags as original post candidates too bade949
@snarfed snarfed added a commit that referenced this issue Sep 9, 2015
@snarfed OPD: fully support post handler, drop mentions for likes/reposts/rsvps
for snarfed/bridgy#51.

this is an ok refactoring, but it'd be better to merge it into granary's Source.original_post_discovery() entirely, except it depends on get_webmention_target(), which uses memcache etc. bleh.
f931252
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Sep 9, 2015
@snarfed move bridgy.util.follow_redirects() to granary.source d69b38d
@snarfed snarfed added a commit to snarfed/webutil that referenced this issue Sep 9, 2015
@snarfed move bridgy.util.clean_webmention_url() to util.clean_url()
for snarfed/bridgy#51. also add CacheDict.set().
5b67820
@snarfed snarfed added a commit that referenced this issue Sep 9, 2015
@snarfed mf2 handlers: handle AS tags without urls
fixes #468. also for #51.
a6d0f39
@snarfed
Owner
snarfed commented Sep 11, 2015

current estimate of cost of implementing this, across all commits here and many others that didn't get attached: >1kloc. whee!

@snarfed snarfed self-assigned this Sep 11, 2015
@snarfed snarfed added a commit that referenced this issue Sep 12, 2015
@snarfed bump app version for OPD changes (#469, #51)
also rename test due to review feedback
812bc58
@snarfed
Owner
snarfed commented Sep 12, 2015

just fyi all, the first pass at this is running in prod. the two key changes are that we only interpret links as original posts if they're on one of the user's domain(s), and we only backfeed likes/reposts/rsvps to original posts, not mentions.

please let me know if you see anything that seems wrong!

@snarfed
Owner
snarfed commented Sep 13, 2015

@armingrewe reported in #470 that this is making him miss some backfeed since he POSSES from a number of different web sites and doesn't have all of them in his silo profiles. may be one real world counterexample to the domain check.

@armingrewe

Thing is, certainly for Twitter you can only have one site in your profile, not sure how I could add the others?
For Google+ I've re-linked the account, it seems to have picked up the other sites now. Fingers crossed that will fix it.

@voxpelli

For Twitter one would probably have to look up the rel-me-links of the linked-to profile and include them as if they were linked to directly by Twitter (ideally maybe resolve the entire identity graph, but that would require something like relspider which would be a first practical use for such a graph in the community – not even IndieAuth uses it yet – and as such pretty experimental).

As the Twitter account claims to have the same identity as the webpage that has the rel-me links any link there can safely be assumed to also be a claimed identity of the Twitter account (although of course not the reverse – the Twitter account can not be safely assumed to be a claimed identity of those pages unless they somehow have a verified chain back to it by eg. linking back with rel-me to the original site or by themselves linking to the Twitter account)

@snarfed
Owner
snarfed commented Sep 13, 2015

@armingrewe we actually pull urls from all text in twitter profiles, including the description, so you can put others there. same with other silos.

@voxpelli good points about rel-me links! we don't currently look at them now, but we definitely could.

@armingrewe

@snarfed ah, thanks for that, hadn't realised that. Updated and relinked my profile, I'll watch out if that works now.

@armingrewe

Just to confirm, as far as I can tell the Twitter and G+ mentions are now flowing through again. On the blog with the most activity I usually post my morning (UK, ~6:30 GMT/BST) and the majority of mentions come over the next few hours. All fine so far.

@snarfed
Owner
snarfed commented Sep 14, 2015

thanks for the update @armingrewe! glad to hear it.

btw Facebook should work in general too, but I know you mentioned it hasn't for you. feel free to post details if you want!

@armingrewe

Facebook was fine all the time ;-) There might be something where bridgy isn't picking up something when I post via WordPress, but I need to look at that before I can be sure if there's an issue.

@snarfed
Owner
snarfed commented Sep 15, 2015

i've updated the discussion of these OPD heuristics in https://indiewebcamp.com/original-post-discovery#Brainstorming . tldr: there are four, and we've hit real world counterexamples for all of them in bridgy, so none are ideal.

  • user's domain
  • within 24h
  • near the end of the silo post
  • nearly the same text as the silo post, ie edit distance is below a given threshold
@kylewm
Collaborator
kylewm commented Sep 15, 2015

few random thoughts...

Another possible heuristic: have we already seen a POSSE for this post on this service? if so, it's more likely that subsequent links are mentions. It's not that strong of a criteria because many people will tweet links to the same piece throughout the day (e.g. Dave Winer), and of course tweets are deleted and reposted as edits.

It's much more costly to incorrectly identify a POSSE copy as a mention, i.e. no backfeed for that post. So the threshold for qualifying as a POSSE copy should probably be way lower, maybe matching some subset of the criteria, like off the top of my head:

* any two of the first three
* any one of the first three + lower than 50% edit distance
* lower than 30% edit distance

It's very difficult to correctly categorize the "Kevin tweets a link to his post within 24h" case without throwing out a lot of legitimate POSSEs. In the specific case on the wiki, we could say it looks like he is tweeting at someone but the original isn't in-reply-to anything...wonder if that applies more generally to self-mentions.

@snarfed
Owner
snarfed commented Sep 15, 2015

thanks @kylewm! interesting idea to record inferred POSSE links and check them later. kind of an extension of the way we already store syndication links. and you're right, the standard way to handle a complicated inference like this based on heuristics is to combine them with weights into a score... and that in this case, false negatives hurt much more than false positives. (I've always described bridgy as deliberately "promiscuous." :P)

I'm already second guessing all this added complexity, though, and it looks like the domain check is comfortably the strongest so far, so I'm kind of leaning toward just that. meh.

@kylewm
Collaborator
kylewm commented Sep 15, 2015

I'm already second guessing all this added complexity, though, and it looks like the domain check is comfortably the strongest so far, so I'm kind of leaning toward just that. meh.

I would support that too. Fight that sunk cost fallacy!

@tinokremer

I'm not sure if it's this issue. I came here when searching for the "No post links found" message in this repository. For me Bridgy behaves a bit odd. I have posted my links as usual to Google+ (manually from my Known instance) and the favorites are feeded back to my site as normal, but the replies are not with the message "No post links found". I checked my Google+ profile and https://stream.tinokremer.nl is mentioned. On my own Known instance, my Google+ profile is mentioned too and IndieAuth sees it as normal.

I'm puzzled why Bridgy cannot see post links, can you shed light on that @snarfed ?

2015-09-22_184303

@snarfed
Owner
snarfed commented Sep 23, 2015

@tinokremer sorry for the trouble! you're right, it probably is due to this. current status: trying to track down the memory leak in #456 (comment), which is blocking further fixes here. wish me luck!

@tinokremer

Memory leaks are the hardest issues to solve and I'm a C# .Net developer. The reference system and garbage collector cleans up most of my mess. Good luck indeed!

@snarfed snarfed added a commit to snarfed/granary that referenced this issue Sep 23, 2015
@snarfed PPD: for redirects, use final URL for domain check, and attach pre-re…
…direct URL

for snarfed/bridgy#51, snarfed/bridgy#485. thanks to @kylewm for help debugging!
2186a97
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Sep 24, 2015
@snarfed add include_redirect_sources kwarg to Source.original_post_discovery()
matches same kwarg in bridgy's original_post_discovery.discover(). for snarfed/bridgy#51, snarfed/bridgy#485
8fcd45f
@snarfed snarfed added a commit that referenced this issue Sep 24, 2015
@snarfed on redirects, only include final URLs in webmention targets, not init…
…ial ones

uses new include_redirect_sources kwarg in Source.original_post_discovery(). for #51, #485
9b47032
@snarfed
Owner
snarfed commented Sep 26, 2015

tentatively closing. this has been running in prod and stable for a few days. I'm sure there are more bugs left to fix, but we can open new issues for them.

@snarfed snarfed closed this Sep 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment