New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

facebook HTML processing pipeline #44

Open
vecna opened this Issue Dec 29, 2016 · 4 comments

Comments

Projects
None yet
1 participant
@vecna
Member

vecna commented Dec 29, 2016

Processing pipeline of HTMLs

from the raw HTML of facebook you can extract meaningful metadata and append your own result to the database, to let other researcher benefit of that, and so on, in a collaborative effort to create a

The goal is having a distributed network of parsers. independent developers might run their own analysis tool on top of some validated meta data. a distributed effort of parsing, trying to emulate the analysis facebook itself does. well, not exactly the same because would be impossible, but somehow, create a working pipeline that might:

  • show to the user (restricted access) more information on what is received
  • perform statistics on topics, penetration of fake news, shape of spreading
  • observing trends online from an open source independent third party, like alexa of facebook
  • provide API for algorithm analysis, to researchers, working group, policy makers, journalists

To begin, we've to extract the smaller chunk of metadata, and make progress in a binary tree of parsers.
image

we can save the metadata submitted, if the information is meaningful, privacy preserving at their best, minimized at the best against attacks that can provide any benefits to minimized to be against decontexualisation attacks at API level.

processed that empower the data analysis and the capability of this network. and the dataset, and the analysis might follow

This is what is in the database after some iteration. every iteration extend the metadata in mongodb:
image

simple kind of parser

function getPostType(snippet) {

    var $ = cheerio.load(snippet.html);

    if ($('.uiStreamSponsoredLink').length > 0) 
        var retVal = "promoted";
    else if ($('.uiStreamAdditionalLogging').length > 0)
        var retVal = "promoted";
    else
        var retVal = "feed";

    // TODO, don't use exclusion condition, but find a selector
    // for 'feed' too, and associate postType: fail so we can investigate on it later
    debug("・%s ∩ %s", snippet.id, retVal);
    return { 'postType': true, 
             'type': retVal };
};

var postType = {
    'name': 'postType',
    'requirements': {},
    'implementation': getPostType,
    'since': "2016-11-13",
    'until': moment().toISOString(),
};
return parse.please(postType);

*The HTMLs are collected via web-extension and saved at the end of this backend-handler: https://github.com/tracking-exposed/facebook/blob/master/lib/events.js#L52 *

More complicate parser exists, they are located in https://github.com/tracking-exposed/facebook/tree/master/parsers

@nolash do you have suggestion? you've been the first to contribute 👍 I'm committing in branch feedBasicInfo, and @fievelk is doing the version in python https://github.com/fievelk/fbt_pyparsers

This is the first script is run in sequence, postType, pasted above, just extend the table 'html' as metadata in the server. is a binary decision tree

$ DEBUG=* node parsers/postType.js 
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/status
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +0ms
  parser:⊹core 46638 HTMLs, 300 per request = 155 requests +1s
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/content
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +5ms

This is the output of the execution, for every html snippet, look in two patterns. It is better if the condition cease do be exclusive. if we can understand how to spot a non-promoted post too, the information is more robust and everything goes better

  parser:postType ・fdb795f8c2394d23dd2280ad4eedf9f7c897b98e ∩ feed +6ms
  parser:postType ・e41f623d1cf4e3737aaf8396ee0f52383622c145 ∩ feed +4ms
  parser:postType ・f55e0ba360454fd295070b8ac4231cfd75a4dc21 ∩ promoted +11ms
  parser:postType ・d76f8d8e8f21162f21a291cccbe5101699bb585e ∩ feed +274ms
  parser:postType ・4ccd0d6090490d9afd0c9c0a4cdb24b47eaa68c6 ∩ feed +729ms
  parser:postType ・916ebb01da701f417391ab30928298a6c24428eb ∩ feed +130ms

@vecna vecna added the help wanted label Dec 29, 2016

vecna added a commit that referenced this issue Dec 29, 2016

@vecna

This comment has been minimized.

Member

vecna commented Dec 29, 2016

this is the branch to compare log-messages-cleaning...feedBasicInfo

@vecna

This comment has been minimized.

Member

vecna commented Dec 29, 2016

This is the first script is run in sequence, postType, pasted above, just extend the table 'html' as metadata in the server. is a binary decision tree

$ DEBUG=* node parsers/postType.js 
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/status
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +0ms
  parser:⊹core 46638 HTMLs, 300 per request = 155 requests +1s
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/content
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +5ms

@vecna vecna changed the title from processing pipeline to facebook HTML processing pipeline Dec 29, 2016

@vecna

This comment has been minimized.

Member

vecna commented Jan 3, 2017

This is the basic of the revision interface: https://facebook.tracking.exposed/revision,
permits to see the parsed metadata compares with the original content. so user can 'report' if something has been parsed wrongly. How can be improved? what has to permit?

@vecna

This comment has been minimized.

Member

vecna commented Jan 22, 2017

This for example, is a recurring problem:
image

The promoted feed stop being recognized as promoted, postType will be improved to cover the new condition, and the post after a certain date will be reparsed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment