Skip to content

FriendlyToS is a project to explain Terms of Services in a friendly way to the users of the web.

Notifications You must be signed in to change notification settings

sethwoodworth/FriendlyToS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FriendlyToS

FriendlyToS is a project to explain Terms of Services in a friendly way to the users of the web.

This is the technical implementation of FriendlyToS and contains four subprojects:

  • HTML and MD of ToSes being monitored ('./documents')
  • FriendlyToS' Django website ('./frndlyts')
  • Terms of Service monitoring framework ('./tosWatch')
  • Natural language ToS annotation system ('./tosGrokr')
  • Web Crawler for ToS' ('./tosCrawl')

TODOs

  • create Goal words list from Anton, Veil, &etc for parsing
    • pg. 14 of Anton & Earp
    • (i.e. TRACK, AGGREGATE, PROVIDE)
  • Pull goal/vulnerability descriptions from Anton & Earp
    • type up in spreadsheet with extra fields of metadata
    • (i.e. "PREVENT use of cookies to send spam")
  • Initial Crawl list (source ToSBACK)
    • Source for list of trackers: Ghostery
  • Decide if ToS diff engine is in tosWatch or tosGrokr
  • Decide when a paragraph has changed enough to be considered a different paragraph instead of merly a revision of a paragraph.
  • Decide if and how comments on a paragraph will relate to later versions of that paragraph.

Installation Requirements

  • OS packages
    • python-lxml
      • or libxml2-dev and libxslt1-dev plus lxml in requirements.txt (haven't tried this yet)
    • git
  • Python
    • Use requirements.txt
    • nltk punkt
      • Use nltk.download() in an interactive session to download the punkt package

Database

Database diagram at https://cacoo.com/diagrams/DFHkFqrvdvr863xi

TODO: finish db schema (draft 1) TODO: decide what tool implements database schema TODO: research South for db migrations

Paragraph
    key > prev Paragarph
    key > next Paragraph
    fk > first Sentence
    fk > prev revision of paragraph;
    key > version
Sentence
    fk > parent Setence
    key > prev Paragarph
    key > next Sentence
Comments
    fkey > user
    fkey > paragraph_id

Logging

Use Django's logging, which is basically just Python's logging with a couple of added methods.

Django and Python both do not provide a handler for logging to a database, so we'll need to write that.

Need to define log formats and the dictConfig

We should define what happens for each log level, and when they are used in FriendlyToS:

  • critical - use DB and AdminEmailHandler handlers
  • error - use DB handler
  • warning - use DB handler
  • info - use DB handler
  • debug - ???

Should filter for DB messages and send those to a file.

Notes & Bookmarks

Pro Bono Privacy Initiative - Provides data privacy expertise to non-profits

Python Parsing: lxml - http://lxml.de/ - looks like a possible solution for building DOMs that can be queried via Xpath. I've had some success retreiving the content I want from a page, but more experimentation is needed.

lxml supports at least three seperate parsers:

  • libxml2 - The default
  • BeautifulSoup - Via lxml.html.ElementSoup
  • html5lib - For parsing HTML 5, via lxml.html.html5parser

There are some functions/methods in Lxml that might be of interest:

  • lxml.html.HtmlElement.interlinks() - Returns (element, attribute, link, pos) for every link in the element
  • lxml.html.diff.htmldiff(doc1, doc2) - A diff function that wraps differences in <ins> and <del> elements
  • lxml.html.diff.html_annotate - Another diff function that behaves like svn blame
  • See http://lxml.de/lxmlhtml.html#html-diff for the above two

Python Markdown Reading: python-markdown2

Python and Git GitPython - Python library for interacting with git repos

Thoughts on Scraping

Default lxml seems to work on the few sites tried so far. However, it might be a good idea to support multiple forms of scraping. Lxml includes three parsers (default, BeautifulSoup, HTML5). Regexs could be a fallback. BTE is a potential last result.

Dumping content into the database: convert <div>, <span>, <p>, <hX>, <tr>, &nbsp into paragraphs in the table. Will have to split <pre> on newlines. Will we keep the formatting of lists? And will we keep links?

Thoughts on Errors in Scraping

Should log IOErrors when they are thrown.

A message should be generated when a urlopen results in some error response (4xx). The message should include the url attempted, the error returned, and the timestamp of the attempt.

A message (ScrapeError ?) should be generated when html.parse returns an empty list, as we can assume that the xpath query in the database has successfully tested before. The generated message should include a timestamp of when the scrape was attempted, the last-modified header from the response, the url called, and the xpath query attempted.

Javascript Runtimes

Future Blog Content Ideas

Discussing history of law, some background on privacy policies, policy analysis of various new laws/cases/etc..., some background of privacy theory

Important Laws and Statutes

  • Communications Decency Act
  • DMCA
  • Video Privacy Protection Act

Interesting or Related Projects

  • Collusion - Firefox plugin that lets you see connections between trackers you have encountered.
  • Privacy Score - As of 3/5/2012, scores 1600+ websites based on how risky a privacy policy is to a user.
  • OpenCalais - Semantic analysis and linking service provided for free by Thomson Reuters.
  • WikiSummary - WikiSummary created summaries of a handful of ToSes five years ago.

Academic References

About

FriendlyToS is a project to explain Terms of Services in a friendly way to the users of the web.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages