Skip to content

This repository contains python code to create a corpus of 12,215 terms of service documents scraped from TOSDR, intended for legal, privacy, and natural language processing research.

Notifications You must be signed in to change notification settings

sonu-gupta/tosdr-terms-of-service-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

TOSDR Terms of Service Corpus

UsageContributorsAcknowledgmentLicense

Description

This repository contains a corpus of 12,215 terms of service (TOS) documents, which were scraped from the TOSDR website using the Beautiful Soup and Requests libraries in Python. The dataset includes both HTML and text versions of the documents.

The data collection process involved retrieving the document text from TOSDR, saving it in an HTML file, removing non-English files detected through the LangDetect library, removing files less than 2B in size and less than 6 words, and converting the remaining files to a text format.

Usage

This corpus can be used by legal, privacy, and natural language processing (NLP) researchers to study terms of service agreements and their implications for user privacy and rights. It can be used to develop models and tools for analyzing terms of service agreements, identifying patterns and trends in the language used, and predicting the impact of certain clauses on users.

To use the dataset, simply download the files from this repository and import them into your preferred analysis tool or programming language.

Contributors

This corpus was created by Sonu Gupta. Contributions to the corpus and repository are welcome and encouraged. To contribute, please fork this repository, make your changes, and submit a pull request.

Acknowledgment

I would like to acknowledge the TOSDR team for their good work and the valuable resource they have created in the form of the TOSDR website. Their efforts in creating a platform that summarizes and rates the terms of service of various online services has been invaluable in raising awareness about online privacy and user rights.

License

This corpus is based on data from the Terms of Service; Didn't Read (TOS; DR) project, licensed under the GNU Affero General Public License version 3 (AGPLv3). As such, this work is also licensed under GNU AGPLv3. For more information about the AGPLv3 license, please visit https://www.gnu.org/licenses/licenses.html#AGPL.

Things in Progress

I am continuously updating and improving this corpus. Currently, I am working on removing the non-TOS documents from the corpus. I am also exploring the use of machine learning and natural language processing techniques to further analyze and extract insights from this corpus.

I am working on a new corpus that would contain TOS documents that are not included in this corpus.

If you are interested in contributing to this project, please feel free to submit a pull request or reach out to us directly. I welcome any and all contributions and feedback.

About

This repository contains python code to create a corpus of 12,215 terms of service documents scraped from TOSDR, intended for legal, privacy, and natural language processing research.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages