ADDReC: Anonymized Disability Discourse Reddit Corpus أدرك
Paper: Large-Scale Anonymized Text-based Disability Discourse Dataset
Reddit comments from three subreddits over a 5 year period (January 2015- Decemeber 2019). Subreddits:
- r/Blind
- r/Disability
- r/ADHD
- names
- usernames of the form
\u\username
- locations
- zip codes
- public named locations
- links to external sites
Attribute | Description |
---|---|
commentator_id | Unique id of username of poster |
created_utc | Timestamp (in UTC) of post |
anonymized_body | Body of comment that has been passed through presidio anonymizer |
anonymized_masks | indicates what data was masked out from the original comment body. This information is sensitive and should not be included in any data made publically available |
ups | community up votes |
downs | community down votes |
score | ups - downs |
controversiality | a score based on ratio of up votes to down votes |
gilded | a reward given by users to other users for a good post, bought with real money |
distinguished | official statement by a moderator |
Identifiers to be masked by the presidio anonymization system.
- CREDIT_CARD
- CRYPTO
- EMAIL_ADDRESS
- IBAN_CODE
- IP_ADDRESS
- LOCATION
- PHONE_NUMBER
- PERSON
- PHONE_NUMBER
- URL
- US_BANK_NUMBER
- US_DRIVER_LICENSE
- US_ITIN
- US_PASSPORT
- US_SSN
- UK_NHS
- NIF
- FIN
- AU_ABN
- AU_ACN
- AU_TFN
- AU_MEDICARE
- USER
- this one was manually defined to catch
/u/username
- regex string:
r'/u/([a-zA-Z0-9_]*)\b'
- everytime a username is identified, an additional regex is made just for that portion.
- this one was manually defined to catch
Anonymize a single dataframe. Each DataFrame is a single month of a subreddit.
- scrape file for usernames to add to registry
- load original registry and masks
- mask every comment and store any results in next col over
Run the anonymization process on every dataframe.
load all csvs from reddit data set into dataframes.
- subreddit: dict[subreddit, dict]
- year: dict[year, dict]
- month: dict[month, dataframe]
- year: dict[year, dict]
print out the whole loaded dataset
We used the presidio anonymizer, but found issues with its name recognition system. It worked very well for identifying raw number ID's, but the named entity recognition tends to miss named locations and grab onto medication names.
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).