ADDReC: Anonymized Disability Discourse Reddit Corpus أدرك

Citation

Paper: Large-Scale Anonymized Text-based Disability Discourse Dataset

Data

Reddit comments from three subreddits over a 5 year period (January 2015- Decemeber 2019). Subreddits:

r/Blind
r/Disability
r/ADHD

Sensitive data anonymized

names
usernames of the form \u\username
locations
- zip codes
- public named locations
links to external sites

Attributes of each entry

Attribute	Description
commentator_id	Unique id of username of poster
created_utc	Timestamp (in UTC) of post
anonymized_body	Body of comment that has been passed through presidio anonymizer
anonymized_masks	indicates what data was masked out from the original comment body. This information is sensitive and should not be included in any data made publically available
ups	community up votes
downs	community down votes
score	ups - downs
controversiality	a score based on ratio of up votes to down votes
gilded	a reward given by users to other users for a good post, bought with real money
distinguished	official statement by a moderator

Presidio Anonymization

Identifiers to be masked by the presidio anonymization system.

Global identifiers

CREDIT_CARD
CRYPTO
EMAIL_ADDRESS
IBAN_CODE
IP_ADDRESS
LOCATION
PHONE_NUMBER
PERSON
PHONE_NUMBER
URL

United states

US_BANK_NUMBER
US_DRIVER_LICENSE
US_ITIN
US_PASSPORT
US_SSN

Uk

UK_NHS

Spain

NIF

Singapore

FIN

Australia

AU_ABN
AU_ACN
AU_TFN
AU_MEDICARE

Custom

USER
- this one was manually defined to catch /u/username
- regex string: r'/u/([a-zA-Z0-9_]*)\b'
- everytime a username is identified, an additional regex is made just for that portion.

Important Functions

`reddit_anonymizer.py`

`anonymize_dataframe(csv_df: pd.DataFrame) -> pd.DataFrame`

Anonymize a single dataframe. Each DataFrame is a single month of a subreddit.

scrape file for usernames to add to registry
load original registry and masks
mask every comment and store any results in next col over

`anonymize_text(sentence: str, analyzer) -> str:`

Run the anonymization process on every dataframe.

`reddit_dataset.py`

`load_dataset() -> Dict[str, Dict[int, Dict[int, pd.DataFrame]]]:`

load all csvs from reddit data set into dataframes.

structure

subreddit: dict[subreddit, dict]
- year: dict[year, dict]
  - month: dict[month, dataframe]

`print_dataset(data: Dict[str, dict]) -> None:`

print out the whole loaded dataset

Notes:

Current State

We used the presidio anonymizer, but found issues with its name recognition system. It worked very well for identifying raw number ID's, but the named entity recognition tends to miss named locations and grab onto medication names.

License

This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
anonymized_dataset		anonymized_dataset
.gitattributes		.gitattributes
.gitignore		.gitignore
License.md		License.md
README.md		README.md
anonymized_dataset.zip		anonymized_dataset.zip
constants.py		constants.py
main.py		main.py
reddit_anonymizer.py		reddit_anonymizer.py
reddit_dataset.py		reddit_dataset.py
requirements.txt		requirements.txt
username_recognizer.py		username_recognizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADDReC: Anonymized Disability Discourse Reddit Corpus أدرك

Citation

Data

Sensitive data anonymized

Attributes of each entry

Presidio Anonymization

Global identifiers

United states

Uk

Spain

Singapore

Australia

Custom

Important Functions

`reddit_anonymizer.py`

`anonymize_dataframe(csv_df: pd.DataFrame) -> pd.DataFrame`

`anonymize_text(sentence: str, analyzer) -> str:`

`reddit_dataset.py`

`load_dataset() -> Dict[str, Dict[int, Dict[int, pd.DataFrame]]]:`

structure

`print_dataset(data: Dict[str, dict]) -> None:`

Notes:

Current State

License

About

Releases

Packages

Contributors 2

Languages

License

thekindlab/addrec

Folders and files

Latest commit

History

Repository files navigation

ADDReC: Anonymized Disability Discourse Reddit Corpus أدرك

Citation

Data

Sensitive data anonymized

Attributes of each entry

Presidio Anonymization

Global identifiers

United states

Uk

Spain

Singapore

Australia

Custom

Important Functions

reddit_anonymizer.py

anonymize_dataframe(csv_df: pd.DataFrame) -> pd.DataFrame

anonymize_text(sentence: str, analyzer) -> str:

reddit_dataset.py

load_dataset() -> Dict[str, Dict[int, Dict[int, pd.DataFrame]]]:

structure

print_dataset(data: Dict[str, dict]) -> None:

Notes:

Current State

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`reddit_anonymizer.py`

`anonymize_dataframe(csv_df: pd.DataFrame) -> pd.DataFrame`

`anonymize_text(sentence: str, analyzer) -> str:`

`reddit_dataset.py`

`load_dataset() -> Dict[str, Dict[int, Dict[int, pd.DataFrame]]]:`

`print_dataset(data: Dict[str, dict]) -> None:`

Packages