This data is sourced from the "Blog Authorship Corpus", available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by "<#>". Tokens other than the 9999 most common tokens were replaced by "", for a vocabulary of 10000 words. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe.
-
Notifications
You must be signed in to change notification settings - Fork 1
spitis/blogs_data
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published