Skip to content
Branch: master
Go to file
Code

Latest commit

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Modified Blog Authorship Dataset

This data is sourced from the "Blog Authorship Corpus", available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by "<#>". Tokens other than the 9999 most common tokens were replaced by "", for a vocabulary of 10000 words. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe.

About

No description, website, or topics provided.

Resources

Releases

No releases published

Languages

You can’t perform that action at this time.