No description or website provided.
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore first commit Nov 15, 2016
README.md
__init__.py first commit Nov 15, 2016
blogs.h5_0 first commit Nov 15, 2016
blogs.h5_1 first commit Nov 15, 2016
blogs.h5_2 first commit Nov 15, 2016
blogs_vocab.pickle first commit Nov 15, 2016

README.md

Modified Blog Authorship Dataset

This data is sourced from the "Blog Authorship Corpus", available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by "<#>". Tokens other than the 9999 most common tokens were replaced by "", for a vocabulary of 10000 words. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe.