DeepLearning research project.
Here are some things that will need to be done if we move forward with this code:
- generalize for the other datasets
- find a general way to specify which fields for a given document vector has numerical values
- figure out how to avoid loading all the numerical values into memory to sort and reverse (for the percentile calculation)
- play with the settings for gensim to determine the optimal dimensionality for each feature vector - current I'm using the defaults.