This is the repository for data Analysis for OSF, github.
To run the scripts here you will have following packages installed. A recommended package manager will be conda.
Packages (updating):
- Python: 3.6
- jupyter notebook: latest
- matplotlib: latest
- tensorflow: 1.*
environment.yml is also provided in the root directory. To quickly set up the environment, do
conda env create -f environment.yml
source activate osfgender_classifier is used for calculating the gender distribution of the active users in OSF.
I have built a CNN for text classification upon Kim's Convolutional Neural Networks for Sentence Classification and dennybritz's work.
Some major changes including allowing multiclass classification, adopting static word embeddings which was discussed in Kim's Paper, allowing unknown choices for classification given noisy data, etc
The training data includes more than 60000 paper abstracts in 10 categories based on the taxonomy and data crawled from Digital Commons Network, and the pre-trained word embeddings is GoogleNews-vectors-negative300 from word2vec
crawler is available here, which uses xlml to parse the page source and also has the functionality to continue crawling if halted without losing any data.
consistency_check is to find the consistency of a certain contributor regarding the categories of the projects he/she contributes to.