Topic Classification #5

TyJK · 2017-05-09T07:18:38Z

Creating an Initial Topic Identification Model

We have created vector models in both Word2Vec and Doc2Vec and so now we are aiming to use these vectors to create features for a classification or topic model that will correctly identify when a topic from a predefined list is being discussed in a comment. We are looking at different possibilities, including custom though imperfect datasets that use subreddit names as labels (generalized into broader topics), or possibly using a classic dataset such as 20newsgroup as a proof of concept.

We will be using the gensim library to create the model and hope to have it completed by the end of the week.

Any expertise or advice on topic modeling would be appreciated.

TyJK · 2017-05-16T02:09:01Z

Our Doc2Vec model is set up with a training suite that allows us to compare distantly labelled comments and return a list of related subreddits.While not a true classifier by any stretch, we feel that without a labelled dataset this is the best we can do and so this issue is at this time pending data collection. We will make an attempt to cluster the 'documents' (subreddits) to form more cohesive, unsupervised categories to see if we can gain better results, but most likely supervised learning will be the solution.

TyJK added addition nlp python expertise wanted pending on prerequisite labels May 9, 2017

TyJK closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic Classification #5

Topic Classification #5

TyJK commented May 9, 2017

TyJK commented May 16, 2017

Topic Classification #5

Topic Classification #5

Comments

TyJK commented May 9, 2017

Creating an Initial Topic Identification Model

TyJK commented May 16, 2017