multiclass_tagger

This algorithim ingests a set of documents and splits it into a training and test set. Using associated metadata--right now in the form of a .csv--the algorithm learns to apply multiple tags to a body of text. Also, this can extract relevant entities from the text including people, places, organizations, etc.

This requires Python 2.7, Sci-Kit learn, NLTK, and NumPy. The tagging algorithm is a linear support vector classifier against a tfidf of n-grams (3) set up as a one vs. rest classifier.

How this works

First, e-mail me and I'll send you the small set of documents as a .zip to run this. The algorithm doesn't work well with the test set for a variety of reasons including poor human tagging and over-tagging. However, I plan to test this on a much larger corpus to see if this can get better results. The entity extraction works ok though. documentagger.py & documentagger.ipynb are the same except the .pynb file will run with iPython Notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
documentagger.ipynb		documentagger.ipynb
documentagger.py		documentagger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

README.md

README.md

documentagger.ipynb

documentagger.ipynb

documentagger.py

documentagger.py

Repository files navigation

multiclass_tagger

How this works

About

Releases

Packages

Languages

USStateDept/multiclass_tagger

Folders and files

Latest commit

History

Repository files navigation

multiclass_tagger

How this works

About

Resources

Stars

Watchers

Forks

Languages