An NLP (Natural Language Processing) Application, which is a simple GUI app for tagging and tokenizing text, written in Python
- Install Python (if you have not done so: click here for instructions)
- Install Tkinter, which is used for Python GUI. The above link from #1 also contains instructions for the installation.
- For the installation of NLTK (Natural Language Toolkit) and Stanford NER Tagger (Named Entity Recognition), click on this link.
This is the process of breaking down a text into (either word or sentence) tokens, depending on the option that you select. This app can also display the number of tokens the text has.
This image above shows the tokenized text (in words), with the token count.
This image above shows the tokenized text (in sentences).
Part of the blackboxing process of tagging is word tokenization, and for each token - it shall be tagged with a specific tag:
- Stanford NER - Stanford's Named Entity Recognition (named entities can be thought of "brands"), can recognize and tag each token as '0' (i.e. not a named entity), 'PERSON' (name of a person), 'LOCATION' (name of a location), 'ORGANIZATION' (name of an organization), etc. It might not be perfect to detect every single named entity (or a false positive wherein it detects a named entity but is not), but most of the time it gets it right.
- NLTK POS - Part of the Natural Language Toolkit is Part-of-Speech tagging. While Stanford's NER algorithm specializes with named entities among nouns, NLTK's POS focuses on tagging each tokenized word (see here for the complete list of tags).
Above shows an image screenshot for tagging using Stanford's NER algorithm.
Above shows an image screenshot for tagging using NLTK's POS algorithm.