# Twitter POS Tagging 
The goal of this tutorial is to introduce a a Part-of-Speech (POS) tagger developed for tweets which was released as part of the [TweetNLP](https://www.ark.cs.cmu.edu/TweetNLP/) toolkit. The code is written in Java and the python wrapper for the tokenization is from [this](https://github.com/myleott/ark-twokenize-py) github repository. This tutorial has code from the [TweetNLP](https://github.com/brendano/ark-tweet-nlp/) github repository as well as the python wrapper from [this](https://github.com/ianozsvald/ark-tweet-nlp-python) repository.

## POS tagging
- POS tagging involves identifying part-of-speech of tokens in a given text. This can be viewed as a task of labeling the sentence w_1, w_2, ....., w_n with pos tags, one for each word: t_1, t_2, ...., t_n.
- The 8 common parts of speech for english language are:
  1. Noun
  2. Verb
  3. Pronoun
  4. Preposition
  5. Adverb
  6. Conjuction
  7. Participle
  8. Article  
- Twitter data is different from standard language data in that there are tokens such as #, @, emoticons, URLs, etc. So the tagset for twitter needs to incorporate the tags for these new tokens. The tags that are used to annotate tweets are as follows:

<img src="pos_tags.png">

## Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
- This tutorial covers how to accomplish the task of POS tagging for twitter data based on this paper: https://aclanthology.org/P11-2008.pdf
- The nature of twitter data poses challenges in using standard POS taggers. The paper develops the above tagset for twitter to include tags for words that are not commonly encountered in language outside of twitter. 
- Around 1,800 tweets were manually annotated with corresponding pos tags.
- Conditional Random Fields (CRFs) were used with features specific to twitter POS tagging. The features for the CRF are below (see paper for more details):
  - Twitter orthography - these features are rules that detect @, #, and URls.
  - Names - these features check for names from a dictionary of compiled tokens which are frequently capitalized.
  - Traditional Tag Dictionary - these are features for all tags that occur in PTB.
  - Distributional Similarity - these features are constructed from the successor and predecessor probabilities for the 10,000 most common terms.
  - Phonetic normalization - words are normalized to ignore alternate spellings of words using the Metaphone algorithm; e.x.{thangs, thanks, thanksss, thanx, thinks, thnx} are mapped to 0NKS.
- 1827 tweets that are annotated are divided into training set of 1000 tweets, dev set of 327 tweets, and test set of 500 tweets. The results of the tagger incorporating the above features are compared with the standard Stanford Tagger and using the above feature set for twitter data reduces error by about 25%.

## Instructions 
To run the tutorial you will need to download the POS tagger from https://code.google.com/archive/p/ark-tweet-nlp/downloads

This requires Java 6. https://www.oracle.com/java/technologies/java-platform.html

Place this ipython notebook that has python wrappers inside the ark-tweet-nlp-0.3.2 folder.

The first step would be to download packages required for the notebook.

In [40]:
from __future__ import unicode_literals

import operator
import re
import sys
import os

import subprocess
import shlex

try:
    from html.parser import HTMLParser
except ImportError:
    from HTMLParser import HTMLParser
  

try:
    import html
except ImportError:
    pass  

## Python Wrapper for POS Tagger
- The functions below call the runTagger.sh to get the POS tag predictions for the tokenized tweets. 
- runTagger.sh script should be invoked.

In [41]:
RUN_TAGGER_CMD = "java -XX:ParallelGCThreads=2 -Xmx500m -jar ark-tweet-nlp-0.3.2.jar"

def _split_results(rows):
    """Parse the tab-delimited returned lines, modified from: https://github.com/brendano/ark-tweet-nlp/blob/master/scripts/show.py"""
    for line in rows:
        line = line.strip()  # remove '\n'
        if len(line) > 0:
            if line.count('\t') == 2:
                parts = line.split('\t')
                tokens = parts[0]
                tags = parts[1]
                confidence = float(parts[2])
                yield tokens, tags, confidence
                
                
def _call_runtagger(tweets, run_tagger_cmd=RUN_TAGGER_CMD):
    """Call runTagger.sh using a named input file"""

    # remove carriage returns as they are tweet separators for the stdin
    # interface
    tweets_cleaned = [tw.replace('\n', ' ') for tw in tweets]
    message = "\n".join(tweets_cleaned)

    # force UTF-8 encoding (from internal unicode type) to avoid .communicate encoding error as per:
    # http://stackoverflow.com/questions/3040101/python-encoding-for-pipe-communicate
    message = message.encode('utf-8')

    # build a list of args
    args = shlex.split(run_tagger_cmd)
    args.append('--output-format')
    args.append('conll')
    po = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    # old call - made a direct call to runTagger.sh (not Windows friendly)
    #po = subprocess.Popen([run_tagger_cmd, '--output-format', 'conll'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    result = po.communicate(message)
    # expect a tuple of 2 items like:
    # ('hello\t!\t0.9858\nthere\tR\t0.4168\n\n',
    # 'Listening on stdin for input.  (-h for help)\nDetected text input format\nTokenized and tagged 1 tweets (2 tokens) in 7.5 seconds: 0.1 tweets/sec, 0.3 tokens/sec\n')

    pos_result = result[0].decode('utf-8').strip('\n\n')  # get first line, remove final double carriage return
    pos_result = pos_result.split('\n\n')  # split messages by double carriage returns
    pos_results = [pr.split('\n') for pr in pos_result]  # split parts of message by each carriage return
    return pos_results


def runtagger_parse(tweets, run_tagger_cmd=RUN_TAGGER_CMD):
    """Call runTagger.sh on a list of tweets, parse the result, return lists of tuples of (term, type, confidence)"""
    pos_raw_results = _call_runtagger(tweets, run_tagger_cmd)
    pos_result = []
    for pos_raw_result in pos_raw_results:
        pos_result.append([x for x in _split_results(pos_raw_result)])
    return pos_result


def check_script_is_present(run_tagger_cmd=RUN_TAGGER_CMD):
    """Simple test to make sure we can see the script"""
    success = False
    try:
        args = shlex.split(run_tagger_cmd)
        args.append("--help")
        po = subprocess.Popen(args, stdout=subprocess.PIPE)
        # old call - made a direct call to runTagger.sh (not Windows friendly)
        #po = subprocess.Popen([run_tagger_cmd, '--help'], stdout=subprocess.PIPE)
        while not po.poll():
            lines = [l for l in po.stdout]
        # we expected the first line of --help to look like the following:
        assert "RunTagger [options]" in lines[0].decode('utf-8')
        success = True
    except OSError as err:
        print("Caught an OSError, have you specified the correct path to runTagger.sh? We are using \"%s\". Exception: %r" % (run_tagger_cmd, repr(err)))
    return success


## Read tokenized tweets
We will now load tweets that have the tokenized for POS tagging.

In [42]:
file = open("tweets_tokenized.txt", "r")
tweets_tokenized = file.readlines()
print(tweets_tokenized)

["I won't win a single game I bet on !! Got Mr. Cliff Lee , if he loses its on me U.S.A ! $ 5 0.0 .\n", '@thecamion I like monkeys , but I still hate COSTCO parking lots .. oO o.O #COSTCO 2:15 PM\n', 'RT @eye_e : this poster-print costs $ 12 . 40 , which is 40% of the normal price ! http://tl.gd/6meogh\n', 'LMBO ! This man filed an EMERGENCY Motion for Continuance on account of the Rangers game tonight !\n', 'Texas Rangers are in the World Series ! Go Rangers !!!!!!!!! : > <3 ♥❤♡ http://fb.me/D2LsXBJx\n']


## Apply POS tagger
The output of the POS tagger is a tuple containing token, predicted output tag, and confidence

In [43]:
# print(RUN_TAGGER_CMD)
# success = check_script_is_present()
# if success:

inp_file = open('tweets_tokenized.txt')
for t in inp_file.readlines():
    print(runtagger_parse([t]))    
    input("press enter ...")
f.close()

[[('I', 'O', 0.9942), ("won't", 'V', 0.9982), ('win', 'V', 0.9993), ('a', 'D', 0.9987), ('single', 'A', 0.9758), ('game', 'N', 0.9988), ('I', 'O', 0.9922), ('bet', 'V', 0.9995), ('on', 'P', 0.9162), ('!!', ',', 0.9873), ('Got', 'V', 0.9965), ('Mr.', '^', 0.9863), ('Cliff', '^', 0.9993), ('Lee', '^', 0.9999), (',', ',', 0.9975), ('if', 'P', 0.9987), ('he', 'O', 0.9979), ('loses', 'V', 0.9996), ('its', 'L', 0.9865), ('on', 'P', 0.9871), ('me', 'O', 0.9823), ('U.S.A', '^', 0.7232), ('!', ',', 0.9442), ('$', 'G', 0.7774), ('5', '$', 0.9919), ('0.0', '$', 0.9866), ('.', ',', 0.993)]]
press enter ...
[[('@thecamion', '@', 0.9995), ('I', 'O', 0.9953), ('like', 'V', 0.9027), ('monkeys', 'N', 0.9408), (',', ',', 0.999), ('but', '&', 0.9974), ('I', 'O', 0.9988), ('still', 'R', 0.9858), ('hate', 'V', 0.9938), ('COSTCO', '^', 0.953), ('parking', 'N', 0.6863), ('lots', 'N', 0.9851), ('..', ',', 0.9939), ('oO', '!', 0.9601), ('o.O', 'E', 0.7939), ('#COSTCO', '^', 0.5895), ('2:15', '$', 0.9761), ('PM