Parts-of-speech tagging for Twitter via SQL

gp-ark-tweet-nlp is a PL/Java Wrapper for Ark-Tweet-NLP - a state-of-the-art parts-of-speech tagger for Twitter. This package enables you to perform part-of-speech tagging on Tweets, using SQL. If your environment is an MPP system like Pivotal's Greenplum Database you can piggyback on the MPP architecture and achieve implicit parallelism in your part-of-speech tagging tasks.

The GitHub page can be found at: http://vatsan.github.io/gp-ark-tweet-nlp/

Installation

To be able to use the PL/Java wrapper you'll need to ensure some pre-requisites are satisfied in your environment.

1. Ensure PL/Java is installed and it works

--Install PL/Java
CREATE LANGUAGE pljava;
--Install PL/Javau
CREATE LANGUAGE pljavau;

-- Define function mapping
DROP FUNCTION IF EXISTS getsysprop(varchar);
CREATE FUNCTION getsysprop(varchar) 
     RETURNS varchar
AS
'java.lang.System.getProperty'
LANGUAGE pljava;
				
--Invoke function
SELECT getsysprop('user.home');

2. Ensure `JAVA_HOME` is set on all hosts

As root run the following

gpssh into all hosts

[gpadmin@mdw ~]$ gpssh -f <hostfile> ;

In the gpssh prompt, run the following:

[gpadmin@mdw ~]$ echo "export JAVA_HOME=/usr/java/jdk1.6.0_24" >> /home/gpadmin/.bashrc

###3. Prepare the Greenplum/Postgres Environment to talk to our PL/Java wrapper

Copy the Libraries: The packaged jar file gp-ark-tweet-nlp.jar that located in the folder build/

[gpadmin@mdw ~]$ gpscp -f <hostfile> gp-ark-tweet-nlp.jar =:/usr/local/greenplum-db/lib/postgresql/java/
[gpadmin@mdw ~]$ gpscp -f <hostfile> ark-tweet-nlp-0.3.2.jar =:/usr/local/greenplum-db/lib/postgresql/java/

Update the CLASSPATH for PL/Java so that it can find our libraries

[gpadmin@mdw ~]$ gpconfig -c pljava_classpath -v \'gp-ark-tweet-nlp.jar:ark-tweet-nlp-0.3.2.jar:examples.jar\'

Updating the pljava_classpath environment variable requires you to also instruct the Greenplum database to reload

[gpadmin@mdw ~]$ gpstop -u

You will have to reconnect to your database for this new parameter to take effect in your session.

To increase memory available to PL/Java run the following commands on the Master Segment

[gpadmin@mdw ~]$ gpconfig -c pljava_vmoptions -v \'-Xmx512M\' 
[gpadmin@mdw ~]$ gpstop -r

Ensure the options have taken effect

[gpadmin@mdw ~]$ gpconfig --show pljava_vmoptions
Values on all segments are consistent
GUC          : pljava_vmoptions
Master  value: -Xmx512M
Segment value: -Xmx512M
[gpadmin@mdw ~]$

You are now set to run the part-of-speech tagger in SQL by defining a User Defined Function to invoke the PL/Java wrapper and a User Defined Type to define the type of the returned result.

Usage

We'll first declare a UDT to define the type of the result return by our part-of-speech tagger

-- Define a type to hold [tweet_id, token_index, token, tag] items
DROP TYPE IF EXISTS token_tag;
CREATE TYPE token_tag
AS
(
	indx int, 
	token text,
	tag text
);

We'll then declare a UDF to invoke the PL/Java wrapper

DROP FUNCTION IF EXISTS posdemo.tag_pos(varchar);
CREATE FUNCTION posdemo.tag_pos(varchar)
	RETURNS SETOF token_tag
AS 
	'postagger.nlp.POSTagger.tagTweet'
IMMUTABLE LANGUAGE PLJAVAU;

Note the use of pljavau instead of just java in the UDF. This is because we are using the untrusted version of PL/Java as the part-of-speech tagger has to read a model file from within gp-ark-tweet-nlp.jar Without the untrusted language clause, we will encounter a java.lang.SecurityException when the code tries to read the model file embedded in gp-ark-tweet-nlp.jar. The reason Greenplum/PostGreSQL classifies languages such as PL/Python, PL/R and PL/Java(u) as untrusted is because a malicious user can run arbitrary system commands which could manipulate with the file system a DBMS is running on. Therefore, creation of UDFs in untrusted languages requires super-user privileges. However, executing these UDFs don't require such a restriction. Since a superuser can review a UDF and inspect it for any malicious code before executing it, it is not a security loophole.

Finally we can invoke the parts-of-speech tagger on a table containing a tweet column like so:

vatsandb=# \d+ posdemo.training_data
                        Table "sentidemo.training_data"
   Column   |            Type             | Modifiers | Storage  | Description 
------------+-----------------------------+-----------+----------+-------------
 rating     | integer                     |           | plain    | 
 id         | bigint                      |           | plain    | 
 ts         | timestamp without time zone |           | plain    | 
 query      | text                        |           | extended | 
 poster     | text                        |           | extended | 
 tweet_body | text                        |           | extended | 
Has OIDs: no
Distributed by: (id)

Here are some sample rows from the table

vatsandb=# select id, tweet_body from sentidemo.training_data limit 10;

     id     |                                                                tweet_body                                                                 
------------+-------------------------------------------------------------------------------------------------------------------------------------------
 1467820906 | @localtweeps Wow, tons of replies from you, may have to unfollow so I can see my friends' tweets, you're scrolling the feed a lot. 
 1467862806 | @MySteezRadio I'm goin' to follow u, since u didn't  LOL  GO ANGELS!
 1467891880 | Argh! I was suuuper sleepy an hour ago, now I'm wide awake.  Hope I don't stay up all night. :-/
 1467896211 | michigan state you make me sad 
 1467911846 | @bananaface IM SORRY I GOT YOU SICK.  lol. going to bed too. NIGHT!
 1467962634 | Im in the mood for some chocolate. I want..... Miniature Reeses cups. Now 
 1467979881 | @redvinylgirl my mom has it. I wish you the best of luck 
 1467987712 | Change of plans, we ordered Macs instead. Time to hit the books! 
 1467996032 | Funny how the little things make me homesick, criminals breakn n2 a brownstone on LawNOrd CI made me misty 
 1468000208 | Man... taxes suck.  I'm horrified that i did something wrong on them.  TurboTax decided to keep around a lot of the stuff I turned off.  
(10 rows)

Time: 329.507 ms

Now we can perform parts-of-speech tagging of the tweets

select id, 
       (t).indx, 
       (t).token, 
       (t).tag
from
(
    select id, 
           posdemo.tag_pos(tweet_body) as t
    from posdemo.training_data
) q

Which returns (limitng to 10 rows for sampel output)

     id     | indx |  token   | tag 
------------+------+----------+-----
 1467810672 |    0 | is       | V
 1467810672 |    1 | upset    | A
 1467810672 |    2 | that     | P
 1467810672 |    3 | he       | O
 1467810672 |    4 | can't    | V
 1467810672 |    5 | update   | V
 1467810672 |    6 | his      | D
 1467810672 |    7 | Facebook | ^
 1467810672 |    8 | by       | P
 1467810672 |    9 | texting  | V
(10 rows)

Questions?

You can reach me at vatsan.cs@utexas.edu

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.settings		.settings
build		build
lib		lib
sql		sql
src/postagger		src/postagger
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parts-of-speech tagging for Twitter via SQL

Installation

1. Ensure PL/Java is installed and it works

2. Ensure `JAVA_HOME` is set on all hosts

Usage

Questions?

About

Releases

Packages

Languages

License

vatsan/gp-ark-tweet-nlp

Folders and files

Latest commit

History

Repository files navigation

Parts-of-speech tagging for Twitter via SQL

Installation

1. Ensure PL/Java is installed and it works

2. Ensure JAVA_HOME is set on all hosts

Usage

Questions?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Ensure `JAVA_HOME` is set on all hosts

Packages