Project2

Course project, Phase 2

DUE: February 18, 2013, 1pm CST

Preamble

This project phase involves using the twitter4j library to access tweets from the Twitter stream using Scala, and start doing interesting processing of tweets. Specifically, you'll detect whether tweets are in English and do very simple polarity classification based on a sentiment lexicon. There are also several mini-lessons about Scala programming embedded in the exercises below.

If anything is unclear, get in touch with me right away. However, I also absolutely encourage you all to ask questions about and discuss these exercises on the class mailing list. Talk, help each other, etc, but do your own code.

Submitting your solutions

For each activity in this phase, fill in the information requested in answers_p2.txt (located in appliednlp/project/phase2 of the appliednlp repository). Submit this on Blackboard.

Your code submission will be contained in your fork of the tshrdlu repository (see step 2).

1. Follow the twitter4j tutorial

Read and work through the Bcomposes twitter4j tutorial.

Written: Show the last five tweets produced from running bcomposes.twitter.LocationStreamer for Austin, San Francisco, and New York City.

2. Fork the tshrdlu repository

Build setup and supporting and stub code are available in the tshrdlu repository. Fork this repository by clicking on the "Fork" button in the top right of the tshrdlu main page. For the rest of the homework you will commit and push changes to your fork.

Written: Provide the web address of your Github fork.

Note: Your fork will be public -- this is intentional as I want you all to get used to putting code out in public and starting building interesting Github profiles. This of course means that you will be able to see each others' forks and that you thus can cheat pretty easily. While I don't want you to cheat in this way, I have no interest in making an effort to stop it. Basically, you will be cheating yourself, so it's on you if you do. Much better to ask questions on the mailing list (or to me directly)!

3. Set up environment variables for OAuth

In order to make it so that your OAuth details are not contained in any code, I have set things up so that tshrdlu can access them if you declare certain environment variables. Here's what you need to add to your bash profile:

export TWITTER4J_OAUTH_CONSUMERKEY="[your consumer key here]"
export TWITTER4J_OAUTH_CONSUMERSECRET="[your consumer secret here]"
export TWITTER4J_OAUTH_ACCESSTOKEN="[your access token here]"
export TWITTER4J_OAUTH_ACCESSTOKENSECRET="[your access token secret here]"

Verify that you can successfully pull tweets in all the ways covered in Bcomposes twitter4j tutorial, but now using tshrdlu. Rather than using SBT as in the tutorial, you will now using the tshrdlu run command (though you'll still use SBT for compiling). For example, here's how to obtain randomly sampled tweets (after you have compiled tshrdlu, of course).

$ tshrdlu run tshrdlu.twitter.StatusStreamer

Note: The tshrdlu main methods don't have sleep timers on them, so they will continue to pull tweets until you kill the process.

Written: Show the command line calls you made to do the following.

follow @wired, @theeconomist, @nytimes, and @wsj
search for scala, java and python
search the bounding boxes around Austin, San Francisco, and New York City

4. Inspect the code

Look at the tshrdlu.twitter package (meaning the files in src/main/scala/tshrdlu/twitter and the definitions contained in them). Look at how the classes, traits, and objects are related and used.

Written: Briefly describe how the authorization properties you specified above make their way from your Unix environment to the Configuration object. (Hint: look at the bin/tshrdlu script.)

Written: In two to three paragraphs, explain what you understand about how the code works, including how classes, traits, mix-ins, and objects are used.

Tip: focus on how LocationStreamer enables locations to be followed with just the following declaration.

object LocationStreamer extends FilteredStreamer with LocationFilter

Written: List any questions you have about what is going on with the code. (I'll answer them later in class.)

Also, feel free to discuss this question on the class mailing list! My focus is not on grading you, but to make sure you are learning some of the strategies for composing complex behaviors using Scala.

5. Enhance the English tweet listener

Look at src/main/scala/tshrdlu/project/Project2.scala. You'll see several definitions of traits, classes and objects, many of which are incomplete stubs. One object is the EnglishStatusStreamer, which mixes in the EnglishStatusListener. That listener has a method isEnglish that simply checks whether the word the is in the tweet, returning true if so and false otherwise. Try running it:

$ tshrdlu run tshrdlu.project.EnglishStatusStreamer

You will see a bunch of tweets go by, the majority of which should be English. However, what you are not seeing are the many English tweets that are being undetected. To see both types, run the EnglishStatusStreamerDebug object instead. Now you'll see every tweet prefaced by either [ENGLISH] or [OTHER], with obvious semantics. As will be clear, there are many English tweets being incorrectly assigned to [OTHER]. Your task is to improve this by using the vocabulary of English words that is contained in the English class in src/main/scala/tshrdlu/util/LanguageUtil.scala. You are also welcome to use the crude SimpleTokenizer in that file (and improve it if you like).

Tip: A good basic strategy is to ensure that some number and/or proportion of words in the tweet match the vocabulary of English.

Tip: You may want to remove/ignore the non-word items like web links, at mentions and hashtags.

Written: When you are satified with your development, describe in a paragraph or two what you did in the isEnglish method to improve it.

Written: Run the EnglishStatusStreamerDebug object, get twenty tweets and then score their accuracy (how many of the twenty were correctly assigned to English or non-English). Discuss a few of the errors and what you might do to fix them.

6. Create a simple sentiment analyzer for the sample stream

The PolarityStatusListener stub implementation provides the start of an implementation for assigning a label of positive, negative or neutral to every English tweet. It's current implementation is very bad: it randomly chooses one of the three labels. You can run it as follows.

$ tshrdlu run tshrdlu.project.PolarityStatusStreamer

The PolarityStatusListener produces summary output about the number and percentage of positive and negative tweets every 100 tweets. It looks like this:

----------------------------------------
Number of English tweets processed: 2000
+	-	~
700.0	653.0	647.0
35	32.65	32.35
----------------------------------------

The first column gives the positive (+) counts (700 tweets) and percentage of positive tweets (35%); the second and third columns give the same for negative (-) and neutral (~) tweets.

Change the implementation of the getPolarity method in PolarityStatusListener to the following:

For each tweet:
- count the number of positive words
- count the number of negative words
- if the number of positive words is greater than the number of negative words, the tweet is positive
- if the number of positive words is less than the number of negative words, the tweet is negative
- otherwise, the tweet is neutral

There are positive and negative lexicons in the src/main/resources/lang/eng/lexicon directory. Look at LanguageUtils.scala to see how you can get those read in as a Set[String] that you can use to determine the above counts. You are welcome to add other resources that you think might help.

Written: Give the summary output after at least 300 tweets.

Written: Get a sequential group of ten tweets, include them in your write-up, and for each one say whether you think the polarity label that was assigned to them was correct or incorrect.

Written: It is fine if you only do the above simple method of determining polarity. However, if you did anything extra, say what you did and why.

7. Run the sentiment analyzer with term filters

The PolarityTermStreamer object is inert in the stub file. Extend the classes and traits necessary to make it into a streaming app that filters by terms and assigns polarity to them.

Note: This requires nothing more than a single line using extends and with.

Written: Having done this, choose two related terms to compare, e.g. book vs movie, and output the summary after 100 tweets for each.

Tip: If you use a term that is very uncommon on Twitter, you won't get many tweets---best to choose something common.

8. Run the sentiment analyzer with location filters

The PolarityLocationStreamer is inert in the stub file. Fix that as you did for PolarityTermStreamer, but also change the outputInterval to 10 instead of its default value of 100. If you don't know how to override a value in a trait already, you'll need to figure out how to do this by looking at books and resources (e.g. Stackoverflow) and using the mailing list. (Don't worry, it's not hard---but it is useful to know you can do it and how to do it.)

Written: Provide the summary output after at least 50 tweets for each of the following cities:

Austin: -97.8 30.25 -97.65 30.35
San Francisco: -122.75 36.8 -121.75 37.8
New York City: -74 40 -73 41

Extra

Take any of the above further, e.g.:

Use an off-the-shelf language classifier rather than the one you created.
Use an off-the-shelf polarity classifier rather than the lexicon-based one.
Modify the location streamers so that they output latitude and longitude as well, and then plot the tweets using an application like Google Earth.
Compute unigram distributions for the different cities based on the tweets that are produced in each and produce word clouds for them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly