Project3

jasonbaldridge edited this page Mar 6, 2013 · 5 revisions

Course project, Phase 3

DUE: March 8, 2013, 1pm CST

Preamble

This project phase involves using the twitter4j library to interact with Twitter as an automated user. This phase is dramatically underspecified compared to the previous ones. Please ask questions about it and discuss code and ideas on the class mailing list. Talk, help each other, etc. Collaborate as a class to do something better than you could on your own.

You are welcome to work in groups of two or three if you like. Please let me know via email if you have chosen to do so.

Submitting your solutions

Write up what you did in as a file <lastname>_<firstname>_p3.txt (you may also submit a PDF if you want to use something other than plain text. Submit this on Blackboard.

Your code submission will be contained in your fork of the tshrdlu repository. You should tag the submission version of your repository as "PP3". Search for "git tag" to see how to tag. Make sure to use git push --tags to ensure that your tag is pushed to github. (You can verify that it has by clicking on the "tags" on your repository's main page and looking for the tag PP3.)


1. Perform user actions using Twitter4j

Read the Twitter4j tutorial blog post about doing user actions.

In your tshrdlu fork, update it to the latest version of tshrdlu (v0.1.3) and add code that allows you to automatically follow all "_anlp" followers of the @appliednlp account. Then, invoke that code---this way every one will be following everyone in the class.

Note: you should add your twitter4j.properties file to the main tshrdlu directory as that is what is needed for authentication for version 0.1.3 of tshrdlu..

2. Adapt and extend the starting code to create new behaviors and/or capabilities

The code in tshrdlu allows mixing of stream and user access via Twitter4j so that it is possible to do automatic responses with a twitter bot. What is there provides a scaffolding with very limited capabilities.

  • tshrdlu.twitter.ReactiveBot: a sample bot that has a listener which allows it to
    • process some simple messages (e.g. you can tell it to "follow @foo @bar" and it will follow them)
    • respond to messages sent to it (as @-replies)---it does this by searching twitter for several terms in the initiating message and then selecting one of the found tweets at random (provided it doesn't mention any users)
  • tshrdlu.twitter.ClusterStream: cluster tweets from the Twitter stream
    • note the tshrdlu.twitter.StatusClusterer object, which does most of the work in this regard and should be fairly reusable for other clustering needs
  • the code in the twitter4j-tutorial repository may also be useful

Expand on this in some interesting way(s), e.g.

  • Improve the search ability in the responder bot so that it returns more interesting/unique responses that are also more relevant to the message sent to them. For example, you could use tf-idf values to pick interesting words to use as the search, or search for bigrams. (You should also consider using a stream rather than the user access to get these tweets.)
  • Given a search for possible responses, improve the selection of the response message, e.g. by selecting those that are the best matches (probably using cosine similarity), and that are most unique in the set of returned results (perhaps using relative frequency ratios).
  • Pre-compute a topic model and use the topics as ways to find words to search for that aren't exact matches to any of the words in the original tweet that you must respond to.
  • Keep state about interactions, e.g. so that if someone tweets to your bot, it might actually proactively tweet to them (for that session) based on the initial reactions.
  • Use information about the person tweeting to your bot to customize the response, e.g. by using their name (or at least filtering out names of other people) or gather their tweets or previous tweets to get more relevant responses.
  • Use an n-gram language model based on all the tweets you got from searching to generate new text for a response tweet.
  • Additionally, start creating new tweets (not necessarily responses) using a language model.
  • Give your bot some personality by having it follow certain kinds of users (e.g. skeptics, musicians, architects) and having it select its responses from them or their followers.
  • Expand the word cloud code from the tutorial so that you cluster the followers of a given user. (Note: because of rate limits, you'll want to do this in two steps---the first to obtain the descriptions and save them to disk, and then another to actually work with them.)

I'm of course happy to discuss any of these ideas or your own further, either in person or over email.

You do not need to use the code I did, at least not in any way that preserves it in your final solution: just use what you need and throw away anything you don't need. Use external libraries that you need, but ensure that they work with the sbt build (either as dependencies or by placing them in the lib directory). Also, you are welcome to mix and match code from other students in order to get further capabilities (recall that these are all public forks of tshrdlu, so you can find the others easily). If you do this, you should give credit to them in your write-up.

With any of these, you must write down what you did, give pointers to the relevant code, and discuss interesting things that happened or that you found. As part of your write-up, include at least one paragraph that gives some initial thoughts regarding the course project: given what we've done so far, what you do think you might be able to do in the context of automating analysis and/or interaction on Twitter using natural language processing and machine learning?

Rubric

Your submission will be scored using the following rubric. Qualities of full-point submissions are given below each area.

  • Coding: 30
    • The code implements non-trivial extensions of the starting code.
    • The code demonstrates thought about program dependencies and flow.
    • The code is organized and documented.
  • Writing: 40
    • The write-up clearly explains what was done.
    • The write-up has examples, including relevant output (e.g. interactions, clusters, etc).
    • The write-up provides analysis of output, as appropriate.
    • The write-up has references to any papers, blog posts, or other resource that were used to complete the work.
    • The write-up is professionally done (organized, free of spelling and grammar errors).
  • Creativity: 10
    • The work shows original thought in selection of task, choosing algorithms for solving it, solutions to coding challenges, and analyzing their output.
    • The work combines different ideas from the class in new ways.
  • Overall quality: 20
    • The work as a whole is high quality.

In giving this rubric in this way, I don't wish to make you feel you must submit something akin to a semester final report. This project stage is of course worth only 4% of the total grade, so it doesn't need to be "big". Do something interesting, do it well, and write it up clearly.