Processing raw text from Twitter to perform feature extraction on tweets for classification using the WEKA machine learning package.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

There are two parts to this project, and

  1. Preprocessing, tokenizing, and tagging parts-of-speech takes in unprocessed tweets and removes noise by

  • removing all HTML tags and attributes
  • convert HTML character codes to ASCII equivalent (e.g. "<" is changed to "<")
  • all URLs are removed
  • @ and # tags are removed
  • each sentence in each tweet in on a new line
  • each tweet is delimited with '|'
  • ellipsis (...) and combinations of ! and ? are not split (e.g. "?!?!?!?!???")
  • each token, including punctuation and clitics, is separated by a space (e.g. "we've" changed to "we 've")
  • tagging each token with its part-of-speech

To run, from the terminal, type: python TWEETS OUTPUT

where ...

  • TWEETS is the input file containing tweets (unprocessed) from Twitter from a particular user
  • OUTPUT is the output file (*.twt) where each tweet is parsed, tokenized and tagged

e.g. python justinbieber justinbieber.twt

  1. Feature extraction takes .twt files to build an ARFF datafile. The produced ARFF datafile gathers feature information that will be used for the classification of tweets.

The features computed for each .twt files are the number of

  • first person pronouns
  • second person pronouns
  • third person pronouns
  • coordinating conjunctions
  • past-tense verbs
  • future-tense verbs
  • commas
  • colons and semi-colons
  • dashes
  • parentheses
  • common nouns
  • proper nouns
  • adverbs
  • wh-words
  • slang
  • all-caps

as well as the average length of sentences for a tweet, the average length of tokens, and the number of sentences in a tweet.

The WEKA machine learning package will be used to classify tweets given the .arff files.

To run, from the terminal, type: python -X CLASS:FILE CLASS:FILE+FILE+FILE ... OUTPUT

where ...

  • X denotes the number of tweets to read; exclude this if you wish to read all the tweets from each FILE
  • CLASS denotes the class (optional, else takes the file name)
  • FILE is the parsed, tokenized and tagged file (.twt) produced from; if there is more than one file for a given class, this is denoted by a + in between each file
  • OUTPUT the generated arff file (*.arff)

e.g. python pop:justinbeiber.twt+katyperry.twt+ladygaga.twt news:cnn.twt+bbcnews.twt+nytimes.twt POPvsNEWS.arff