# Askweet - Identifying questions in twitter streams

Humans are inquisitive by nature! The popularity of the internet has led to the creation of various portals to ask
questions. Search engine are one of the most prominent go-to source for users seeking answers to their questions. The recent growth of social networks like Twitter, Facebook, Reddit has made it possible for people to direct questions towards their friends, followers, and other like-minded people.

As an example, Twitter generates around 500 million tweets per day. Research has shown that roughly 10% of
these are tweets are questions and many of them go unanswered. Imagine if we can develop an AI that can automatically classify tweets as questions and better yet, answer them directly.


## Data gathering from the Twitter Firehose API
Twitter Firehose is an API provided by Twitter that allows users to tweets real time. It is also possible to filter tweets based on certain criteria.

In [None]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    stream.filter(track=['?','who', 'what', 'when', 'where', 'why', 'how', 'do', 'is', 'could', 'can', "can't", 'cant', 'would', "wouldn't", "wouldnt", 'should', "shouldn't", "shouldnt", 'did', 'will', 'has', 'have', "won't", 'does', 'wont', 'doesnt', "doesn\'t", 'had', 'are'])


## Sample data
Once the data is captured in a text file, it is annotated with labels to create training data. Here is a sample set - 

tweet_id	| tweet_original	| is_question	| is_answerable
-----------  |    ---------      | ---------|--------
29 | @TheGRAMMYs: When a band member leaves, what happens next? http://t.co/QOjFGFYdjv http://t.co/gR1a7WNyKg	| yes	| no
30	| @ThingsAKidSaid: When someone calls you ugly http://t.co/ewFB1RPENW	| no |no
34	| RT @ambermakda: I have no words http://t.co/zjkPWUd7LZ	| no	| no 
35	| So apparently the NHL replaced its refs with those from the NBA because that shouldn't have been a goaltenders interference penalty.	| no | no 
24	| Is David De Gea ready to extend his Manchester United contract?: ESPN FC's Craig Burley analyses Lukas Podolsk... http://t.co/eizdhI5xog	| yes	| yes

As you can see, every tweet has two labels which identify whether it is a question and whether it is a machine answerable question. The second label is important if we want to consider only those tweets that can be answered automatically by machines. As a first step, building an AI to automatically answer factual questions (eg. What is Barrack Obama's height?) is a smaller problem than trying to answer questions that asks for opinions, or worse, rhetorical questions.

## Generating Feature Set
The types of features extracted were broadly classified into 4 categories - 
## 1. Lexical features
Given a tweet, it's unigrams, bigrams and trigrams were used as features.
An n-gram is basically a set of n terms from a given tweet. Presence of ngrams such as "what", "when will" or "why is US" could imply that a tweet is indeed a question. 


### Cleaned data
Lexical features needed data to be cleaned before unigrams, bigrams and trigrams were computed. During the process of cleaning, URLs, Twitter handles referring to usernames, and retweets were removed from the tweet

tweet_id	| tweet_original	| is_question	| is_answerable
-----------  |    ---------      | ---------|--------
29 |  When a band member leaves, what happens next? | yes	| no
30	|  When someone calls you ugly | no |no
34	| I have no words 	| no	| no 
35	| So apparently the NHL replaced its refs with those from the NBA because that shouldn't have been a goaltenders interference penalty.	| no | no 
24	| Is David De Gea ready to extend his Manchester United contract?: ESPN FC's Craig Burley analyses Lukas Podolsk... 	| yes	| yes



In [6]:
import csv
import re
import json
import nltk
from nltk.stem.snowball import SnowballStemmer
from collections import defaultdict

class LexicalFeatures:
    def __init__(self, filename):
        self.filename = filename
        self.featureIndex = defaultdict(int)
        self.features = []
        self.unigram_vocab = defaultdict(int)
        self.bigram_vocab  = defaultdict(int)
        self.trigram_vocab = defaultdict(int)
        self.data = {}
        self.get_data_from_file()

    def get_data_from_file(self):
        stemmer = SnowballStemmer("english")
        infile = open(self.filename,"r")
        self.data = json.load(infile)
        infile.close()
        print "Input tweet format - "
        print self.data["5264"]
        print "\n"
        for ids in self.data:
            tweet = self.data[ids][u'tags']
            tweet_text = [ words[0] for words in tweet ]
            # print tweet_text
            clean_tweet = ' '.join(tweet_text)
            regex_form = '^rt\s+|@\w+:*|https?://[\w\.\/]*'
            clean_tweet = re.sub(regex_form, '', clean_tweet)
            clean_tweet = [stemmer.stem(x) for x in clean_tweet.split()]
            if ids == "5264":
                print "Cleaned tweet -"
                print clean_tweet
                print "\n"
                
            for item in clean_tweet:
                self.unigram_vocab[item] += 1

            for item in list(nltk.bigrams(clean_tweet)):
                self.bigram_vocab[item] += 1

            for item in list(nltk.trigrams(clean_tweet)):
                self.trigram_vocab[item] += 1
            
        # Use unigrams only if they occur 5 or more times
        temp = [ k for k,v in self.unigram_vocab.iteritems() if v >= 5 ]
        self.unigram_vocab = temp

        temp = [ k for k,v in self.bigram_vocab.iteritems() if v >= 5 ]
        self.bigram_vocab = temp

        temp = [ k for k,v in self.trigram_vocab.iteritems() if v >= 3 ]
        self.trigram_vocab = temp
        print "Sample Unigram features - "
        print self.unigram_vocab[0:10]
        print "\n"
        print "Sample Bigram features - "
        print self.bigram_vocab[0:10]
        print "\n"
        print "Sample Trigram features - "
        print self.trigram_vocab[0:10]
        self.features = self.unigram_vocab + self.bigram_vocab + self.trigram_vocab

        print "\nThe number of Unigram Features are %s" % (len(self.unigram_vocab))
        print "\nThe number of Bigram Features are %s" % (len(self.bigram_vocab))
        print "\nThe number of Trigram Features are %s" % (len(self.trigram_vocab))
        print "\nTotal number of Lexical Features are %s" % (len(self.features))

        for index, item in enumerate(self.features):
            self.featureIndex[item] = index
        
if __name__ == "__main__":
    inputFile = "tweet_tags_15k.json"
    trainData = LexicalFeatures(inputFile)



Input tweet format - 
{u'is_question': 0, u'is_answerable': 0, u'tags': [[u'RT', u'~', 0.9979], [u'@disneywords', u'@', 0.9991], [u':', u'~', 0.9803], [u'Many', u'A', 0.8671], [u'of', u'P', 0.9994], [u'the', u'D', 0.9996], [u'things', u'N', 0.9981], [u'that', u'P', 0.8976], [u'seem', u'V', 0.9083], [u'impossible', u'A', 0.9445], [u'now', u'R', 0.993], [u'will', u'V', 0.9931], [u'become', u'V', 0.9985], [u'realities', u'N', 0.9923], [u'tomorrow', u'N', 0.9558], [u'.', u',', 0.9968], [u'Walt', u'^', 0.9331], [u'Disney', u'^', 0.8178]]}


Cleaned tweet -
[u'rt', u':', u'mani', u'of', u'the', u'thing', u'that', u'seem', u'imposs', u'now', u'will', u'becom', u'realiti', u'tomorrow', u'.', u'walt', u'disney']


Sample Unigram features - 
[u'yellow', u'four', u'otra', u'lord', u'digit', u'sehun', u'dell', u'prize', u'wednesday', u'solid']


Sample Bigram features - 
[(u'my', u'final'), (u'hoodi', u'are'), (u'have', u'made'), (u'want', u'you'), (u'now', u'that'), (u'need', u'it'), (u'here', u'

## 2. Meta features

These features describe the characteristics of the tweet. 5 different meta features were used - length of the tweet, number of words in the tweet, whether the tweet has a URL, whether the tweet mentions other users and the coverage of the words. Coverage corresponds to the number of unique features it has with respect to the dictionary of the training data.

In [24]:
class FeatureVectors:
    def __init__(self):
        self.featureVectors = defaultdict()

    def compute_features(self, trainData):
        stemmer = SnowballStemmer("english")
        # pre save the sets of ngram
        unigram_vocab_set = set(trainData.unigram_vocab)
        bigram_vocab_set = set(trainData.bigram_vocab)
        trigram_vocab_set = set(trainData.trigram_vocab)

        for ids in trainData.data:
            # print ids
            features = [0]*len(trainData.features)
            # meta features
            length, number, has_url, has_mention, coverage = (0,)*5

            #get the tweet
            tweet = trainData.data[ids][u'tags']
            tweet_is_question = trainData.data[ids][u'is_question']
            tweet_is_answerable = trainData.data[ids][u'is_answerable']
           
            #get the tags
            temp = set([x[2] for x in tweet])
            #check for url and mentions
            if 'U' in temp:
                has_url = 1
            if '@' in temp:
                has_mention = 1

            #compute len of tweet and number of words
            length = sum([len(word[0]) for word in tweet])
            number = len(tweet)

            #process tweet for coverage and other Lexical features
            tweet_text = [ words[0] for words in tweet ]
            clean_tweet = ' '.join(tweet_text)
            regex_form = '^rt\s+|@\w+:*|https?://[\w\.\/]*'
            clean_tweet = re.sub(regex_form, '', clean_tweet)
            clean_tweet = [stemmer.stem(x) for x in clean_tweet.split()]

            #compute coverage
            coverage = len(set(clean_tweet)) / len(trainData.unigram_vocab)
            #compute lexical features
            for item in clean_tweet:
                if item in unigram_vocab_set:
                    features[trainData.featureIndex[item]] = 1

            for item in list(nltk.bigrams(clean_tweet)):
                if item in trigram_vocab_set:
                    features[trainData.featureIndex[item]] = 1

            for item in list(nltk.trigrams(clean_tweet)):
                if item in trigram_vocab_set:
                    features[trainData.featureIndex[item]] = 1

            #append the lexical and meta features
            lex_meta_features = features + [length, number, has_url, has_mention, coverage]
            self.featureVectors[ids] = {'features':lex_meta_features,'is_question':tweet_is_question, 'is_answerable':tweet_is_answerable}



## 3. POS Features

These features compose of unigrams, bigrams and trigrams computed over part-of-speech tags. Since POS tags don't attach to particular words, POS features also identify context. Since the characteristics of tweets are slightly different from normal Engilish sentences, a specialized POS tagger for tweets developed by CMU was used - http://www.ark.cs.cmu.edu/TweetNLP/ The tagger uses seperate tags for URLs, hashtags, and also has specific tags for known shortforms (eg. ikr - i know right).
Once the tags are identified for the tweet, unigrams bigrams and trigrams are computed over these tags.

<img src="img.png" style="max-width:80%; ">

In [7]:
import csv
import CMUTweetTagger
import json
import codecs
import re
import nltk
from collections import defaultdict

class TweetPOS_LM:
    def __init__(self):
        self.trainFile = 'Train1.csv'
        self.tagFile = 'tweet_tags_15k.json'
        self.featureFile = 'postags_features_15k.json'
        self.tweet_id = []
        self.tweet_original = []
        self.is_question = []
        self.is_answerable = []
        self.tweet_unigram = defaultdict(set)
        self.tweet_bigram = defaultdict(set)
        self.tweet_trigram = defaultdict(set)
        self.tweet_feature_list = []
        self.tweet_features = {}

    # read the file containing tweets
    def readTweets(self):
        with open(self.trainFile, 'rb') as csvfile:
            tweetreader = csv.reader(csvfile, delimiter=',', quotechar='"')
            header = tweetreader.next()
            for row in tweetreader:
                self.tweet_id.append(row[0])
                self.tweet_original.append(row[1])
                self.is_question.append(row[2])
                self.is_answerable.append(row[3])

    # This function computes the POS feature set
    def pos_tagger(self):
        tweets = []
        for tw in self.tweet_original:
            try:
                tw = tw.decode('unicode_escape').encode('ascii','ignore')
            except:
                tw = re.sub(r'\\+', '', tw)
                tw = tw.decode('unicode_escape').encode('ascii','ignore')
            tweets.append(tw)
        
        pos_unig = []
        pos_big = []
        pos_trig = []
        # tag tweets using python version of CMU Tweet NLP
        sent_tags = CMUTweetTagger.runtagger_parse(tweets)
        # fil_tweet = open('tweet_tags.json','w')
        
        print "Result of CMU tagger"
        print sent_tags[5264]
        i = 0
        for sent in sent_tags:
            unigrams = set([tag_tuple[1] for tag_tuple in sent])
            bigrams = set(nltk.bigrams(unigrams))
            trigrams = set(nltk.trigrams(unigrams))
            self.tweet_unigram[self.tweet_id[i]] = unigrams
            self.tweet_bigram[self.tweet_id[i]] = bigrams
            self.tweet_trigram[self.tweet_id[i]] = trigrams
            
            pos_unig.extend(list(unigrams))
            pos_big.extend(list(bigrams))
            pos_trig.extend(list(trigrams))
            i += 1
        
        pos_unig = list(set(pos_unig))
        pos_big= list(set(pos_big))
        pos_trig = list(set(pos_trig))
        print "Sample POS Unigram features - "
        print pos_unig[0:3]
        print "\n"
        print "Sample Bigram features - "
        print pos_big[0:3]
        print "\n"
        print "Sample Trigram features - "
        print pos_trig[0:3]
        print "\nThe number of POS Unigram Features are %s" % (len(pos_unig))
        print "\nThe number of POS Bigram Features are %s" % (len(pos_big))
        print "\nThe number of POS Trigram Features are %s" % (len(pos_trig))
        self.tweet_feature_list = set(pos_unig + pos_big + pos_trig)
        print "\nTotal number of POS Features are %s" % (len(self.tweet_feature_list))

    # This function creates a feature vector for every tweet
    def get_features(self):
        feature_index = {}
        i = 0
        for feature in self.tweet_feature_list:
            feature_index[feature] = i
            i += 1
        i = 0
        for tweet in self.tweet_id:
            # feature_dict = {}
            features = [0,]*len(feature_index)
            for unigram in self.tweet_unigram[tweet]:
                features[feature_index[unigram]] = 1
            for bigram in self.tweet_bigram[tweet]:
                features[feature_index[bigram]] = 1
            for trigram in self.tweet_trigram[tweet]:
                features[feature_index[trigram]] = 1
            is_question = 1 if self.is_question[i] == "yes" else 0
            is_anserable = 1 if self.is_answerable[i] == "yes" else 0
            self.tweet_features[tweet] = {"features":features, "is_question":is_question, "is_answerable":is_anserable}
            i += 1


if __name__ == "__main__":
    tp = TweetPOS_LM()
    tp.readTweets()
    tp.pos_tagger()



Result of CMU tagger
[('I', 'O', 0.9972), ('cut', 'V', 0.9982), ('my', 'D', 0.9987), ('mouth', 'N', 0.9968), ('man', 'N', 0.6624), ('it', 'O', 0.9847), ('hurt', 'V', 0.9871), ('so', 'R', 0.9597), ('bad', 'A', 0.9412), ('when', 'R', 0.9858), ('I', 'O', 0.9981), ('eat', 'V', 0.9996), ('something', 'N', 0.9918)]
Sample POS Unigram features - 
['!', '#', '$']


Sample Bigram features - 
[('R', ','), ('U', 'E'), ('$', '&')]


Sample Trigram features - 
[('X', ',', '^'), ('U', 'D', 'O'), ('P', 'N', 'T')]

The number of POS Unigram Features are 24

The number of POS Bigram Features are 340

The number of POS Trigram Features are 1613

Total number of POS Features are 1977


## 4. WordNet features

The last category of features are WordNet features which are features derived from word relationships. WordNet is a database of words containing a semantic lexicon for the English language that organizes words into groups called synsets (i.e., synonym sets). Synonyms and hypernyms of every word already in the tweet are extracted from the WordNet database and incorporated as a feature. 

## Feature Selection - Binormal Seperation

The total size of the feature is roughly 22,000 features which means that every tweet is now represented by a vector of 22,000 dimensions! The number of tweets that we are working with is around 15,000+ tweets which makes it very sparse and it is not possible to get good classification results. This is expected given the huge and open vocabulary in Twitter. We adopted a state-of-the-art feature selection method named Bi-Normal Separation (BNS) for feature selection as it outperforms other well-known metric such as Information Gain and Chi-distance, especially when the dataset is very skewed.

More information on bi-normal separation can be found here - http://www.hpl.hp.com/techreports/2007/HPL-2007-32R1.pdf. It is calculated using the formula - 
Bi-Normal Separation (BNS): | F-1(tpr) – F-1(fpr) | 

where
* tp = number of positive tweets containing a feature,
* fp = number of negative tweets containing word,
* fn = pos – tp,
* tn = neg – fp,
* tpr=P(word | positive class) = tp/pos,
* fpr=P(word | negative class) = fp/neg, 
* F-1 is the inverse Normal cumulative distribution function, as commonly available from statistical tables. 

Using BNS, all the features are ranked, higher the BNS value, better is the feature.

In [None]:
import numpy as np
import json
import csv
import sys
from collections import Counter, defaultdict
from scipy.stats import norm
from operator import itemgetter


def load_features_and_labels(filename):
    print "\nStage: Load Features"
    #Feature1
    infile1 = open(filename,"r")
    feature1 = json.load(infile1)
    infile1.close()
    return feature1


##### Input feature_dict is a dictionary where the keys are tweet IDs and the value is a tuple/list of feature vector, label
##### Output is bns_features which is a dictionary of feature id (just an index number) and the BNS value.
def count_metrics(feature_dict, label_type, bns_filename):
    vals = feature_dict.values()
    tweet_features = [f_tuple["features"] for f_tuple in vals]
    tweet_labels = [f_tuple[label_type] for f_tuple in vals]
    num_features = len(tweet_features[0])
    label_counts = Counter(tweet_labels)
    pos = label_counts[1]
    neg = label_counts[0]
    print 'num_features',num_features
    print 'pos',pos
    print 'neg',neg

    true_pos = defaultdict(int)
    true_neg = defaultdict(int)
    false_pos = defaultdict(int)
    false_neg = defaultdict(int)

    true_pos_rate = {}
    false_pos_rate = {}

    np_labels = np.array(tweet_labels)
    for i in xrange(num_features):
        vals = [features[i] for features in tweet_features]
        # print vals
        np_vals = np.array(vals)
        true_pos[i] = np.sum(np.logical_and(np_vals, np_labels))
        false_pos[i] = np.sum(np.logical_and(np_vals, np.logical_not(np_labels)))
        true_pos_rate[i] = float(true_pos[i])/pos
        false_pos_rate[i] = float(false_pos[i])/neg

    bns_features = {}
    for i in xrange(num_features):
        if i%1000 == 0:
            print 'Calculating',i,'of',num_features,'...'
        print 'generating bns_features',i,'of',num_features,'...'
        if true_pos_rate[i] != 0 and false_pos_rate[i] != 0 and true_pos_rate[i] != 1 and false_pos_rate[i] != 1:
            bns_features[i] = abs(norm.ppf(true_pos_rate[i]) - norm.ppf(false_pos_rate[i]))

    top_features = sorted(bns_features, key=bns_features.get, reverse=True)

if __name__ == "__main__":
    features = load_features_and_labels("lexical_tags_15k.json")
    print 'getting bns scores for is_question...'
    label_type = "is_question" 
    bns_filename = "lexical_bns_scores_"+label_type+".json"
    count_metrics(features, label_type, bns_filename)
    print 'getting bns scores for is_answerable...'
    label_type = "is_answerable" 
    bns_filename = "lexical_bns_scores_"+label_type+".json"
    count_metrics(features, label_type, bns_filename)




## Support Vector Machines

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. The optimal hyperplane is the one that achieves the maximum seperation between the two classes. Since classifying tweets is a binary classification problem, we use a linear SVM.

<img src="img2.png" style="max-width:80%; ">

In [2]:
import json
from collections import defaultdict
from sklearn import svm
from sklearn import cross_validation
import csv
import numpy

def run_svm(feature, labels):
    '''Fitting a classifier using SVM Regression and predicting the values
            using linear kernel and perform five fold cross validation
    '''
    print "\nStage: SVM"

    X = feature
    Y = labels
    
    clf = svm.SVC(kernel='linear', C=1)
    scores = cross_validation.cross_val_score(clf, X, Y, cv=5)
    accuracy = scores.mean()

    return accuracy

def load_json(filename):
    print "\nStage: Load Features"
    #Feature1
    json_file = open(filename,"r")
    new_json = json.load(json_file)
    json_file.close()

    return new_json

def select_top_k_features(feature_dict, k, label_type,bns_scores):
    top_features = bns_scores[:k]
    tweet_features = []
    tweet_labels = []
    for tweet_id,features in feature_dict.iteritems():
        feature_list_n = numpy.array(features["features"])
        feature_list_k = feature_list_n[top_features]
        tweet_features.append(feature_list_k)
        tweet_labels.append(features[label_type])

    return (tweet_features, tweet_labels)

def run_svm_for_bns_features(feature_type):
    feature_dict = load_json(feature_type+"_features_15k.json")
    label_type_list = ["is_question","is_answerable"]
    result = {}

    print "Running for feature type",feature_type

    for label_type in label_type_list:
        print "Running svm for label:",label_type,"\n\n"
        bns_scores = load_json(feature_type+"_bns_scores_"+label_type+".json")
        result[label_type] = {}

        k =len(bns_scores)
        # print bns_scores
        print 'k:'+str(k)
        (features_k, labels_k) = select_top_k_features(feature_dict, k, label_type, bns_scores)
        accuracy = run_svm(features_k, labels_k)
        print 'accuracy:'+str(accuracy)
        result[label_type][k] = accuracy

        for k in xrange(200,len(bns_scores),200):
            print 'k:'+str(k)
            (features_k, labels_k) = select_top_k_features(feature_dict, k, label_type, bns_scores)
            accuracy = run_svm(features_k, labels_k)
            print 'accuracy:'+str(accuracy)
            result[label_type][k] = accuracy

    return result

def run_svm_for_all_features(feature_type):

    feature_dict = load_json(feature_type+"_features_15k.json")
    label_type_list = ["is_question","is_answerable"]
    result = {}

    for label_type in label_type_list:
        tweet_features = []
        tweet_labels = []
        for tweet_id,features in feature_dict.iteritems():
            feature_list_n = numpy.array(features["features"])
            tweet_features.append(feature_list_n)
            tweet_labels.append(features[label_type])

        result[label_type] = {}
        accuracy = run_svm(tweet_features, tweet_labels)
        result[label_type] = accuracy
    return result

    



## Results

For every feature type, SVM was run twice, once to classify as 'isQuestion' and the second time to classify as 'isAnswerable'. The first time, the classifiers were run on the raw features without feature selection. The second time, the classifiers were run by by selecting the top 'k' features in each category.

<img src="img3.png">

<img src="img4.png">