# I'm positive! :)

I'm planning on doing a high-risk, high-reward (ideally low-investment) project on how people process emoticons in Tweets. Many studies have used emoticons to label Tweets for supervised(-ish) classifier training, a procedure known as ["Distant Supervision"](http://web.stanford.edu/~jurafsky/mintz.pdf). In its basic implementation:

1. A corpus of Tweets is collected
2. Tweets containing positive ( :) ) and negative emoticons ( :( ) are identified and labeled as positive or negative
3. The emoticons are then stripped from those Tweets and the classifier is trained on them
4. The classifier is applied to a testing set.

And this works pretty well and has the perk of not requiring you to get a bunch of people on mturk to label Tweets by hand! 

The project I'm planning is going to spice this up by adding some Psychology into the mix--How do people remember Tweets if we randomly add positive and negative emoticons to them? If this technique works, it could have some useful implications for classification tasks in general. There are many situations where there are not clear labeled examples such as situations in which emoticons are considered inappropriate (e.g., deaths) or where the baserate of category expression is low ([inferring political preferences based on Tweets is hindered by most Republicans and Democrats who aren't elected officials posting very little political content](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/viewFile/6128/6347))

But first, I want to establish how well a rudimentary classifier can do using 1) Tweets that have the correct labels (a reasonable upper bound) 2) Using neutral Tweets that we've randomly given positive and negative emoticons.


Various utility packages


In [1]:
%matplotlib inline
import csv
import numpy as np
import matplotlib.pyplot as plt
import os
import random
import pickle

# Weird unicode error processing some of the tweets
import re
import sys


Machine learning and string processing packages


In [2]:
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import tweet_preprocess
from sklearn.naive_bayes import BernoulliNB


# 1. Setting up models and data processing functions



## Naive Bayes classifier

For this project, I'm going to initially use a simple Naive Bayes model that uses the binary presence or absence of a word. While tf-idf (text frequency-inverse document frequency) is useful in many situations, given that Tweets are short and meaningful words are unlikely to repeat, we shouldn't get much more information using tf-idf.

Concretely, given training and testing data, the code below trains a classifier based on whether the 100 top words were present or not (binary) and then cross-validates


In [11]:
def nb_classifier(tweet_train, tweet_test, label_train, label_test,K):
	# Run Naive Bayes classifier on binary presence

	## Vectorize training words
	train_count_vec=CountVectorizer(binary=True)
	train_count_words=train_count_vec.fit_transform(tweet_train)
	test_count_words=train_count_vec.transform(tweet_test)
	train_feat_names=train_count_vec.get_feature_names()

	# Reduce feature space to K best features
	ch2 = SelectKBest(chi2, k=K)
	X_train = ch2.fit_transform(train_count_words, label_train)
	X_test = ch2.transform(test_count_words)

	feature_names=train_count_vec.get_feature_names()
	k_feature_inds=[]
	for ii in ch2.get_support(indices=True): 
		k_feature_inds.append(feature_names[ii])

	# NB model
	nb_class=BernoulliNB()
	nb_class.fit(X_train,label_train)

	print 'Training accuracy'
	train_acc=nb_class.score(X_train,label_train)
	print train_acc

	print 'Testing accuracy'
	test_acc=nb_class.score(X_test,label_test)
	print test_acc

	return train_count_vec,k_feature_inds,test_acc

## Load all tweets and return the each Tweet's text and label (0 for negative, 1 for positive)

This code loads our [corpus](http://www.cs.york.ac.uk/semeval-2013/task2/). These Tweets have all been rated by hand. I already ran the code that downloads all the tweets from Twitter and stores them into a tsv (tweet-b-actual.tsv in the file below).

There's also some NLP steps I added in here to remove stopwords (e.g., the, a) and to stem the words (e.g., justice->just, justify->just).

In [44]:
## Load tweets and positive (1), negative (0) labels
def load_tweets(fname):
	# Load tweets text

	if not os.path.exists(fname):
		print 'Processing tweets...'

		# Identify stopwords to remove
		sw=stopwords.words('english')

		# create stemmer to stem words
		stemmer=SnowballStemmer("english")

		fold_name='semval'
		#file_name='tweet-a.tsv' # a is context rating
		file_name='tweet-b-actual.tsv' # b is message rating
		full_name=os.path.join(fold_name,file_name)
		tweet_text_temp,tweet_label_temp=tweet_preprocess.load_tweets(full_name)

		tweet_text=[]
		tweet_label=[]
		for ttt,tlt in zip(tweet_text_temp,tweet_label_temp):

			if tlt=='positive' or tlt=='negative':
				# Store string minus stopwords
				ttt=ttt.lower()
				for s in sw:
					ttt=re.sub('(?<![a-z])'+s+'(?![a-z])',' ',ttt)
					#ttt=str.replace(str(ttt),'\s'+s+'\s','') # Remove stopwords

				# stem words
				ttt_split=str.split(str(ttt),' ')
				new_ttt=''
				for it in ttt_split:
					new_ttt=new_ttt+' '+stemmer.stem(it)

				tweet_text.append(new_ttt)

				# labels
				if tlt=='positive':
					tweet_label.append(1)
				else:
					tweet_label.append(0)
		with open(fname, 'w') as f:
			pickle.dump([tweet_text,tweet_label], f)
	else:
		print 'Loading processed tweets...'
		with open(fname) as f:
			tweet_text,tweet_label = pickle.load(f)
        print 'Load complete'


	return tweet_text,tweet_label

fname='semval/loaded_tweets.pickle'
tweet_text_raw,tweet_label_raw=load_tweets(fname)

Loading processed tweets...
Load complete


Typical for Twitter data, we have a bias towards positive Tweets (people generally don't like to post their complaints)

In [45]:
prob_positive=np.mean(np.array(tweet_label_raw))

print '\nBaserate of positive Tweets'
print prob_positive


Baserate of positive Tweets
0.732074263764


In [50]:
def equal_size_samples(tweet_text,tweet_label):
    neg_ind=np.array(tweet_label)==0
    num_neg=np.sum(neg_ind)

    tweet_neg=[tweet_text[i] for i in np.where(neg_ind)[0]]
    
    pos_ind=np.array(tweet_label)==1
    num_pos=np.sum(pos_ind)
    tweet_pos=[tweet_text[i] for i in np.where(pos_ind)[0]]
    
    num_samp=np.min([num_neg,num_pos])
    tweet_neg_samp=np.random.choice(tweet_neg,size=num_samp,replace=False)
    tweet_pos_samp=np.random.choice(tweet_pos,size=num_samp,replace=False)
  
    new_tweet_text=[]
    for tw in tweet_neg_samp: new_tweet_text.append(tw)
    for tw in tweet_pos_samp: new_tweet_text.append(tw)        

    new_tweet_label=[0]*num_samp+[1]*num_samp
    return new_tweet_text,new_tweet_label
    

    

tweet_text,tweet_label=equal_size_samples(tweet_text_raw,tweet_label_raw)  

prob_positive=np.mean(np.array(tweet_label))
print '\nBaserate of positive Tweets'
print prob_positive


Baserate of positive Tweets
0.5


## How well can we do with correctly labeled training examples?

Let's run our classifier on the correctly labeled data

In [52]:
num_samp=len(tweet_text)
tweet_text=tweet_text[0:num_samp]
tweet_label=tweet_label[0:num_samp]	

tweet_train, tweet_test, label_train, label_test = train_test_split(tweet_text, tweet_label, \
                                                                    test_size=0.33, random_state=42)

print '\nUsing labeled sentiment tweets to predict sentiment tweets'
num_K=[1,5,10,100,500,1000]
best_k=0
best_acc=0
best_classifier=None
best_feature_names=None
for K in num_K:
    print '\n'+str(K)+' Best Features'
    labeled_classifier,labeled_feature_names,labeled_acc= \
        nb_classifier(tweet_train, tweet_test, label_train, label_test,K)
    if labeled_acc>best_acc: best_k=K;best_acc=labeled_acc; \
        best_classifier=labeled_classifier;best_feature_names=labeled_feature_names

print '\nBest K: '+str(best_k)
print 'Test accuracy: '+str(best_acc)        


Using labeled sentiment tweets to predict sentiment tweets

1 Best Features
Training accuracy
0.542372881356
Testing accuracy
0.533453887884

5 Best Features
Training accuracy
0.619982158787
Testing accuracy
0.598553345389

10 Best Features
Training accuracy
0.655664585192
Testing accuracy
0.654611211573

100 Best Features
Training accuracy
0.800178412132
Testing accuracy
0.669077757685

1000 Best Features
Training accuracy
0.942908117752
Testing accuracy
0.660036166365

Best K: 100
Test accuracy: 0.669077757685


We get a small improvement from baseline (~.5) using a pretty stupid model. In the future, I'll want to add bigrams and other features, but I seem to have at least the basic tools for answering my questions.

In [53]:
print '\n'
print 'K='+str(best_k) +' best words'
print best_feature_names



K=100 best words
[u'10th', u'1st', u'8pm', u'amaz', u'award', u'awesom', u'bad', u'band', u'biggest', u'birthday', u'bit', u'bless', u'boehner', u'bro', u'cancel', u'cast', u'come', u'concert', u'day', u'dead', u'deal', u'delay', u'demitra', u'didn', u'die', u'dont', u'enjoy', u'excit', u'fail', u'fb', u'feb', u'feel', u'final', u'forward', u'free', u'fuck', u'fun', u'good', u'great', u'happen', u'happi', u'harvey', u'hey', u'homecom', u'honor', u'hope', u'http', u'instagr', u'instead', u'iv', u'japan', u'justinbieb', u'kill', u'kinda', u'know', u'like', u'll', u'loss', u'love', u'may', u'mon', u'movie', u'need', u'news', u'next', u'novemb', u'november', u'obama', u'pavol', u'perfect', u'pj', u'proud', u'put', u'right', u'riot', u'road', u'rock', u'rugbi', u'sad', u'saturday', u'school', u'see', u'serious', u'shit', u'show', u'side', u'sorri', u'stop', u'stupid', u'super', u'thank', u'ticket', u'tri', u'twat', u'use', u'weekend', u'win', u'window', u'without', u'wors']


And a lot of the words seem to make sense too.

## How about with incorrectly labeled neutral examples?

And for a null null baseline, let's try to do this classification with neutral Tweets that we assign labels to randomly. Our key manipulation is going to be trying to use people's interpretation of emoticons to get some sentiment-word signal out of these.

## Load neutral tweets

In [54]:
def load_neutral_tweets(fname,prob_positive):
	# Load tweets text

	if not os.path.exists(fname):
		print 'Processing tweets...'

		# Identify stopwords to remove
		sw=stopwords.words('english')

		# create stemmer to stem words
		stemmer=SnowballStemmer("english")

		fold_name='semval'
		#file_name='tweet-a.tsv' # a is context rating
		file_name='tweet-b-actual.tsv' # b is message rating
		full_name=os.path.join(fold_name,file_name)
		tweet_text_temp,tweet_label_temp=tweet_preprocess.load_tweets(full_name)

		tweet_text=[]
		tweet_label=[]
		for ttt,tlt in zip(tweet_text_temp,tweet_label_temp):

			if (not tlt=='positive') and (not tlt=='negative'):
				# Store string minus stopwords
				ttt=ttt.lower()
				for s in sw:
					ttt=re.sub('(?<![a-z])'+s+'(?![a-z])',' ',ttt)
					#ttt=str.replace(str(ttt),'\s'+s+'\s','') # Remove stopwords

				# stem words
				ttt_split=str.split(str(ttt),' ')
				new_ttt=''
				for it in ttt_split:
					new_ttt=new_ttt+' '+stemmer.stem(it)

				tweet_text.append(new_ttt)

				# labels
				is_pos=random.random()<prob_positive
				if is_pos:
					tweet_label.append(1)
				else:
					tweet_label.append(0)
		with open(fname, 'w') as f:
			pickle.dump([tweet_text,tweet_label], f)
	else:
		print 'Loading processed tweets...'
		with open(fname) as f:
			tweet_text,tweet_label = pickle.load(f)


	return tweet_text,tweet_label

In [57]:

fname='semval/loaded_neutral_tweets.pickle'
neutral_tweet_text,neutral_tweet_label=load_neutral_tweets(fname,.5)

neutral_tweet_train, neutral_tweet_test, neutral_label_train, neutral_label_test = \
    train_test_split(neutral_tweet_text, neutral_tweet_label, test_size=0.33, random_state=42)

print '\nUsing neutral tweets to predict sentiment tweets'
num_K=[1,5,10,100,500,1000]
best_n_k=0
best_n_acc=0
best_n_classifier=None
best_n_feature_names=None
for K in num_K:
    print '\n'+str(K)+' Best Features'
    neutral_classifier,neutral_feature_names,neutral_acc= \
        nb_classifier(neutral_tweet_train, tweet_test, neutral_label_train, label_test,K)
    if neutral_acc>best_n_acc: best_n_k=K;best_n_acc=neutral_acc; \
        best_n_classifier=neutral_classifier;best_n_feature_n_names=neutral_feature_names

print '\nBest K: '+str(best_n_k)
print 'Test accuracy: '+str(best_n_acc)

Loading processed tweets...

Using neutral tweets to predict sentiment tweets

1 Best Features
Training accuracy
0.740088105727
Testing accuracy
0.497287522604

5 Best Features
Training accuracy
0.748409202154
Testing accuracy
0.499095840868

10 Best Features
Training accuracy
0.754772393539
Testing accuracy
0.499095840868

100 Best Features
Training accuracy
0.817914831131
Testing accuracy
0.48643761302

500 Best Features
Training accuracy
0.84140969163
Testing accuracy
0.493670886076

1000 Best Features
Training accuracy
0.844836025453
Testing accuracy
0.490054249548

Best K: 5
Test accuracy: 0.499095840868


And as we would expect, no signal--The model performs at chance on the test set. 

# Interim Summary

Using a classifier trained on Tweets sentiment labeled by hand, we were able to classify Tweets as positive or negative. Neutral Tweets that we randomly assigned sentiments did not improve predictive power. And now comes the challenging part. Let's see if we can make some lemonade out of this by adding some emoticons and having people remember the Tweets.