In [1]:
import pandas as pd
import numpy as np

import re, nltk
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer


In [2]:
train_df = pd.read_csv("Sentimental training dataset.txt", header=None, delimiter="\t", quoting=3)
train_df.columns = ["Sentiment","Text"]
test_df = pd.read_csv("Sentimental test dataset.txt", header=None, delimiter="\t", quoting=3)
test_df.columns = ["Text"]

In [3]:
train_df.head()

Unnamed: 0,Sentiment,Text
0,1,The Da Vinci Code book is just awesome.
1,1,this was the first clive cussler i've ever rea...
2,1,i liked the Da Vinci Code a lot.
3,1,i liked the Da Vinci Code a lot.
4,1,I liked the Da Vinci Code but it ultimatly did...


In [4]:
test_df.head()

Unnamed: 0,Text
0,""" I don't care what anyone says, I like Hillar..."
1,have an awesome time at purdue!..
2,"Yep, I'm still in London, which is pretty awes..."
3,"Have to say, I hate Paris Hilton's behavior bu..."
4,i will love the lakers.


Let's count how many labels do we have for each sentiment class and what are their percentage.

In [5]:
train_df.Sentiment.value_counts()

1    3995
0    3091
Name: Sentiment, dtype: int64

In [6]:
(train_df.Sentiment.value_counts()/len(train_df))*100

1    56.378775
0    43.621225
Name: Sentiment, dtype: float64

let's calculate the average number of words per sentence. 

In [7]:
np.mean([len(s.split(" ")) for s in train_df.Text])

10.886819079875812

### Preparing a corpus
The class sklearn.feature_extraction.text.CountVectorizer in the wonderful scikit learn Python library converts a collection of text documents to a matrix of token counts. This is just what we need to implement later on our bag-of-words linear classifier.
First we need to init the vectorizer. We need to remove punctuations, lowercase, remove stop words, and stem words. All these steps can be directly performed by CountVectorizer if we pass the right parameter values. We can do as follows.

In [8]:
#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 85
)

In [9]:
corpus_data_features = vectorizer.fit_transform(train_df.Text.tolist() + test_df.Text.tolist())

In [10]:
corpus_data_features_nd = corpus_data_features.toarray()
corpus_data_features_nd.shape

(40138, 85)

Let's take a look at the words in the vocabulary:

In [11]:
vocab = vectorizer.get_feature_names()
print(vocab)

['aaa', 'amaz', 'angelina', 'awesom', 'beauti', 'becaus', 'boston', 'brokeback', 'citi', 'code', 'cool', 'cruis', 'd', 'da', 'drive', 'francisco', 'friend', 'fuck', 'geico', 'good', 'got', 'great', 'ha', 'harri', 'harvard', 'hate', 'hi', 'hilton', 'honda', 'imposs', 'joli', 'just', 'know', 'laker', 'left', 'like', 'littl', 'london', 'look', 'lot', 'love', 'm', 'macbook', 'make', 'miss', 'mission', 'mit', 'mountain', 'movi', 'need', 'new', 'oh', 'onli', 'pari', 'peopl', 'person', 'potter', 'purdu', 'realli', 'right', 'rock', 's', 'said', 'san', 'say', 'seattl', 'shanghai', 'stori', 'stupid', 'suck', 't', 'thi', 'thing', 'think', 'time', 'tom', 'toyota', 'ucla', 've', 'vinci', 'wa', 'want', 'way', 'whi', 'work']


We can also print the counts of each word in the vocabulary as follows:

In [12]:
# Sum up the counts of each vocabulary word
dist = np.sum(corpus_data_features_nd, axis=0)
    
# For each, print the vocabulary word and the number of times it 
# appears in the data set
for tag, count in zip(vocab, dist):
    print(count,tag)

1179 aaa
485 amaz
1765 angelina
3170 awesom
2146 beauti
1694 becaus
2190 boston
2000 brokeback
423 citi
2003 code
481 cool
2031 cruis
439 d
2087 da
433 drive
1926 francisco
477 friend
452 fuck
1085 geico
773 good
571 got
1178 great
776 ha
2094 harri
2103 harvard
4492 hate
794 hi
2086 hilton
2192 honda
1098 imposs
1764 joli
1054 just
896 know
2019 laker
425 left
4080 like
507 littl
2233 london
811 look
421 lot
10334 love
1568 m
1059 macbook
631 make
1098 miss
1101 mission
1340 mit
2081 mountain
1207 movi
1220 need
459 new
551 oh
674 onli
2094 pari
1018 peopl
454 person
2093 potter
1167 purdu
2126 realli
661 right
475 rock
3914 s
495 said
2038 san
627 say
2019 seattl
1189 shanghai
467 stori
2886 stupid
4614 suck
1455 t
1705 thi
662 thing
1524 think
781 time
2117 tom
2028 toyota
2008 ucla
774 ve
2001 vinci
3703 wa
1656 want
932 way
547 whi
512 work


In order to perform logistic regression to our training dataset, first we need to split our training data to get an evaluation set. The problem with not having labels in our original test set still persists, and we need to create a separate evaluation set from our original training data if we want to evaluate our classifier. We will use train_test_split:

In [13]:
from sklearn.cross_validation import train_test_split
    
# remember that corpus_data_features_nd contains all of our 
# original train and test data, so we need to exclude
# the unlabeled test entries
X_train, X_test, y_train, y_test  = train_test_split(
        corpus_data_features_nd[0:len(train_df)], 
        train_df.Sentiment,
        train_size=0.85, 
        random_state=1234)




Now we are ready to train our classifier

In [14]:
from sklearn.linear_model import LogisticRegression
    
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)

And once the model has been created, we can predict the Test dataset

In [15]:
y_pred = log_model.predict(X_test)

In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.98      0.99      0.98       467
          1       0.99      0.98      0.99       596

avg / total       0.98      0.98      0.98      1063



Finally, we can re-train our model with all the training data and use it for
sentiment classification with the original (unlabeled) test set.

In [17]:
# train classifier
log_model = LogisticRegression()
log_model = log_model.fit(X=corpus_data_features_nd[0:len(train_df)], y=train_df.Sentiment)
    
# get predictions
test_pred = log_model.predict(corpus_data_features_nd[len(train_df):])
    
# sample some of them
import random
spl = random.sample(range(len(test_pred)), 15)
    
# print text and labels
for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print(sentiment, text)

1 have an awesome time at purdue!..
0 Lakers suck under pressure and they don't win unless Kobe passes the ball which I doubt will happen in game 7.
1 Purdue's is quite ugly, however, I'm a boilermaker and always will be despite the outlook;
1 Three days at Purdue with three awesome people.
1 My Purdue Cal friends are awesome!..
0 i try to enjoy my time here in san francisco.......
0 I've been working on an article, and Antid Oto has been, er, so upset about the shitty Harvard plagiarizer that he hasn't been able to even look at keyboards.
1 Proud parents Tom Cruise, 44, and Katie Holmes, 27, were on the cover with their beautiful baby girl.
0 Stupid UCLA.
1 I also love the new rabbits, I still want an x-terra-but Luke's will do, and I kinda like Honda Elements, but they got a bad safety rating: (..
0 UCLA is stupid, I realized.
1 i love Kappo Honda, which is across the street..
0 gawssh i hate london i hope he blows up and his guts fly everywhere and then birds eat his guts.
0 UCLA is