In [1]:
"""
So in the sentiment analysis process there are a couple of stages more or less differentiated. The first is about 
processing natural language, and the second about training a model. The first stage is in charge of processing text in 
a way that, when we are ready to train our model, we already know what variables the model needs to consider as inputs. 
The model itself is in charge of learning how to determine the sentiment of a piece of text based on these variables. 

For the model part we will introduce and use linear models. They aren't the most powerful methods in terms of accuracy, 
but they are simple enough to be interpreted in their results as we will see. Linear methods allow us to define our input 
variable as a linear combination of input variables. In tis case we will introduce logistic regression.

Finally we will need some data to train our model. For this we will use data from the Kaggle competition UMICH SI650. 
"""
import pandas as pd
import numpy as np
import re, nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer

# define local file names
test_data_file_name = '6_1_testdata.txt'
train_data_file_name = '6_1_training.txt'
#load files into data frames for processing.
test_data_df = pd.read_csv(test_data_file_name, header=None, delimiter="\t", quoting=3)
test_data_df.columns = ["Text"]
train_data_df = pd.read_csv(train_data_file_name, header=None, delimiter="\t", quoting=3)
train_data_df.columns = ["Sentiment","Text"]
print("Shape of Test data-" + str(test_data_df.shape) + "\n" + "Shape of training data - " + str(train_data_df.shape))
print(train_data_df.head())
print(test_data_df.head())
print("No of labels we have for each sentiment class - \n" + str(train_data_df.Sentiment.value_counts()))
print("AVG no. of words per sentence - "+str(np.mean([len(s.split(" ")) for s in train_data_df.Text])))
"""
We will process our text sentences and create a corpus. We will also extract important words and establish them as input 
variables for our classifier. We will use basic transformations.  The requirements of a bag-of-words classifier are minimal 
. We just need to count words, so the process is reduced to do some simplification and unification of terms  and then 
count them. The simplification process mostly includes removing punctuation, lowercasing, removing stop-words, and 
reducing words to its lexical roots (i.e. stemming).
    The class sklearn.feature_extraction.text.CountVectorizer in the wonderful scikit learn Python library converts a 
collection of text documents to a matrix of token counts. This is just what we need to implement later on our bag-of-words 
linear classifier.
    First we need to init the vectoriser. We need to remove punctuations, lowercase, remove stop words, and stem words. 
All these steps can be directly performed by CountVectorizer if we pass the right parameter values. We are using porter
for stemming.
    The approach we will be using here is called a bag-of-words model. In this kind of model we simplify documents to a
multi-set of terms frequencies. That means that, for our model, a document sentiment tag will depend on what words appear 
in that document, discarding any grammar or word order but keeping multiplicity.This is what we just did before, use our text 
entries to build term frequencies. We ended up with the same entries in our dataset but, instead of having them defined by 
a whole text, they are now defined by a series of counts of the most frequent words in our whole corpus. Now we are going
to use these vectors as features to train a classifier.
"""
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed
def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems
########
vectorizer = CountVectorizer(
    analyzer = 'word', tokenizer = tokenize, lowercase = True, stop_words = 'english', max_features = 85 )
corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())
corpus_data_features_nd = corpus_data_features.toarray()
print(corpus_data_features_nd.shape)
vocab = vectorizer.get_feature_names()
print(vocab)

Shape of Test data-(33052, 1)
Shape of training data - (7086, 2)
   Sentiment                                               Text
0          1            The Da Vinci Code book is just awesome.
1          1  this was the first clive cussler i've ever rea...
2          1                   i liked the Da Vinci Code a lot.
3          1                   i liked the Da Vinci Code a lot.
4          1  I liked the Da Vinci Code but it ultimatly did...
                                                Text
0  " I don't care what anyone says, I like Hillar...
1                  have an awesome time at purdue!..
2  Yep, I'm still in London, which is pretty awes...
3  Have to say, I hate Paris Hilton's behavior bu...
4                            i will love the lakers.
No of labels we have for each sentiment class - 
1    3995
0    3091
Name: Sentiment, dtype: int64
AVG no. of words per sentence - 10.8868190799
(40138, 85)
['aaa', 'amaz', 'angelina', 'awesom', 'beauti', 'becaus', 'boston', 'brokeba

In [10]:
# Sum up the counts of each vocabulary word
dist = np.sum(corpus_data_features_nd, axis=0)
    
# For each, print the vocabulary word and the number of times it  appears in the data set
for tag, count in zip(vocab, dist):
    print (str(count)+':'+str(tag))

1179:aaa
485:amaz
1765:angelina
3170:awesom
2146:beauti
1694:becaus
2190:boston
2000:brokeback
423:citi
2003:code
481:cool
2031:cruis
439:d
2087:da
433:drive
1926:francisco
477:friend
452:fuck
1085:geico
773:good
571:got
1178:great
776:ha
2094:harri
2103:harvard
4492:hate
794:hi
2086:hilton
2192:honda
1098:imposs
1764:joli
1054:just
896:know
2019:laker
425:left
4080:like
507:littl
2233:london
811:look
421:lot
10334:love
1568:m
1059:macbook
631:make
1098:miss
1101:mission
1340:mit
2081:mountain
1207:movi
1220:need
459:new
551:oh
674:onli
2094:pari
1018:peopl
454:person
2093:potter
1167:purdu
2126:realli
661:right
475:rock
3914:s
495:said
2038:san
627:say
2019:seattl
1189:shanghai
467:stori
2886:stupid
4614:suck
1455:t
1705:thi
662:thing
1524:think
781:time
2117:tom
2028:toyota
2008:ucla
774:ve
2001:vinci
3703:wa
1656:want
932:way
547:whi
512:work


In [13]:
# Bag of Words Representation
print(corpus_data_features_nd)
#print(corpus_data_features)

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [5]:
"""
In order to perform logistic regression in Python we use LogisticRegression. But first let's split our training data in order to get an evaluation set.
""" 
#from sklearn.cross_validation import train_test_split  <<<DEPRECATED>>>
from sklearn.model_selection import train_test_split

# Corpus_data_features_nd contains all of our original train and test data, so we need to exclude the unlabeled test entries
X_train, X_test, y_train, y_test  = train_test_split(
        corpus_data_features_nd[0:len(train_data_df)], 
        train_data_df.Sentiment,
        train_size=0.85, 
        random_state=1234)

In [14]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)
y_pred = log_model.predict(X_test)
# Train our classifier There is a function for classification called sklearn.metrics.classification_report which calculates several 
# types of (predictive) scores on a classification model. 
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.98      0.99      0.98       467
          1       0.99      0.98      0.99       596

avg / total       0.98      0.98      0.98      1063



In [19]:
# Finally, we can re-train our model with all the training data and use it for sentiment classification with the original (unlabeled) test set.
# train classifier
log_model = LogisticRegression()
log_model = log_model.fit(X=corpus_data_features_nd[0:len(train_data_df)], y=train_data_df.Sentiment)
    
# get predictions
test_pred = log_model.predict(corpus_data_features_nd[len(train_data_df):])
    
# sample some of them
import random
spl = random.sample(range(len(test_pred)), 15)
    
# print text and labels
for text, sentiment in zip(test_data_df.Text[spl], test_pred[spl]):
    print(str(sentiment) +'\t'+( text))

0	As much as I hate Tom Cruise..
0	All I can say is Northwest airlines are a bunch of idiots as well as Delta and if at all possible I won't be flying with them anymore...
1	i miss AAA...
1	Well, my job at GEICO has been great so far.
1	i love mit and harvard both..
0	oh look, it's a parody of those crappy mastercard ads..
1	I LOVE MY TOYOTA COROLLA S! Except...
0	TBS's new stuff sucks, AAA's stuff is boring, and I basically only like Graduation Day and Beating Heart Baby from Head Automatica...
1	I just switched to Allstate insurance, because I miss David " Allstate " Palmer..
0	Bottomline American Airlines sucks and Jet Blue rocks.
1	" I LOVE SHANGHAI VERY MUCH!
0	Which makes me think I need AAA more than AA...
1	My new desk is in a windowed corner with an awesome view of Seattle and the Olympic Mountains.
1	Oh my god I LOVE Pommes mit Mayo.
1	mostly these stupid guys with their shitty honda civics trying to sup them up like they are badass.
