# Logistic Regression for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [1]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [2]:
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

<br>
<br>

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [3]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

## This tokenizer removes stopwords, and other unnecessary stuff, but doesn't use stemming
## we will keep our words as they are
def tokenizer_we( text ) :
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    
    return text
    

Let's give it at try:

In [4]:
tokenizer_we('This :) is a <a> test! :-)</br>')

['test', ':)', ':)']

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [5]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [6]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [7]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)



## Exercise 1: 
Define features based on word embeddings (pre-trained word2vec vectors can be used). Define suitable d dimension, and sequence length

### TRAINING THE EMBEDDING IN OUR CORPUS

We use gensim : conda install -c anaconda gensim

In [9]:

# import gensim to train our embeddings
import gensim
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import LabeledSentence

## Define parameters
EMBEDDINGS_DIM = 300
MIN_COUNT = 10
MODEL_NAME = 'w2v_model'

def labelize( docs, labels, label_type ) :
    _labelized = []
    _len = len( docs )
    
    for i in range( _len ) :
        _label = '%s_%s' % ( label_type, labels[i] )
        _doc_tokenized = tokenizer_we( docs[i] )
        _labelized.append( LabeledSentence( _doc_tokenized, [_label] ) )
        
    return _labelized

_docs, _labels = get_minibatch( stream_docs(path='shuffled_movie_data.csv'), 50000 )

_sentences = labelize( _docs, _labels, 'TRAIN' )

USE_PRETRAINED_EMBEDDINGS = True

_model = None

if USE_PRETRAINED_EMBEDDINGS :
    print( 'using pretrained embeddings' )
    _model = Word2Vec.load( MODEL_NAME )
else :
    print( 'training embeddings' )
    _model = Word2Vec( size = EMBEDDINGS_DIM, min_count = MIN_COUNT )
    _model.build_vocab( [ _sentence.words for _sentence in _sentences ] )
    _model.train( [ _sentence.words for _sentence in _sentences ], 
                  total_examples = len( _sentences ), 
                  epochs = _model.iter )
    _model.save( MODEL_NAME )


using pretrained embeddings


In [10]:
## Test the model
#print( np.array( _model['good'] ) )
print( _model.most_similar( 'good' ) )

[('decent', 0.739240288734436), ('alright', 0.6402304172515869), ('great', 0.6112497448921204), ('ok', 0.5998105406761169), ('okay', 0.5934993028640747), ('fine', 0.5813044309616089), ('nice', 0.5799181461334229), ('bad', 0.5750390887260437), ('excellent', 0.5701903104782104), ('cool', 0.5602899789810181)]


### Create features from the embeddings

Here we just take each word from the document and create a vector from the average of all word embedding vectors

In [11]:
## Create the features from the trained embeddings

def vectorize( doc, w2vModel ) :
    
    _doc_tokenized = tokenizer_we( doc )
    _docVec = np.zeros( w2vModel.vector_size )
    _len = len( _doc_tokenized )
    
    for _word in _doc_tokenized :
        # get the embedding from the model
        if _word in w2vModel :
            _wvec = np.array( w2vModel[_word] )
            # add it to the doc vector
            _docVec = _docVec + _wvec
        
    _docVec = _docVec / _len
    
    return _docVec

#print( vectorize( _docs[1], _model ) )
#print( len( _docs ) )

Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. 

In [12]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, n_iter=2)
doc_stream = stream_docs(path='shuffled_movie_data.csv')


## Exercise 2 

*  Define at least a Three layer neural network. 
*  Define its structure (number of hidden neurons, etc)
*  Define nonlinear function for hidden layers.
*  Define a suitbale loss function for binary classification
*  Implement the backpropagation algorithm for this structure
*  Train the model using SGD

In [13]:
## Neural Network impl

def sigmoid( v ) :
    return 1. / ( 1. + np.exp( -v ) )

def d_sigmoid( v ) :
    return sigmoid( v ) * ( 1. - sigmoid( v ) )

def tanh( v ) :
    return np.tanh( v )

def d_tanh( v ) :
    return 1. - tanh( v ) ** 2

def relu( v ) :
    return np.where( v > 0, v, 0.001 * v )

def d_relu( v ) :
    return np.where( v > 0, 1, 0.001 )

ACTIVATION_SIGMOID = 'sigmoid'
ACTIVATION_TANH    = 'tanh'
ACTIVATION_RELU    = 'relu'

class LLayer :
    
    def __init__( self, sIn, sOut, activation = ACTIVATION_SIGMOID ) :
        
        self.sIn = sIn
        self.sOut = sOut
        
        self.alpha = 0.01
        
        self.g = sigmoid
        self.dg = d_sigmoid
        
        self.z = np.zeros( ( sOut, 1 ) )
        self.a = np.zeros( ( sOut, 1 ) )
        self.a_1 = np.zeros( ( sIn, 1 ) )
        
        if activation == ACTIVATION_TANH :
            self.g = tanh
            self.dg = d_tanh
            print( 'tanh activation chosen' )
        elif activation == ACTIVATION_RELU :
            self.g = relu
            self.dg = d_relu
            print( 'relu activation chosen' )
        else :
            print( 'sigmoid activation chosen' )
        
        self.W = np.random.rand( sOut, sIn ) * 0.01
        self.b = np.random.rand( sOut, 1 ) * 0.01
    
    def forward( self, a_l_1 ) :
        
        self.a_1 = a_l_1
        self.z = np.dot( self.W, self.a_1 ) + self.b
        self.a = self.g( self.z )
        
        return self.a
        
    def backprop( self, da_l ) :
        #print( 'z: ', self.z )
        dz_l = da_l * self.dg( self.z )
        #print( 'dz_l: ', dz_l )
        dW_l = np.dot( dz_l, self.a_1.T )
        db_l = dz_l
        da_l_1 = np.dot( self.W.T, dz_l )
        
        self.W = self.W - self.alpha * dW_l
        self.b = self.b - self.alpha * db_l
        
        #print( 'w,b?' )
        #print( self.W )
        #print( self.b )
        #print( '????' )
        
        return da_l_1
    
    
    
class LNetwork :
    
    def __init__( self, iDim, oDim, alpha = 0.5, nEpochs = 10 ) :
        
        self.iDim = iDim
        self.oDim = oDim
        self.nEpochs = nEpochs
        self.alpha = alpha
        self.hlayers = []
        
        self.nlayers = 0
        
    def addLayer( self, layer ) :
        
        # check that this layer matches the previous
        
        if self.nlayers == 0 :
            if layer.sIn != self.iDim :
                print( 'LNetwork::addLayer> layers mismatch - input layer' )
        elif self.hlayers[-1].sOut != layer.sIn :
            print( 'LNetwork::addLayer> layers mismatch' )
        
        self.hlayers.append( layer )
        layer.alpha = self.alpha
        self.nlayers += 1
    
    def _trainEpoch( self, X, Y ) :
        
        _n = X.shape[0]
        _m = X.shape[1]
        
        _cost = 0
        
        for i in range( _m ) :
            
            x = np.array( [X[:,i]] ).T
            y = Y[i]
            
            # forward pass
            a = x
            
            for l in range( self.nlayers ) :
                a = self.hlayers[l].forward( a )
            #print( 'a: ', a )
            # compute gradient at output
            _cost += -y * np.log( a ) - ( 1 - y ) * np.log( 1 - a )
            #print( 'cost: ', _cost )
            da = -( y / a ) + ( 1 - y ) / ( 1 - a )
            #print( 'da: ', da )
            # backpropagate the gradient
            for i in range( self.nlayers ) :
                da = self.hlayers[self.nlayers - i - 1].backprop( da )
        
        print( 'cost: ', _cost )
    
    def train( self, X, Y ) :
        print( 'started training' )
        _n = X.shape[0]
        _m = X.shape[1]
        
        if _n != self.iDim :
            print( 'LNetwork::train> dimension mismatch' )
            
        for i in range( self.nEpochs ) :
            # train given the number of epochs
            self._trainEpoch( X, Y )
            
            print( 'end epoch: ', ( i + 1 ) )
            
        print( 'done training' )
    
    def score( self, X, Y ) :
        _n = X.shape[0]
        _m = X.shape[1]
        
        if _n != self.iDim :
            print( 'LNetwork::score> dimension mismatch' )
            
        _count = 0
        
        for i in range( _m ) :
            x = np.array( [X[:,i]] ).T
            y = Y[i]
            
            _y = self.predict( x )
            
            if _y > 0.5 :
                _y = 1
            else :
                _y = 0
                
            if _y == y :
                _count += 1
                
        return _count / _m
    
    def predict( self, x ) :
        a = x
            
        for l in range( self.nlayers ) :
            a = self.hlayers[l].forward( a )
        
        
        return a

    @staticmethod
    def saveModel( model, modelname ) :
        import joblib

        joblib.dump( model, './nnet_' + modelname + '.pkl')
        
    @staticmethod
    def loadModel( modelname ) :        
        import joblib
        
        _model = joblib.load('./nnet_' + modelname + '.pkl')

        return _model


#### Just for testing, train a NN to learn the OR gate

In [14]:
## testing the nn on the OR gate

_l1 = LLayer( 2, 2, ACTIVATION_RELU )
_l2 = LLayer( 2, 1 )
    
_nn = LNetwork( 2, 1, alpha = 1, nEpochs = 20 )
_nn.addLayer( _l1 )
_nn.addLayer( _l2 )

X = np.array( [[0.,0.],[0.,1.],[1.,0.],[1.,1.]] ).T
Y = [0., 1., 1., 1.]

_nn.train( X, Y )

#xtest = np.array( [[0,0]] ).T
#ytest = 0

#_nn.predict( xtest )
_nn.score( X, Y )
LNetwork.saveModel( _nn, 'OR_gate' )

_foo = LNetwork.loadModel( 'OR_gate' )
_foo.score( X, Y )

relu activation chosen
sigmoid activation chosen
started training
cost:  [[ 2.73854732]]
end epoch:  1
cost:  [[ 2.58519799]]
end epoch:  2
cost:  [[ 2.61787598]]
end epoch:  3
cost:  [[ 2.64016894]]
end epoch:  4
cost:  [[ 2.64818565]]
end epoch:  5
cost:  [[ 2.64426141]]
end epoch:  6
cost:  [[ 2.6237359]]
end epoch:  7
cost:  [[ 2.57148835]]
end epoch:  8
cost:  [[ 2.4624069]]
end epoch:  9
cost:  [[ 2.04036251]]
end epoch:  10
cost:  [[ 2.34297247]]
end epoch:  11
cost:  [[ 1.40707489]]
end epoch:  12
cost:  [[ 1.7894413]]
end epoch:  13
cost:  [[ 0.86652788]]
end epoch:  14
cost:  [[ 0.9766562]]
end epoch:  15
cost:  [[ 0.52845325]]
end epoch:  16
cost:  [[ 0.40127087]]
end epoch:  17
cost:  [[ 0.31802599]]
end epoch:  18
cost:  [[ 0.32357093]]
end epoch:  19
cost:  [[ 0.24415527]]
end epoch:  20
done training


1.0

In [15]:

## Generating training and test set
doc_stream = stream_docs(path='shuffled_movie_data.csv')

TOTAL_SIZE = 50000
TRAIN_SIZE = 45000
TEST_SIZE = TOTAL_SIZE - TRAIN_SIZE

_t_docs, _t_labels = get_minibatch( doc_stream, size=TRAIN_SIZE )

X_train = np.zeros( ( TRAIN_SIZE, EMBEDDINGS_DIM ) )
Y_train = np.array( _t_labels )

for i in range( len( _t_docs ) ) :
    _docvec = vectorize( _t_docs[i], _model )
    X_train[i] = _docvec

X_train = X_train.T

_t_docs, _t_labels = get_minibatch( doc_stream, size = TEST_SIZE )

X_test = np.zeros( ( TEST_SIZE, EMBEDDINGS_DIM ) )
Y_test = np.array( _t_labels )

for i in range( len( _t_docs ) ) :
    _docvec = vectorize( _t_docs[i], _model )
    X_test[i] = _docvec

X_test = X_test.T

In [16]:
## Defining our nnet arquitecture

NET_MODEL_NAME = 'tanh'
USE_PRETRAINED_NNET = True

_net = None

if USE_PRETRAINED_NNET :
    print( 'using pretrained model' )
    _net = LNetwork.loadModel( NET_MODEL_NAME )
else :
    print( 'training model ...' )
    _layer1 = LLayer( EMBEDDINGS_DIM, 100, LLayer.ACTIVATION_TANH )
    _layer2 = LLayer( 100, 150, LLayer.ACTIVATION_TANH )
    _layer3 = LLayer( 150, 150, LLayer.ACTIVATION_TANH )
    _layer4 = LLayer( 150, 1, LLayer.ACTIVATION_SIGMOID )

    _net = LNetwork( EMBEDDINGS_DIM, 1, alpha = 0.025, nEpochs = 50 )
    _net.addLayer( _layer1 )
    _net.addLayer( _layer2 )
    _net.addLayer( _layer3 )
    _net.addLayer( _layer4 )

    _net.train( X_train, Y_train )
    LNetwork.saveModel( _net, NET_MODEL_NAME )
    
print( 'score in training set: ', _net.score( X_train, Y_train ) )
print( 'score in test set: ', _net.score( X_test, Y_test ) )

using pretrained model
score in training set:  0.8763555555555556
score in test set:  0.8668


Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [17]:
import pyprind
pbar = pyprind.ProgBar( 45 )

doc_stream = stream_docs(path='shuffled_movie_data.csv')
classes = np.array([0, 1])

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.fit_transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
    
print( 'Accuracy in training set: %.3f' % clf.score( X_test, Y_test ) )

0%                          100%


[##############################]
Total time elapsed: 103.082 sec


Accuracy in training set: 0.867


## Exercise 3 

Compare  with your Neural Network

* Logistic regression        -> 0.8670
* Neural network - 50 epochs -> 0.8668

I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [18]:
clf = clf.partial_fit(X_test, y_test)



<br>
<br>

# Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

To install:
conda install -c anaconda joblib

In [19]:
import joblib
import os
if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, './vectorizer.pkl')
joblib.dump(clf, './clf.pkl')

['./clf.pkl']

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` "right".

In [20]:
%%writefile tokenizer.py
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Overwriting tokenizer.py


In [21]:
from tokenizer import tokenizer
joblib.dump(tokenizer, './tokenizer.pkl')

['./tokenizer.pkl']

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [22]:
import joblib
tokenizer = joblib.load('./tokenizer.pkl')
vect = joblib.load('./vectorizer.pkl')
clf = joblib.load('./clf.pkl')

After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook.

In [23]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

array([0])

In [24]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)

array([1])