# Logistic Regression for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [1]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [2]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

<br>
<br>

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [3]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [4]:
tokenizer('This :) is a <a> test! :-)</br>')

['test', ':)', ':)']

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [5]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [6]:
text = next(stream_docs(path='shuffled_movie_data.csv'))

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [7]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

# Excercise 1: define new features according to https://web.stanford.edu/~jurafsky/slp3/5.pdf

## Excercise 1:


Define new features

We are going to rewrite the tokenizer function to use normalized terms, using the Porter steamming algorithm, instead of normal terms.
"The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems."

In [36]:
def feature_extractor(sentence, plot=False):
    
    from nltk.corpus import opinion_lexicon
    from nltk.tokenize import treebank
    from math import log1p

    tokenizer = treebank.TreebankWordTokenizer()
    pos_words = 0
    neg_words = 0
    no_word = 0
    pronous_word = 0
    exc_word = 0
    count_words = 0
    tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]

    x = list(range(len(tokenized_sent))) # x axis for the plot
    y = []
    features = []
    for word in tokenized_sent:
        count_words +=1
        if word in opinion_lexicon.positive():
            pos_words += 1
            y.append(1) # positive
        elif word in opinion_lexicon.negative():
            neg_words += 1
            y.append(-1) # negative
        elif word in 'no':
            no_word = 1 
            y.append(-1) #word = no
        elif word in 'me' or word in 'you' or word in 'I' or word in 'your' or word in 'mine':
            pronous_word += 1 
            y.append(-1) #contiene pronous 1st, 2rd
        elif word in '!':
            exc_word = 1 
            y.append(-1) #contiene signo de exclamación
        else:
            y.append(0) # neutral
            
    #print("x1= ",pos_words)
    #print("x2= ",neg_words)
    #print("x3= ",no_word)
    #print("x4= ",pronous_word)
    #print("x5= ",exc_word)
    #print("x6= ",log1p(count_words))
    #print(x)
    features.append(pos_words)
    features.append(neg_words)
    features.append(no_word)
    features.append(pronous_word)
    features.append(exc_word)
    features.append(log1p(count_words))
    print(features)
    
    return features

    #if pos_words > neg_words:
     #   print('Positive')
    #elif pos_words < neg_words or no_word == True:
      #print('Negative')
    #elif pos_words == neg_words:
     #   print('Neutral')
    

    if plot == True:
        _show_plot(x, y, x_labels=tokenized_sent, y_labels=['Negative', 'Neutral', 'Positive']) 
#feature_extractor(X_test, plot=False)

In [40]:
doc_stream = stream_docs(path='shuffled_movie_data.csv')
X_train, y_train = get_minibatch(doc_stream, size=2500)

In [41]:
def convert(x):
    features=[]
    for i in range(len(x)):
        features.append(feature_extractor(x[i], plot=False))
    return features

In [None]:
X_train = convert(X_train)

[8, 13, 0, 10, 0, 5.648974238161206]
[12, 10, 0, 13, 1, 5.662960480135946]
[12, 20, 0, 11, 1, 5.981414211254481]
[4, 0, 0, 6, 0, 4.406719247264253]
[4, 1, 0, 3, 0, 5.075173815233827]
[9, 3, 1, 8, 0, 5.111987788356544]
[12, 4, 0, 9, 0, 5.83773044716594]
[3, 4, 0, 3, 1, 5.099866427824199]
[6, 2, 0, 5, 1, 4.867534450455582]
[8, 4, 0, 3, 1, 5.003946305945459]
[0, 2, 0, 8, 0, 4.07753744390572]
[11, 0, 1, 5, 0, 5.375278407684165]
[13, 8, 0, 16, 0, 5.783825182329737]
[10, 3, 0, 11, 0, 5.883322388488279]
[3, 5, 0, 2, 0, 4.248495242049359]
[6, 2, 1, 4, 0, 5.288267030694535]
[7, 12, 1, 5, 1, 5.293304824724492]
[2, 0, 0, 7, 0, 4.51085950651685]
[6, 10, 1, 4, 0, 5.332718793265369]
[13, 13, 1, 10, 0, 6.035481432524756]
[10, 3, 1, 2, 0, 4.795790545596741]
[4, 8, 0, 6, 0, 4.9344739331306915]
[4, 3, 0, 4, 0, 4.770684624465665]
[3, 3, 0, 5, 1, 4.61512051684126]
[5, 9, 0, 8, 1, 5.111987788356544]
[7, 7, 0, 19, 0, 5.720311776607412]
[15, 11, 0, 18, 1, 5.966146739123692]
[10, 22, 1, 25, 0, 6.1312264894831

[23, 10, 0, 13, 0, 6.113682179832232]
[12, 7, 1, 16, 0, 5.8664680569332965]
[18, 19, 0, 3, 0, 6.021023349349527]
[7, 4, 1, 3, 0, 5.0106352940962555]
[6, 6, 0, 7, 0, 5.099866427824199]
[1, 5, 0, 2, 0, 4.143134726391533]
[9, 15, 0, 20, 0, 5.929589143389895]
[15, 18, 0, 10, 1, 6.169610732491456]
[9, 6, 1, 22, 1, 5.877735781779639]
[7, 16, 0, 9, 0, 5.8664680569332965]
[9, 5, 0, 2, 0, 5.1647859739235145]
[16, 11, 0, 6, 0, 5.968707559985366]
[11, 12, 1, 21, 0, 5.929589143389895]
[9, 6, 0, 7, 0, 5.480638923341991]
[10, 3, 0, 10, 1, 5.497168225293202]
[6, 8, 0, 14, 0, 5.402677381872279]
[3, 3, 0, 11, 1, 5.135798437050262]
[7, 9, 0, 7, 1, 5.771441123130016]
[0, 1, 0, 3, 0, 4.204692619390966]
[29, 11, 1, 16, 1, 6.07993319509559]
[8, 5, 0, 7, 1, 5.0238805208462765]
[7, 10, 1, 5, 0, 5.752572638825633]
[18, 21, 1, 13, 1, 6.416732282512326]
[4, 7, 0, 10, 0, 5.231108616854587]
[24, 10, 0, 12, 0, 5.749392985908253]
[7, 0, 0, 3, 0, 3.970291913552122]
[2, 3, 0, 2, 0, 3.6375861597263857]
[2, 3, 0, 10, 0,

[3, 12, 0, 2, 0, 4.976733742420574]
[7, 5, 0, 3, 1, 5.288267030694535]
[7, 10, 0, 5, 0, 5.393627546352361]
[10, 18, 0, 27, 1, 5.8805329864007]
[5, 3, 0, 9, 0, 5.204006687076795]
[11, 27, 1, 31, 0, 6.628041376179533]
[8, 4, 0, 12, 0, 5.318119993844216]
[5, 5, 0, 12, 1, 5.247024072160486]
[3, 4, 0, 8, 0, 4.564348191467836]
[2, 3, 1, 10, 0, 5.147494476813453]
[3, 5, 1, 3, 0, 5.303304908059076]
[3, 3, 1, 2, 1, 4.248495242049359]
[9, 1, 0, 7, 0, 5.062595033026967]
[19, 8, 0, 18, 0, 6.476972362889683]
[29, 17, 0, 27, 0, 6.558197802812269]
[6, 4, 0, 4, 1, 5.308267697401205]
[6, 8, 0, 1, 0, 4.820281565605037]
[3, 0, 0, 6, 1, 4.882801922586371]
[8, 4, 0, 14, 0, 5.3230099791384085]
[34, 30, 0, 21, 1, 7.1098794630722715]
[5, 4, 0, 8, 0, 5.003946305945459]
[7, 1, 0, 7, 0, 4.955827057601261]
[7, 8, 0, 19, 0, 6.077642243349034]
[42, 23, 1, 25, 0, 6.654152520183219]
[16, 2, 0, 11, 0, 5.25227342804663]
[7, 7, 1, 11, 0, 5.616771097666572]
[10, 18, 1, 18, 0, 6.118097198041348]
[1, 4, 0, 4, 1, 5.06259503

[10, 2, 0, 10, 0, 5.3981627015177525]
[5, 4, 1, 9, 0, 5.198497031265826]
[6, 9, 1, 7, 1, 4.962844630259907]
[10, 4, 1, 18, 1, 5.3471075307174685]
[3, 2, 0, 7, 0, 5.1298987149230735]
[6, 7, 0, 4, 1, 5.056245805348308]
[13, 4, 1, 5, 0, 5.575949103146316]
[11, 6, 1, 5, 0, 5.572154032177765]
[11, 5, 0, 7, 0, 5.3230099791384085]
[31, 5, 1, 18, 1, 6.7464121285733745]
[14, 14, 0, 7, 1, 5.950642552587727]
[16, 27, 1, 16, 1, 6.47543271670409]
[32, 65, 0, 33, 1, 7.249215057114389]
[16, 23, 1, 14, 0, 6.375024819828097]
[5, 19, 1, 11, 1, 5.420534999272286]
[14, 11, 1, 17, 0, 5.902633333401366]
[6, 2, 0, 12, 1, 5.293304824724492]
[3, 5, 0, 7, 0, 4.90527477843843]
[9, 4, 0, 6, 1, 4.997212273764115]
[11, 15, 0, 11, 0, 5.54907608489522]
[0, 4, 0, 1, 0, 4.330733340286331]
[4, 7, 0, 6, 0, 5.288267030694535]
[2, 8, 0, 3, 0, 5.049856007249537]
[6, 11, 0, 7, 0, 5.472270673671475]
[3, 2, 0, 1, 0, 4.276666119016055]
[19, 11, 0, 11, 0, 5.780743515792329]
[14, 28, 0, 13, 0, 5.993961427306569]
[2, 5, 0, 9, 0, 5

[9, 6, 0, 5, 0, 5.723585101952381]
[8, 5, 1, 12, 1, 5.37989735354046]
[13, 16, 0, 6, 0, 5.91350300563827]
[0, 0, 0, 7, 0, 4.912654885736052]
[3, 4, 0, 5, 0, 4.820281565605037]
[24, 34, 0, 11, 0, 6.668228248417403]
[13, 17, 0, 8, 0, 5.726847747587197]
[9, 17, 0, 4, 1, 5.5254529391317835]
[12, 5, 0, 9, 1, 5.713732805509369]
[4, 9, 1, 4, 0, 4.955827057601261]
[4, 3, 0, 5, 1, 5.0238805208462765]
[5, 5, 0, 4, 1, 4.897839799950911]
[16, 11, 0, 14, 1, 5.823045895483019]
[2, 13, 0, 3, 0, 5.123963979403259]
[5, 7, 0, 2, 0, 4.736198448394496]
[14, 13, 1, 25, 0, 6.3784261836515865]
[14, 2, 0, 6, 1, 5.241747015059643]
[6, 8, 1, 15, 0, 5.634789603169249]
[21, 15, 1, 26, 0, 6.52795791762255]
[15, 6, 1, 9, 1, 6.016157159698354]
[4, 15, 0, 0, 0, 5.666426688112432]
[8, 3, 1, 11, 0, 5.529429087511423]
[11, 3, 0, 8, 0, 5.723585101952381]
[7, 1, 0, 8, 1, 5.0369526024136295]
[8, 1, 0, 4, 0, 4.574710978503383]
[2, 5, 1, 1, 1, 5.10594547390058]
[6, 10, 1, 2, 0, 5.241747015059643]
[5, 2, 0, 0, 1, 4.7535901911

[20, 20, 0, 20, 1, 6.646390514847729]
[15, 14, 0, 13, 0, 6.587550014824796]
[7, 9, 1, 10, 1, 5.093750200806762]
[4, 2, 0, 21, 1, 5.181783550292085]
[20, 20, 1, 23, 1, 6.635946555686647]
[5, 6, 0, 9, 0, 5.272999558563747]
[5, 7, 0, 3, 0, 5.198497031265826]
[3, 14, 0, 17, 0, 5.272999558563747]
[6, 4, 0, 15, 0, 5.342334251964811]
[8, 6, 0, 13, 0, 5.176149732573829]
[4, 1, 0, 5, 0, 4.9344739331306915]
[2, 12, 1, 8, 1, 5.7899601708972535]
[8, 23, 0, 9, 1, 5.948034989180646]
[9, 7, 1, 4, 1, 5.389071729816501]
[18, 6, 0, 13, 0, 5.69035945432406]
[12, 14, 0, 11, 1, 5.717027701406222]
[3, 8, 0, 8, 0, 5.3706380281276624]
[9, 6, 0, 11, 0, 5.488937726156687]
[7, 6, 0, 7, 1, 5.298317366548036]
[7, 1, 0, 4, 0, 4.90527477843843]
[9, 14, 1, 19, 1, 5.955837369464831]
[12, 6, 1, 3, 0, 5.147494476813453]
[6, 11, 1, 3, 1, 5.1647859739235145]
[8, 6, 0, 1, 0, 5.293304824724492]
[14, 14, 0, 17, 0, 5.91350300563827]
[34, 46, 1, 22, 0, 7.051855622955894]
[10, 1, 1, 12, 0, 6.013715156042802]
[8, 2, 0, 21, 0, 5.

[5, 2, 0, 4, 0, 5.0106352940962555]
[1, 4, 0, 9, 0, 4.634728988229636]
[10, 3, 0, 3, 1, 5.303304908059076]
[10, 6, 0, 7, 0, 5.484796933490655]
[15, 18, 0, 16, 0, 6.391917113392602]
[7, 4, 0, 8, 0, 5.231108616854587]
[16, 2, 0, 11, 1, 5.147494476813453]
[3, 4, 0, 5, 1, 4.110873864173311]
[15, 11, 1, 16, 0, 6.042632833682381]
[6, 2, 0, 8, 0, 4.976733742420574]
[8, 25, 1, 21, 1, 6.467698726104354]
[13, 6, 1, 20, 0, 5.765191102784844]
[12, 4, 0, 5, 0, 5.147494476813453]
[11, 8, 1, 3, 1, 5.730099782973575]
[17, 1, 1, 30, 0, 5.634789603169249]
[10, 22, 1, 11, 1, 5.529429087511423]
[18, 15, 1, 8, 1, 6.251903883165888]
[5, 4, 0, 6, 0, 4.997212273764115]
[4, 6, 1, 4, 1, 5.60947179518496]
[2, 3, 1, 4, 0, 4.6443908991413725]
[2, 5, 0, 4, 0, 4.912654885736052]
[14, 6, 0, 27, 0, 6.056784013228625]
[5, 10, 0, 16, 0, 5.799092654460526]
[7, 0, 1, 8, 0, 4.787491742782046]
[13, 13, 0, 9, 0, 5.883322388488279]
[3, 5, 0, 6, 1, 5.10594547390058]
[12, 6, 1, 4, 0, 5.10594547390058]
[8, 2, 0, 7, 0, 4.88280192

Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. 

In [13]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='shuffled_movie_data.csv')
# Excercise 2: implement a Logistic Regression classifier, using regularization, according to https://web.stanford.edu/~jurafsky/slp3/5.pdf

# Excercise 2:

## Implementing my Logistic Regression classifier, using regularization

In [14]:
def sigmoid(x):
    from math import exp
    
    return 1 / (1 + exp(-x))

In [15]:
def Positive_Sentiment(w,features,bias):
    import numpy as np
    return sigmoid(np.dot(w,features) + bias)
    

In [16]:
def Negative_Seniment(w,features,bias):
    import numpy as np
    return 1 - sigmoid(np.dot(w,features) + bias)
    

In [17]:
def CrossEntropyLoss(y_guess, y_true):
    from math import log1p

    return - (y_true*log1p(y_guess) + (1-y)*loglp(1-y_guess))
    

In [18]:
def Cost(x, y, w,bias):
    m = len(y)
    cost=0
    for i in range(m):
        cost = CrossEntropyLoss(Positive_Sentiment(w,x[i],bias),y[i])
    return 1/m * cost

In [19]:
def Gradient_j(x,y,j,w,bias):
    m=len(x)
    suma=0
    for i in range(m):
        features=x[i]
        suma=suma+features[j]*(Positive_Sentiment(w,features,bias)-y[i])
    return suma

In [20]:
def Gradient(x,y,w,bias):
    m=len(x)
    suma=0
    for i in range(m):
        features=x[i]
        suma=suma+(Positive_Sentiment(w,features,bias)-y[i])
    return suma

In [21]:
def entrenar(x,y):
    n=len(x[0])
    w=[0]*n
    bias=0
    learning_rate=0.1
    gradient=[0]*(n)
    gradient_b=0
    for it in range(10):
        for j in range(n):
            gradient[j]= Gradient_j(x,y,j,w,bias)
        gradient_b=Gradient(x,y,w,bias)
        for i in range(n):
            w[i]=w[i]-learning_rate*gradient[i]
        bias=bias-learning_rate*gradient_b
    tetha=w
    tetha.append(bias)
    
    
    return tetha
            
            

In [22]:
#X_train, y_train = get_minibatch(doc_stream, size=1)
#X_test = vect_2.transform(X_test)
#print('Accuracy: %.3f' % clf_2.score(X_test, y_test))
print(X_train)

[[8, 13, 0, 0, 0, 5.648974238161206], [12, 10, 0, 2, 1, 5.662960480135946], [12, 20, 0, 0, 1, 5.981414211254481], [4, 0, 0, 0, 0, 4.406719247264253], [4, 1, 0, 0, 0, 5.075173815233827]]


In [23]:
entrenar(X_train,y_test)

NameError: name 'X_test' is not defined

In [24]:
def training_Regularization(x,y):
    from math import log1p
    n=len(x[0])
    w=[0]*n
    bias=0
    learning_rate=0.1
    gradient=[0]*(n)
    gradient_b=0
    alpha=0.0001
    count=0
    count2=0
    resultado =0 
    for it in range(10):
        for j in range(n):
            gradient[j]= Gradient_j(x,y,j,w,bias)
        gradient_b=Gradient(x,y,w,bias)
        for i in range(n):
            w[i]=w[i]-learning_rate*gradient[i]
        bias=bias-learning_rate*gradient_b
    
    for j in range(n):
        count = count + w[j]**2
    
    #for j in range(n)
    #   count2 = max(count2 + log1p(Positive_Sentiment(w,features,bias)-y[i]))
    
    w_max = max(w)
    indice=w.index(w_max)    
    w[indice]=w_max-alpha*count
    tetha=w
    tetha.append(bias)
    return tetha
            
    

In [25]:
Tetha=training_Regularization(X_train,y_train)

In [26]:
X_test, y_test = get_minibatch(doc_stream, size=100)
X_test = convert(X_test)
#X_test = vect_2.transform(X_test)
#print('Accuracy: %.3f' % clf_2.score(X_test, y_test))

[8, 13, 0, 0, 0, 5.648974238161206]
[12, 10, 0, 2, 1, 5.662960480135946]
[12, 20, 0, 0, 1, 5.981414211254481]
[4, 0, 0, 0, 0, 4.406719247264253]
[4, 1, 0, 0, 0, 5.075173815233827]
[9, 3, 1, 0, 0, 5.111987788356544]
[12, 4, 0, 1, 0, 5.83773044716594]
[3, 4, 0, 0, 1, 5.099866427824199]
[6, 2, 0, 0, 1, 4.867534450455582]
[8, 4, 0, 0, 1, 5.003946305945459]
[0, 2, 0, 1, 0, 4.07753744390572]
[11, 0, 1, 0, 0, 5.375278407684165]
[13, 8, 0, 0, 0, 5.783825182329737]
[10, 3, 0, 1, 0, 5.883322388488279]
[3, 5, 0, 0, 0, 4.248495242049359]
[6, 2, 1, 0, 0, 5.288267030694535]
[7, 12, 1, 0, 1, 5.293304824724492]
[2, 0, 0, 0, 0, 4.51085950651685]
[6, 10, 1, 1, 0, 5.332718793265369]
[13, 13, 1, 0, 0, 6.035481432524756]
[10, 3, 1, 0, 0, 4.795790545596741]
[4, 8, 0, 0, 0, 4.9344739331306915]
[4, 3, 0, 0, 0, 4.770684624465665]
[3, 3, 0, 0, 1, 4.61512051684126]
[5, 9, 0, 0, 1, 5.111987788356544]
[7, 7, 0, 0, 0, 5.720311776607412]
[15, 11, 0, 2, 1, 5.966146739123692]
[10, 22, 1, 1, 0, 6.131226489483141]
[5, 9

In [33]:
def Prueba(x,y,tetha):
    aciertos=0
    total=0
    w=tetha[0:6]
    print(w)
    bias=tetha[6]
    print(bias)
    for i in range(len(x)):
        y_guess=Positive_Sentiment(w,x[i],bias)
        if y_guess > 0.5:
            y_guess=1
        else:
            y_guess=0
        print(y_guess)
        if y[i] == y_guess:
            aciertos=aciertos+1
        total=total+1
    return aciertos/total
        
        

In [34]:
Prueba(X_test,y_test,Tetha)

[-2.9449552376049257, -2.704653621112209, -0.001710931412452596, -0.7070261078804156, -0.701696546283828, -0.35935720744505484]
0.033044747040050504
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


0.47

In [None]:
#import pyprind
#pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    #pbar.update()

Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [None]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [None]:
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

To install:
conda install -c anaconda joblib

In [None]:
import joblib
import os
if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, './vectorizer.pkl')
joblib.dump(clf, './clf.pkl')

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` "right".

In [None]:
%%writefile tokenizer.py
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

In [None]:
from tokenizer import tokenizer
joblib.dump(tokenizer, './tokenizer.pkl')

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [None]:
import joblib
tokenizer = joblib.load('./tokenizer.pkl')
vect = joblib.load('./vectorizer.pkl')
clf = joblib.load('./clf.pkl')

After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook.

In [None]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

In [None]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)