# Chapter 8 : Applying Machine Learning to Sentiment Analysis

## Preparing the IMDb movie review data for text processing

*Sentiment analysis* or *opinion mining* concerns the classification of the attitude of the writer; generally, positive or negative. 

### Preprocessing the movie dataset into a convenient format

The files need to be made into a single `.csv` file.

first loop through all the files and put them into a single dataframe

In [1]:
import pandas as pd
import os
import sys
# change the 'basepath' to the directory of the >>> # unzipped movie dataset
basepath = 'aclImdb'
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l) 
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding = 'utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index = True)
df.columns = ['review', 'sentiment']

  df = df.append([[txt, labels[l]]], ignore_index = True)


Save the dataframe to a csv and then read it:

In [2]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [3]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# the following column renaming is necessary on some computers: >>> 
df = df.rename(columns={"0": "review", "1": "sentiment"})
df.sample(3)

Unnamed: 0,review,sentiment
28921,This film deals with the Irish rebellion in th...,1
11971,This movie is pure guano. Mom always said if y...,0
15919,Well the plot is entertaining but it is full o...,0


In [4]:
df.shape

(50000, 2)

## Introducing the bag of words model

*Bag-of-words* represents texts as numerical feature vectors.  This is done by:
1. Making a vocbulary of unique tokens from the entire set of documents
2. A feature vector is made for each document that contains counts of the tokens

These feature vectors tend to be *sparse*, i.e. they contain a lot of zeros.

### Transforming words into feature vectors

`CountVectorizer` is built into scikit-learn and builds a bag-of-words model automatically:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer 
count = CountVectorizer() #make the object
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two']) 
bag = count.fit_transform(docs) #fit and transform

In [6]:
#can get the counts
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
#and the vectorized text
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

Note that each position of the vector in `bag.toarray()` corresponds to a document and the count of tokens in it. These vector values are called the *raw term frequencies*: $tf(t, d)$, where $t$ is the term and $d$ is the number of occurences.  

#### N-Gram models
Bag-of-words is also called the *unigram model* since it counts only sequences of one word.  This can be extended to *n-grams* that use sequences of $n$ words. This is implemented in `CountVectorizer` via the `ngram_range = (x, y)` parameter: where x and y are the range of word sequences.

### Assessing word relevancy via term frequency-inverse document frequency

If a word is used across both classes, they do not provide much information. This is where *term frequency-inverse document frequency* (tf-idf) is used, as it downweights the terms shared across the classes. It is defined as the product of the term frequency and the inverse document frequency:
$$
    \text{tf-idf}(t, d) = tf(t, d) \times idf(t, d)
$$
$idf$ is defined by:
$$
    idf(t, d) = \log\frac{n_d}{a + df(t,d)}
$$
where $n_d$ is the number of documents, and $df(t, d)$ is the number of documents, $d$, that contain the term $t$. The logarithm function ensures that low document frequencies are not given too much weight. This is implemented in sci-kit learn via the `TfidfTransformer`.  Compare the calculations below with the ones above; the word `is` gets a much lower tf-idf score.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer 
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


Note that sci-kit learn calculates the tf-idf a little bit differently:
$$
    idf = \log\frac{1 + n_d}{1 + df(t, d)}
$$
and 
$$
    \text{tf-idf} = tf(t, d) \times (idf(t, d) + 1)
$$

Moreover, the calculation is normalized:
$$
    v_{norm} = \frac{v}{\|v\|_2} = \frac{v}{(\sum_{i=1}^n v_i^2)^{1/2}}
$$

### Cleaning text data

Generally punctuation and HTML markup characters are removed from text before it is transformed; this can be done with regex:

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) #removes html
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                       text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
        ' '.join(emoticons).replace('-', '')) 
    return text

Compare the unprocessed and processed text:

In [10]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [11]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

### Processing documents into tokens

Split the text into individual words.

In [12]:
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

The document can also be *stemmed*, which transforms the word into its root. This maps related words to the same stem. This is done with `nltk`:

In [13]:
from nltk.stem.porter import PorterStemmer 
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()] 
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Another processing that is done is to remove *stop words*, i.e. common words that usually bear little infomration.

In [14]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification

In [15]:
X_train = df.loc[:25000, 'review'].values 
y_train = df.loc[:25000, 'sentiment'].values 
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Note that the following grid search is computationally quite expensive because of the number of features in the data set.

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf = TfidfVectorizer(strip_accents=None,
    lowercase=False,
    preprocessor=None) 
small_param_grid = [
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [None],
        'vect__tokenizer': [tokenizer, tokenizer_porter],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]},
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokenizer],
             'vect__use_idf':[False],
        'vect__norm':[None],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]  
    },
    ]
## The following code will take a while

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(solver='liblinear')) ])
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,scoring='accuracy', cv=5,verbose=2, n_jobs=1) 
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f84c8b59b80>; total time=   2.6s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f84c8b59b80>; total time=   2.7s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f84c8b59b80>; total time=   2.7s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f84c8b59b80>; total time=   2.8s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f84c8b59b80>; total time=   2.8s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter 

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('clf',
                                        LogisticRegression(solver='liblinear'))]),
             n_jobs=1,
             param_grid=[{'clf__C': [1.0, 10.0], 'clf__penalty': ['l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [None],
                          'vect__tokenizer': [<function tokenizer at 0x7f84c8b59b80>,
                                              <function tokenizer_porter at 0x7f84c8acfee0>...
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                 

Note that `TfidfVectorizer` is used which combines `CountVectorizer` and `TfidfTransformer`

In [17]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f84c8b59b80>}
CV Accuracy: 0.887
Test Accuracy: 0.893


## Working with bigger data -- online algorithms and out-of-core learning

The dataset used here is relatively small, only 50,000 features.  Often there are significantly more.  In such applications *out-of-core learning* is used for data that exceeds computer memory. This technique streams parts of the training set to gradually train the model. For example (see chap 2), stochastic gradient descent is updated using the following weight and bias deltas for each partial training set:
$$
    \Delta w_j = \eta(y^{(i)} - \sigma(z^{(i)}))x_j^{(i)}
$$
$$
    \Delta b = \eta(y^{(i)} - \sigma(z^{(i)}))
$$
(this is in contrast to using the accumulated errors over all training samples).

First define a tokenizer function:

In [18]:
import numpy as np
import re
from nltk.corpus import stopwords 

stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
        + ' '.join(emoticons).replace('-', '') 
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

Now a function to stream the documents to the Stochastic Gradient descent one by one:

In [19]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv: 
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2]) 
            yield text, label
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

Also make a function to call the documents in batches.

In [20]:
def get_minibatch(doc_stream, size): 
    docs, y = [], []
    try:
        for _ in range(size):
            #this is where the doc is gotten from the generator
            text, label = next(doc_stream) 
            docs.append(text) 
            y.append(label)
    except StopIteration: 
        return None, None
    return docs, y

Now use `HashingVectorizer` because `TfidfVectorizer` and `CountVectorizer` both need to keep all features in memory

In [21]:
from sklearn.feature_extraction.text import HashingVectorizer 
from sklearn.linear_model import SGDClassifier
#use the above defined tokenizer function and set the number of features
vect = HashingVectorizer(decode_error='ignore',
    n_features=2**21,
    preprocessor=None,
    tokenizer=tokenizer) 
clf = SGDClassifier(loss='log', random_state=1) 
doc_stream = stream_docs(path='movie_data.csv')

#### Starting the partial fit of the classifier

1. Get a batch of 1000 documents
2. transform the documents using the `HashingVectorizer`
3. partially fit the classifier

In [22]:
classes = np.array([0, 1]) 
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)

In [23]:
X_test, y_test = get_minibatch(doc_stream, size=5000) 
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868


Note that accuracy drops, but the partial fit algorithm is memory efficient

## Topic modeling with latent Dirichlet allocation

*Topic modeling* assigns topics to unlabeled documents. 

### Decomposing text documents with LDA

*Latent Dirichlet Allocation (LDA)* is a generative probabilisitc model that uses groups of words that appear across documents to assign a label. It uses a bag-of-words matrix and decomposes:
- a document-to-topic matrix 
- a word-to-topic matrix

Then the product of these two matrices is the bag-of-words matrix with the least possible error. However, the number of topics must be defined by the user. 

### LDA with scikit-learn
Use `LatentDirichletAllocation` in scikit learn. 
- load a dataset
- create a bag-of-words matrix via `CountVectorizer`
- fit a `LatentDirichletAllocation` object
- use `LatentDirichletAllocation.componenets_` to analyze the most important words for each topic

In [25]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
from sklearn.feature_extraction.text import CountVectorizer 
count = CountVectorizer(stop_words='english', max_df=.1, max_features=5000)
X = count.fit_transform(df['review'].values)

Note that `max_df=0.1` excludes words that occur too frequently across all documents. `max_features=5000` limits the features in the data.  Note that these are hyperparameters that can be tuned/changed depending on the use-case. 

In [26]:
from sklearn.decomposition import LatentDirichletAllocation 
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch') 
X_topics = lda.fit_transform(X)

Using `learning_method='batch'` means the training is in one iteration. The other option is `online` which is less accurate but computationally more efficient. 

In [29]:
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
for i in topic.argsort() [:-n_top_words - 1:-1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool


Print the top five words for each of 10 topics. These are used to label the reviews.  For example, the first topic is "bad movies", the ninth topic is "movies based on books."