# Chapter 8 : Applying Machine Learning to Sentiment Analysis

## Preparing the IMDb movie review data for text processing

*Sentiment analysis* or *opinion mining* concerns the classification of the attitude of the writer; generally, positive or negative. 

### Preprocessing the movie dataset into a convenient format

The files need to be made into a single `.csv` file.

first loop through all the files and put them into a single dataframe

In [6]:
import pandas as pd
import os
import sys
# # change the 'basepath' to the directory of the >>> # unzipped movie dataset
# basepath = 'aclImdb'
# labels = {'pos': 1, 'neg': 0}
# df = pd.DataFrame()
# for s in ('test', 'train'):
#     for l in ('pos', 'neg'):
#         path = os.path.join(basepath, s, l) 
#         for file in sorted(os.listdir(path)):
#             with open(os.path.join(path, file), 'r', encoding = 'utf-8') as infile:
#                 txt = infile.read()
#             df = df.append([[txt, labels[l]]], ignore_index = True)
# df.columns = ['review', 'sentiment']

  df = df.append([[txt, labels[l]]], ignore_index = True)


Save the dataframe to a csv and then read it:

In [8]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [10]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# the following column renaming is necessary on some computers: >>> 
df = df.rename(columns={"0": "review", "1": "sentiment"})
df.sample(3)

Unnamed: 0,review,sentiment
28921,This film deals with the Irish rebellion in th...,1
11971,This movie is pure guano. Mom always said if y...,0
15919,Well the plot is entertaining but it is full o...,0


In [11]:
df.shape

(50000, 2)

## Introducing the bag of words model

*Bag-of-words* represents texts as numerical feature vectors.  This is done by:
1. Making a vocbulary of unique tokens from the entire set of documents
2. A feature vector is made for each document that contains counts of the tokens

These feature vectors tend to be *sparse*, i.e. they contain a lot of zeros.

### Transforming words into feature vectors

`CountVectorizer` is built into scikit-learn and builds a bag-of-words model automatically:

In [17]:
from sklearn.feature_extraction.text import CountVectorizer 
count = CountVectorizer() #make the object
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two']) 
bag = count.fit_transform(docs) #fit and transform

In [18]:
#can get the counts
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [19]:
#and the vectorized text
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

Note that each position of the vector in `bag.toarray()` corresponds to a document and the count of tokens in it. These vector values are called the *raw term frequencies*: $tf(t, d)$, where $t$ is the term and $d$ is the number of occurences.  

#### N-Gram models
Bag-of-words is also called the *unigram model* since it counts only sequences of one word.  This can be extended to *n-grams* that use sequences of $n$ words. This is implemented in `CountVectorizer` via the `ngram_range = (x, y)` parameter: where x and y are the range of word sequences.