__Chapter 8 - Applying Machine Learning to Sentiment Analysis__

1. [Preparing the IMDb movie review data for text processing](#Preparing-the-IMDb-movie-review-data-for-text-processing)
1. [Bag-of-words](#Bag-of-words)
1. [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)
1. [](#)
1. [](#)
1. [](#)
1. [](#)


In [1]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
dataPath = os.path.abspath(os.path.join('../../Data'))
modulePath = os.path.abspath(os.path.join('../../CustomModules'))
sys.path.append(modulePath) if modulePath not in sys.path else None
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))


# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format


# Modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.discriminant_analysis as discriminant_analysis
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.utils as utils


# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt


# Custom extensions and settings
from quickplot import qp, qpUtil, qpStyle
from mlTools import powerGridSearch
sns.set(rc = qpStyle.rcGrey)


# Magic functions
%matplotlib inline


<a id = 'Preparing-the-IMDb-movie-review-data-for-text-processing'></a>

# Preparing the IMDb movie review data for text processing

Sentiment analysis is a subdiscipline of NLP that is concerned with analyzing the polarity of documents. One particular task seeks to classify documents based on the expressed emotions of the authors regarding a topic.

IMDb movies reviews have been gathered into a dataset consistening of 50,000 individual user critiques. Each review is labeled as positive or negative, where postitive means the movie received > 6 stars and negative means the movie received < 5 stars.

In [None]:
#

import tarfile
with tarfile.open('aclImdb_v1.tar.gq', 'r:gz') as tar:
    tar.extractall()
    

In [None]:


import pyprind

basepath = 'my path'

labels = {'pos' : 1, 'neg' :0 }
pbar(pyprind.ProgBar(50000))
df = pd.DataFrame()
for s in ('test','train'):
    for l in ('pos','neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding = 'utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, ]])
            pbar.update()
df.columns = ['review','sentiment']

In [None]:
#

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

df = pd.read_csv('movie_data.csv', encoding = 'utf-8')
df[:5]

<a id = 'Bag-of-words'></a>

# Bag-of-words

Bag-of-words is a method for represented text data in numerical feature vectors. This involves two key steps:

1. Create a vocabulary of unique token (for example, words) from entire set of documents
2. Construct a feature vector from each document that contains the counts of how often each word occurs in that specific document. These individual features vectors are typically very sparse because a single document will contains a small subset of the overall corpus vocabulary

<a id = 'Transforming-words-into-feature-vectors'></a>

## Transforming words into feature vectors

In [4]:
# CountVectorizer() example

count = feature_extraction.text.CountVectorizer()
docs = np.array([
    'The sun is shining'
    ,'The weather is sweet'
    ,'The sun is shining and the weather is sweet, and one and one is two'
])
bag = count.fit_transform(docs)


In [5]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A