- Note - Don't run the cells as a live demo - some tasks can take 10 minutes or longer...
# Text Classification
- applying machine learning to classifiy natural language for various tasks
- a comprehensive article on Text Classification: https://arxiv.org/pdf/2004.03705.pdf
- some common text classification tasks:
    1. sentiment analysis
    2. news categorization
    3. topic analysis
    4. question answering (QA)
    5. natural language inference (NLI)
    
## Sentiment Analysis
- subfield of **natural language processing (NLP)** 
- also called **opinion mining**
- apply ML algorithms to classify documents based on their polarity:
    - the attitude of the writer

## General steps
1. clean and prepare text data
2. build feature vectors from text documents
3. train a machine learning model to classify positive and negative movie reviews
4. test and evaluate the model

## IMDb dataset
- contains 50,000 labeled moview reviews from Internet Moview Database (IMDb)
- task is to classify reviews as **positive** or **negative**
- compressed archive can be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment


### Download and untar IMDb dataset
- on Linux and Mac use the following cells
- on Windows, manually download the archive and untar using 7Zip or other application
- or use the provided Python code

In [1]:
# let's download the file
# FYI - file is ~ 84 MB; takes a while...
! curl -o data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  1335k      0  0:01:01  0:01:01 --:--:--  970k 0  1011k      0  0:01:21  0:00:11  0:01:10 1545k  0  1484k      0  0:00:55  0:00:34  0:00:21  908k  1469k      0  0:00:55  0:00:38  0:00:17 1236k  0:00:39  0:00:16 1445k


In [2]:
# let's see the contents of the data folder
! ls data

aclImdb_v1.tar.gz         titanic.csv               wdbc.data
melb_data.csv             titanic.xlsx              wdbc.names
melbourne-housingdata.zip titanic_sorted_age.xlsx


In [4]:
# let's untar the compressed aclImdb_v1.tar.gz file
! tar -zxf data/aclImdb_v1.tar.gz --directory data

### Option Python code to download and extract tar file

In [5]:
import os
import sys
import tarfile
import time
import urllib.request


source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'data/aclImdb_v1.tar.gz'


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()


if not os.path.isdir('data/aclImdb') and not os.path.isfile(target):
    urllib.request.urlretrieve(source, target, reporthook)

In [6]:
# untar the file
if not os.path.isdir('data/aclImdb'): # if the directory doesn't exist untar the target to path
    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall(path="./data")

### Preprocess the movie dataset into a more convenient format
- extract and load the movie dataset into Pandas DataFrame
- NOTE: can take up to **10 minutes**
- use Pthon Progress Indicator (PyPrind) package to show progress bar from Python code

In [3]:
! pip install pyprind

Collecting pyprind
  Using cached PyPrind-2.11.2-py3-none-any.whl (8.6 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.2


In [4]:
import pyprind
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'data/aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:47


### Shuffle and save the assembled data as CSV file
- pickle the DataFrame as a binary file for faster load

In [8]:
import numpy as np
import pickle

In [5]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index)) # randomize files

In [9]:
df

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
...,...,...
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0


In [10]:
# save csv format
df.to_csv('data/movie_data.csv', index=False, encoding='utf-8')

In [12]:
# save DataFrame as a pickle dump
pickle.dump(df, open('data/movie_data.pd', 'wb'))

In [16]:
# directly load the pickled file as DataFrame
df = pickle.load(open('data/movie_data.pd', 'rb'))

In [17]:
df

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
...,...,...
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0


### bag-of-words model
- ML algorithms only work on numerical values
- need to encode/transform text data into numerical values using **bag-of-words** model
- **bag-of-words** technique allow us to represent text as numerical feature vectors:
    1. extract all the unieque tokens -- e.g., words -- from entire document
    2. construct feature vector that contains the word frequency in the particular document 
    3. order of the words in the document doesn't matter - hence bag-of-words
- since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vector will be **sparse** mostly consisting of zeros

### transform words into feature vectors
- use `CountVectorizer` class implemented in scikit-learn
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- `CountVectorizer` takes an array of text data and returns a bag-of-words vectors

In [24]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [25]:
# let's look at the vocabulary_ contents of count object
count.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

In [26]:
bag

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [27]:
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

### bag-of-words feature vector
- the values in the feature vectors are also called the **raw term frequencies**
    - $x^i = tf(t^i, d)$
    - the number of times a term, $t$ appears in a document, $d$
- indices of terms are usually assigned alphabetically

### N-gram models
- the above model is **1-gram** or **unigram** model
    - each item or token in the vocabulary represents a single word
- if the sentence is: "The sun is shining"
- 1-gram: "the", "sun", "is", "shining"
- 2-gram: "the sun", "sun is", "is shining"
- `CountVectorizer` class allows us to use different n-gram models via its `ngram_range` parameter
- e.g. ngram_range(2, 2) will use 2-gram model

## Assess word relevency via term frequency-inverse document frequency
- words ofter occur across multiple documents from all the classes (positive and negative in IMDb)
- frequentyly occuring words don't contain discriminatory information
- **tf-idf** model can be used to downweight these frequently occuring words in the feature vectors
    - $tf\mbox{-}idf(t, d) = tf(t, d) \times idf(t, d)$
    - $ idf(t, d) = log\frac{n_d}{1+df(d, t)}$
    - $n_d$ - total number of documents
    - $df(d, t)$ - number of documents, $d$ that contain the term $t$
    - $log$ ensures that low document frequencies are not given too much weight