# Feature Extraction



In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm

In [2]:
DATA_DIR = "data/aclImdb/"
RANDOM_SEED = 7575

## loading data

In [3]:
from glob import glob
from itertools import islice
from pathlib import Path
import os
import re

def read_text_dir(datadir):
    path = Path(datadir)
    for fil in path.glob("*.txt"):
        yield fil
        
# Regex to remove all Non-Alpha Numeric 
SPECIAL_CHARS = re.compile(r'([^a-z\d!?.\s])', re.IGNORECASE)

def read_files(files):
    for fil in files:
        with open(fil, "r") as f:
            yield SPECIAL_CHARS.sub("", f.read())

In [4]:
from itertools import product

def read_data(_dir):
    splits = ["train", "test"]
    labels = {"pos": 1, "neg": 0}
    
    dfs = []
    
    for split, label in product(splits, labels.keys()):
        datadir = os.path.join(_dir, f"{split}/{label}")
        text_gen = read_files(read_text_dir(datadir))
        dfs.append(
            pd.DataFrame({"text": list(tqdm(text_gen)), "split": split, "label": labels[label]})
        )
    return pd.concat(dfs)

In [5]:
imdb_df = read_data(DATA_DIR)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [6]:
train_df = imdb_df[imdb_df.split == "train"]
test_df = imdb_df[imdb_df.split == "test"]

imdb_df.head()

Unnamed: 0,text,split,label
0,Fear of a Black Hat is a superbly crafted film...,train,1
1,Many reviews Ive read reveals that most people...,train,1
2,A nicely done thriller with plenty of sex in i...,train,1
3,Im going to keep this fairly brief as to not s...,train,1
4,At first I thought the Ring would be a more th...,train,1


## Vectorizing our data

In order to extract information from text, 
we need to vectorize our word sequences. 
In other words, we'll transform our sentences into numerical features. 
There are many vectorization or embedding techniques such as Bag of Words, 
Pre-Trained word embeddings, but in our case we'll be using a representation known as [**TF-IDF**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

TF-IDF stands for "Term Frequency, Inverse Document Frequency", 
and is the product of two indendent statistics: Term Frequency (word counts) 
and Inverse Document Frequency (1 / number of documents containing a word).
There are [various modifications often made](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition), but the simplest
formulation simply uses term frequency:
$$
tf(i, d) = \text{raw count} = f_{i, d}
$$

for a word $i$ and document $d$  
$$
idf(i, D) = \log{\frac{N}{1 + |\{ d \in D : i \in d \}|}}
$$

for a collection of documents $D$, so that our TFIDF value for a word $i$ and document $d$ is:
$$
TFIDF(i, d) = tf(i, d) * idf(i, D)
$$

Note that a 1 is included in the denominator of the definition of `idf` to prevent
from division by zero.  Of course, if our vocabulary is generated from our
corpus (collection of documents) then we won't run into this problem.

### intuition

The TF-IDF score of a word is high when it is frequently found in a document. 
However, if the word appears in many documents, i.e. is not a good discriminator, 
it will have a lower score. 
For example, common words such as "the" or "and" will have low score since they appear in many documents. 

### implementation

We'll use the scikit-learn implementation 
[`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
for the most part, but you're encouraged to implement your own in
`feature.py`.  There is a class stubbed out that is designed compliant
to the sklearn transformer api.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = train_df.text
test_corpus = test_df.text

vectorizer = TfidfVectorizer(
    ngram_range=(1, 2), # sizes of ngrams to use
    lowercase=False, # convert text to lowercase first
    min_df=0.001,
    max_df=0.95,
    max_features=5000
)
train_feat = vectorizer.fit_transform(train_corpus)
test_feat = vectorizer.transform(test_corpus)

In [8]:
print(f"Shapes:\ntrain: {train_feat.shape}\ntest: {test_feat.shape}")

Shapes:
train: (25000, 5000)
test: (25000, 5000)


So, we have 25000 examples with 10000 features each as desired.
Note that we fit the vectorizer only on the training corpus, but
transform all features.  This is very important!

## modeling our data

Now we have a simple task: binary classification given numerical feature data.
We can now use many common models to predict the sentiment (positive or negative)
of this data!  

### logistic regression

In [9]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression()
lr_clf.fit(train_feat, imdb_df[imdb_df.split == "train"].label)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## testing

In [10]:
from sklearn.metrics import accuracy_score, f1_score

In [11]:
print("Logistic Regression:")
y_pred = lr_clf.predict(test_feat)
y_true = imdb_df[imdb_df.split == "test"].label
print(f"accuracy: {accuracy_score(y_true, y_pred)}")
print(f"f1: {f1_score(y_true, y_pred)}")

Logistic Regression:
accuracy: 0.88712
f1: 0.887980311209908


Overall we do well! Hyper-parameter tuning will probably push our accuracy and f1
above .9 which is quite good.  Keep in mind this is just considering 10,000 words
and only their counts, not even placement in a sentence.

## visualizing / understanding

Recent work has been done to attempt to explain why models emit 
a particular classification given data.

In [12]:
def clf_fn(clf):
    """
    Returns a function that takes as parameter a text instance and 
    returns the output of clf.predict_proba of the given TFIDF repr.
    """
    
    def _predict_proba(_input):
        if type(_input) is str:
            feat = vectorizer.transform([_input])
        else:
            feat = vectorizer.transform(_input)
    
        return clf.predict_proba(feat)
    
    return _predict_proba

In [13]:
from lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=["negative", "positive"])

In [14]:
from ipywidgets import interact

@interact(text_instance="The Room was one of the funniest movies I've ever seen!")
def lime_example(text_instance):
    exp = explainer.explain_instance(text_instance, classifier_fn=clf_fn(lr_clf))
    exp.show_in_notebook()

interactive(children=(Text(value="The Room was one of the funniest movies I've ever seen!", description='text_…