Bag-of-words representation
---

Goal: Represent texts as a vector of numbers for our ML tasks!

In [1]:
# We cannot use X: ML models work with numerical features!
X = [
    "Scikit-learn makes ML easy, easy as 123", 
    "Learning TensorFlow for deep learning"
]

Idea: Create a feature for each word with the number of times the word appears as its value

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Scikit-learn provides a vectorizer transformer for that
vect = CountVectorizer()
X_encoded = vect.fit_transform(X);


<2x11 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [5]:
# Print vocabulary (the features) with their column index in the feature matrix
for word, index in vect.vocabulary_.items():
    print('"{}" with index {}'.format(word, index))

"scikit" with index 9
"learn" with index 5
"makes" with index 7
"ml" with index 8
"easy" with index 3
"as" with index 1
"123" with index 0
"learning" with index 6
"tensorflow" with index 10
"for" with index 4
"deep" with index 2


In [6]:
# Input size
X_encoded.shape

(2, 11)

Note: In practice, this approach usually leads to a large number of features (words) and only a few non-zero values per entry. For this reason, Scikit-learn stores the data as a "sparse matrix" that only stores non-zero values which is more memory efficient

In [7]:
# Scikit-learn uses sparse matrices instead of Numpy arrays
X_encoded

<2x11 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

But we can always get back the data as Numpy array

In [8]:
X_encoded.toarray()

array([[1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 0],
       [0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 1]], dtype=int64)

Notes

* `X_encoded[0, 3]` is 2 because "easy" (with index 3) appears twice in the first entry
* `X_encoded[1, 6]` is 2 because "learning" with index 6 appears twice in the second entry

Concrete use case - Sentiment analysis
---

Task: Classify movie reviews as being `positive` or `negative` about their movie. This is a binary classification task with text input.

Download the `Large Movie Review Dataset v1.0 ` data from https://ai.stanford.edu/~amaas/data/sentiment/ and extract it in a `aclImdb` folder next to this notebook

In [13]:
from sklearn.datasets import load_files
import os

# Train set
train_data = load_files(os.path.join('aclImdb', 'train'), categories=['pos', 'neg'])

In [14]:
# First review
train_data.data[0]

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

In [15]:
# Classes
train_data.target_names

['neg', 'pos']

In [16]:
train_data.target[0] # "positive"

1

In [17]:
# Create X, y arrays
vect = CountVectorizer()
y = train_data.target
X = train_data.data
X_encoded = vect.fit_transform(X)

# Vocabulary and input
print('Training size:', len(train_data.data))
print('Vocabulary size:', len(vect.vocabulary_))
print('Input shape:', X_encoded.shape)
print('Target shape:', y.shape)

Training size: 25000
Vocabulary size: 74849
Input shape: (25000, 74849)
Target shape: (25000,)


In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Use logistic regression to classify reviews
grid_cv = GridSearchCV(
    estimator=LogisticRegression(solver='liblinear'),
    param_grid={
        'C': [0.1, 1, 10]
    },
    cv=3,
    return_train_score=True,
    n_jobs=-1,
    verbose=1
)
grid_cv.fit(X_encoded, y);

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:   43.3s remaining:   54.1s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   54.0s finished


In [19]:
import pandas as pd

# Collect results in a DataFrame
cv_results = pd.DataFrame(grid_cv.cv_results_)
cols = ['param_C', 'mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score']
cv_results[cols].sort_values('mean_test_score', ascending=False)

Unnamed: 0,param_C,mean_test_score,std_test_score,mean_train_score,std_train_score
0,0.1,0.88524,0.001181,0.9799,0.000129
1,1.0,0.87604,0.001177,0.99924,5.7e-05
2,10.0,0.87032,0.001643,1.0,0.0


In [24]:
grid_cv.predict(X_encoded[10])
X[10]

b"Probably the worst Dolph film ever. There's nothing you'd want or expect here. Don't waste your time. Dolph plays a miserable cop with no interests in life. His brother gets killed and Dolph tries to figure things out. The character is just plain stupid and stumbles around aimlessly. Pointless."

Variants
---

There are also slightly more sophisticated methods like TD-IDF which normalizes counts by popularity of the word - the idea is that words that appear in many entries are less relevant to solve our task. A common example are stopwords like "the", "is", "a" but common words might also be specific to our data set ex. "movie", "actor" are common words in our case and irrelevant to classify the review as positive or negative.

In addition to logistic regressions, the Naive Bayes classifier is another baseline to evaluate when working with texts

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create a pipeline this time
pipe = Pipeline([
    ('vect', None),
    ('clf', None)
])

# Try with simple counts and TF-IDF / Logistic regressions and Naive Bayes
grid_cv = GridSearchCV(
    estimator=pipe,
    param_grid={
        'vect': [CountVectorizer(), TfidfVectorizer()],
        'clf': [LogisticRegression(solver='liblinear'), MultinomialNB()]
    },
    cv=3,
    return_train_score=True,
    scoring='accuracy'
)
grid_cv.fit(X, y)

# Collect results in a DataFrame
cv_results = pd.DataFrame({
    'clf': [type(clf).__name__ for clf in grid_cv.cv_results_['param_clf']],
    'vect': [type(clf).__name__ for clf in grid_cv.cv_results_['param_vect']],
    'mean_test_score': grid_cv.cv_results_['mean_test_score'],
    'std_test_score': grid_cv.cv_results_['std_test_score'],
    'mean_train_score': grid_cv.cv_results_['mean_train_score'],
    'std_train_score': grid_cv.cv_results_['std_train_score']
})
cv_results.sort_values('mean_test_score', ascending=False)

Final test evaluation
---

In [None]:
test_data = load_files(os.path.join('aclImdb', 'test'), categories=['pos', 'neg'])
grid_cv.score(test_data.data, test_data.target) # Score best estimator

Additional ressources
---

Scikit-learn provides many options for vectorizers, make sure to check the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) documentations