# Machine Learning with Python

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.datasets import load_files

reviews_train = load_files("../assets/imdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

reviews_test = load_files("../assets/imdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

# 3.2 Sentiment Analysis

### The task

Our first machine learning task will be a *binary classification*, trying to make a predictor for the class label (i.e. positive or negative review) based on the text. This is a form of [*sentiment analysis*](https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17).

Unlike our previous examples using structured data, we do not yet have any features that can be used to a learning algorithm. Finding a useful data representation is therefore a key component of NLP.

### Bag of Words

A very simple but usually quite effective approach is the so-called *bag of words*.

Here, we __discard the information contained in the document structure and the order of the words in each sentence,__ and just represent each document as a frequency table showing how often each word occurs therein.

The three stages are

* *Tokenization* - break each document into a list of words.
* *Vocabulary building* - collect all the words found in the corpus and sort them in alphabetical order.
* *Encoding* - for each document, create a frequency table over the vocabulary.

Here is a simple example on two short documents:

In [3]:
bards_words = [
    "The fool doth think he is wise,",
    "but the wise man knows himself to be a fool",
]


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

In [5]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

Vocabulary size: 13
Vocabulary content:
 {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


The vectorizer is case-aware, so it knows that "The" and "the" are the same word.

In [6]:
bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


The sparse matrix format is much more memory-efficient for data where we expect many zeros.

For inspection, we can use `toarray()` to change this back into a "dense" numpy array:

In [7]:
print("Dense representation of bag_of_words:\n{}".format(
    bag_of_words.toarray()))

Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


Notice that the words with index 3 ("fool") ,9 ("the") and 12 ("wise") appear in both documents.

Let's try the same approach with the IMDB data:

In [8]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3431196 stored elements in Compressed Sparse Row format>


In [10]:
feature_names = vect.get_feature_names_out()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849
First 20 features:
['00' '000' '0000000000001' '00001' '00015' '000s' '001' '003830' '006'
 '007' '0079' '0080' '0083' '0093638' '00am' '00pm' '00s' '01' '01pm' '02']
Features 20010 to 20030:
['dratted' 'draub' 'draught' 'draughts' 'draughtswoman' 'draw' 'drawback'
 'drawbacks' 'drawer' 'drawers' 'drawing' 'drawings' 'drawl' 'drawled'
 'drawling' 'drawn' 'draws' 'draza' 'dre' 'drea']
Every 2000th feature:
['00' 'aesir' 'aquarian' 'barking' 'blustering' 'bête' 'chicanery'
 'condensing' 'cunning' 'detox' 'draper' 'enshrined' 'favorit' 'freezer'
 'goldman' 'hasan' 'huitieme' 'intelligible' 'kantrowitz' 'lawful' 'maars'
 'megalunged' 'mostey' 'norrland' 'padilla' 'pincher' 'promisingly'
 'receptionist' 'rivals' 'schnaas' 'shunning' 'sparse' 'subset'
 'temptations' 'treatises' 'unproven' 'walkman' 'xylophonist']


Notice that the definition of "word" is currently very basic, just any text that is separable by white space or punctuation, so numbers, alternative word forms and typos all appear as different features. 

We can now try a classifier. Using logistic regression on the defined features:

In [11]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88


Not bad for such a simple model!

One simple improvement is to restrict the allowed features to words that appear at least a minimum number of times in the corpus. This will help to weed out typos and other uninformative text, and also reduce the size of the feature space, which is helpful in itself. Here we will set `min_df=5`

In [12]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type '<class 'numpy.int64'>'
	with 3354014 stored elements in Compressed Sparse Row format>


In [14]:
feature_names = vect.get_feature_names_out()

print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

First 50 features:
['00' '000' '007' '00s' '01' '02' '03' '04' '05' '06' '07' '08' '09' '10'
 '100' '1000' '100th' '101' '102' '103' '104' '105' '107' '108' '10s'
 '10th' '11' '110' '112' '116' '117' '11th' '12' '120' '12th' '13' '135'
 '13th' '14' '140' '14th' '15' '150' '15th' '16' '160' '1600' '16mm' '16s'
 '16th']
Features 20010 to 20030:
['repentance' 'repercussions' 'repertoire' 'repetition' 'repetitions'
 'repetitious' 'repetitive' 'rephrase' 'replace' 'replaced' 'replacement'
 'replaces' 'replacing' 'replay' 'replayable' 'replayed' 'replaying'
 'replays' 'replete' 'replica']
Every 700th feature:
['00' 'affections' 'appropriately' 'barbra' 'blurbs' 'butchered' 'cheese'
 'commitment' 'courts' 'deconstructed' 'disgraceful' 'dvds' 'eschews'
 'fell' 'freezer' 'goriest' 'hauser' 'hungary' 'insinuate' 'juggle'
 'leering' 'maelstrom' 'messiah' 'music' 'occasional' 'parking'
 'pleasantville' 'pronunciation' 'recipient' 'reviews' 'sas' 'shea'
 'sneers' 'steiger' 'swastika' 'thrusting' 't

In [15]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88


The cross-validation score is essentially unchanged, but the number of features has reduced to 1/3 of the original number.

On the test dataset:

In [16]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)

X_test = vect.transform(text_test)

print("Test score: {:.2f}".format(lr.score(X_test, y_test)))

Test score: 0.86


### Stopwords

Sometimes we have a list of words that we do not want to use as features, for example because they do not add any information. We can eliminate these using the `stop_words` argument.

In [17]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['hasnt', 'then', 'six', 'while', 'up', 'our', 'until', 'whence', 'nobody', 'may', 'seems', 'mill', 'whether', 'forty', 'made', 'less', 'by', 'down', 'nine', 'too', 'yourself', 'anywhere', 'anyone', 'no', 'before', 'but', 'you', 'onto', 'thereby', 'in', 'any', 'last']


In [18]:
# Specifying stop_words="english" uses the built-in list.
# We could also augment it and pass our own.
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:
<25000x26966 sparse matrix of type '<class 'numpy.int64'>'
	with 2149958 stored elements in Compressed Sparse Row format>


In [19]:
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88


Using this fixed list has not improved our model. However, if the dataset were much smaller then we may find that the use of stopwords would help to focus the model on more informative words.

A simple corpus-specific option is to use the `max_df` argument to eliminate words that appear very frequently.

### Rescaling with *tf-idf*

We can try to use the corpus itself to determine feature importance. One common approach is *term frequency-inverse document frequency* (tf-idf).

tf-idf for a word *w* in a document *d* is given by


\begin{equation*}
\text{tfidf}(w, d) = \text{tf} \log\big(\frac{N + 1}{N_w + 1}\big) + 1
\end{equation*}

where 

**tf** is the number of times the word appears in document *d*.

*N* is the total number of documents in the training set.

*Nw* is the number of documents in the training set that contain *w*.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=5, norm=None).fit(text_train)
X_train = vect.transform(text_train)

scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.87


No improvement in performance... But organising the features in this way can make the model more interpretable:

In [22]:
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
# get feature names
feature_names = np.array(vect.get_feature_names_out())

print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))
print()
print("Features with highest tfidf: \n{}".format(feature_names[sorted_by_tfidf[-20:]]))


Features with lowest tfidf:
['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'
 'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'fond'
 'stinker' 'avoided' 'emphasis' 'commented' 'disappoint' 'realizing'
 'downhill' 'inane']

Features with highest tfidf: 
['coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'taker' 'macarthur'
 'vargas' 'jesse' 'basket' 'dominick' 'the' 'victor' 'bridget' 'victoria'
 'khouri' 'zizek' 'rob' 'timon' 'titanic']


Features with low inverse document frequency are the most commonly occuring words, which presumably have low information content:

In [23]:
sorted_by_idf = np.argsort(vect.idf_)
print("Features with lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))


Features with lowest idf:
['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'
 'was' 'as' 'on' 'movie' 'not' 'have' 'one' 'be' 'film' 'are' 'you' 'all'
 'at' 'an' 'by' 'so' 'from' 'like' 'who' 'they' 'there' 'if' 'his' 'out'
 'just' 'about' 'he' 'or' 'has' 'what' 'some' 'good' 'can' 'more' 'when'
 'time' 'up' 'very' 'even' 'only' 'no' 'would' 'my' 'see' 'really' 'story'
 'which' 'well' 'had' 'me' 'than' 'much' 'their' 'get' 'were' 'other'
 'been' 'do' 'most' 'don' 'her' 'also' 'into' 'first' 'made' 'how' 'great'
 'because' 'will' 'people' 'make' 'way' 'could' 'we' 'bad' 'after' 'any'
 'too' 'then' 'them' 'she' 'watch' 'think' 'acting' 'movies' 'seen' 'its'
 'him']


### Investigating the model

Because the features in *bag of words* are just natural language terms, the resulting models can be highly interpretable.


In [24]:
lr.coef_[0]

array([-0.1651748 , -0.0304902 , -0.11180878, ...,  0.18723533,
       -0.00373665, -0.10921337])

In [26]:
coefs = pd.Series(lr.coef_[0], vect.get_feature_names_out() )
coefs = coefs.sort_values(ascending=False)

In [27]:
coefs.head(10)

refreshing      1.676745
erotic          1.380814
wonderfully     1.368901
carrey          1.363813
funniest        1.327075
surprisingly    1.304696
flawless        1.296816
excellent       1.291702
vengeance       1.290208
appreciated     1.283186
dtype: float64

In [28]:
coefs.tail(10)

laughable        -1.518168
unfunny          -1.521221
boring           -1.544886
mess             -1.558348
awful            -1.729991
lacks            -1.801141
poorly           -1.932531
worst            -2.155337
disappointment   -2.162398
waste            -2.193415
dtype: float64

### n-Grams

One major problem with *bag of words* is that the meaning contained in __word order__ is completely lost. Compare the sentiment of

* She was happy not to be going to the party.
* She was not happy to be going to the party.

If we extend our features to __encompass two or more consecutive tokens,__ then we will have a chance to extract at least some meaning from word order. This may be very important for languages whose word boundaries are not so easily recognised as in English. 

This approach is called *n-grams*. We use a "sliding window" of a specified number of tokens.


`n=1` is just the same as *bag of words*:

In [29]:
print("bards_words:\n{}".format(bards_words))

bards_words:
['The fool doth think he is wise,', 'but the wise man knows himself to be a fool']


In [30]:
cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

Vocabulary size: 13
Vocabulary:
['be' 'but' 'doth' 'fool' 'he' 'himself' 'is' 'knows' 'man' 'the' 'think'
 'to' 'wise']


With *n* fixed at 2, we get different features:

In [31]:
cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

Vocabulary size: 14
Vocabulary:
['be fool' 'but the' 'doth think' 'fool doth' 'he is' 'himself to'
 'is wise' 'knows himself' 'man knows' 'the fool' 'the wise' 'think he'
 'to be' 'wise man']


In [32]:
print("Transformed data (dense):\n{}".format(cv.transform(bards_words).toarray()))

Transformed data (dense):
[[0 0 1 1 1 0 1 0 0 1 0 1 0 0]
 [1 1 0 0 0 1 0 1 1 0 1 0 1 1]]


Or we can specify a range for *n*:

In [33]:
cv = CountVectorizer(ngram_range=(1, 3)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

Vocabulary size: 39
Vocabulary:
['be' 'be fool' 'but' 'but the' 'but the wise' 'doth' 'doth think'
 'doth think he' 'fool' 'fool doth' 'fool doth think' 'he' 'he is'
 'he is wise' 'himself' 'himself to' 'himself to be' 'is' 'is wise'
 'knows' 'knows himself' 'knows himself to' 'man' 'man knows'
 'man knows himself' 'the' 'the fool' 'the fool doth' 'the wise'
 'the wise man' 'think' 'think he' 'think he is' 'to' 'to be' 'to be fool'
 'wise' 'wise man' 'wise man knows']


Notice how increasing the value of *n* will cause the number of features to explode rapidly.

Applying this to the IMDB data:

In [34]:
vect = CountVectorizer(min_df=5, ngram_range=(2, 2)).fit(text_train)
print("Vocabulary size: {}".format(len(vect.vocabulary_)))

Vocabulary size: 128166


In [35]:
X_train = vect.transform(text_train)

scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88


In [36]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)

coefs = pd.Series(lr.coef_[0], vect.get_feature_names_out() )
coefs = coefs.sort_values(ascending=False)

In [37]:
coefs

definitely worth      1.380835
well worth            1.357000
highly recommended    1.298359
10 10                 1.284409
must see              1.233707
                        ...   
not even             -1.177808
than this            -1.265824
not worth            -1.399645
waste of             -1.755501
the worst            -2.363117
Length: 128166, dtype: float64

### Exercise

Try applying *bag of words* to the Spanish language paper reviews. 

Can you fit a linear regression to predict the `evaluation` score?

Apply some of the variations discussed above to see if you can improve cross-validation performance.

In [38]:
import json

# load data using Python JSON module
with open('../assets/reviews.json','r') as f:
    data = json.loads(f.read())

reviews = pd.json_normalize(data, record_path=['review'])

In [39]:
reviews.keys()

Index(['confidence', 'evaluation', 'id', 'lan', 'orientation', 'remarks',
       'text', 'timespan'],
      dtype='object')

In [40]:
reviews.head()

Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05
3,4,2,1,es,1,,Se explica en forma ordenada y didáctica una e...,2010-07-05
4,4,2,2,es,0,,,2010-07-05


In [41]:
reviews_es = reviews.query('lan == "es"')

In [42]:
reviews_es.text

0      - El artículo aborda un problema contingente y...
1      El artículo presenta recomendaciones prácticas...
2      - El tema es muy interesante y puede ser de mu...
3      Se explica en forma ordenada y didáctica una e...
4                                                       
                             ...                        
400    El trabajo pretende ofrecer una visión del uso...
401    El paper está bien escrito y de fácil lectura....
402    Observación de fondo:  No se presenta un ejemp...
403    Se propone un procedimiento para elaborar máqu...
404    El artículo describe básicamente los component...
Name: text, Length: 388, dtype: object

In [43]:
from sklearn.model_selection import train_test_split

text_train, text_test, y_train, y_test = train_test_split(reviews_es.text, reviews_es.evaluation, random_state=1)

In [77]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = CountVectorizer(min_df=1,token_pattern=r'(?u)\b[^_\d\W][^_\d\W]+\b').fit(text_train)
vectorizer = CountVectorizer(ngram_range=(1, 3), token_pattern=r'(?u)\b[^_\d\W][^_\d\W]+\b').fit(text_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))
# print("Vocabulary content:\n {}".format(vectorizer.vocabulary_))

Vocabulary size: 65826


In [78]:
X_train = vectorizer.transform(text_train)
print("X_train: {}".format(repr(X_train)))

X_train: <291x65826 sparse matrix of type '<class 'numpy.int64'>'
	with 109348 stored elements in Compressed Sparse Row format>


In [79]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.37


In [80]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)

X_test = vectorizer.transform(text_test)

print("Test score: {:.2f}".format(lr.score(X_test, y_test)))

Test score: 0.36
