# Machine Learning with Python

In [6]:
import numpy as np
import matplotlib.pyplot as plt

In [1]:
from sklearn.datasets import load_files

reviews_train = load_files("../assets/imdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

reviews_test = load_files("../assets/imdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

## 3.3 Unsupervised Approaches

There are a number of unsupervised approaches in NLP which can give us useful information about the substructure of a corpus, or suggest relevance rankings.

### Topic Modelling

The idea of *topic modelling* is to assign each document to one or more topic classes. For example, news data can usually be assigned to *politics*, *sport*, *finance*, etc.

The *Latent Dirichlet Allocation* (LDA) approach models each document as a weighted *mixture* of topics. Topics will be derived from groups of words that tend to occur together. For example, "MP", "minister" and "vote" will tend to be found together, while "goal", "team" and "score" will tend to be found together. Because this is an unsupervised approach, the topics obtained may not coincide with our understanding of the relevant groups withing the corpus. However, LDA can still be a useful way to reduce the dimensionality of the data for further exploration.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)

In [3]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,learning_method="batch",
                                max_iter=25, random_state=0)
# We build the model and transform the data in one step
# Computing transform takes some time,
# and we can save time by doing both at once
document_topics = lda.fit_transform(X)

In [4]:
print("lda.components_.shape: {}".format(lda.components_.shape))

lda.components_.shape: (10, 10000)


In [7]:
# for each topic (a row in the components_), sort the features (ascending).
# Invert rows with [:, ::-1] to make sorting descending
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
# get the feature names from the vectorizer:
feature_names = np.array(vect.get_feature_names_out())

Here are the top-weighted features for the ten topics identified:

In [8]:
for topic_id in range(len(sorting)):
    features = feature_names[sorting[topic_id]]
    print( "topic", topic_id, features[0:5] )


topic 0 ['between' 'young' 'family' 'real' 'performance']
topic 1 ['war' 'world' 'us' 'our' 'american']
topic 2 ['funny' 'worst' 'comedy' 'thing' 'guy']
topic 3 ['show' 'series' 'episode' 'tv' 'episodes']
topic 4 ['didn' 'saw' 'am' 'thought' 'years']
topic 5 ['horror' 'action' 'effects' 'budget' 'nothing']
topic 6 ['kids' 'action' 'animation' 'game' 'fun']
topic 7 ['cast' 'role' 'john' 'version' 'novel']
topic 8 ['performance' 'role' 'john' 'actor' 'oscar']
topic 9 ['house' 'woman' 'gets' 'killer' 'girl']


We can use a larger number of topics to get finer resolution on the substructure of the corpus.

If few training data are available, transforming into the lower-dimensional LDA feature space may be a helpful preprocessing step for supervised learning.


In [9]:
# sort by weight of "horror" topic 7
horror = np.argsort(document_topics[:, 7])[::-1]
# print the five documents where the topic is most important
for i in horror[:5]:
    # show first two sentences
    print(b".".join(text_train[i].split(b".")[:2]) + b".\n")

b'In late 1800s San Francisco, poor well-dressed Errol Flynn (as James J. Corbett) works at a bank, and enjoys attending local "fights" (boxing) with co-worker and drinking buddy Jack Carson (as Walter Lowrie).\n'
b'"The Ex-Mrs. Bradford" (1936), starring Thin Man series star William Powell (this film was released the same year as the second Thin Man film, "After The Thin Man," comes very close to duplicating the fun and style of the Thin Man films, but it nonetheless misses.\n'
b"The year is 1896.Jeff Webster (James Stewart) doesn't like people.\n"
b'Josef Von Sternberg directs this magnificent silent film about silent Hollywood and the former Imperial General to the Czar of Russia who has found himself there. Emil Jannings won a well-deserved Oscar, in part, for his role as the general who ironically is cast in a bit part in a silent picture as a Russian general.\n'
b"Sidney Stratton is having trouble maintaining jobs at various textile mills mainly because of his experimentation in 

### Finding Similar Documents

We can imagine using a distance metric within the *bag of words*, *n-grams* or LDA feature space to find documents that are "similar" to a given input document.

However, these features are not good at capturing the *semantic* similarity between documents. Consider

* The cat sat on the mat
* The feline rested on the rug


`word2vec` is a very popular approach to this problem that is based on neural networks. Using a large unannotated plain text corpus, `word2vec` constructs a *continuous* feature space with a predefined dimensionality, within which individual words are located. Words with similar semantic meaning are located close together in this feature space. More remarkably, certain linear relationships can be identified:

vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

We can use the `gensim` package to explore some of the possibilities of this approach.

https://radimrehurek.com/gensim/index.html

Firstly, we will download a pre-trained model with 100 dimensions, trained on 400,000 Wikipedia articles from 2014.

In [10]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-100")




In [11]:
model

<gensim.models.keyedvectors.KeyedVectors at 0x7f170f11dd80>

The model gives the vector representation of any word from the corpus:

In [12]:
model['potato']

array([-0.59242  ,  0.64671  , -0.17863  , -0.32955  , -0.33752  ,
       -0.3927   ,  1.5196   , -0.51104  ,  0.11168  , -0.36983  ,
       -0.99754  ,  0.097873 , -0.057832 ,  0.72555  , -0.071104 ,
        0.57666  ,  0.23989  , -0.090268 , -0.15606  , -0.077584 ,
       -0.77263  ,  0.53222  ,  0.91333  , -0.54726  ,  0.2099   ,
        1.0181   ,  0.068781 , -0.073462 , -0.62079  , -0.13303  ,
       -0.29637  ,  0.70528  ,  0.72128  , -0.73807  , -0.12144  ,
        0.74609  , -0.062772 ,  0.12818  ,  0.56983  , -0.60158  ,
        0.09011  , -1.3489   , -0.2133   , -0.80586  ,  0.33789  ,
        0.14686  ,  0.057557 ,  0.12637  , -0.41219  , -0.20943  ,
       -0.87134  , -0.1932   ,  0.15099  ,  0.57204  , -1.0344   ,
       -0.16244  ,  0.15535  ,  0.51579  ,  0.55834  ,  0.2793   ,
        0.77384  ,  0.91842  ,  0.13026  , -0.19378  ,  0.86953  ,
       -0.81084  ,  0.26939  , -0.39627  ,  0.42148  , -0.56352  ,
       -0.19354  , -0.0036075,  0.40456  ,  0.14934  , -0.5237

In [13]:
model['potato'].shape

(100,)

The model can compute similarities to other words.

In [14]:
model.most_similar("potato")

[('potatoes', 0.778667151927948),
 ('peanut', 0.7455057501792908),
 ('tomato', 0.745377779006958),
 ('bread', 0.736748218536377),
 ('cheese', 0.7155595421791077),
 ('baked', 0.7084025740623474),
 ('pumpkin', 0.7009469866752625),
 ('fried', 0.7007490396499634),
 ('cabbage', 0.6885643005371094),
 ('bean', 0.6814960837364197)]

By embedding documents into this space, document similarity can be computed.