# Text mining

We show how to use scikit-network for text mining. We here consider the novel [Les Misérables](https://en.wikipedia.org/wiki/Les_Misérables) by Victor Hugo (Project Gutenberg, translation by Isabel F. Hapgood). By considering the graph between words and paragraphs, we can embed both words and paragraphs in the same vector space and compute cosine similarity between them.

Each word is considered as in the original text; more advanced [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) can be used instead.

Other graphs can be considered, like the graph of co-occurrence of words within a window of 5 words, or the graph of chapters and words. These graphs can be combined to get richer information and better embeddings.

In [1]:
from re import sub

In [2]:
import numpy as np

In [3]:
from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral

## Load data

In [4]:
filename = 'miserables-en.txt'

In [5]:
with open(filename, 'r') as f:
    text = f.read()

In [6]:
len(text)

3254333

In [7]:
print(text[:494])

﻿The Project Gutenberg EBook of Les Misérables, by Victor Hugo

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org


Title: Les Misérables
       Complete in Five Volumes

Author: Victor Hugo

Translator: Isabel F. Hapgood

Release Date: June 22, 2008 [EBook #135]
Last Updated: January 18, 2016




## Pre-processing

In [8]:
# extract main text
main = text.split('LES MISÉRABLES')[-2].lower()

In [9]:
len(main)

3215017

In [10]:
# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)

In [11]:
# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)

In [12]:
len(paragraphs)

13499

In [13]:
paragraphs[1000]

'after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work  '

## Build graph

In [14]:
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]

In [15]:
graph = from_adjacency_list(paragraph_words, bipartite=True)

In [16]:
biadjacency = graph.biadjacency
words = graph.names_col

In [17]:
biadjacency

<13499x23093 sparse matrix of type '<class 'numpy.int64'>'
	with 416331 stored elements in Compressed Sparse Row format>

In [18]:
len(words)

23093

## Statistics

In [19]:
n_row, n_col = biadjacency.shape

In [20]:
paragraph_lengths = biadjacency.dot(np.ones(n_col))

In [21]:
np.quantile(paragraph_lengths, [0.1, 0.5, 0.9, 0.99])

array([  6.,  23., 127., 379.])

In [22]:
word_counts = biadjacency.T.dot(np.ones(n_row))

In [23]:
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])

array([  1.  ,   2.  ,  23.  , 282.08])

## Embedding

In [24]:
dimension = 50
spectral = Spectral(dimension, regularization=100)

In [25]:
spectral.fit(biadjacency)

Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)

In [26]:
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_

In [27]:
# some word
i = int(np.argwhere(words == 'love'))

In [28]:
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]

array(['love', 'kiss', 'ye', 'celestial', 'hearts', 'loved', 'tender',
       'roses', 'joys', 'sweet', 'wedded', 'charming', 'angelic', 'adore',
       'aurora', 'pearl', 'voluptuousness', 'chaste', 'innumerable',
       'heart'], dtype='<U21')

In [29]:
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])

array([-0.24307366, -0.14047851, -0.02607974,  0.14319717,  0.42843234])

In [30]:
# some paragraph
i = 1000
print(paragraphs[i])

after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work  


In [31]:
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
    print(paragraphs[j])
    print()

after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work  

he was a man of lofty stature  half peasant  half artisan  he wore a huge leather apron  which reached to his left shoulder  and which a hammer  a red handkerchief  a powder horn  and all sorts of objects which were upheld by the girdle  as in a pocket  caused to bulge out  he carried his head thrown backwards  his shirt  widely opened and turned back  displayed his bull neck  white and bare  he had thick eyelashes  enormous black whiskers  prominent eyes  the lower part of his face like a snout  and besides all this  that air of being on his own ground  which is indescribable 

this was the state which the shepherd idyl  begun at five o clock in t

In [32]:
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])


array([-0.30671191, -0.17309593, -0.00319729,  0.21574375,  0.45969887])