# Text mining

We show how to use scikit-network for text mining. We here consider the novel [Les Misérables](https://en.wikipedia.org/wiki/Les_Misérables) by Victor Hugo (Project Gutenberg, translation by Isabel F. Hapgood). By considering the graph between words and paragraphs, we can embed both words and paragraphs in the same vector space and compute cosine similarity between them.

Each word is considered as in the original text; more advanced [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) can be used instead.

Other graphs can be considered, like the graph of co-occurrence of words within a window of 5 words, or the graph of chapters and words. These graphs can be combined to get richer information and better embeddings.

In [None]:
import numpy as np

In [None]:
import re

In [None]:
from sknetwork.data import convert_edge_list
from sknetwork.embedding import Spectral
from sknetwork.linalg import normalize

## Load data

In [None]:
filename = 'miserables-en.txt'

In [None]:
with open(filename, 'r') as f:
    text = f.read()

In [None]:
len(text)

In [None]:
print(text[:494])

## Pre-processing

In [None]:
# extract main text
main = text.split('LES MISÉRABLES')[-2].lower()

In [None]:
len(main)

In [None]:
# remove ponctuation
main = re.sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = re.sub(r'["-]', ' ', main)

In [None]:
# extract paragraphs
sep = '|||'
main = re.sub(r'\n\n+', sep, main)
main = re.sub('\n', ' ', main)
paragraphs = main.split(sep)

In [None]:
len(paragraphs)

In [None]:
paragraphs[1000]

## Build graph

In [None]:
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]

In [None]:
graph = convert_edge_list(paragraph_words, bipartite=True)

In [None]:
biadjacency = graph.biadjacency
words = graph.names_col

In [None]:
biadjacency

In [None]:
len(words)

## Statistics

In [None]:
n_row, n_col = biadjacency.shape

In [None]:
paragraph_lengths = biadjacency.dot(np.ones(n_col))

In [None]:
np.quantile(paragraph_lengths, [0.1, 0.5, 0.9, 0.99])

In [None]:
word_counts = biadjacency.T.dot(np.ones(n_row))

In [None]:
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])

## Embedding

In [None]:
dimension = 50
spectral = Spectral(dimension, regularization=100)

In [None]:
spectral.fit(biadjacency)

In [None]:
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_

In [None]:
# some word
i = int(np.argwhere(words == 'love'))

In [None]:
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]

In [None]:
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])

In [None]:
# some paragraph
i = 1000
print(paragraphs[i])

In [None]:
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
    print(paragraphs[j])
    print()

In [None]:
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])