# text_wrangler API

1. [Constructor](#constructor)
2. [Sentence Matrix](#sentence_matrix)
3. [Vocabulary](#vocabulary)
4. [Stopwords](#stopwords)
5. [Filtered Vocabulary](#filtered_vocabulary)
6. [Demo with Word2Vec](#word2vec_demo)

In [1]:
import os
os.chdir("..")

In [2]:
from src.graph.Graph import UndirectedGraph
from src.text.text_wrangler import Corpus

## 1 Constructor <a name="constructor"></a>
Loads a corpus into a text_wrangler.Corpus object

In [3]:
shakespeare = Corpus("docs/shakespeare.txt")

## 2 Sentence Matrix <a name="sentence_matrix"></a>
A matrix where the rows are tokenized sentences. This is the format that the Word2Vec model expects to receive the corpus in.

In [10]:
print(len(shakespeare.sentence_matrix))
print(shakespeare.sentence_matrix[200])

99624
['ah', 'wherefore', 'with', 'infection', 'should', 'he', 'live', 'and', 'with', 'his', 'presence', 'grace', 'impiety', 'that', 'sin', 'by', 'him', 'advantage', 'should', 'achieve', 'and', 'lace', 'it', 'self', 'with', 'his', 'society']


## 3 Vocabulary <a name="vocabulary"></a>
A set containing the unique vocabulary in the corpus.

In [5]:
list(shakespeare.vocab)[:10]

['forgave',
 'openness',
 'irremovable',
 'strangles',
 'napkin',
 'bencher',
 'mural',
 'truepenny',
 'malkin',
 'unhallow']

In [6]:
len(shakespeare.vocab)

22444

## 4 Stopwords <a name="stopwords"></a>
A set containing the words that may be unimportant.

In [7]:
list(shakespeare.stopwords)[:10]

['below',
 'not',
 'how',
 'they',
 'should',
 'while',
 'ma',
 'their',
 'what',
 'herself']

## 5 Filtered Vocabulary <a name="filtered_vocabulary"></a>
A set containing the corpus' vocab, with stopwords filtered out.

In [11]:
print(list(shakespeare.filtered_vocab)[:10])
print(len(shakespeare.filtered_vocab))

['forgave', 'openness', 'irremovable', 'strangles', 'napkin', 'bencher', 'mural', 'truepenny', 'unhallow', 'nayward']
22316


## 6 Word2Vec Demo with text_wrangler.Corpus <a name="word2vec_demo"></a>

In [9]:
from gensim.models import Word2Vec

model = Word2Vec(shakespeare.sentence_matrix, size = 120,
                 window = 5, min_count=5, workers=8, sg=1)
for i in range(5):
    model.train(shakespeare.sentence_matrix, total_examples=len(shakespeare.sentence_matrix),
                epochs=1, compute_loss=True)
    loss = model.get_latest_training_loss()
    # Quick glimpse at what Word2Vec finds to be the most similar
    sim = model.wv.most_similar("romeo")
    print("Round {} ==================".format(i))
    for s in sim:
        print(s)
    print("\n\n")



('tybalt', 0.8616291880607605)
('juliet', 0.8234875202178955)
('arthur', 0.7686914801597595)
('mercutio', 0.7674623727798462)
('harry', 0.7403349876403809)
('cell', 0.7379287481307983)
('rutland', 0.7296293377876282)
('nurse', 0.728378176689148)
('percy', 0.7262632846832275)
('cato', 0.7247920036315918)



('tybalt', 0.865548312664032)
('juliet', 0.8079246282577515)
('mercutio', 0.7740050554275513)
('arthur', 0.7464786171913147)
('cell', 0.7256070375442505)
('cato', 0.7171488404273987)
('rutland', 0.7166009545326233)
('hector', 0.708516001701355)
('imogen', 0.707841157913208)
('percy', 0.7043918371200562)



('tybalt', 0.856055498123169)
('juliet', 0.7969727516174316)
('mercutio', 0.760789155960083)
('arthur', 0.7261384129524231)
('cell', 0.7158156633377075)
('cato', 0.7057111263275146)
('outright', 0.6867507696151733)
('rutland', 0.6862920522689819)
('marcus', 0.6840404272079468)
('murdered', 0.6761247515678406)



('tybalt', 0.8447830677032471)
('juliet', 0.7775887250900269)
('mercut