# text_wrangler API

1. [Constructor](#constructor)
2. [Sentence Matrix](#sentence_matrix)
3. [Vocabulary](#vocabulary)
4. [Stopwords](#stopwords)
5. [Filtered Vocabulary](#filtered_vocabulary)
6. [Nouns](#nouns)
7. [Adjectives](#adjectives)
8. [Verbs](#verbs)
9. [Demo with Word2Vec](#word2vec_demo)

In [1]:
import os
os.chdir("..")

In [2]:
from src.graph.Graph import UndirectedGraph
from src.text.text_wrangler import Corpus

## 1 Constructor <a name="constructor"></a>
Loads a corpus into a text_wrangler.Corpus object

In [3]:
shakespeare = Corpus("data/input/shakespeare.txt")

## 2 Sentence Matrix <a name="sentence_matrix"></a>
A matrix where the rows are tokenized sentences. This is the format that the Word2Vec model expects to receive the corpus in.

In [4]:
print(len(shakespeare.sentence_matrix))
print(shakespeare.sentence_matrix[200])

99624
['ah', 'wherefore', 'with', 'infection', 'should', 'he', 'live', 'and', 'with', 'his', 'presence', 'grace', 'impiety', 'that', 'sin', 'by', 'him', 'advantage', 'should', 'achieve', 'and', 'lace', 'it', 'self', 'with', 'his', 'society']


## 3 Vocabulary <a name="vocabulary"></a>
A set containing the unique vocabulary in the corpus.

In [5]:
print(list(shakespeare.vocab)[:10])

['thawing',
 'conceal',
 'insisture',
 'money',
 'plod',
 'endless',
 'seymour',
 'tainture',
 'agrees',
 'meanes']

In [12]:
print(len(shakespeare.vocab))

22444


## 4 Stopwords <a name="stopwords"></a>
A set containing the words that may be unimportant.

In [13]:
list(shakespeare.stopwords)[:10]

["hasn't",
 "don't",
 'her',
 'both',
 'no',
 'there',
 'yourselves',
 'yourself',
 'when',
 'you']

## 5 Filtered Vocabulary <a name="filtered_vocabulary"></a>
A set containing the corpus' vocab, with stopwords filtered out.

In [8]:
print(list(shakespeare.filtered_vocab)[:10])
print(len(shakespeare.filtered_vocab))

['thawing', 'insisture', 'conceal', 'money', 'plod', 'endless', 'seymour', 'tainture', 'agrees', 'meanes']
22316


## 6 Nouns <a name="nouns"></a>
A set containing the nouns in the vocabulary

In [14]:
print(list(shakespeare.nouns)[:10])

['wrist', 'chief', 'welcom', 'horologe', 'pantaloon', 'insisture', 'charm', 'money', 'toucheth', 'plod']


## 7 Adjectives <a name="adjectives"></a>
A set containing the adjectives in the vocabulary

In [15]:
print(list(shakespeare.adjectives)[:10])

['bread', 'recant', 'conceal', 'defendant', 'conceive', 'seymour', 'bona', 'encourag', 'smallest', 'dragonish']


## 8 Verbs <a name="verbs"></a>
A set containing the verbs in the vocabulary

In [16]:
print(list(shakespeare.verbs)[:10])

['oaken', 'purposed', 'thawing', 'youngling', 'clay', 'qualify', 'preserved', 'moonlight', 'thrilling', 'bring']


## 9 Word2Vec Demo with text_wrangler.Corpus <a name="word2vec_demo"></a>

In [10]:
from gensim.models import Word2Vec

model = Word2Vec(shakespeare.sentence_matrix, size = 120,
                 window = 5, min_count=5, workers=8, sg=1)
for i in range(5):
    model.train(shakespeare.sentence_matrix, total_examples=len(shakespeare.sentence_matrix),
                epochs=1, compute_loss=True)
    loss = model.get_latest_training_loss()
    # Quick glimpse at what Word2Vec finds to be the most similar
    sim = model.wv.most_similar("romeo")
    print("Round {} ==================".format(i))
    for s in sim:
        print(s)
    print("\n\n")



('tybalt', 0.8562448024749756)
('juliet', 0.7816978096961975)
('arthur', 0.7632609605789185)
('cato', 0.7622706890106201)
('mercutio', 0.7500942945480347)
('harry', 0.7439703345298767)
('hector', 0.7414019107818604)
('cell', 0.7264564037322998)
('montague', 0.7242982387542725)
('thisbe', 0.7203301787376404)



('tybalt', 0.8603017330169678)
('juliet', 0.7692221403121948)
('cato', 0.7610849738121033)
('mercutio', 0.7580196857452393)
('arthur', 0.752902626991272)
('rutland', 0.7148600816726685)
('hector', 0.7136489152908325)
('cell', 0.7121493816375732)
('aeneas', 0.7036502361297607)
('imogen', 0.7032143473625183)



('tybalt', 0.8507821559906006)
('mercutio', 0.7540134191513062)
('cato', 0.7527130842208862)
('juliet', 0.7486774921417236)
('arthur', 0.7279664874076843)
('cell', 0.6948233842849731)
('aeneas', 0.6902241706848145)
('outright', 0.6883275508880615)
('edgar', 0.6873379945755005)
('bassianus', 0.6869306564331055)



('tybalt', 0.8400412201881409)
('mercutio', 0.7457845211029053