## Homework 1
### NLP Basics & NLP Pipelines

Welcome to Homework 1! 

The homework contains several tasks. You can find the amount of points that you get for the correct solution in the task header. Maximum amount of points for each homework is _six_.

The **grading** for each task is the following:
- correct answer - **full points**
- insufficient solution or solution resulting in the incorrect output - **half points**
- no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (`a = 'cat'` - not good, `word = 'cat'` - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.

<font color='red'>**Important!:**</font> **before sending your solution, do the `Kernel -> Restart & Run All` to ensure that all your code works.**

In [22]:
import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import spacy
nlp = spacy.load("en_core_web_sm")

from tqdm import tqdm
import re
from collections import defaultdict, Counter

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Task 1. Find the data (0.5 points)

Find large enough text data in English or [any other language supported by Spacy](https://spacy.io/usage/models). If the resources for your language are very limited, you may use English or other language of your preference.

**What is the language of your data?**

<font color='green'>English</font>

**Where did you get the text data?**

<font color='green'>[Project Gutenberg](https://www.gutenberg.org/ebooks/1998)</font>

**What kind of text is it? (books, magazines, news articles, etc.)**

<font color='green'>Text from a book ["Thus Spake Zarathustra by"](https://www.gutenberg.org/files/1998/1998-0.txt)</font>

**What style(s) of text does your data have? (user commetaries, scientific, neutral, etc.)**

<font color='green'>Philosophy, Liturature. I assume there might be complications as it uses non-conventional language to some extend.</font>

**Was it easy to download the data? If no, desribe what difficulties you had and how you resolved them.**

<font color='green'>It was fine. I clicked "Ctrl+S" on [this page](https://www.gutenberg.org/files/1998/1998-0.txt) and it saved the file on my Desktop.</font>

### Task 2. Tokenize and count statistics (0.5 points)

Using either NLTK or Spacy tools, tokenize your text data that you found in the previous exercise.

P.S. if you are using Spacy, don't forget to load an appropriate module for it

Compute and output the following:
- number of sentences 
- number of tokens 
- number of unique tokens (or types)
- average length of a sentence
- average length of a token

In [0]:
# Replace the path with the name of your data file
data_path = "text.txt"

data = open(data_path, encoding='utf-8').read()
data
# Split the data into sentences and tokens
print('Using NLTK:')
print(word_tokenize(data))
print(sent_tokenize(data))
print('------------------')
print('------------------')


print('Using Spacy:')
doc = nlp(data)
print([token.text for token in doc])
print([sents for sents in doc.sents])

In [3]:
num_sentences = len(list(doc.sents))
num_tokens = len([token.text for token in doc])
num_unique_tokens = len(set([token.text for token in doc]))
avg_sentence_len = round(num_tokens/num_sentences,2)
avg_token_len = round(sum([len(token.text) for token in doc])/len([token.text for token in doc]),2)

print("Number of sentences:", num_sentences)
print("Number of tokens:", num_tokens)
print("Number of unique tokens (or types):", num_unique_tokens)
print("Average sentence length:", avg_sentence_len)
print("Average token length:", avg_token_len)

Number of sentences: 9716
Number of tokens: 151095
Number of unique tokens (or types): 12209
Average sentence length: 15.55
Average token length: 3.68


### Task 3. Byte pair encoding (BPE) tokenization (1 point)

#### Task 3.1 (0.25 points)

[Byte pair encoding (BPE)](https://en.wikipedia.org/wiki/Byte_pair_encoding) is a simple algorithm of data compression. It looks for the most frequent pair of bytes in the data and replaces it with a new byte which is not seen in the data. 

Recently, this idea became [used in the tokenization](https://www.aclweb.org/anthology/P16-1162.pdf). Let's say that we want to train a network that captures the meaning of words. We can have in out data the following words: `low`, `lower`, `lowest`. If we tokenize the text in a simple way by splitting the words as a whole, the model will probably learn the relation between `low`, `lower`, `lowest`. Now, imagine that we get some new text that the model didn't see during training and it has the words `small`, `smaller`, `smallest` and in the training data we had only the word `small`. Since the model didn't see `smaller` and `smallest` during the training, it will most likely fail to capture the relation.

One of the ways to solve this is BPE tokenization. It learns the most frequent sequences and can split an unknown word into **subwords**. In our case, it can split `smaller` into `['small', 'er']` since we had `small` in the training data and probably many other words ending with -er. Now. instead of one unknown word, the model have two known subwords from which it can take the information.

The code below builds the subwords from the text data. For the purpose of time saving, we set the number of merges to 1000. 

Study the code below and answer the questions after it.

In [4]:
filename = data
def get_vocab(filename):
    """Gets the text from a file and splits it with spaces."""
    
    vocab = Counter()
    with open(filename, encoding='utf-8') as f:
        for line in f:
            words = line.strip().split()
            for word in words:
                vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

def get_stats(vocab):
    """Computes the frequencies for each pair of characters in the vocab."""

    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, in_vocab):
    """Merges the most frequent pair.

    Arguments:
    pair -- the most frequent word pair (tuple(str, str))
    in_vocab -- vocabulary with frequencies (dict)
    """
    
    out_vocab = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in in_vocab:
        out_word = p.sub(''.join(pair), word)
        out_vocab[out_word] = in_vocab[word]
    return out_vocab

def get_tokens_from_vocab(vocab):
    tokens_frequencies = Counter()
    vocab_tokenization = {}
    for word, freq in vocab.items():
        word_tokens = word.split()
        for token in word_tokens:
            tokens_frequencies[token] += freq
        vocab_tokenization[''.join(word_tokens)] = word_tokens
    return tokens_frequencies, vocab_tokenization

def measure_token_length(token):
    if token[-4:] == '</w>':
        return len(token[:-4]) + 1
    else:
        return len(token)

vocab = get_vocab(data_path)

print('\n==========')
print('Tokens Before BPE')
tokens_frequencies, vocab_tokenization = get_tokens_from_vocab(vocab)
print('All tokens: {}'.format(tokens_frequencies.keys()))
print('Number of tokens: {}'.format(len(tokens_frequencies.keys())))
print('==========')

num_merges = 1000
for i in tqdm(range(num_merges)):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)

tokens_frequencies, vocab_tokenization = get_tokens_from_vocab(vocab)

print('\nAll tokens: {}'.format(tokens_frequencies.keys()))
print('Number of tokens: {}'.format(len(tokens_frequencies.keys())))
print('==========')

  0%|          | 1/1000 [00:00<01:55,  8.68it/s]


Tokens Before BPE
All tokens: dict_keys(['\ufeff', 'T', 'h', 'e', '</w>', 'P', 'r', 'o', 'j', 'c', 't', 'G', 'u', 'n', 'b', 'g', 'E', 'B', 'k', 'f', 's', 'S', 'p', 'a', 'Z', ',', 'y', 'F', 'i', 'd', 'N', 'z', 'w', 'l', 'm', 'v', '.', 'Y', '-', 'L', ':', 'A', 'C', 'D', '7', '2', '0', '8', '[', '#', '1', '9', ']', 'R', 'U', '6', '*', 'O', 'H', 'I', 'J', 'K', '’', 'q', 'W', 'M', 'V', 'X', '“', '”', ';', 'x', '‘', '3', '5', '?', '(', ')', '!', 'Q', '4', '_', '/', '%', '@', '$'])
Number of tokens: 86


100%|██████████| 1000/1000 [01:36<00:00, 10.38it/s]


All tokens: dict_keys(['\ufeff', 'The</w>', 'Project</w>', 'Gutenber', 'g</w>', 'E', 'B', 'ook</w>', 'of</w>', 'Thus</w>', 'S', 'p', 'ake</w>', 'Zarathustra,</w>', 'by</w>', 'F', 'ri', 'ed', 'ch</w>', 'Nietzsche</w>', 'This</w>', 'e', 'is</w>', 'for</w>', 'the</w>', 'use</w>', 'an', 'y', 'one</w>', 'where</w>', 'at</w>', 'no</w>', 'co', 'st</w>', 'and</w>', 'with</w>', 'al', 'most</w>', 'r', 'est', 'c', 'tion', 's</w>', 'w', 'hat', 'so', 'ever', '.</w>', 'Y', 'ou</w>', 'may</w>', 'op', 'y</w>', 'it,</w>', 'give</w>', 'it</w>', 'away</w>', 'or</w>', 're', '-', 'un', 'der</w>', 'ter', 'm', 'L', 'ic', 'en', 'se</w>', 'in', 'cl', 'ud', 'ed</w>', 'this</w>', 'on', 'l', 'ine</w>', '.', 'g', 'utenber', 'or', 'T', 'it', 'le', ':</w>', 'Zarathustra</w>', 'A</w>', 'All</w>', 'N', 'A', 'u', 'th', 's', 'at', 'Th', 'om', 'as</w>', 'C', 'on</w>', 'P', 'o', 'st', 'ing</w>', 'D', 'e:</w>', 'ov', 'em', 'b', 'er</w>', '7', ',</w>', '2', '0', '8', '</w>', '[', '#', '1', '9', ']', 'R', 'a', 'ce', 'er,</w




Answer the following questions:

**Study the subwords from your data. Do you see any subwords that make sense from the liguistic point of view? (e.g. suffixes, prefixes, common roots etc.). Provide examples.**

<font color='green'>"Zarathustra" and other names make sense to be stand alone as they are specific names and repeat over and over in the text. </font>

<font color='green'>The 'ay\</w\>' and alike make sense to be among new tokens as it pops up in such words as "may", "way", "Gay", "bay". However, such token is hard to comprehend at first sight - part of which word it is.</font>

<font color='green'>There are stand alone specific words like "with", "most", "Thus" which make sense as those are words repeating in the language frequently.</font>

<font color='green'>Interesting that it has created multiple tokens like 'howev', 'however,\</w\>', 'How\</w\>', 'how', which appear mostly in words like "how" and "however". But as they are presented in different variations (for example, with comma, capitalized first letter), then those would require their own tokens.</font>

**What will happen if you increase the number of merges?**

<font color='green'>I increased the number of merges to 3000 and then to 5000. I noticed that every time the number of tokens increased almost 3 and 5 times. However, I noticed that with increase in merges, there are more tokens looking like a full word rather than a set of letters. I assume that at some point the number of tokens would stop increasing.</font>

#### Task 3.2 (0.75 points)

Now, you are going to implement the function that splits the an unknown word into subwords using the vocab that we built above. 

One way to do it is the following:
1. Sort our vocab by the length in the descending order.
2. Find the boundaries of the "window" that is going to search if a candidate word has a corresponding subword in the vocab. In the beginning, the starting index is 0, since we start to scan the word from the first characher. The end index is the length of the longest subword in the vocab or the length of the word if it is smaller.
3. In a while loop, start looking at the possible subwords. If the subword you are looking at is in the vocab, append it to the result. Now, your new starting index is your previous end index. Your new end index is your new start index plus the length of the longest subword in the vocab or the length of the word if it is smaller than the resulting sum. If the subword is not in the vocab, we reduce the end index by one thus narrowing our search window. Finally, is the length of our window is equal to one, we put an unknown subword in the result and update our window as above.
4. End the loop when we reach the end of the word.

After you finish with the function, test the tokenizer on a very common word and on a very unusual word (you can even try to invent a word yourself).

In [16]:
# Sorting the subwords by the length in the descending order
sorted_tokens_tuple = sorted(tokens_frequencies.items(), key=lambda item: (measure_token_length(item[0]), item[1]), reverse=True)
sorted_tokens = [token for (token, freq) in sorted_tokens_tuple]

def tokenize_word(string, sorted_tokens, unknown_token='</u>'):
    """
    Tokenizes the word into subword using learned BPE vocab
    
    Arguments:
    string -- a word to tokenize. Must end with </w>
    sorted_tokens -- sorted vocab by frequency in descending order
    unknown_token -- a token to replace the words not found in the vocab
    """
    
    if string == '':
        return []
    if sorted_tokens == []:
        return [unknown_token]

    # We are going to store our subwords here
    string_tokens = []
    
    # Find the maximum length of the ngram in vocab
    ngram_max_len = len(sorted_tokens[0])
    # End index is the maximum lenth of the ngram or the length of the string is it's smaller
    end_idx = (ngram_max_len if len(string)<ngram_max_len else len(string))
    # Starting index is 0 in the beginning
    start_idx = 0
    
    while start_idx < len(string):
        subword = string[start_idx:end_idx]
        if subword in sorted_tokens:
            string_tokens.append(subword)
            start_idx = end_idx
            end_idx = start_idx + end_idx
        elif len(subword) == 1:
            string_tokens.append([unknown_token])
            start_idx = end_idx
            end_idx = start_idx + end_idx
        else:
            end_idx-=1
            
    return string_tokens

# The word should end with "</w>". For example, "cat</w>".
word_known = 'God</w>'
word_unknown = 'Serendipity^~</w>'

print('Tokenizing word: {}...'.format(word_known))
if word_known in vocab_tokenization:
    print(vocab_tokenization[word_known])
else:
    print(tokenize_word(string=word_known, sorted_tokens=sorted_tokens, unknown_token='</u>'))
    

print('Tokenizing word: {}...'.format(word_unknown))
if word_unknown in vocab_tokenization:
    print(vocab_tokenization[word_unknown])
else:
    print(tokenize_word(string=word_unknown, sorted_tokens=sorted_tokens, unknown_token='</u>'))

Tokenizing word: God</w>...
['God</w>']
Tokenizing word: Serendipity^~</w>...
['S', 'e', 're', 'n', 'di', 'pi', 'ty', ['</u>'], ['</u>'], '</w>']


### Task 4. Lemmatization and normalization (1 point)

#### Task 4.1 (0.5 points)

Using either NTLK or Spacy, lemmatize your data.
Make a copy of your data but this time transform all the tokens and lemmas into the lowercase.

Provide the following statistics:
- Number of unique lemmas (original case)
- Number of unique lemmas (lower case)
- Number of unique tokens (original case)
- Number of unique tokens (lower case)

In [17]:
# Lemmatize your data
lemmas = ([token.lemma_ for token in doc])


# Make a copy of your tokens but in lowercase
lemmas_lower = ([token.lemma_.lower for token in doc])


# Count statistics (no need to calculate the number of unique tokens in original case since we did it in Task 2)
num_unique_lemmas = len(set(lemmas))
num_unique_lemmas_lower = len(set(lemmas_lower))
num_unique_tokens_lower = len(set([token.text.lower for token in doc]))

# Print out the numbers
print("Number of unique lemmas (original case):", num_unique_lemmas)
print("Number of unique lemmas (lower case):", num_unique_lemmas_lower)
print("Number of unique tokens (original case):", num_unique_tokens)
print("Number of unique tokens (lower case):", num_unique_tokens_lower)

Number of unique lemmas (original case): 9668
Number of unique lemmas (lower case): 120571
Number of unique tokens (original case): 12209
Number of unique tokens (lower case): 120173


#### Task 4.2 (0.5 points)

Look at the numbers you got. 

**Imagine that you want to use your data to train a network that captures the meaning of the words. Do you want to use tokens or lemmas? Original or lowercase? Explain your choice.**

<font color='green'>If we lemmatize - we get the base form of the word. We lose prefixes, suffixes which make morphological meaning of the word. For example if we remove "al" from "critical", we are left with "critic" which has a different meaning. Same applies to lowercase.</font>

**Imagine that you want to use your data to train a system that detects named entities, i.e. names of people, places, companies etc. Do you want to use tokens or lemmas? Original or lowercase? Explain your choice.**

<font color='green'>For ner I would prefer to have original casing (not lowercase) as it would be harder to detect NE right in lowercase English text. Capital letters can be a great hint. That is the reason why Chinese language can be hard for NE.
Regarding Lemmatization - I would use lemmas, as for example token can be "Jason's", but we are interested in name "Jason".</font>

### Task 5. Choose your pipeline (0.5 points)

Choose the pipeline between [Spacy](https://spacy.io/) and [StanfordNLP](https://github.com/stanfordnlp/stanfordnlp).

**Which pipeline did you choose? Why?**

<font color='green'>StanfordNLP. Based on [this page](https://spacy.io/usage/facts-figures), comparison provided by Spacy made me note that StanfordNLP seems to be more advanced.</font>

**What components does the pipeline have?**

<font color='green'>“tokenizer, mwt, part-of-speech, lemmatization, dependency parsing”. Based on [this page](https://stanfordnlp.github.io/stanfordnlp/pipeline.html)</font>

**What languages does the pipeline support?**

<font color='green'>Stanfordnlp has more languages than SpaCy, as it is more "mature" and Spacy is "younger" and did not cover same langugages to same extend. Full list of languages can be found [here](https://stanfordnlp.github.io/stanfordnlp/models.html).</font>

In [28]:
# import your pipeline here
import stanfordnlp
stanfordnlp.download('en')   
nlp = stanfordnlp.Pipeline()

Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /root/stanfordnlp_resources/en_ewt_models.zip


100%|██████████| 235M/235M [02:08<00:00, 1.80MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources

### Task 6. Process your text (1.5 points)

#### Task 6.1 (1 point)

Process the text data from the first task with the pipeline of your choice. 

Select one sentence from the processed document and print out all the results (tokens, pos-tags, lemmas, depparse, etc.).

In [30]:
# Process the text
doc = nlp(data)


# Print out the results
doc.sentences[4].print_tokens()
doc.sentences[4].print_dependencies() 



<Token index=1;words=[<Word index=1;text=Author;lemma=author;upos=NOUN;xpos=NN;feats=Number=Sing;governor=0;dependency_relation=root>]>
<Token index=2;words=[<Word index=2;text=:;lemma=:;upos=PUNCT;xpos=:;feats=_;governor=1;dependency_relation=punct>]>
<Token index=3;words=[<Word index=3;text=Friedrich;lemma=Friedrich;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=1;dependency_relation=appos>]>
<Token index=4;words=[<Word index=4;text=Nietzsche;lemma=Nietzsche;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=3;dependency_relation=flat>]>
('Author', '0', 'root')
(':', '1', 'punct')
('Friedrich', '1', 'appos')
('Nietzsche', '3', 'flat')


#### Task 6.2 (0.5 points)

**Look at your output above. Are the results correct? If no, provide the examples of the mistakes.**

<font color='green'>In my case, specifically for 5th sentence in my text - it is all good. However if we check sentence 4, there is a word "Book", and lemma= "Book". IMO it is not correct.</font>

**What is the difference between a POS tag and morphological tag?**

<font color='green'>POS is about distinguishing which role the word plays in the sentence and based on it it is assigned to lexical category like noun, verb, adjective, adverb and other categories which depend on language. 

Morphological tagging aims to distinguish additional lexical and grammatical properties of words for morfologically rich languages like Turkish. The tags would distinguish Gender, Case etc.</font>

**What is the difference between tagging and parsing?**

<font color='green'>Tagging is about assigning a marker to each word in a text.
Parsing (or dependency parsing) refers to building/showing lexical/syntatic dependencies between words. It can be found useful for free order languages.</font>

**Analyze the dependency parsing result. Does it make sense? Briefly describe the meaning behind the relations.**

<font color='green'>For sentence 5:
Result makes sense.
there is flat relation between Friedrich Nietzsche as it is a multiword expressions. Author is a root (head/governer) as it is the main gist of the sentence.</font>

### Task 7. Statistics (1 point)

In your processed output, compute and print out (in a human readable format) the following stats:
- POS tag frequency for each tag (in descending order)
- 50 most frequent lemmas
- 10 least frequent lemmas

In [31]:
print('POS tag frequency:')
# Compute and print out POS tag frequency
for words in (set([word.upos for sent in doc.sentences for word in sent.words])): 
  print(words, 'frequency is:', [word.upos for sent in doc.sentences for word in sent.words].count(words)) 

print('------\n\n50 most frequent lemmas:')
# Compute and print out 50 most frequent lemmas
print(Counter([word.lemma for sent in doc.sentences for word in sent.words]).most_common(50))

print('------\n\n10 least frequent lemmas:')
# Compute and print out 10 least frequent lemmas
print(Counter([word.lemma for sent in doc.sentences for word in sent.words]).most_common()[:-11:-1])

POS tag frequency:
PUNCT frequency is: 24451
NOUN frequency is: 22057
CCONJ frequency is: 5817
VERB frequency is: 14187
ADJ frequency is: 9041
PART frequency is: 2711
SCONJ frequency is: 1987
INTJ frequency is: 740
NUM frequency is: 795
X frequency is: 172
PROPN frequency is: 2473
DET frequency is: 10697
AUX frequency is: 7560
SYM frequency is: 106
PRON frequency is: 17156
ADP frequency is: 12422
ADV frequency is: 9218
------

50 most frequent lemmas:
[(',', 8612), ('the', 6119), ('and', 4699), ('.', 4358), ('be', 4090), ('of', 3120), ('to', 2786), ('I', 2644), ('he', 2404), ('!', 2276), ('a', 2084), ('"', 2082), (':', 1677), ('--', 1634), ('it', 1597), ('in', 1553), ('they', 1421), ('that', 1364), ('-', 1246), ('for', 1171), ('not', 1053), ('one', 1002), ('have', 996), ('my', 981), ('do', 979), ('all', 916), (';', 915), ('thou', 881), ('with', 875), ('?', 869), ('you', 731), ('but', 716), ('Zarathustra', 699), ('this', 682), ('ye', 659), ('man', 631), ('on', 564), ('we', 561), ('as', 