# <center>HW #1: Analyze Documents by Numpy</center>

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

**Problem Description**

In this assignment, you'll write functions to analyze an article to find out the word distributions and key concepts. 

The packages you'll need for this assignment include `numpy` and `string`. Some useful functions:
- string, list, dictionary: `split`,`join`, `count`, `index`,`strip`
- numpy: `sum`, `where`,`log`, `argsort`,`argmin`, `argmax` 

## Q1. Define a function to analyze word counts in a document


Define a function named `tokenize(doc)` which process an input document (denoted as `doc`) as follows: 

* First convert the document to lower case.
* Split the document into a list of tokens by **space** (including tabs and new lines). For example, `Hello, it's a helloooo world!` -> `["Hello,", "it's", "a", "helloooo", "world!"]` 
* Remove leading or trailing punctuations of each token. For example, `world!` ->`world`, but `it's` is not changed as the punctiation is in the middle. 
    - Hint, you can import module *string*, use `string.punctuation` to get a list of punctuations (say `puncts`), and then use function `strip(puncts)` to remove leading or trailing punctuations in each token
* Find the count of each unique `non-empty` token and save the count as a dictionary, named `vocab`, i.e., `{"Hello,": 1, a: 1, ...}` 
* Return the dictionary
    

In [374]:
import numpy as np
import string
import pprint as pp
# add your input statement

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [375]:
def tokenize(doc):
    
    vocab = {}
    
    # add your code here
    
    doc=doc.lower() # Convert the doc to lower case
    tokens=doc.split() # Split the doc into a list of tokens by space
    tokens=[token.strip(string.punctuation) for token in tokens if token.strip(string.punctuation)] 
    # Remove leading or trailing punctuations of each token and filter out empty strings
    for token in tokens:
        if token in vocab:
            vocab[token]+=1
        else:
            vocab[token]=1 
    return vocab

#test
doc = "Hello , it's a helloooo world!"
vocab = tokenize(doc)
vocab

{'hello': 1, "it's": 1, 'a': 1, 'helloooo': 1, 'world': 1}

In [296]:
# Test function

doc = "Hello , it's a helloooo world!"
vocab = tokenize(doc)
vocab

{'hello': 1, "it's": 1, 'a': 1, 'helloooo': 1, 'world': 1}

## Q2: Split unusual words into common pieces 


Notice that some words contains extra characters or punctuations. Next we'll find the common subwords in each word (e.g., split "helloooo" to "hello" and "ooo").

**Q2.1.** Define a function `get_pair_count(vocab)` to count the freqency of two subwords in a word as follows:


- The input is a dictionary (denoted as `vocab`) which maps each word into its count. The word contains subwords delimited by space. For example, at the beginning, we treat each character as a subword. Thus, the `vocab` from Q1 is `{"h e l l o":1, "a":1, ...}`
- Count any pair of consecutive subwords in each word and create a new dictionary to note down the total count of each pair across all the words, e.g. `{"e l": 2}`.
- Return the dictionary for the subword pairs.

In [376]:
def get_pair_count(vocab):

    pairs = {}

    # add your code here
    for word, count in vocab.items():
        subwords = word.split() # Split the word into characters
        for i in range(len(subwords) - 1):
            pair = (subwords[i], subwords[i + 1]) # Consider each adjacent pair of characters
            pairs[pair] = pairs.get(pair, 0) + count # Update the count of the pair
    return pairs

# Test

# At the start, treat each character as a subword. 
# Add spaces as delimiters of subwords in each word 

init_vocab = {' '.join(list(word)) : count for word, count in vocab.items()}
pp.pprint(init_vocab)
print("\n")

pairs = get_pair_count(init_vocab)
pp.pprint(pairs)

{'a': 1, 'h e l l o': 1, 'h e l l o o o o': 1, "i t ' s": 1, 'w o r l d': 1}


{("'", 's'): 1,
 ('e', 'l'): 2,
 ('h', 'e'): 2,
 ('i', 't'): 1,
 ('l', 'd'): 1,
 ('l', 'l'): 2,
 ('l', 'o'): 2,
 ('o', 'o'): 3,
 ('o', 'r'): 1,
 ('r', 'l'): 1,
 ('t', "'"): 1,
 ('w', 'o'): 1}


In [132]:
# Test

# At the start, treat each character as a subword. 
# Add spaces as delimiters of subwords in each word 

init_vocab = {' '.join(list(word)) : count for word, count in vocab.items()}
pp.pprint(init_vocab)
print("\n")

pairs = get_pair_count(init_vocab)
pp.pprint(pairs)

{'a': 1, 'h e l l o': 1, 'h e l l o o o o': 1, "i t ' s": 1, 'w o r l d': 1}


{("'", 's'): 1,
 ('e', 'l'): 2,
 ('h', 'e'): 2,
 ('i', 't'): 1,
 ('l', 'd'): 1,
 ('l', 'l'): 2,
 ('l', 'o'): 2,
 ('o', 'o'): 3,
 ('o', 'r'): 1,
 ('r', 'l'): 1,
 ('t', "'"): 1,
 ('w', 'o'): 1}


**Q2.2**. Define a function `merge_subwords(pair, vocab)` as follows:


- The inputs include a subword pair (denoted as `pair`), and the original vocabulary dictionary (denoted as `vocab`).
- For each word in `vocab`, if it contains `pair`, remove the space delimiter between the pair. Now this pair becomes a new subword. 
    - Hint: if you know regular expression, feel free to use it here. Otherwise, you can simply use function `replace`. Don't worry about some minor cross-boundary issues, e.g., `('hell' 'o')` may be matched with `hell oo`.
- Return the new vocabuary dictionary

In [377]:
def merge_subwords(pair, vocab):

    # initialize output vocab
    vcab_out = {}

    # add your code here
    for word, count in vocab.items():
        # replace subword pair separated by space with merged subword
        new_subwords = word.replace(pair[0] + ' ' + pair[1], pair[0] + pair[1])
        vcab_out[new_subwords] = count
    return vcab_out

# Test

pair = ('h', 'e')

# replace all 'h e' substrings by 'he'
new_vocab = merge_subwords(pair, init_vocab)

pp.pprint(new_vocab)

{'a': 1, 'he l l o': 1, 'he l l o o o o': 1, "i t ' s": 1, 'w o r l d': 1}


In [106]:
# Test

pair = ('h', 'e')

# replace all 'h e' substrings by 'he'
new_vocab = merge_subwords(pair, init_vocab)

pp.pprint(new_vocab)

{'a': 1, 'he l l o': 1, 'he l l o o o o': 1, "i t ' s": 1, 'w o r l d': 1}


**Q2.3**. Define a function `subword_tokenize(doc, num_merges = 5)` to put all functions together.


- The inputs include a document (denoted as `doc`) and the number of times to merge subwords.
- Call `tokenize(doc)` to get the initial vocabulary dictionary, denoted as `vocab`
- For each word in `vocab`, add a space delimiter between characters to indict that each character is treated as a subword initially. Save these charaters into a list named `subwords`
- Repeat the follow steps for `num_merges` times:
    - Call `get_pair_count(vocab)` to get the frequency of subword pair across the words
    - Find the subword pair with the highest count, denoted as `pair`. If there is a tie, take any pair.
    - Call `merge_subwords(pair, vocab)` to merge the selected subwords and update the vocabulary `vocab`. Add the new subword into the list `subwords`.
- Finally, split each word in `vocab` by space to generate a new dictionary for the count of each subword.
- Return the subword dictionary and also `subwords` list.

In [378]:
def subword_tokenize(doc, num_merges = 5):
    vocab_out = {}
    subwords = []
    # add your code here
    vocab_out = tokenize(doc)
    
    for word, count in vocab_out.items():
        characters = list(word)
        for char in characters:
            if char not in subwords:
                subwords.append(char)

    vocab_out = {" ".join(list(word)) : count for word, count in vocab_out.items()} 

    for i in range(num_merges):
        print("Merge #"+str(i)+":")
        pairs = get_pair_count(vocab_out)
        pair = max(pairs, key=pairs.get)
        print("pairs: ", dict(sorted(pairs.items(), key=operator.itemgetter(1), reverse=True)))
        print("pair: ",pair)

        merged_subword = pair[0] + pair[1]
        vocab_out = merge_subwords(pair, vocab_out) 
        subwords.append(merged_subword) 
        print("Vocab:", vocab_out)
        print("Subwords:", subwords)

        print("\n")
    

    return vocab_out, subwords

# test
# for debugging, you can print out the result of each merge as shown below.

doc = "Hello world, it's a helloooo world!"

vocab_out, subwords = subword_tokenize(doc, num_merges = 9)

print("vocab:")
pp.pprint(vocab_out)

print("subwords:")
pp.pprint(subwords)


Merge #0:
pairs:  {('o', 'o'): 3, ('h', 'e'): 2, ('e', 'l'): 2, ('l', 'l'): 2, ('l', 'o'): 2, ('w', 'o'): 2, ('o', 'r'): 2, ('r', 'l'): 2, ('l', 'd'): 2, ('i', 't'): 1, ('t', "'"): 1, ("'", 's'): 1}
pair:  ('o', 'o')
Vocab: {'h e l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'h e l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo']


Merge #1:
pairs:  {('h', 'e'): 2, ('e', 'l'): 2, ('l', 'l'): 2, ('w', 'o'): 2, ('o', 'r'): 2, ('r', 'l'): 2, ('l', 'd'): 2, ('l', 'o'): 1, ('i', 't'): 1, ('t', "'"): 1, ("'", 's'): 1, ('l', 'oo'): 1, ('oo', 'oo'): 1}
pair:  ('h', 'e')
Vocab: {'he l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'he l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he']


Merge #2:
pairs:  {('he', 'l'): 2, ('l', 'l'): 2, ('w', 'o'): 2, ('o', 'r'): 2, ('r', 'l'): 2, ('l', 'd'): 2, ('l', 'o'): 1, ('i', 't'): 1, ('t', "'"): 1, ("'", 's'): 1, ('l', 'oo'): 1, ('oo', 'oo'): 1}
pair:  ('he', 'l')


In [16]:
# test
# for debugging, you can print out the result of each merge as shown below.

doc = "Hello world, it's a helloooo world!"

vocab_out, subwords = subword_tokenize(doc, num_merges = 9)

print("vocab:")
pp.pprint(vocab_out)

print("subwords:")
pp.pprint(subwords)


Merge #0: 
pair: ('o', 'o') 
Vocab:	{'h e l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'h e l l oo oo': 1} 
Subwords:	['e', 'r', 'a', 'i', 's', 'h', 'l', "'", 't', 'd', 'o', 'w', 'oo'] 

Merge #1: 
pair: ('h', 'e') 
Vocab:	{'he l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'he l l oo oo': 1} 
Subwords:	['e', 'r', 'a', 'i', 's', 'h', 'l', "'", 't', 'd', 'o', 'w', 'oo', 'he'] 

Merge #2: 
pair: ('he', 'l') 
Vocab:	{'hel l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hel l oo oo': 1} 
Subwords:	['e', 'r', 'a', 'i', 's', 'h', 'l', "'", 't', 'd', 'o', 'w', 'oo', 'he', 'hel'] 

Merge #3: 
pair: ('hel', 'l') 
Vocab:	{'hell o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hell oo oo': 1} 
Subwords:	['e', 'r', 'a', 'i', 's', 'h', 'l', "'", 't', 'd', 'o', 'w', 'oo', 'he', 'hel', 'hell'] 

Merge #4: 
pair: ('w', 'o') 
Vocab:	{'hell o': 1, 'wo r l d': 2, "i t ' s": 1, 'a': 1, 'hell oo oo': 1} 
Subwords:	['e', 'r', 'a', 'i', 's', 'h', 'l', "'", 't', 'd', 'o', 'w', 'oo', 'he', 'hel', 'hell', 'w

## Q3. Generate a document term matrix (DTM) as a numpy array


Define a function `get_dtm(docs)` as follows:
- The input is a list of documents, denoted as `docs`
- For each document, call `tokenize(doc)` defined in **Q1** (let's only use the simple version for now) to get the vocabulary dictionary 
- Pool the keys from all the dictionaries to get a list of unique words, denoted as `unique_words` 
- Creates a numpy array (denoted as `dtm`) with a shape of (# of documents x # of unique words), and set the initial values to 0. 
- Fill cell `dtm[i,j]` with the count of the `j`th word in the `i`th document 
- Return `dtm` and `unique_words`

In [379]:
# A test document collection). This document can be found at https://hbr.org/2022/04/the-power-of-natural-language-processing

# treat each paragraph as a document

docs = open("chatgpt.txt", 'r').readlines()

dtm, words = get_dtm(docs)

In [380]:
dtm.shape

# check words in a paragraph
p = 0 # paragraph id
docs[p]
[w for i,w in enumerate(words) if dtm[p][i]>0] 

pp.pprint(sorted(words))

(26, 314)

"Ethan Mollick has a message for the humans and the machines: can't we all just get along?\n"

['ethan',
 'mollick',
 'has',
 'a',
 'message',
 'for',
 'the',
 'humans',
 'and',
 'machines',
 "can't",
 'we',
 'all',
 'just',
 'get',
 'along']

['22-year-old',
 'a',
 'a.i',
 'about',
 'abroad',
 'academic',
 'access',
 'acknowledge',
 'adapt',
 'admits',
 'adopted',
 'after',
 'again',
 'against',
 'agrees',
 'all',
 'allowing',
 'almost',
 'along',
 'already',
 'alternates',
 'an',
 'and',
 'anxiety',
 'any',
 'app',
 'are',
 'artificial',
 'as',
 'asked',
 'asking',
 'assessments',
 'associate',
 'at',
 'away',
 'b',
 'b-minus',
 'banned',
 'be',
 'been',
 'before',
 'behalf',
 'believes',
 'between',
 'bot',
 "bot's",
 'but',
 'by',
 'calculators',
 'can',
 "can't",
 'capability',
 'challenge',
 'change',
 'changed',
 'changes',
 'chatbot',
 'chatgpt',
 'cheating',
 'check',
 'cites',
 'class',
 'classes',
 'classroom',
 'code',
 'come',
 'company',
 'compose',
 'computer',
 'concerns',
 'convinced',
 'core',
 'could',
 "couldn't",
 'course',
 'crashed',
 'created',
 'deserve',
 'despite',
 'detect',
 'did',
 "didn't",
 'differently',
 'districts',
 'do',
 "don't",
 'earlier',
 'early',
 'educators',
 'edward',
 'emerging'

In [372]:
def tokenize(doc):
    
    vocab = {}
    
    # add your code here
    
    doc=doc.lower() # Convert the doc to lower case
    tokens=doc.split() # Split the doc into a list of tokens by space
    tokens=[token.strip(string.punctuation) for token in tokens if token.strip(string.punctuation)] 
    # Remove leading or trailing punctuations of each token and filter out empty strings
    for token in tokens:
        if token in vocab:
            vocab[token]+=1
        else:
            vocab[token]=1 
    return vocab
def get_dtm(docs):
    
    # get all words
    all_words = []
    dtm = None
    
    # add your code here
    tokenized_docs=[tokenize(doc) for doc in docs]
    
    for token in tokenized_docs:
        for key in token.keys():
            if key not in all_words:
                all_words.append(key)
            
    dtm=np.zeros((len(docs), len(all_words)))
    
    for i in range(len(dtm)):
        word=tokenized_docs[i]
        for j in range(len(dtm[0])):
            if all_words[j] in word.keys():
                dtm[i][j]=word[all_words[j]]
    
    return dtm, all_words

# test
docs = ["Hello , it's a helloooo world!",
       "Again, it is hello world!"]

dtm, words = get_dtm(docs)
dtm
words

array([[1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 1., 1., 1.]])

['hello', "it's", 'a', 'helloooo', 'world', 'again', 'it', 'is']

In [203]:
docs = ["Hello , it's a helloooo world!",
       "Again, it is hello world!"]

dtm, words = get_dtm(docs)
dtm
words

array([[1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 1., 1., 1.]])

['hello', "it's", 'a', 'helloooo', 'world', 'again', 'it', 'is']

## Q4 Analyze DTM Array (4 points)


**Don't use any loop in this task**. You should use array operations to take the advantage of high performance computing.

Define a function named `analyze_dtm(dtm, words, docs)` as follows:
- It takes an array `dtm`, an array of `words`, and an array of documents (denoted `docs`) as inputs, where `dtm` is the array you created from `docs` in Q3 with a shape of $(m \times n)$, and `words` corresponds to the columns of `dtm`.
- Calculate the document frequency for each word $j$, e.g., how many documents contain word $j$. Save the result to array $df$. $df$ has shape of $(n,)$ or $(1, n)$. 
- Normalize the word count per paragraph: divides word count, i.e., $dtm_{i,j}$, by the total number of words in document $i$. Save the result as an array named $tf$. $tf$ has shape of $(m,n)$. 
* For each $dtm_{i,j}$, calculate $tfidf_{i,j} = \frac{tf_{i, j}}{1+log(df_j)}$, i.e., divide each normalized word count by the log of the document frequency of the word (add 1 to the denominator to avoid dividing by 0).  $tfidf$ has shape of $(m,n)$ 
* Print out the following:
    
    - the total number of words in the documents represented by `dtm` 
    - the number of documents and the number of unique words
    - the most frequent top 10 words in this document    
    - top-5 words that show in most of the documents, i.e. words with the top 5 largest $df$ values (print words first, then their values. ) 
    - the longest document in terms of the number of words. Print out this document.
    - top-5 words with the largest $tfidf$ values in the longest document (show words and values) 
    - documents that contain `intelligence` word.

Note, for all the steps, **do not use any loop**. Just use array functions and broadcasting for high performance computation.

Your answer may be different from the example output, since words may have the same values in the dtm but are kept in different positions

In [384]:
def analyze_dtm(dtm, words, docs):

    df=np.sum(dtm>0, axis=0) 
    
    word_counts_per_doc = np.sum(dtm, axis=1, keepdims=True)
    tf=dtm/word_counts_per_doc
    
    tfidf=tf/(1+np.log(df)[np.newaxis, :])
    
    #1.
    total_words=np.sum(dtm)
    print(f"Total number of words in the documents: {total_words}\n")
    
    #2.
    num_docs=dtm.shape[0]
    num_unique_words=dtm.shape[1]
    print(f"Number of documents: {num_docs}, Number of unique words: {num_unique_words}\n")

    
    #3.
    word_frequencies=np.sum(dtm, axis=0)
    top_10_index=np.argsort(word_frequencies)[-10:][::-1]  
    top_10_words=words[top_10_index]
    top_10_values= word_frequencies[top_10_index]
    print("Most frequent top 10 words:", top_10_words)
    print("values:", top_10_values, "\n")

    
    #4.
    top_5_df_index=np.argsort(df)[-5:][::-1]
    top_5_words_df=words[top_5_df_index]
    top_5_df_values=df[top_5_df_index]
    print("Top-5 words in most documents:", top_5_words_df)
    print("values", top_5_df_values, "\n")
    
    #5.
    longest_doc_index=np.argmax(word_counts_per_doc)
    longest_doc=docs[longest_doc_index]
    print("The longest document in terms of the number of words:", longest_doc, "\n")
    
    #6.
    top_5_tfidf_indices_longest_doc=np.argsort(tfidf[longest_doc_index])[-5:][::-1]
    top_5_words_tfidf_longest_doc=words[top_5_tfidf_indices_longest_doc]
    top_5_tfidf_values_longest_doc=tfidf[longest_doc_index, top_5_tfidf_indices_longest_doc]
    print("Top-5 words with the largest TF-IDF values in the longest document:", top_5_words_tfidf_longest_doc, top_5_tfidf_values_longest_doc, "\n")
    
    #7.
    intelligence_index=np.where(words == 'intelligence')[0]
    docs_intelligence_index=np.where(dtm[:,intelligence_index]>0)[0]
    if intelligence_index.size>0:
        docs_with_intelligence=np.nonzero(dtm[:, intelligence_index])[0]
        docs_containing_intelligence = docs[docs_with_intelligence]
    else:
        docs_containing_intelligence = np.array([])
    print("Document IDs:", docs_intelligence_index)
    print("Documents containing 'intelligence':", docs_containing_intelligence)
    
    
    #test
    words = np.array(words)
docs = np.array(docs)

analyze_dtm(dtm, words, docs)

Total number of words in the documents: 705.0

Number of documents: 26, Number of unique words: 314

Most frequent top 10 words: ['the' 'to' 'and' 'a' 'that' 'he' 'it' 'in' 'of' 'is']
values: [31. 25. 19. 17. 13. 12. 12. 12. 11. 10.] 

Top-5 words in most documents: ['the' 'and' 'to' 'a' 'in']
values [20 15 14 12 11] 

The longest document in terms of the number of words: """I think everybody is cheating ... I mean, it's happening. So what I'm asking students to do is just be honest with me,"" he said. ""Tell me what they use ChatGPT for, tell me what they used as prompts to get it to do what they want, and that's all I'm asking from them. We're in a world where this is happening, but now it's just going to be at an even grander scale."""
 

Top-5 words with the largest TF-IDF values in the longest document: ['what' 'me' "i'm" "it's" 'happening'] [0.05479452 0.04109589 0.02739726 0.02739726 0.02739726] 

Document IDs: [ 4 14]
Documents containing 'intelligence': ['"Some school district

In [26]:
words = np.array(words)
docs = np.array(docs)

analyze_dtm(dtm, words, docs)

The total number of words:
705.0

the number of documents: 26, the number of unique words in the documents: 314

The top 10 frequent words:
['the' 'to' 'and' 'a' 'that' 'it' 'he' 'in' 'of' 'is'],
 values: [31. 25. 19. 17. 13. 12. 12. 12. 11. 10.]

The top 5 words with highest df values:
['the' 'and' 'to' 'a' 'in'],
values: [20 15 14 12 11]

The longest document:
 24: """I think everybody is cheating ... I mean, it's happening. So what I'm asking students to do is just be honest with me,"" he said. ""Tell me what they use ChatGPT for, tell me what they used as prompts to get it to do what they want, and that's all I'm asking from them. We're in a world where this is happening, but now it's just going to be at an even grander scale."""


The top 5 words with highest tf-idf values in the longest document:
['what' 'me' 'happening' "i'm" 'tell'],
values: [0.05479452 0.04109589 0.02739726 0.02739726 0.02739726]

documents that contain word 'intelligence': 
Document IDs:[ 4 14],
Text: ['"Some

## Q5 (Bonus). Generating DTM by subword tokenization (2 points)

Assume you only need to keep the top N most frequent words (e.g., N = 200) in the collection of documents. Redo Q3-Q4 as follows:

- Use the subword tokenization you developed in Q2 to tokenize documents
- Generate a dtm with only the top-N most frequent words in the entire collection.
- Then analyze the dtm as in Q4.


Describe and implement your ideas. Again, no loop should be used in your solution to Q4. **Don't just submit code. You need to explain your idea as markdowns. No score will be given if only code is submitted**

# Put everything together and test using main block**

In [16]:
# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    
    print("\n=======Q1: =========\n")
    doc = "Hello , it's a helloooo world!"
    vocab = tokenize(doc)
    pp.pprint(vocab)
    
    print("\n=======Q2: =========\n")
    doc = "Hello world, it's a helloooo world!"

    vocab_out, subwords = subword_tokenize(doc, num_merges = 9)

    print("vocab:")
    pp.pprint(vocab_out)

    print("subwords:")
    pp.pprint(subwords)
    
    print("\n=======Q3: =========\n")
    
    docs = ["Hello , it's a helloooo world!",
       "Again, it is hello world!"]

    dtm, words = get_dtm(docs)
    pp.pprint(dtm)
    pp.pprint(words)
    
    print("\n=======Q4: =========\n")
    
    docs = open("chatgpt.txt", 'r').readlines()

    dtm, words = get_dtm(docs)

    words = np.array(words)
    docs = np.array(docs)

    analyze_dtm(dtm, words, docs)