# <center>HW 1: Document Term Matrix</center>

<div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment

**Problem Description**

In this assignment, you'll write a class and functions to analyze an article to find out the word distributions and key concepts. 

The packages you'll need for this assignment include numpy and pandas. Some useful functions: 
- string: `split`,`strip`, `count`,`index`
- numpy: `argsort`,`argmax`, `sum`, `where`

## Q1. Define a function to analyze word counts in an input sentence 


Define a function named `tokenize(text)` which does the following:
* accepts a sentence (i.e., `text` parameter) as an input
* splits the sentence into a list of tokens by **space** (including tab, and new line). 
    - e.g., `it's a hello world!!!` will be split into tokens `["it's", "a","hello","world!!!"]`  
* removes the **leading/trailing punctuations or spaces** of each token, if any
    - e.g., `world!!! -> world`, while `it's` does not change
    - hint, you can import module *string*, use `string.punctuation` to get a list of punctuations (say `puncts`), and then use function `strip(puncts)` to remove leading or trailing punctuations in each token
* only keeps tokens with 2 or more characters, i.e. `len(token)>1` 
* converts all tokens into lower case 
* find the count of each unique token and save the counts as dictionary, i.e., `{world: 1, a: 1, ...}`
* returns the dictionary 
    

In [1]:
import string
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
def tokenize(text):
    
    vocab = {}
    
    puncts = '!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~'
    
    text = text.lower()
    
    for x in text:
        if x in puncts:
            text = text.replace(x, "")

    sid = text.split()
    
    for x in sid:
        length = len(x)
        if length < 2:
            sid.remove(x)
    
    for y in sid:
        c = sid.count(y)
        vocab[y] = c
    
    return vocab


In [3]:
# test your code
text = """it's a hello world!!!
           it is hello world again."""
tokenize(text)

{"it's": 1, 'hello': 2, 'world': 2, 'it': 1, 'is': 1, 'again': 1}

## Q2. Generate a document term matrix (DTM) as a numpy array


Define a function `get_dtm(sents)` as follows:
- accepts a list of sentences, i.e., `sents`, as an input
- uses `tokenize` function you defined in Q1 to get the count dictionary for each sentence
- pools the words from all the strings togehter to get a list of  unique words, denoted as `unique_words`
- creates a numpy array, say `dtm` with a shape (# of docs x # of unique words), and set the initial values to 0.
- fills cell `dtm[i,j]` with the count of the `j`th word in the `i`th sentence
- returns `dtm` and `unique_words`

In [4]:
def get_dtm(sents):
    
    dtm, all_words = None, None
    
    words_series = pd.Series([],dtype='object')
    
    for x in sents:
        done = tokenize(x)
        df_done = pd.Series([done])
        words_series = pd.concat([words_series,df_done],ignore_index=True)
    
    temp = sents.str.cat(sep=" ")  
    all_words = list(tokenize(temp).keys())
    
    dtm = np.zeros((len(words_series), len(all_words)))
    
    for i in enumerate(words_series):
        for j in enumerate(all_words):
            if(j[-1] in words_series[i[0]]):
                dtm[i[0],j[0]] += words_series[i[0]][j[-1]]
                
                
        
    return dtm , all_words

In [5]:
# A test document. This document can be found at https://hbr.org/2022/04/the-power-of-natural-language-processing getnnz()

sents = pd.read_csv("sents.csv")
sents.head()

Unnamed: 0,text
0,The Power of Natural Language Processing
1,"Until recently, the conventional wisdom was th..."
2,But in the past two years language-based AI ha...
3,The most visible advances have been in what’s ...
4,It has been used to write an article for The G...


In [6]:
dtm, all_words = get_dtm(sents.text)

# Check if the array is correct
dtm.shape

# randomly check one sentence
idx = 3

# get the dictionary using the function in Q1
vocab = tokenize(sents["text"].loc[idx])
print(sorted(vocab.items(), key = lambda item: item[0]))

# get all non-zero entries in dtm[idx] and create a dictionary
# these two dictionaries should be the same
sents.loc[idx]
vocab1 ={all_words[j]: dtm[idx][j] for j in np.where(dtm[idx]>0)[0]}
print(sorted(vocab1.items(), key = lambda item: item[0]))


(81, 678)

[('advances', 1), ('ai', 1), ('been', 1), ('branch', 1), ('called', 1), ('can', 1), ('computers', 1), ('do', 1), ('focused', 1), ('have', 1), ('how', 1), ('humans', 1), ('in', 1), ('language', 2), ('like', 1), ('most', 1), ('nlp', 1), ('of', 1), ('on', 1), ('process', 1), ('processing”', 1), ('the', 2), ('visible', 1), ('what’s', 1), ('“natural', 1)]


text    The most visible advances have been in what’s ...
Name: 3, dtype: object

[('advances', 1.0), ('ai', 1.0), ('been', 1.0), ('branch', 1.0), ('called', 1.0), ('can', 1.0), ('computers', 1.0), ('do', 1.0), ('focused', 1.0), ('have', 1.0), ('how', 1.0), ('humans', 1.0), ('in', 1.0), ('language', 2.0), ('like', 1.0), ('most', 1.0), ('nlp', 1.0), ('of', 1.0), ('on', 1.0), ('process', 1.0), ('processing”', 1.0), ('the', 2.0), ('visible', 1.0), ('what’s', 1.0), ('“natural', 1.0)]


## Q3 Analyze DTM Array 


**Don't use any loop in this task**. You should use array operations to take the advantage of high performance computing.

Define a function named `analyze_dtm(dtm, words, sents)` which:
* takes an array $dtm$ and $words$ as an input, where $dtm$ is the array you get in Q2 with a shape $(m \times n)$, $words$ contains an array of words corresponding to the columns of $dtm$, and $sents$ are the list of sentences you used in Q2.
* calculates the sentence frequency for each word, say $j$, e.g. how many sentences contain word $j$. Save the result to array $df$ ($df$ has shape of $(n,)$ or $(1, n)$).
* normalizes the word count per sentence: divides word count, i.e., $dtm_{i,j}$, by the total number of words in sentence $i$. Save the result as an array named $tf$ ($tf$ has shape of $(m,n)$).
* for each $dtm_{i,j}$, calculates $tf\_idf_{i,j} = \frac{tf_{i, j}}{df_j}$, i.e., divide each normalized word count by the sentence frequency of the word. The reason is, if a word appears in most sentences, it does not have the discriminative power and often is called a `stop` word. The inverse of $df$ can downgrade the weight of such words. $tf\_idf$ has shape of $(m,n)$
* prints out the following:
    
    - the total number of words in the document represented by $dtm$
    - the most frequent top 10 words in this document    
    - words with the top 10 largest $df$ values (show words and their $df$ values)
    - the longest sentence (i.e., the one with the most words)
    - top-10 words with the largest $tf\_idf$ values in the longest sentence (show words and values) 
* returns the $tf\_idf$ array.



Note, for all the steps, **do not use any loop**. Just use array functions and broadcasting for high performance computation.

In [7]:
def analyze_dtm(dtm, words, sents):
    
    tfidf = None
    
    # add your code here
    
    #section1
    num_rows,num_cols = dtm.shape
    s1 = dtm.sum()
    print('\nThe total number of words:\n',s1)
    
    #section2
    col_sum = dtm.sum(axis=0)
    sum_sort = np.argsort(col_sum)
    idx2 = sum_sort[-10:]
    top_words = words[idx2]
    sum_words = col_sum[idx2]
    s2 = list(zip(top_words,sum_words))[::-1]
    print('\nThe top 10 frequent words: \n',s2)
    
    #section3
    df = np.count_nonzero(dtm,axis=0)
    count_sort = np.argsort(df)
    idx3 = count_sort[-10:]
    top_words2 = words[idx3]
    count_words = df[idx3]
    s3 = list(zip(top_words2,count_words))[::-1]
    print('\nThe top 10 words with highest df values: \n',s3)
    
    #section4
    row_sum = dtm.sum(axis=1)
    row_sum_lst = list(row_sum)
    max_val = max(row_sum_lst)
    i = row_sum_lst.index((max_val))
    s4 = sents[i]
    print('\nThe longest sentence : \n',s4)
    
    #section5
    tf = dtm/(row_sum.reshape(81,1))
    tfidf = tf/(df.reshape(1,678))
    tfidf_sentence_vals = tfidf[i,:]
    tfidf_vals_sort = np.argsort(tfidf_sentence_vals)
    idx5 = tfidf_vals_sort[-10:]
    tfidf_main_words = words[idx5]
    tfidf_main_vals = tfidf_sentence_vals[idx5]
    s5 = list(zip(tfidf_main_words, tfidf_main_vals))[::-1]
    print('\nThe top 10 words with highest tf-idf values in the longest sentece: \n',s5)
    
    
    return tfidf

In [11]:
# convert the list to array so you can leverage array operations
words = np.array(all_words)

analyze_dtm(dtm, words, sents.text)


The total number of words:
 1856.0

The top 10 frequent words: 
 [('the', 69.0), ('to', 65.0), ('and', 52.0), ('of', 50.0), ('for', 37.0), ('ai', 26.0), ('in', 24.0), ('is', 23.0), ('are', 22.0), ('tasks', 20.0)]

The top 10 words with highest df values: 
 [('the', 47), ('to', 42), ('and', 41), ('of', 38), ('for', 32), ('ai', 23), ('in', 22), ('like', 20), ('is', 20), ('tasks', 19)]

The longest sentence : 
 Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems.

The top 10 words with highest tf-idf values in the longest sentece: 
 [('pivotal', 0.02), ('reshaping', 0.02), ('school–level', 0.02), ('math', 0.02), ('problems', 0.02), ('perform', 0.02), ('enabled', 0.02), ('release', 0.02), ('upon', 0.02), ('larger',

array([[0.0035461 , 0.16666667, 0.00438596, ..., 0.        , 0.        ,
        0.        ],
       [0.00073368, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00092507, 0.        , 0.00114416, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.00092507, 0.        , 0.00114416, ..., 0.        , 0.        ,
        0.        ],
       [0.00073368, 0.        , 0.        , ..., 0.03448276, 0.        ,
        0.        ],
       [0.00085106, 0.        , 0.        , ..., 0.        , 0.04      ,
        0.04      ]])

## Q4. Find keywords of the document (Bonus) 

Can you leverage $dtm$ array you generated to find a few keywords that can be used to tag this document? e.g., AI, language models, tools, etc.


Use a pdf file to describe your ideas and also implement your ideas.

## Put everything together and test using main block

In [10]:
# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    # Test Question 1
    text = """it's a hello world!!!
           it is hello world again."""
    print("Test Question 1")
    print(tokenize(text))
    
    
    # Test Question 2
    print("\nTest Question 2")
    sents = pd.read_csv("sents.csv")
    
    dtm, all_words = get_dtm(sents.text)
    print(dtm.shape)
    
    
    #3 Test Question 3
    print("\nTest Question 3")
    words = np.array(all_words)

    tfidf= analyze_dtm(dtm, words, sents.text)
    
    

Test Question 1
{"it's": 1, 'hello': 2, 'world': 2, 'it': 1, 'is': 1, 'again': 1}

Test Question 2
(81, 678)

Test Question 3

The total number of words:
 1856.0

The top 10 frequent words: 
 [('the', 69.0), ('to', 65.0), ('and', 52.0), ('of', 50.0), ('for', 37.0), ('ai', 26.0), ('in', 24.0), ('is', 23.0), ('are', 22.0), ('tasks', 20.0)]

The top 10 words with highest df values: 
 [('the', 47), ('to', 42), ('and', 41), ('of', 38), ('for', 32), ('ai', 23), ('in', 22), ('like', 20), ('is', 20), ('tasks', 19)]

The longest sentence : 
 Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems.

The top 10 words with highest tf-idf values in the longest sentece: 
 [('pivotal', 0.02), ('reshaping', 0.02), ('school–level', 