# <center>Assignment 1</center>

## Q1. Define a function to analyze the frequency of words in a string ## (5 points)
 - Define a function named "**count_token**" which (0.5 point)
     * has a string as an input (0.5 point)
     * splits the string into a list of tokens by space. For example, "it's hello world" will be split into two tokens ["it's", "hello","world!"] (1 point)
     * removes all spaces around each token (including tabs, newline characters ("\n")) (0.5 point)
     * removes empty tokens, i.e. *len*(token)==0 (0.5 point)
     * converts all tokens into lower case (0.5 point)
     * create a dictionary containing the count of every unique token, e.g. {'its': 5, 'hello':1,...} (1 point)
     * returns the dictionary as the output (0.5 point)

## Q2. Define a class to analyze a document ## (5 points)
 - Define a new class called "**Text_Analyzer**" which does the following : (0.5 point)
    - has two attributes: 
        * **input_string**, which receives the string value passed by users when creating an object of this class. (0.5 point)
        * **token_count**, which is set to {} when an object of this class is created. (0.5 point)
        
    - a function named "**analyze**" that does the following: 
      * calls the function "count_token" to get a token-count dictionary. (1 point)
      * saves this dictionary to the token_count attribute (0.5 point)
      
    - another function named "**save_to_file**", which 
      * has a string parameter which specifies the full name path of a file to be created (0.5 point)
      * saves count_token dictionary into this file with each key-value pair as a line delimited by comma (see "foo.csv" in Exercise 10.3 for examples). (1.5 point)
      

### Q3. (Bonus) Segment documents by punctuation ## (4 points)
 - Create a new function called "**corpus_analyze**" which does the following :
     * takes **a list of strings** as an input
     * for each string, do the following:
         * splits the string into a list of tokens by **any space** or **any punctuation** (i.e. any character from the list <font color="blue">!"#$%&'()\*+,-./:;<=>?@[\\]^_`{|}~</font> ), e.g. "it's hello world!" should be split into a list ["it", "s", "hello", "world"] (2 points)
         * removes leading and trailing spaces of each token 
         * removes any empty token or token with only 1 character
         * converts all tokens into lower case 
     * creates a token count dictionary named **token_freq**, which gives the **total count** of each unique token in all the input strings, e.g. {'the', 100, 'of': 50, ...} (1 point)
     * creates another dictionary called **token_to_doc**, where each key is a unique token, and the corresponding value is the list of indexes of the input strings that contain the token. For example {'the': [ 2, 5 ], 'of':[3, 4], ...}, i.e. the 2rd and 6th strings contain the token "the", and the 4th and 5th strings have token "of". (1 point)
     * returns (token_freq, token_to_doc) as the output

In [8]:
# Structure of your solution to Assignment 1 

import numpy as np
import csv

def count_token(text):

    tokens=text.split(" ")
    tokens=[token.lower().strip() for token in tokens if len(token.strip())>0]
    token_count={token:tokens.count(token) for token in set(tokens)}

    return token_count

class Text_Analyzer(object):
    
    def __init__(self, doc):
        
        self.input_string=doc
        self.token_count ={}
          
    def analyze(self):
        self.token_count = count_token(self.input_string)
        
    def save_to_file(self, output_filepath):

        with open(output_filepath, 'w') as f:
            writer=csv.writer(f, delimiter=",")
            items=self.token_count.items()
            writer.writerows(items)

def corpus_analyze(docs):
    
    token_freq, token_to_doc = {}, {}
    
    # if PunktWordTokenizer from NLTK is used, it's also OK
    
    for doc_id,doc in enumerate(docs):
        char_list = list(doc)
        for idx, w in enumerate(char_list):
            if w in "!\"#$%&'()\*+,-./:;<=>?@[\\]^_`{|}~":
                char_list[idx]=' '
                
        tokens=''.join(char_list).lower().split(' ')
        
        for token in tokens:
            if len(token)>1:
                if token in token_freq:
                    token_freq[token]+=1
                else:
                    token_freq[token]=1

                if token in token_to_doc:
                    token_to_doc[token].append(doc_id)
                else:
                    token_to_doc[token]=[doc_id]
                
    return token_freq, token_to_doc                                                                            

# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    # Test Question 1
    text='''Hello world!
        It's is a hello world example !'''   
    print("Test Q1:\n",count_token(text))
    
    # # The output of your text should be: 
    # {'world': 1, '!': 1, 'world!': 1, 'a': 1, "it's": 1, 
    # 'example': 1, 'hello': 2, 'is': 1}
    
    # Test Question 2
    analyzer=Text_Analyzer(text)
    analyzer.analyze()
    analyzer.save_to_file("/Users/rliu/temp/test.csv")
    # You should be able to find the csv file with 8 lines, 2 columns
    
    #3 Test Question 3
    docs=['Hello world!', "It's is a hello world example !"]
    word_freq, token_to_doc=corpus_analyze(docs)
    
    print("Test Q3:\n", word_freq)
    # output should be {'hello': 2, 'world': 2, 'it': 1, 'is': 1, 'example': 1}

    print(token_to_doc)
    # output should be {'hello': [0, 1], 'world': [0, 1], 'it': [1], 'is': [1], 'example': [1]}

{'world': 1, '!': 1, 'world!': 1, 'a': 1, "it's": 1, 'example': 1, 'hello': 2, 'is': 1}
{'hello': 2, 'world': 2, 'it': 1, 'is': 1, 'example': 1}
{'hello': [0, 1], 'world': [0, 1], 'it': [1], 'is': [1], 'example': [1]}
