# <center>Assignment 2</center>

## Q1. Define a function to analyze a numpy array (5 points)
 - Assume we have an array which contains term frequency of each document. Where each row is a document, each column is a word, and the value denotes the frequency of the word in the document. Define a function named "analyze_tf" which:
      * has two input parameters: (1 point)
        * a rank 2 input array
        * a parameter "binary" with a default value set to False
      * does the following steps in sequence:
        1. if "binary" is True, binarizes the input array, i.e. if a value is greater than 1, change it to 1. (1 point)
        2. normalizes the frequency of each word as: word frequency divided by the length of the document (i.e. sum of each row). Save the result as an array named **tf** (i.e. term frequency). The sum of each row of tf should be 1. (1 point)
        3. calculates the document frequency (**df**) of each word, i.e. how many documents contain a specific word (0.5 point)
        4. calculate the inverse document frequency (**idf**) of each word as ** N/df ** (df divided by N) where N is the number of documents (0.5 point)
        5. calculates **tf_idf** array as: **tf * log (idf)** (tf multiply the log (base e) of idf ). The reason is, if a word appears in most documents, it does not have the discriminative power and often is called a "stop" word. The inverse of df can downgrade the weight of such words.(1 point)
      * returns the tf_idf array.
 - Note, for all the steps, ** do not use any loop**. Just use array functions and broadcasting for high performance computation.

## Q2. Define a function to analyze car dataset using pandas (5 points)
 - Define a function named "analyze_cars" to do the follows:
   * Take a csv file path string as an input. Assume the csv file is in the format of the provided sample file.(0.5 point)
   * Read the csv file as a dataframe with the first row as column names (0.5 point)
   * Find cars with top 3 mpg among those of origin = 1. Print the names (i.e. "car" column) and mpg of these three cars. (1 point)
   * Create a new column called "brand" to store the brand name as the first word in "car" column (hint: use "apply" function) (1 point)
   * Show the mean, min, and max mpg values for each of these brands: "ford", "buick" and "honda"(1 point)
   * Create a cross tab to show the average mpg of each brand and each origin value. Use "brand" as row index and "origin" as column index. (1 point)
 - This function does not have any return. Just print out the result of each calculation step.

#### Q3 (Bonus). More sophisticated analyze_tf function (3 points)
 - Assume we have an array which contains term frequency of each document. Where each row is a document, each column is a word, and the value denotes the frequency of the word in the document. Define a function named "advanced_analyze_tf" which: 
      * has three input parameters: 
        * a rank 2 input array
        * a parameter "min_df" (minimum document frequency) with a default value set to 0 
        * a parameter "max_words" (maximum number of words) with a default value set None
      * process the input array as follows in sequence:
        1. if "min_df">0, remove words with document frequency (df) less than "min_df", i.e. the corresponding columns are removed (1 point)
        2. if "max_words"> 0 and "max_words" < the total number of columns (M), only words with top "max_words" frequency (df) are kept. M - "max_words" columns are removed from the array (1 point)
      * call the analyze_tf function in Q1 using the resulting array to get an tf_idf array (0.5 point)
      * returns tf_idf and the original indexes of remaining words. (0.5 point)
 - Note, for all the steps, ** do not use any loop**. Just use array functions and broadcasting for high performance computation.

## Submission Guideline##
- Following the solution template provided below. Use __main__ block to test your functions
- Save your code into a python file (e.g. assign2.py) that can be run in a python 3 environment. In Jupyter Notebook, you can export notebook as .py file in menu "File->Download as".
- Make sure you have all import statements. To test your code, open a command window in your current python working folder, type "python assign1.py" to see if it can run successfully.

In [3]:
# Structure of your solution to Assignment 1 

import numpy as np
import csv
import pandas as pd

def car_analysis(filepath):
    
    # read data
    df=pd.read_csv(filepath,  header=0)
    #print(df)
    # sort
    
    print(df[df["origin"]==1].sort_values(by="mpg", ascending=False).iloc[0:3][["car", "mpg"]])
    
    # get brand column
    df['brand']=df.apply(lambda x: x["car"].split(" ")[0], axis=1)
    
    # get min, max, max
    print(df[df["brand"].isin(["ford","buick", "honda"])].groupby("brand")\
['mpg'].agg([np.mean, np.min, np.max]))

    # get cross tab
    print(pd.crosstab(columns=df.origin,index=df.brand, values=df.mpg, aggfunc=np.mean ))
    # add your code


def analyze_tf(arr, binary=False):
    # suppose arr has shape (m,n)
    
    # binarize if binary=True
    if binary:
        arr=np.where(arr>0,1,0)
    
    # normalize, tf shape: (m,n)
    # np.sum(arr, axis=1) has shape (m,)
    # use [:,None] to make it (m,1) for broadcasting
    tf=arr/(np.sum(arr, axis=1)[:,None])

    # get df, shape (n,)
    df=np.sum(np.where(arr>0, 1, 0), axis=0)

    # get idf, shape (n,)
    idf=arr.shape[0]/df
    
    # get tf_idf
    tf_idf=tf*np.log(idf[None,:])  
    
    return tf_idf

def advanced_analyze_tf(arr, binary=False, min_df=0, max_words=None):
    # suppose arr has shape (m,n)
    
    # by default, all words are returned
    selected_words=np.arange(arr.shape[1])
    
    df=np.sum(np.where(arr>0,1,0), axis=0)
    
    # process min_df
    if min_df>0:
        # get indexes of words with df>=min_df. 
        # min_df_selection is a one-dimension array
        selected_words=np.where(df>=min_df)[0]
        
        # select columns from df
        df=df[selected_words]
        
        
    # process max_words
    if max_words!=None:
        if max_words>0 and max_words<arr.shape[1]:
            
            # sort df to get indexes of top max_words 
            # note that df may have been modified by min_df condition
            # indexes returned are not the original word index
           
            max_words_selection = np.argsort(df)[-max_words:]
            
            # get original word indexes
            selected_words=selected_words[max_words_selection]
            
            
    arr=arr[:, selected_words]
    
    tf_idf=analyze_tf(arr, binary)
    
    return tf_idf, selected_words

# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    # Test Question 1
    arr=np.array([[0,1,0,2,0,1],[1,0,1,1,2,0],[0,0,2,0,0,1], [0,0,1,1,1,1]])
    
    print(arr)
    
    tf_idf=analyze_tf(arr)
    print("\nQ1, binary=False",tf_idf)
    
    tf_idf=analyze_tf(arr, binary=True)
    print("\nQ1, binary=True",tf_idf)
    
    # test question 2 
    print("\nQ2")
    car_analysis('../../dataset/cars.csv')
    
    # test question 3
    tf_idf, selected_words=advanced_analyze_tf(arr)
    print("\nQ3")
    print(tf_idf)
    print(selected_words)
    
    tf_idf, selected_words=advanced_analyze_tf(arr, min_df=2)
    print("\nQ3, min_df=2", tf_idf)
    print(selected_words)
    
    tf_idf, selected_words=advanced_analyze_tf(arr, max_words=3)
    print("\nQ3, max_words=3")
    print(tf_idf)
    print(selected_words)
    
   # tf_idf, selected_words=advanced_analyze_tf(arr, min_df=1, max_words=3)
   # print("\nQ3, min_df=2, max_words=3")
   # print(tf_idf)
   # print(selected_words)

[[0 1 0 2 0 1]
 [1 0 1 1 2 0]
 [0 0 2 0 0 1]
 [0 0 1 1 1 1]]

Q1, binary=False [[0.         0.34657359 0.         0.14384104 0.         0.07192052]
 [0.27725887 0.         0.05753641 0.05753641 0.27725887 0.        ]
 [0.         0.         0.19178805 0.         0.         0.09589402]
 [0.         0.         0.07192052 0.07192052 0.1732868  0.07192052]]

Q1, binary=True [[0.         0.46209812 0.         0.09589402 0.         0.09589402]
 [0.34657359 0.         0.07192052 0.07192052 0.1732868  0.        ]
 [0.         0.         0.14384104 0.         0.         0.14384104]
 [0.         0.         0.07192052 0.07192052 0.1732868  0.07192052]]

Q2
                     car   mpg
194   chevrolet chevette  29.0
82       dodge colt (sw)  28.0
29   chevrolet vega 2300  28.0
           mean  amin  amax
brand                      
buick  14.75000  12.0  21.0
ford   17.12069  10.0  26.0
honda  28.50000  24.0  33.0
origin              1          2          3
brand                                 