# <center>Assignment 2</center>

## Q1. Define a function to analyze a numpy array
 - Assume we have an array (with shape (M,N)) which contains term frequency of each document, where each row is a document, each column is a word, and the corresponding value denotes the frequency of the word in the document. Define a function named "analyze_tf_idf" which:
      * takes the **array**, and an integer **K** as the parameters.
      * normalizes the frequency of each word as: word frequency divided by the length of the document. Save the result as an array named **tf** (i.e. term frequency)
      * calculates the document frequency (**df**) of each word, e.g. how many documents contain a specific word
      * calculates **tf_idf** array as: **tf / (log(df)+1)** (tf divided by log(df)). The reason is, if a word appears in most documents, it does not have the discriminative power and often is called a "stop" word. The inverse of df can downgrade the weight of such words.
      * for each document, finds out the **indexes of words with top K largest values in the tf_idf array**, ($0<K<=N$). These indexes form an array, say **top_K**, with shape (M, K)
      * returns the tf_idf array, and the top_K array.
 - Note, for all the steps, ** do not use any loop**. Just use array functions and broadcasting for high performance computation.

## Q2. Define a function to analyze stackoverflow dataset using pandas
 - Define a function named "analyze_data" to do the follows:
   * Take a csv file path string as an input. Assume the csv file is in the format of the provided sample file (question.csv).
   * Read the csv file as a dataframe with the first row as column names
   * Find questions with top 3 viewcounts among those answered questions (i.e answercount>0). Print the title and viewcount columns of these questions.
   * Find the top 5 users (i.e. quest_name) who asked the most questions.
   * Create a new column called "first_tag" to store the very first tag in the "tags" column (hint: use "apply" function; tags are separted by ", ")
   * Show the mean, min, and max viewcount values for each of these tags: "python", "pandas" and "dataframe"
   * Create a cross tab with answercount as row indexes, first_tag as column names, and the count of samples as the value. For "python" question (i.e. first_tag="python"), how many questions were not answered (i.e., answercount=0), how many questions were answered once (i.e., answercount=1), and how many questions were anasered twice  (i.e., answercount=2)? Print these numbers.
 - This function does not have any return. Just print out the result of each calculation step.

## Q3 (Bonus). Analyzed a collection of documents
 - Define a function named "analyze_corpus" to do the follows:
   * Similar to Q2, take a csv file path string as an input. Assume the csv file is in the format of the provided sample file (question.csv).
   * Read the "title" column from the csv file and convert it to lower case
   * Split each string in the "title" column by space to get tokens. Create an array where each row represents a title, each column denotes a unique token, and each value denotes the count of the token in the document
   * Call your function in Q1 (i.e. analyze_tf_idf) to analyze this array
   * Print out the top 5 words by tf-idf score for the first 20 questions. Do you think these top words allow you to find similar questions or differentiate a question from dissimilar ones? Write your analysis as a pdf file.
   
- This function does not have any return. Just print out the result if asked.
   

## Submission Guideline##
- Following the solution template provided below. Use __main__ block to test your functions
- Save your code into a python file (e.g. assign2.py) that can be run in a python 3 environment. In Jupyter Notebook, you can export notebook as .py file in menu "File->Download as".
- Make sure you have all import statements. To test your code, open a command window in your current python working folder, type "python assign2.py" to see if it can run successfully.
- **Each homework assignment should be completed independently. Never ever copy others' work**

In [1]:
# Structure of your solution to Assignment 1 
import pandas as pd
import numpy as np
from termcolor import colored
from sklearn.feature_extraction.text import CountVectorizer


def analyze_data(filepath):
    
    # add your code here
    data = pd.read_csv(filepath)
    print(colored("Top 3 viewcounts where answercount is greater than 0" , "blue", attrs=['bold']))

    #question with top 3 viewcount where answercount>0
    print(data[data.answercount>0].nlargest(3,'viewcount')[['title', 'viewcount']])
    print('\n')

    print(colored("Top 5 users who has asked most freqent ques", 'blue', attrs=['bold']))

    #top 3 users who has asked most questions
    print(data.groupby('quest_name').count().nlargest(5, 'id')['id'])
    print("\n")

    #adding row first_tag with first token of tags
    data['first_tag'] = data['tags'].apply(lambda x:x.split(',')[0])

    print(colored("Min, Max and Mean of viewcount of each first_tag 'python', 'pandas' & 'dataframe'", 'blue', attrs=['bold']))

    #getting min max and mean of viewcount of each first_tag
    print(data.loc[data.first_tag.isin(['python', 'pandas', 'dataframe'])].groupby('first_tag').viewcount.agg([np.min, np.max, np.mean]))
    print('\n')

    print(colored("Crosstab with answercount as row indexes and first_tag as column names", 'blue', attrs=['bold']))
    
    #crosstab with answercount as row indexes and first_tag as column name
    print(pd.crosstab(data.answercount,data.first_tag))

    
def analyze_tf_idf(arr,K):
    
    tf_ifd=None
    top_k=None
    
#     summation of row
    row_sum = np.sum(arr, axis=1)  
#     normalized matrix by row
    tf = arr / row_sum[:, np.newaxis]
#     calculating how many documnets contains specific word
    tf_1 = np.where(arr>0,1,0)
    df = np.sum(tf_1, axis=0)
#      checking for stop words
    tf_idf = tf/(np.log(df)+1)
#  
    top_k = (-tf_idf).argsort()
    return tf_idf, top_k[:,0:K]

def analyze_corpus(filepath):
    
    # add your code here
    data = pd.read_csv(filepath)
    
    #converting title column into lowercase
    data1 = data['title'].str.lower().str.split()
    
    vectorizer = CountVectorizer()
    
    #using countvectorizer library to get count of unique elements and there index
    X = vectorizer.fit_transform(data['title'])
    
    #getiing list of all unique tokens
    unique_tokens = vectorizer.get_feature_names()
    
    #converting into array
    y = X.toarray()
    
    print(colored("Calling analyze_tf_idf function to analyze array", 'blue', attrs=['bold']))
    print(analyze_tf_idf(y,5))
    print('\n')
    
    tf_idf, top_k=analyze_tf_idf(y[0:20,],5)
    print(colored("top 5 words for the first 20 question", 'blue', attrs=['bold']))
    for i in top_k:
        list1=[]
        for j in i:
            list1.append(unique_tokens[j])
        print(list1)
    
    

# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java
if __name__ == "__main__":  

        # Test Question 1
    arr=np.array([[0,1,0,2,0,1],[1,0,1,1,2,0],[0,0,2,0,0,1]])

    print("\nQ1")
    print(colored("Indixes of top K values from array", 'blue', attrs=['bold']))
    tf_idf, top_k=analyze_tf_idf(arr,3)
    print(top_k)

    print("\nQ2")
    print(analyze_data('question.csv'))

        # test question 3
    print("\nQ3")
    analyze_corpus('question.csv')


Q1
[1m[34mIndixes of top K values from array[0m
[[3 1 5]
 [4 0 2]
 [2 5 0]]

Q2
[1m[34mTop 3 viewcounts where answercount is greater than 0[0m
                                                 title  viewcount
75   Python: Pandas pd.read_excel giving ImportErro...      33297
163                     Python convert object to float      16658
886                  Subtract two columns in dataframe      11176


[1m[34mTop 5 users who has asked most freqent ques[0m
quest_name
Rahul rajan     7
Shuvayan Das    7
Danny W         6
el323           6
Hana            5
Name: id, dtype: int64


[1m[34mMin, Max and Mean of viewcount of each first_tag 'python', 'pandas' & 'dataframe'[0m
           amin   amax        mean
first_tag                         
pandas       14   4499  454.687500
python        5  33297  428.670091


[1m[34mCrosstab with answercount as row indexes and first_tag as column names[0m
first_tag    arrays  c++  django  excel  function  json  machine-learning  \
ans

