<div style="text-align: center; display:block">
    <div style="display: inline-block">
        <h1  style="text-align: center">Text EDA Module</h1>
        <div style="width:80%; text-align: center"><i>Author:</i> <strong>Soham Mullick</strong> </div>
    </div>
</div>

The purpose of This module is to perform EDA on the processed text data output from the previous text cleaner module to gain insights on the classification problem. To better understand the diffrence between the features of the two different class - for our case RMA and No RMA the same EDA steps are performed for each of the class examples

<b>Input</b> - The clean file from the text cleaner module

<b>Output</b> - TextStat calculation results (mainly containing count, Log_freq, TF-Idf etc.) for both RMA and No RMA cases

### Importing important modules

In [None]:
#Core modules
from collections import Counter
import operator
import math as m
from collections import defaultdict
import pandas as pd
import configparser
import logging
import time

#Gensim modules for text processing
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models import word2vec
from gensim.models.word2vec import Word2Vec

### Get File

In [None]:
def getFile(fileName):
    try :
        raw_data=pd.read_csv(fileName,encoding='latin-1') #Change the Filename in config to use different Dataset
    except FileNotFoundError:
        print('\n File name not Correct. Please try again')
        raw_data=getFile(fileName)
    return raw_data

### Text EDA Steps

The following functions are used for doing the calculation

In [None]:
# Count the number of occurances
def count_words(list_of_doc):
    count=Counter()
    for document in list_of_doc:
        document=str(document)
        count.update(document.split())
    return count

# Count Number of documents having a word
def doc_count(word,docs):
    count=0
    docs=list(docs)
    for i in range(len(docs)):
        if word in str(docs[i]).split():
            count+=1
            continue
    return count    

# Log of Word Count
def wordCountlog(Wordcount):
    wordCountDictLog = {}
    for key, val in Wordcount.items():
        wordCountDictLog[key] = 1 + m.log(float(val))
    return wordCountDictLog.values()

# Augmented frequency (Word Count Normalised by max count)
def Augfreq(Wordcount):
    augmentedFrequency = {}
    maxfreq = max(list(Wordcount.values()))
    for key, val in Wordcount.items():
        augmentedFrequency[key] = 0.5 + 0.5 * (val/maxfreq)
    return augmentedFrequency.values()

# Count Inverse Document Frequency
def invDocFreq(Wordcount,docs):
    N = len(docs)
    wordCountDictLog = {}
    for key, val in Wordcount.items():
        num_docs_having_the_word = 1+doc_count(key,docs)
        wordCountDictLog[key] =m.log(1 + N/num_docs_having_the_word)
    return wordCountDictLog.values()


#To Get all the relevant Text statistics in a table
def TextStat(wc_dict,docs,fileName,saveFile=False):
    log_freq=wordCountlog(wc_dict)
    aug_freq=Augfreq(wc_dict)
    idf=invDocFreq(wc_dict,docs)
    TextDataframe = pd.DataFrame(index=wc_dict.keys())
    TextDataframe['Counts']= (wc_dict.values())
    TextDataframe['Term Frequency']=TextDataframe['Counts']/len(TextDataframe)
    TextDataframe['Log Frequency']= (log_freq)
    TextDataframe['Augmented Frequency']= (aug_freq)
    TextDataframe['Inverse Document Term Frequency']= idf
    TextDataframe['TF-IDF'] = TextDataframe['Term Frequency'] * TextDataframe['Inverse Document Term Frequency']
    
    ### CIF (Classification Importance Factor is a derived metric to normalize between important words for different classes.
    ### The assumption behind this being that the word occuring with similar importance in both the classes do not help much 
    ### in the classification decision between the classes)
    
    TextDataframe['CIF'] = TextDataframe['Augmented Frequency']*(1/TextDataframe['Inverse Document Term Frequency'])
    if saveFile:
        TextDataframe.to_csv(fileName,index=True)
    return TextDataframe

# To create dictionary and filter based on count(for differential treatment of RMA and NO RMA cases)
def count_filter(textList):
    dic_count=count_words(textList)
    sorted_dic=sorted(dic_count.items(), key=operator.itemgetter(1),reverse=True)
    min_count=EDA_min_count
    output_dict={k: v for k, v in dict(sorted_dic).items() if ((v > min_count))}
    return output_dict


### Read Config and create logger

In [None]:
# Loading config file
config = configparser.ConfigParser()
config.read('./config.ini')

# Read config file
input_file=str(config['Text_EDA']['Input_file'])
datacol_ug=str(config['Text_EDA']['Datacol_UG'])
datacol_bg=str(config['Text_EDA']['Datacol_BG'])
rma_ug_output=str(config['Text_EDA']['RMA_UG_TextStat'])
rma_bg_output=str(config['Text_EDA']['RMA_BG_TextStat'])
norma_ug_output=str(config['Text_EDA']['NoRMA_UG_TextStat'])
norma_bg_output=str(config['Text_EDA']['NoRMA_BG_TextStat'])

common_list = config['Text_processing']['added_stop_list']
common_list = list(common_list.replace(' ', '').split(','))

# Create logger file
logging.basicConfig(filename="Text_EDA_{}.log".format(time.strftime('%b-%d-%Y_%H%M',time.localtime())),level=logging.DEBUG)

### Get Data

In [None]:
clean_data=getFile(input_file)
clean_data.head()

### Basic Info about the Dataset

In [None]:
logging.debug('Total No. of cases in clean data '+str(len(clean_data)))
logging.debug('Total No. of RMA cases in clean data '+str(len(clean_data[clean_data['rma_flag']==0])))
logging.debug('Total No. of NO RMA cases in clean data '+str(len(clean_data[clean_data['rma_flag']==1])))


### Split Data on Class

In [None]:
# Divide the dataset based on RMA and No RMA cases
rma_data=clean_data[clean_data['rma_flag']==0]
norma_data=clean_data[clean_data['rma_flag']==1]
logging.debug('In the RMA dataset no. of cases are '+str(len(rma_data)))
logging.debug('In the No RMA dataset no. of cases are '+str(len(norma_data)))

### Apply EDA steps

The whole analysis is divided into two different aspect:

    To analyse the distribution of words between classes (RMA and NO RMA)
    To analyse how the bigrams and unigrams are acting differently

#### Operations on RMA Data

##### Control-Box

In [None]:
EDA_min_count=20           #To filter based on count - RMA Cases

##### Uni-gram Analysis

In [None]:
rma_dict=count_filter(rma_data[datacol_ug])
RMADataframeUG=TextStat(rma_dict,rma_data[datacol_ug],rma_ug_output,saveFile=True)

In [None]:
logging.debug("The total number of words in RMA UG dictionary: "+str(len(rma_dict)))

##### Bi-gram Analysis

In [None]:
rma_bigram_dict=count_filter(rma_data[datacol_bg])
RMADataframeBG=TextStat(rma_bigram_dict,rma_data[datacol_bg],rma_bg_output,saveFile=True)

In [None]:
logging.debug("The total number of words in RMA BG dictionary: "+str(len(rma_bigram_dict)))

#### Operations on No RMA Data

##### Control-Box

In [None]:
EDA_min_count=30   #To filter based on count - No RMA Cases

##### Uni-gram Analysis

In [None]:
norma_dict=count_filter(norma_data[datacol_ug])
NoRMADataframeUG=TextStat(norma_dict,norma_data[datacol_ug],norma_ug_output,saveFile=True)

In [None]:
logging.debug("The total number of words in No RMA UG dictionary: "+str(len(norma_dict)))

##### Bi-gram Analysis

In [None]:
norma_bigram_dict=count_filter(norma_data[datacol_bg])
NoRMADataframeBG=TextStat(norma_bigram_dict,norma_data[datacol_bg],norma_bg_output,saveFile=True)

In [None]:
logging.debug("The total number of words in No RMA BG dictionary: "+str(len(norma_bigram_dict)))