<h2>PREPROCESSING TEXT DATA WITH TF-IDF</h2>

<li>TF-IDF with a context d in corpus D:
$$r_{d} = [tf-idf(w_{1},d,D),tf-idf(w_{2},d,D),...,tf-idf(w_{|V|},d,D)]$$
,where $r_{d} \in R^{|V|}$ and $V = \{w_{i}\}$ is the dictionary of all words in D.
</li>
<br>
<li>The formula of $tf-idf(w_{i},d,D)$ is:
$$tf-idf(w_{i},d,D) = tf(w_{i},d,D).idf(w_{i},d,D)$$
,where: $tf(w_{i},d,D) = \frac{f(w_{i},d)}{max\{f(w_{j},d):w_{j} \in V)\}}\\
        idf(w_{i},d,D) = log_{10} \frac{|D|}{|\{d' \in D:w_{i} \in d'\}|}$
and $f(w_{i},d)$ is the numbers of appearance of $w_{i}$ in context d.
</li>
<br>
<li>Identify V:
    <ul>With each context d in D:
        <li>Seperate d to some words by punctuations, then collect $W_{d}$.</li>
        <li>Delete stop words from $W_{d}$.</li>
        <li>Stem words in $W_{d}$.</li>
    </ul>
    <li>Then $V = \bigcup_{d \in D} W_{d}$.</li>

<h3>GATHER DATA</h3>

In [3]:
#import libraries
import os
from nltk.stem.porter import PorterStemmer
import re
import numpy as np
from collections import defaultdict

In [4]:
def gather_20newsgroup_data():
    path="C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/"
    #get the dirpath of train set and test set
    dirs = [path + dirname + '/' 
            for dirname in os.listdir(path) if not os.path.isfile(path+dirname)]
    train_dir, test_dir = (dirs[0],dirs[1]) if 'train' in dirs[0]\
        else (dirs[1],dirs[0])
    #get the list of newsgroup in train set and test set
    list_newsgroup = [newgroup for newgroup in os.listdir(train_dir)]
    list_newsgroup.sort()
    #store stop words 
    with open(r"C:\Users\Admin\DSLab Training sessions\Session 1\data\20news-bydate\stop_words.txt") as f:
        stopwords = f.read().splitlines()
    stemmer = PorterStemmer()
    
    def collect_data_from(parent_dir,newsgroup_list):
        data=list()
        for group_id,newsgroup in enumerate(newsgroup_list):
            label = group_id
            dir_path = parent_dir + '/' + newsgroup +'/'
            #list of filename and path to file
            files = [(filename,dir_path + filename)
                     for filename in os.listdir(dir_path) if os.path.isfile(dir_path + filename)]
            files.sort()
            for filename,filepath in files:
                #open file in folders
                with open(filepath) as f:
                    text = f.read().lower()
                    #stem words that not being stop word
                    words = [stemmer.stem(word) 
                             for word in re.split('\W+',text) if word not in stopwords]
                    #append to data
                    content = ' '.join(words)
                    assert len(content.splitlines())==1
                    data.append(str(label) + '<fff>' + filename + '<fff>' + content)
        return data
    #collect train data, test data, full data
    train_data = collect_data_from(parent_dir = train_dir, newsgroup_list=list_newsgroup)
    test_data = collect_data_from(parent_dir = test_dir, newsgroup_list=list_newsgroup)
    full_data = train_data+test_data
    # write datas to files
    with open("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/20news-train-processed.txt",'w') as f:
        f.write('\n'.join(train_data))
    with open("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/20news-test-processed.txt",'w') as f:
        f.write('\n'.join(test_data))
    with open("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/20news-full-processed.txt",'w') as f:
        f.write('\n'.join(full_data))

In [23]:
#read data and gather data
gather_20newsgroup_data()

<h3>PROCESSING DATA</h3>

In [5]:
def generate_vocalbulary(data_path):
    # compute idf 
    def compute_idf(df, corpus_size):
        assert df>0
        return np.log10(corpus_size * 1/df)
    #read data
    with open(data_path) as f:
        lines = f.read().splitlines()
    #initialize the dictionary of numbers of doc containing word and corpus size
    doc_count = defaultdict(int)
    corpus_size = len(lines)
    #update the value of doc_count
    for line in lines:
        feature = line.split('<fff>')
        text=feature[-1]
        words = list(set(text.split()))
        for word in words:
            doc_count[word]+=1
    #generate list of word and relative idf
    words_idfs = [(word,compute_idf(document_freq,corpus_size))
        for word,document_freq in zip(doc_count.keys(),doc_count.values()) if document_freq > 10 and not word.isdigit()]
    words_idfs.sort(key = lambda x:-x[1])
    #write data to file
    print("Vocabulary size is:{}".format(len(words_idfs)))
    with open("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/words_idfs.txt",'w') as f:
        f.write('\n'.join([word+'<fff>'+str(idf) for word,idf in words_idfs]))

In [6]:
generate_vocalbulary("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/20news-full-processed.txt")

Vocabulary size is:14234


In [21]:
#TF-IDF
def get_tf_idf(datapath):
    #get pre-computed idf values
    with open(r"C:\Users\Admin\DSLab Training sessions\Session 1\data\20news-bydate\words_idfs.txt") as f:
        words_idfs = [(line.split('<fff>')[0],float(line.split('<fff>')[1])) for line in f.read().splitlines()]
        #generate index of words
        word_IDS = dict([(word,index) for index,(word,idf) in enumerate(words_idfs)])
        idfs = dict(words_idfs)
    #read data
    with open(datapath) as f:
        documents = [(int(line.split('<fff>')[0]),
                     int(line.split('<fff>')[1]),
                     line.split('<fff>')[2]) for line in f.read().splitlines()]
    data_tf_idf=[]
    for document in documents:
        #get label,doc_id,text
        label,doc_id,text=document
        #get word set
        words = [word for word in text.split() if word in idfs]
        word_set = list(set(words))
        #determine the max of frequency of words
        max_term_freq = max([words.count(word) for word in word_set])
        #store tf-idf values of words
        words_tfidf = []
        sum_square = 0.0
        for word in word_set:
            #calculate tf_idf value of word
            term_freq = words.count(word)
            tf_idf_value = term_freq *1 / max_term_freq *idfs[word]
            words_tfidf.append((word_IDS[word],tf_idf_value))
            #calculate sum_square
            sum_square += tf_idf_value**2
        #normalize tf-idf
        words_tfidf_normalized = [str(index) + ':' + str(tf_idf_value/np.sqrt(sum_square))
                                  for index, tf_idf_value in words_tfidf]
        #store data
        sparse_rep = ' '.join(words_tfidf_normalized)
        data_tf_idf.append([label,doc_id,sparse_rep])
    return data_tf_idf

data_tf_idf = get_tf_idf("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/20news-full-processed.txt")

In [22]:
#write tf-idf to file
with open("C:/Users/Admin/DSLab Training sessions/Session 1/data/20news-bydate/data_tf_idf.txt",'w') as f:
    res = []
    for i in range(len(data_tf_idf)):
        line = '<fff>'.join([str(data_tf_idf[i][0]),str(data_tf_idf[i][1]),data_tf_idf[i][2]])
        res.append(line)
    f.write('\n'.join(res))