# CPTW vs TF-IDF
This Jupyter Notebook Implements the methods described in the paper Contextually Propogated Term Weights for Document. We compare the results with the popular TF-IDF algorithm

In [53]:
#To Do (Possibel that this cell should be deleted)
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Proccessing Training Data
Below the training data is proccesed such that we have the labels for the different training data in the variable **train_labels** and the corresponding text in **train_text**

In [8]:
train_file = open("reuters/r8-train-all-terms.txt", "r")
train_texts = []
train_labels = []
for x in train_file:
    split = x.split()
    train_labels += [split[0]]
    words = " ".join(split[1:])
    train_texts += [words]

## Proccessing Test Data
We repeat the same steps with the test data with the respective variables beign **test_labels** and **test_text**

In [9]:
test_file = open("reuters/r8-test-all-terms.txt", "r")
test_texts = []
test_labels = []
for x in test_file:
    split = x.split()
    test_labels += [split[0]]
    words = " ".join(split[1:])
    test_texts += [words]

## Training TF-IDF Model
Training a prediction model using TD-IDF as a baseline to compare with the CPTW algorithm that will be implemented later. The code is copied from [stack-overflow](https://stackoverflow.com/questions/43494059/list-of-tfidf-points-for-scikit-nearest-neighbor). Code uses K = 1 for KNN.

In [None]:
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
training = train_texts
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
neigh = NearestNeighbors(n_neighbors=1, n_jobs=-1) 
neigh.fit(X_train_tfidf)

test= test_texts
X_test_counts = count_vect.transform(test)

X_test_tfidf = tfidf_transformer.transform(X_test_counts)

comp = neigh.kneighbors(X_test_tfidf, return_distance=False)

## Baseline result of TF-IDF
Results of KNN on TF- IDF Below

In [18]:
from sklearn.metrics import f1_score
pred_labels = [train_labels[int(idx)] for idx in comp]
print(f1_score(test_labels, pred_labels, average = "micro"))
print(f1_score(test_labels, pred_labels, average = "macro"))

0.840109639104614
0.7862565745939916


## Implementing CPTW
In order to implement CPTW, we need to first get the Word Embedding using the word2vec implementation.

In [None]:
"""    model1 = gensim.models.Word2Vec(
        documents,
        size=150,
        window=10,
        min_count=2,
        workers=10)
    model.train(documents, total_examples=len(documents), epochs=10)
"""

In [140]:
ftrain_text = [w for row in train_texts for w in row.split() if w in model.vocab]
ftest_text = [w for row in test_texts for w in row.split() if w in model.vocab]

X_train = [[w for w in row.split() if w in model.vocab] for row in train_texts]
X_test = [[w for w in row.split() if w in model.vocab] for row in test_texts]
combined = ftrain_text + ftest_text
unique_words = {words for words in (ftrain_text + ftest_text)}
words = list(unique_words)
len(unique_words)

15587

In [123]:
import numpy as np
print(model.most_similar("cat"))
def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)
restrict_w2v(model, unique_words)
model

[('cats', 0.8099379539489746), ('dog', 0.7609456777572632), ('kitten', 0.7464985251426697), ('feline', 0.7326233983039856), ('beagle', 0.7150583267211914), ('puppy', 0.7075453996658325), ('pup', 0.6934291124343872), ('pet', 0.6891531348228455), ('felines', 0.6755931377410889), ('chihuahua', 0.6709762215614319)]


In [None]:
import numpy as np
import time
def gamma(w_k, w_j, d_i, cossim):
    freq = d_i.count(w_k)
    sim = cossim
    return freq * sim

def cptw(d_i, tau = 5):
    result = np.zeros(len(unique_words))
    for idx in range(len(words)):
        if (idx % 1000 == 0):        w_j = words[idx]
        most_similar = [(w_j, 1.0)] + model.similar_by_word(w_j, topn = 5)


        ws = [w for (w, c) in most_similar]
        cs = [c for (w, c) in most_similar]

        alpha_j = 1 / sum(cs)
        gammas = [gamma(w_k, w_j, d_i, c_k) for (w_k , c_k) in most_similar]
        result[idx] = alpha_j * sum(gammas)
    return result

TrData = []
count = 0
for f in ftrain_text:
    TrData += [cptw(f)]
    count += 1
    if count % 10 == 0:
        print("Currently processed " + str(count) + " files")


Currently processed 10 files
Currently processed 20 files
Currently processed 30 files
Currently processed 40 files
Currently processed 50 files
Currently processed 60 files
Currently processed 70 files
Currently processed 80 files
Currently processed 90 files
Currently processed 100 files
Currently processed 110 files
Currently processed 120 files
Currently processed 130 files
Currently processed 140 files
Currently processed 150 files
Currently processed 160 files
Currently processed 170 files
Currently processed 180 files
Currently processed 190 files
Currently processed 200 files
Currently processed 210 files
Currently processed 220 files
Currently processed 230 files
Currently processed 240 files
Currently processed 250 files
Currently processed 260 files
Currently processed 270 files
Currently processed 280 files
Currently processed 290 files
Currently processed 300 files
Currently processed 310 files
Currently processed 320 files
Currently processed 330 files
Currently processed

1.1920929e-07