<h2>
    Implementation of paper on Natural Language Processing
</h2>
<p>
    <ul>
        <li> Title: Simple and Effective Dimensionality Reduction for
Word Embeddings </li>
        <li> Author: Vikas Raunak </li>
        <li> Published in: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.</li>
        <li> <a href = "https://github.com/vaibagga/NLP-Paper"> Link for code </a> </li>
    </ul>
</p>
<hr>

Importing all the dependencies:
<ol>
    <li> Numpy </li>
    <li> Pandas </li>
    <li> Gensim</li>
    <li> NLTK </li>
</ol>

In [6]:
import numpy as np
import pandas as pd
from gensim.models import word2vec
from tqdm import tqdm
from sklearn.decomposition import PCA
import pickle
import sys
import os
from tqdm import tqdm

Importing glove embeddings (saved on local machine) 
<a href = "https://nlp.stanford.edu/projects/glove/"> Link for word embeddings </a>

In [7]:
def loadGloveModel(gloveFile):
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print("Word embeddings succesfully loaded")
    return model

In [8]:
glove300 = loadGloveModel('glove.6B.300d.txt')

Word embeddings succesfully loaded


In [9]:
## reducing dimensions of words
X_train = []
X_train_names = []
for x in tqdm(glove300):
        X_train.append(glove300[x])
        X_train_names.append(x)

X_train = np.asarray(X_train)
pca_embeddings = {}

# PCA to get Top Components
pca =  PCA(n_components = 300)
X_train = X_train - np.mean(X_train)
X_fit = pca.fit_transform(X_train)
U1 = pca.components_

z = []

# Removing Projections on Top Components
for i, x in enumerate(X_train):
    for u in U1[0:7]:        
            x = x - np.dot(u.transpose(),x) * u 
    z.append(x)

z = np.asarray(z)

# PCA Dim Reduction
pca =  PCA(n_components = 150)
X_train = z - np.mean(z)
X_new_final = pca.fit_transform(X_train)


# PCA to do Post-Processing Again
pca =  PCA(n_components = 150)
X_new = X_new_final - np.mean(X_new_final)
X_new = pca.fit_transform(X_new)
Ufit = pca.components_

X_new_final = X_new_final - np.mean(X_new_final)

final_pca_embeddings = {}
embedding_file = open('pca_embed2.txt', 'w')

100%|██████████| 400000/400000 [00:00<00:00, 1337255.11it/s]


In [13]:
def read_word_vectors(filename):    
    word_vecs = {}
    if filename.endswith('.gz'): file_object = gzip.open(filename, 'r')
    else: file_object = open(filename, 'r')

    for line_num, line in enumerate(file_object):
        line = line.strip().lower()
        word = line.split()[0]
        word_vecs[word] = numpy.zeros(len(line.split())-1, dtype=float)
        for index, vec_val in enumerate(line.split()[1:]):
            word_vecs[word][index] = float(vec_val)
        word_vecs[word] /= math.sqrt((word_vecs[word]**2).sum() + 1e-6)        

    sys.stderr.write("Vectors read from: "+filename+" \n")
    return word_vecs


In [16]:
print("Results for the Embedding")
!python all_wordsim.py pca_embed2.txt data/word-sim/
print("Results for Glove")
!python all_wordsim.py glove.6B.300d.txt data/word-sim


Results for the Embedding
Vectors read from: pca_embed2.txt 
   Serial             Dataset Num Pairs Not Found       Rho
0       1  EN-RW-STANFORD.txt      2034       252  0.436566
1       2    EN-MTurk-287.txt       287         0  0.660352
2       3   EN-SIMLEX-999.txt       999         0  0.385396
3       4    EN-MEN-TR-3k.txt      3000         0  0.748888
4       5       EN-YP-130.txt       130         0  0.549244
5       6     EN-VERB-143.txt       144         0  0.393999
6       7    EN-MTurk-771.txt       771         0  0.652786
7       8   EN-WS-353-ALL.txt       353         0  0.675084
8       9   EN-WS-353-REL.txt       252         0  0.626766
9      10        EN-MC-30.txt        30         0  0.747969
10     11        EN-RG-65.txt        65         0  0.775491
11     12   EN-WS-353-SIM.txt       203         0  0.715894
Results for Glove
Vectors read from: glove.6B.300d.txt 
   Serial             Dataset Num Pairs Not Found       Rho
0       1  EN-RW-STANFORD.txt      2034    